algorithms computer architecture: Topics by Science.gov

Sample records for algorithms computer architecture

Computer architecture for efficient algorithmic executions in real-time systems: New technology for avionics systems and advanced space vehicles

NASA Technical Reports Server (NTRS)

Carroll, Chester C.; Youngblood, John N.; Saha, Aindam

1987-01-01

Improvements and advances in the development of computer architecture now provide innovative technology for the recasting of traditional sequential solutions into high-performance, low-cost, parallel system to increase system performance. Research conducted in development of specialized computer architecture for the algorithmic execution of an avionics system, guidance and control problem in real time is described. A comprehensive treatment of both the hardware and software structures of a customized computer which performs real-time computation of guidance commands with updated estimates of target motion and time-to-go is presented. An optimal, real-time allocation algorithm was developed which maps the algorithmic tasks onto the processing elements. This allocation is based on the critical path analysis. The final stage is the design and development of the hardware structures suitable for the efficient execution of the allocated task graph. The processing element is designed for rapid execution of the allocated tasks. Fault tolerance is a key feature of the overall architecture. Parallel numerical integration techniques, tasks definitions, and allocation algorithms are discussed. The parallel implementation is analytically verified and the experimental results are presented. The design of the data-driven computer architecture, customized for the execution of the particular algorithm, is discussed.
Computer architecture for efficient algorithmic executions in real-time systems: new technology for avionics systems and advanced space vehicles

DOE Office of Scientific and Technical Information (OSTI.GOV)

Carroll, C.C.; Youngblood, J.N.; Saha, A.

1987-12-01

Improvements and advances in the development of computer architecture now provide innovative technology for the recasting of traditional sequential solutions into high-performance, low-cost, parallel system to increase system performance. Research conducted in development of specialized computer architecture for the algorithmic execution of an avionics system, guidance and control problem in real time is described. A comprehensive treatment of both the hardware and software structures of a customized computer which performs real-time computation of guidance commands with updated estimates of target motion and time-to-go is presented. An optimal, real-time allocation algorithm was developed which maps the algorithmic tasks onto the processingmore » elements. This allocation is based on the critical path analysis. The final stage is the design and development of the hardware structures suitable for the efficient execution of the allocated task graph. The processing element is designed for rapid execution of the allocated tasks. Fault tolerance is a key feature of the overall architecture. Parallel numerical integration techniques, tasks definitions, and allocation algorithms are discussed. The parallel implementation is analytically verified and the experimental results are presented. The design of the data-driven computer architecture, customized for the execution of the particular algorithm, is discussed.« less
Data Compression for Maskless Lithography Systems: Architecture, Algorithms and Implementation

DTIC Science & Technology

2008-05-19

Data Compression for Maskless Lithography Systems: Architecture, Algorithms and Implementation Vito Dai Electrical Engineering and Computer Sciences...servers or to redistribute to lists, requires prior specific permission. Data Compression for Maskless Lithography Systems: Architecture, Algorithms and...for Maskless Lithography Systems: Architecture, Algorithms and Implementation Copyright 2008 by Vito Dai 1 Abstract Data Compression for Maskless
Parallel Architectures and Parallel Algorithms for Integrated Vision Systems. Ph.D. Thesis

NASA Technical Reports Server (NTRS)

Choudhary, Alok Nidhi

1989-01-01

Computer vision is regarded as one of the most complex and computationally intensive problems. An integrated vision system (IVS) is a system that uses vision algorithms from all levels of processing to perform for a high level application (e.g., object recognition). An IVS normally involves algorithms from low level, intermediate level, and high level vision. Designing parallel architectures for vision systems is of tremendous interest to researchers. Several issues are addressed in parallel architectures and parallel algorithms for integrated vision systems.
Parallel, stochastic measurement of molecular surface area.

PubMed

Juba, Derek; Varshney, Amitabh

2008-08-01

Biochemists often wish to compute surface areas of proteins. A variety of algorithms have been developed for this task, but they are designed for traditional single-processor architectures. The current trend in computer hardware is towards increasingly parallel architectures for which these algorithms are not well suited. We describe a parallel, stochastic algorithm for molecular surface area computation that maps well to the emerging multi-core architectures. Our algorithm is also progressive, providing a rough estimate of surface area immediately and refining this estimate as time goes on. Furthermore, the algorithm generates points on the molecular surface which can be used for point-based rendering. We demonstrate a GPU implementation of our algorithm and show that it compares favorably with several existing molecular surface computation programs, giving fast estimates of the molecular surface area with good accuracy.
Hardware architecture design of image restoration based on time-frequency domain computation

NASA Astrophysics Data System (ADS)

Wen, Bo; Zhang, Jing; Jiao, Zipeng

2013-10-01

The image restoration algorithms based on time-frequency domain computation is high maturity and applied widely in engineering. To solve the high-speed implementation of these algorithms, the TFDC hardware architecture is proposed. Firstly, the main module is designed, by analyzing the common processing and numerical calculation. Then, to improve the commonality, the iteration control module is planed for iterative algorithms. In addition, to reduce the computational cost and memory requirements, the necessary optimizations are suggested for the time-consuming module, which include two-dimensional FFT/IFFT and the plural calculation. Eventually, the TFDC hardware architecture is adopted for hardware design of real-time image restoration system. The result proves that, the TFDC hardware architecture and its optimizations can be applied to image restoration algorithms based on TFDC, with good algorithm commonality, hardware realizability and high efficiency.
Concurrent extensions to the FORTRAN language for parallel programming of computational fluid dynamics algorithms

NASA Technical Reports Server (NTRS)

Weeks, Cindy Lou

1986-01-01

Experiments were conducted at NASA Ames Research Center to define multi-tasking software requirements for multiple-instruction, multiple-data stream (MIMD) computer architectures. The focus was on specifying solutions for algorithms in the field of computational fluid dynamics (CFD). The program objectives were to allow researchers to produce usable parallel application software as soon as possible after acquiring MIMD computer equipment, to provide researchers with an easy-to-learn and easy-to-use parallel software language which could be implemented on several different MIMD machines, and to enable researchers to list preferred design specifications for future MIMD computer architectures. Analysis of CFD algorithms indicated that extensions of an existing programming language, adaptable to new computer architectures, provided the best solution to meeting program objectives. The CoFORTRAN Language was written in response to these objectives and to provide researchers a means to experiment with parallel software solutions to CFD algorithms on machines with parallel architectures.
A Low Cost VLSI Architecture for Spike Sorting Based on Feature Extraction with Peak Search.

PubMed

Chang, Yuan-Jyun; Hwang, Wen-Jyi; Chen, Chih-Chang

2016-12-07

The goal of this paper is to present a novel VLSI architecture for spike sorting with high classification accuracy, low area costs and low power consumption. A novel feature extraction algorithm with low computational complexities is proposed for the design of the architecture. In the feature extraction algorithm, a spike is separated into two portions based on its peak value. The area of each portion is then used as a feature. The algorithm is simple to implement and less susceptible to noise interference. Based on the algorithm, a novel architecture capable of identifying peak values and computing spike areas concurrently is proposed. To further accelerate the computation, a spike can be divided into a number of segments for the local feature computation. The local features are subsequently merged with the global ones by a simple hardware circuit. The architecture can also be easily operated in conjunction with the circuits for commonly-used spike detection algorithms, such as the Non-linear Energy Operator (NEO). The architecture has been implemented by an Application-Specific Integrated Circuit (ASIC) with 90-nm technology. Comparisons to the existing works show that the proposed architecture is well suited for real-time multi-channel spike detection and feature extraction requiring low hardware area costs, low power consumption and high classification accuracy.
Massively parallel algorithms for real-time wavefront control of a dense adaptive optics system

DOE Office of Scientific and Technical Information (OSTI.GOV)

Fijany, A.; Milman, M.; Redding, D.

1994-12-31

In this paper massively parallel algorithms and architectures for real-time wavefront control of a dense adaptive optic system (SELENE) are presented. The authors have already shown that the computation of a near optimal control algorithm for SELENE can be reduced to the solution of a discrete Poisson equation on a regular domain. Although, this represents an optimal computation, due the large size of the system and the high sampling rate requirement, the implementation of this control algorithm poses a computationally challenging problem since it demands a sustained computational throughput of the order of 10 GFlops. They develop a novel algorithm,more » designated as Fast Invariant Imbedding algorithm, which offers a massive degree of parallelism with simple communication and synchronization requirements. Due to these features, this algorithm is significantly more efficient than other Fast Poisson Solvers for implementation on massively parallel architectures. The authors also discuss two massively parallel, algorithmically specialized, architectures for low-cost and optimal implementation of the Fast Invariant Imbedding algorithm.« less
Solving the Cauchy-Riemann equations on parallel computers

NASA Technical Reports Server (NTRS)

Fatoohi, Raad A.; Grosch, Chester E.

1987-01-01

Discussed is the implementation of a single algorithm on three parallel-vector computers. The algorithm is a relaxation scheme for the solution of the Cauchy-Riemann equations; a set of coupled first order partial differential equations. The computers were chosen so as to encompass a variety of architectures. They are: the MPP, and SIMD machine with 16K bit serial processors; FLEX/32, an MIMD machine with 20 processors; and CRAY/2, an MIMD machine with four vector processors. The machine architectures are briefly described. The implementation of the algorithm is discussed in relation to these architectures and measures of the performance on each machine are given. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Conclusions are presented.
Parallel computer vision

DOE Office of Scientific and Technical Information (OSTI.GOV)

Uhr, L.

1987-01-01

This book is written by research scientists involved in the development of massively parallel, but hierarchically structured, algorithms, architectures, and programs for image processing, pattern recognition, and computer vision. The book gives an integrated picture of the programs and algorithms that are being developed, and also of the multi-computer hardware architectures for which these systems are designed.
Strategies for concurrent processing of complex algorithms in data driven architectures

NASA Technical Reports Server (NTRS)

Stoughton, John W.; Mielke, Roland R.

1988-01-01

The purpose is to document research to develop strategies for concurrent processing of complex algorithms in data driven architectures. The problem domain consists of decision-free algorithms having large-grained, computationally complex primitive operations. Such are often found in signal processing and control applications. The anticipated multiprocessor environment is a data flow architecture containing between two and twenty computing elements. Each computing element is a processor having local program memory, and which communicates with a common global data memory. A new graph theoretic model called ATAMM which establishes rules for relating a decomposed algorithm to its execution in a data flow architecture is presented. The ATAMM model is used to determine strategies to achieve optimum time performance and to develop a system diagnostic software tool. In addition, preliminary work on a new multiprocessor operating system based on the ATAMM specifications is described.
CUDA Optimization Strategies for Compute- and Memory-Bound Neuroimaging Algorithms

PubMed Central

Lee, Daren; Dinov, Ivo; Dong, Bin; Gutman, Boris; Yanovsky, Igor; Toga, Arthur W.

2011-01-01

As neuroimaging algorithms and technology continue to grow faster than CPU performance in complexity and image resolution, data-parallel computing methods will be increasingly important. The high performance, data-parallel architecture of modern graphical processing units (GPUs) can reduce computational times by orders of magnitude. However, its massively threaded architecture introduces challenges when GPU resources are exceeded. This paper presents optimization strategies for compute- and memory-bound algorithms for the CUDA architecture. For compute-bound algorithms, the registers are reduced through variable reuse via shared memory and the data throughput is increased through heavier thread workloads and maximizing the thread configuration for a single thread block per multiprocessor. For memory-bound algorithms, fitting the data into the fast but limited GPU resources is achieved through reorganizing the data into self-contained structures and employing a multi-pass approach. Memory latencies are reduced by selecting memory resources whose cache performance are optimized for the algorithm's access patterns. We demonstrate the strategies on two computationally expensive algorithms and achieve optimized GPU implementations that perform up to 6× faster than unoptimized ones. Compared to CPU implementations, we achieve peak GPU speedups of 129× for the 3D unbiased nonlinear image registration technique and 93× for the non-local means surface denoising algorithm. PMID:21159404
CUDA optimization strategies for compute- and memory-bound neuroimaging algorithms.

PubMed

Lee, Daren; Dinov, Ivo; Dong, Bin; Gutman, Boris; Yanovsky, Igor; Toga, Arthur W

2012-06-01

As neuroimaging algorithms and technology continue to grow faster than CPU performance in complexity and image resolution, data-parallel computing methods will be increasingly important. The high performance, data-parallel architecture of modern graphical processing units (GPUs) can reduce computational times by orders of magnitude. However, its massively threaded architecture introduces challenges when GPU resources are exceeded. This paper presents optimization strategies for compute- and memory-bound algorithms for the CUDA architecture. For compute-bound algorithms, the registers are reduced through variable reuse via shared memory and the data throughput is increased through heavier thread workloads and maximizing the thread configuration for a single thread block per multiprocessor. For memory-bound algorithms, fitting the data into the fast but limited GPU resources is achieved through reorganizing the data into self-contained structures and employing a multi-pass approach. Memory latencies are reduced by selecting memory resources whose cache performance are optimized for the algorithm's access patterns. We demonstrate the strategies on two computationally expensive algorithms and achieve optimized GPU implementations that perform up to 6× faster than unoptimized ones. Compared to CPU implementations, we achieve peak GPU speedups of 129× for the 3D unbiased nonlinear image registration technique and 93× for the non-local means surface denoising algorithm. Copyright © 2010 Elsevier Ireland Ltd. All rights reserved.
A novel strategy for load balancing of distributed medical applications.

PubMed

Logeswaran, Rajasvaran; Chen, Li-Choo

2012-04-01

Current trends in medicine, specifically in the electronic handling of medical applications, ranging from digital imaging, paperless hospital administration and electronic medical records, telemedicine, to computer-aided diagnosis, creates a burden on the network. Distributed Service Architectures, such as Intelligent Network (IN), Telecommunication Information Networking Architecture (TINA) and Open Service Access (OSA), are able to meet this new challenge. Distribution enables computational tasks to be spread among multiple processors; hence, performance is an important issue. This paper proposes a novel approach in load balancing, the Random Sender Initiated Algorithm, for distribution of tasks among several nodes sharing the same computational object (CO) instances in Distributed Service Architectures. Simulations illustrate that the proposed algorithm produces better network performance than the benchmark load balancing algorithms-the Random Node Selection Algorithm and the Shortest Queue Algorithm, especially under medium and heavily loaded conditions.
Implementation and analysis of a Navier-Stokes algorithm on parallel computers

NASA Technical Reports Server (NTRS)

Fatoohi, Raad A.; Grosch, Chester E.

1988-01-01

The results of the implementation of a Navier-Stokes algorithm on three parallel/vector computers are presented. The object of this research is to determine how well, or poorly, a single numerical algorithm would map onto three different architectures. The algorithm is a compact difference scheme for the solution of the incompressible, two-dimensional, time-dependent Navier-Stokes equations. The computers were chosen so as to encompass a variety of architectures. They are the following: the MPP, an SIMD machine with 16K bit serial processors; Flex/32, an MIMD machine with 20 processors; and Cray/2. The implementation of the algorithm is discussed in relation to these architectures and measures of the performance on each machine are given. The basic comparison is among SIMD instruction parallelism on the MPP, MIMD process parallelism on the Flex/32, and vectorization of a serial code on the Cray/2. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally, conclusions are presented.
Efficient Numeric and Geometric Computations using Heterogeneous Shared Memory Architectures

DTIC Science & Technology

2017-10-04

Report: Efficient Numeric and Geometric Computations using Heterogeneous Shared Memory Architectures The views, opinions and/or findings contained in this...Chapel Hill Title: Efficient Numeric and Geometric Computations using Heterogeneous Shared Memory Architectures Report Term: 0-Other Email: dm...algorithms for scientific and geometric computing by exploiting the power and performance efficiency of heterogeneous shared memory architectures . These
Architecutres, Models, Algorithms, and Software Tools for Configurable Computing

DTIC Science & Technology

2000-03-06

and J.G. Nash. The gated interconnection network for dynamic programming. Plenum, 1988 . [18] Ju wook Jang, Heonchul Park, and Viktor K. Prasanna. A ...Sep. 1997. [2] C. Ebeling, D. C. Cronquist , P. Franklin and C. Fisher, "RaPiD - A configurable computing architecture for compute-intensive...ABSTRACT (Maximum 200 words) The Models, Algorithms, and Architectures for Reconfigurable Computing (MAARC) project developed a sound framework for
Computer vision camera with embedded FPGA processing

NASA Astrophysics Data System (ADS)

Lecerf, Antoine; Ouellet, Denis; Arias-Estrada, Miguel

2000-03-01

Traditional computer vision is based on a camera-computer system in which the image understanding algorithms are embedded in the computer. To circumvent the computational load of vision algorithms, low-level processing and imaging hardware can be integrated in a single compact module where a dedicated architecture is implemented. This paper presents a Computer Vision Camera based on an open architecture implemented in an FPGA. The system is targeted to real-time computer vision tasks where low level processing and feature extraction tasks can be implemented in the FPGA device. The camera integrates a CMOS image sensor, an FPGA device, two memory banks, and an embedded PC for communication and control tasks. The FPGA device is a medium size one equivalent to 25,000 logic gates. The device is connected to two high speed memory banks, an IS interface, and an imager interface. The camera can be accessed for architecture programming, data transfer, and control through an Ethernet link from a remote computer. A hardware architecture can be defined in a Hardware Description Language (like VHDL), simulated and synthesized into digital structures that can be programmed into the FPGA and tested on the camera. The architecture of a classical multi-scale edge detection algorithm based on a Laplacian of Gaussian convolution has been developed to show the capabilities of the system.
Strategies for concurrent processing of complex algorithms in data driven architectures

NASA Technical Reports Server (NTRS)

Stoughton, John W.; Mielke, Roland R.

1987-01-01

The results of ongoing research directed at developing a graph theoretical model for describing data and control flow associated with the execution of large grained algorithms in a spatial distributed computer environment is presented. This model is identified by the acronym ATAMM (Algorithm/Architecture Mapping Model). The purpose of such a model is to provide a basis for establishing rules for relating an algorithm to its execution in a multiprocessor environment. Specifications derived from the model lead directly to the description of a data flow architecture which is a consequence of the inherent behavior of the data and control flow described by the model. The purpose of the ATAMM based architecture is to optimize computational concurrency in the multiprocessor environment and to provide an analytical basis for performance evaluation. The ATAMM model and architecture specifications are demonstrated on a prototype system for concept validation.

Sigint Application for Polymorphous Computing Architecture (PCA): Wideband DF

DTIC Science & Technology

2006-08-01

Polymorphous Computing Architecture (PCA) program as stated by Robert Graybill is to Develop the computing foundation for agile systems by establishing...ubiquitous MUSIC algorithm rely upon an underlying narrowband signal model [8]. In this case, narrowband means that the signal bandwidth is less than...a wideband DF algorithm is needed to compensate for this model inadequacy. Among the various wideband DF techniques available, the coherent signal
An S N Algorithm for Modern Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Baker, Randal Scott

2016-08-29

LANL discrete ordinates transport packages are required to perform large, computationally intensive time-dependent calculations on massively parallel architectures, where even a single such calculation may need many months to complete. While KBA methods scale out well to very large numbers of compute nodes, we are limited by practical constraints on the number of such nodes we can actually apply to any given calculation. Instead, we describe a modified KBA algorithm that allows realization of the reductions in solution time offered by both the current, and future, architectural changes within a compute node.
A supportive architecture for CFD-based design optimisation

NASA Astrophysics Data System (ADS)

Li, Ni; Su, Zeya; Bi, Zhuming; Tian, Chao; Ren, Zhiming; Gong, Guanghong

2014-03-01

Multi-disciplinary design optimisation (MDO) is one of critical methodologies to the implementation of enterprise systems (ES). MDO requiring the analysis of fluid dynamics raises a special challenge due to its extremely intensive computation. The rapid development of computational fluid dynamic (CFD) technique has caused a rise of its applications in various fields. Especially for the exterior designs of vehicles, CFD has become one of the three main design tools comparable to analytical approaches and wind tunnel experiments. CFD-based design optimisation is an effective way to achieve the desired performance under the given constraints. However, due to the complexity of CFD, integrating with CFD analysis in an intelligent optimisation algorithm is not straightforward. It is a challenge to solve a CFD-based design problem, which is usually with high dimensions, and multiple objectives and constraints. It is desirable to have an integrated architecture for CFD-based design optimisation. However, our review on existing works has found that very few researchers have studied on the assistive tools to facilitate CFD-based design optimisation. In the paper, a multi-layer architecture and a general procedure are proposed to integrate different CFD toolsets with intelligent optimisation algorithms, parallel computing technique and other techniques for efficient computation. In the proposed architecture, the integration is performed either at the code level or data level to fully utilise the capabilities of different assistive tools. Two intelligent algorithms are developed and embedded with parallel computing. These algorithms, together with the supportive architecture, lay a solid foundation for various applications of CFD-based design optimisation. To illustrate the effectiveness of the proposed architecture and algorithms, the case studies on aerodynamic shape design of a hypersonic cruising vehicle are provided, and the result has shown that the proposed architecture and developed algorithms have performed successfully and efficiently in dealing with the design optimisation with over 200 design variables.
Efficient parallel implementation of active appearance model fitting algorithm on GPU.

PubMed

Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou

2014-01-01

The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures.
Efficient Parallel Implementation of Active Appearance Model Fitting Algorithm on GPU

PubMed Central

Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou

2014-01-01

The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures. PMID:24723812
Parallel algorithms for mapping pipelined and parallel computations

NASA Technical Reports Server (NTRS)

Nicol, David M.

1988-01-01

Many computational problems in image processing, signal processing, and scientific computing are naturally structured for either pipelined or parallel computation. When mapping such problems onto a parallel architecture it is often necessary to aggregate an obvious problem decomposition. Even in this context the general mapping problem is known to be computationally intractable, but recent advances have been made in identifying classes of problems and architectures for which optimal solutions can be found in polynomial time. Among these, the mapping of pipelined or parallel computations onto linear array, shared memory, and host-satellite systems figures prominently. This paper extends that work first by showing how to improve existing serial mapping algorithms. These improvements have significantly lower time and space complexities: in one case a published O(nm sup 3) time algorithm for mapping m modules onto n processors is reduced to an O(nm log m) time complexity, and its space requirements reduced from O(nm sup 2) to O(m). Run time complexity is further reduced with parallel mapping algorithms based on these improvements, which run on the architecture for which they create the mappings.
NETRA: A parallel architecture for integrated vision systems. 1: Architecture and organization

NASA Technical Reports Server (NTRS)

Choudhary, Alok N.; Patel, Janak H.; Ahuja, Narendra

1989-01-01

Computer vision is regarded as one of the most complex and computationally intensive problems. An integrated vision system (IVS) is considered to be a system that uses vision algorithms from all levels of processing for a high level application (such as object recognition). A model of computation is presented for parallel processing for an IVS. Using the model, desired features and capabilities of a parallel architecture suitable for IVSs are derived. Then a multiprocessor architecture (called NETRA) is presented. This architecture is highly flexible without the use of complex interconnection schemes. The topology of NETRA is recursively defined and hence is easily scalable from small to large systems. Homogeneity of NETRA permits fault tolerance and graceful degradation under faults. It is a recursively defined tree-type hierarchical architecture where each of the leaf nodes consists of a cluster of processors connected with a programmable crossbar with selective broadcast capability to provide for desired flexibility. A qualitative evaluation of NETRA is presented. Then general schemes are described to map parallel algorithms onto NETRA. Algorithms are classified according to their communication requirements for parallel processing. An extensive analysis of inter-cluster communication strategies in NETRA is presented, and parameters affecting performance of parallel algorithms when mapped on NETRA are discussed. Finally, a methodology to evaluate performance of algorithms on NETRA is described.
A task-based parallelism and vectorized approach to 3D Method of Characteristics (MOC) reactor simulation for high performance computing architectures

NASA Astrophysics Data System (ADS)

Tramm, John R.; Gunow, Geoffrey; He, Tim; Smith, Kord S.; Forget, Benoit; Siegel, Andrew R.

2016-05-01

In this study we present and analyze a formulation of the 3D Method of Characteristics (MOC) technique applied to the simulation of full core nuclear reactors. Key features of the algorithm include a task-based parallelism model that allows independent MOC tracks to be assigned to threads dynamically, ensuring load balancing, and a wide vectorizable inner loop that takes advantage of modern SIMD computer architectures. The algorithm is implemented in a set of highly optimized proxy applications in order to investigate its performance characteristics on CPU, GPU, and Intel Xeon Phi architectures. Speed, power, and hardware cost efficiencies are compared. Additionally, performance bottlenecks are identified for each architecture in order to determine the prospects for continued scalability of the algorithm on next generation HPC architectures.
Implementation theory of distortion-invariant pattern recognition for optical and digital signal processing systems

NASA Astrophysics Data System (ADS)

Lhamon, Michael Earl

A pattern recognition system which uses complex correlation filter banks requires proportionally more computational effort than single-real valued filters. This introduces increased computation burden but also introduces a higher level of parallelism, that common computing platforms fail to identify. As a result, we consider algorithm mapping to both optical and digital processors. For digital implementation, we develop computationally efficient pattern recognition algorithms, referred to as, vector inner product operators that require less computational effort than traditional fast Fourier methods. These algorithms do not need correlation and they map readily onto parallel digital architectures, which imply new architectures for optical processors. These filters exploit circulant-symmetric matrix structures of the training set data representing a variety of distortions. By using the same mathematical basis as with the vector inner product operations, we are able to extend the capabilities of more traditional correlation filtering to what we refer to as "Super Images". These "Super Images" are used to morphologically transform a complicated input scene into a predetermined dot pattern. The orientation of the dot pattern is related to the rotational distortion of the object of interest. The optical implementation of "Super Images" yields feature reduction necessary for using other techniques, such as artificial neural networks. We propose a parallel digital signal processor architecture based on specific pattern recognition algorithms but general enough to be applicable to other similar problems. Such an architecture is classified as a data flow architecture. Instead of mapping an algorithm to an architecture, we propose mapping the DSP architecture to a class of pattern recognition algorithms. Today's optical processing systems have difficulties implementing full complex filter structures. Typically, optical systems (like the 4f correlators) are limited to phase-only implementation with lower detection performance than full complex electronic systems. Our study includes pseudo-random pixel encoding techniques for approximating full complex filtering. Optical filter bank implementation is possible and they have the advantage of time averaging the entire filter bank at real time rates. Time-averaged optical filtering is computational comparable to billions of digital operations-per-second. For this reason, we believe future trends in high speed pattern recognition will involve hybrid architectures of both optical and DSP elements.
Petri net model for analysis of concurrently processed complex algorithms

NASA Technical Reports Server (NTRS)

Stoughton, John W.; Mielke, Roland R.

1986-01-01

This paper presents a Petri-net model suitable for analyzing the concurrent processing of computationally complex algorithms. The decomposed operations are to be processed in a multiple processor, data driven architecture. Of particular interest is the application of the model to both the description of the data/control flow of a particular algorithm, and to the general specification of the data driven architecture. A candidate architecture is also presented.
Multi-threaded Sparse Matrix Sparse Matrix Multiplication for Many-Core and GPU Architectures.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Deveci, Mehmet; Trott, Christian Robert; Rajamanickam, Sivasankaran

Sparse Matrix-Matrix multiplication is a key kernel that has applications in several domains such as scientific computing and graph analysis. Several algorithms have been studied in the past for this foundational kernel. In this paper, we develop parallel algorithms for sparse matrix- matrix multiplication with a focus on performance portability across different high performance computing architectures. The performance of these algorithms depend on the data structures used in them. We compare different types of accumulators in these algorithms and demonstrate the performance difference between these data structures. Furthermore, we develop a meta-algorithm, kkSpGEMM, to choose the right algorithm and datamore » structure based on the characteristics of the problem. We show performance comparisons on three architectures and demonstrate the need for the community to develop two phase sparse matrix-matrix multiplication implementations for efficient reuse of the data structures involved.« less
Multi-threaded Sparse Matrix-Matrix Multiplication for Many-Core and GPU Architectures.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Deveci, Mehmet; Rajamanickam, Sivasankaran; Trott, Christian Robert

Sparse Matrix-Matrix multiplication is a key kernel that has applications in several domains such as scienti c computing and graph analysis. Several algorithms have been studied in the past for this foundational kernel. In this paper, we develop parallel algorithms for sparse matrix-matrix multiplication with a focus on performance portability across different high performance computing architectures. The performance of these algorithms depend on the data structures used in them. We compare different types of accumulators in these algorithms and demonstrate the performance difference between these data structures. Furthermore, we develop a meta-algorithm, kkSpGEMM, to choose the right algorithm and datamore » structure based on the characteristics of the problem. We show performance comparisons on three architectures and demonstrate the need for the community to develop two phase sparse matrix-matrix multiplication implementations for efficient reuse of the data structures involved.« less
Resource Efficient Hardware Architecture for Fast Computation of Running Max/Min Filters

PubMed Central

Torres-Huitzil, Cesar

2013-01-01

Running max/min filters on rectangular kernels are widely used in many digital signal and image processing applications. Filtering with a k × k kernel requires of k 2 − 1 comparisons per sample for a direct implementation; thus, performance scales expensively with the kernel size k. Faster computations can be achieved by kernel decomposition and using constant time one-dimensional algorithms on custom hardware. This paper presents a hardware architecture for real-time computation of running max/min filters based on the van Herk/Gil-Werman (HGW) algorithm. The proposed architecture design uses less computation and memory resources than previously reported architectures when targeted to Field Programmable Gate Array (FPGA) devices. Implementation results show that the architecture is able to compute max/min filters, on 1024 × 1024 images with up to 255 × 255 kernels, in around 8.4 milliseconds, 120 frames per second, at a clock frequency of 250 MHz. The implementation is highly scalable for the kernel size with good performance/area tradeoff suitable for embedded applications. The applicability of the architecture is shown for local adaptive image thresholding. PMID:24288456
The science of computing - Parallel computation

NASA Technical Reports Server (NTRS)

Denning, P. J.

1985-01-01

Although parallel computation architectures have been known for computers since the 1920s, it was only in the 1970s that microelectronic components technologies advanced to the point where it became feasible to incorporate multiple processors in one machine. Concommitantly, the development of algorithms for parallel processing also lagged due to hardware limitations. The speed of computing with solid-state chips is limited by gate switching delays. The physical limit implies that a 1 Gflop operational speed is the maximum for sequential processors. A computer recently introduced features a 'hypercube' architecture with 128 processors connected in networks at 5, 6 or 7 points per grid, depending on the design choice. Its computing speed rivals that of supercomputers, but at a fraction of the cost. The added speed with less hardware is due to parallel processing, which utilizes algorithms representing different parts of an equation that can be broken into simpler statements and processed simultaneously. Present, highly developed computer languages like FORTRAN, PASCAL, COBOL, etc., rely on sequential instructions. Thus, increased emphasis will now be directed at parallel processing algorithms to exploit the new architectures.
Optical systolic solutions of linear algebraic equations

NASA Technical Reports Server (NTRS)

Neuman, C. P.; Casasent, D.

1984-01-01

The philosophy and data encoding possible in systolic array optical processor (SAOP) were reviewed. The multitude of linear algebraic operations achievable on this architecture is examined. These operations include such linear algebraic algorithms as: matrix-decomposition, direct and indirect solutions, implicit and explicit methods for partial differential equations, eigenvalue and eigenvector calculations, and singular value decomposition. This architecture can be utilized to realize general techniques for solving matrix linear and nonlinear algebraic equations, least mean square error solutions, FIR filters, and nested-loop algorithms for control engineering applications. The data flow and pipelining of operations, design of parallel algorithms and flexible architectures, application of these architectures to computationally intensive physical problems, error source modeling of optical processors, and matching of the computational needs of practical engineering problems to the capabilities of optical processors are emphasized.
Approximation algorithms for planning and control

NASA Technical Reports Server (NTRS)

Boddy, Mark; Dean, Thomas

1989-01-01

A control system operating in a complex environment will encounter a variety of different situations, with varying amounts of time available to respond to critical events. Ideally, such a control system will do the best possible with the time available. In other words, its responses should approximate those that would result from having unlimited time for computation, where the degree of the approximation depends on the amount of time it actually has. There exist approximation algorithms for a wide variety of problems. Unfortunately, the solution to any reasonably complex control problem will require solving several computationally intensive problems. Algorithms for successive approximation are a subclass of the class of anytime algorithms, algorithms that return answers for any amount of computation time, where the answers improve as more time is allotted. An architecture is described for allocating computation time to a set of anytime algorithms, based on expectations regarding the value of the answers they return. The architecture described is quite general, producing optimal schedules for a set of algorithms under widely varying conditions.
Multiprocessor architecture: Synthesis and evaluation

NASA Technical Reports Server (NTRS)

Standley, Hilda M.

1990-01-01

Multiprocessor computed architecture evaluation for structural computations is the focus of the research effort described. Results obtained are expected to lead to more efficient use of existing architectures and to suggest designs for new, application specific, architectures. The brief descriptions given outline a number of related efforts directed toward this purpose. The difficulty is analyzing an existing architecture or in designing a new computer architecture lies in the fact that the performance of a particular architecture, within the context of a given application, is determined by a number of factors. These include, but are not limited to, the efficiency of the computation algorithm, the programming language and support environment, the quality of the program written in the programming language, the multiplicity of the processing elements, the characteristics of the individual processing elements, the interconnection network connecting processors and non-local memories, and the shared memory organization covering the spectrum from no shared memory (all local memory) to one global access memory. These performance determiners may be loosely classified as being software or hardware related. This distinction is not clear or even appropriate in many cases. The effect of the choice of algorithm is ignored by assuming that the algorithm is specified as given. Effort directed toward the removal of the effect of the programming language and program resulted in the design of a high-level parallel programming language. Two characteristics of the fundamental structure of the architecture (memory organization and interconnection network) are examined.
Innovative architectures for dense multi-microprocessor computers

NASA Technical Reports Server (NTRS)

Donaldson, Thomas; Doty, Karl; Engle, Steven W.; Larson, Robert E.; O'Reilly, John G.

1988-01-01

The results of a Phase I Small Business Innovative Research (SBIR) project performed for the NASA Langley Computational Structural Mechanics Group are described. The project resulted in the identification of a family of chordal-ring interconnection architectures with excellent potential to serve as the basis for new multimicroprocessor (MMP) computers. The paper presents examples of how computational algorithms from structural mechanics can be efficiently implemented on the chordal-ring architecture.
Parallel language constructs for tensor product computations on loosely coupled architectures

NASA Technical Reports Server (NTRS)

Mehrotra, Piyush; Vanrosendale, John

1989-01-01

Distributed memory architectures offer high levels of performance and flexibility, but have proven awkard to program. Current languages for nonshared memory architectures provide a relatively low level programming environment, and are poorly suited to modular programming, and to the construction of libraries. A set of language primitives designed to allow the specification of parallel numerical algorithms at a higher level is described. Tensor product array computations are focused on along with a simple but important class of numerical algorithms. The problem of programming 1-D kernal routines is focused on first, such as parallel tridiagonal solvers, and then how such parallel kernels can be combined to form parallel tensor product algorithms is examined.
Computer sciences

NASA Technical Reports Server (NTRS)

Smith, Paul H.

1988-01-01

The Computer Science Program provides advanced concepts, techniques, system architectures, algorithms, and software for both space and aeronautics information sciences and computer systems. The overall goal is to provide the technical foundation within NASA for the advancement of computing technology in aerospace applications. The research program is improving the state of knowledge of fundamental aerospace computing principles and advancing computing technology in space applications such as software engineering and information extraction from data collected by scientific instruments in space. The program includes the development of special algorithms and techniques to exploit the computing power provided by high performance parallel processors and special purpose architectures. Research is being conducted in the fundamentals of data base logic and improvement techniques for producing reliable computing systems.

Quantum Computing Architectural Design

NASA Astrophysics Data System (ADS)

West, Jacob; Simms, Geoffrey; Gyure, Mark

2006-03-01

Large scale quantum computers will invariably require scalable architectures in addition to high fidelity gate operations. Quantum computing architectural design (QCAD) addresses the problems of actually implementing fault-tolerant algorithms given physical and architectural constraints beyond those of basic gate-level fidelity. Here we introduce a unified framework for QCAD that enables the scientist to study the impact of varying error correction schemes, architectural parameters including layout and scheduling, and physical operations native to a given architecture. Our software package, aptly named QCAD, provides compilation, manipulation/transformation, multi-paradigm simulation, and visualization tools. We demonstrate various features of the QCAD software package through several examples.
Redundancy management for efficient fault recovery in NASA's distributed computing system

NASA Technical Reports Server (NTRS)

Malek, Miroslaw; Pandya, Mihir; Yau, Kitty

1991-01-01

The management of redundancy in computer systems was studied and guidelines were provided for the development of NASA's fault-tolerant distributed systems. Fault recovery and reconfiguration mechanisms were examined. A theoretical foundation was laid for redundancy management by efficient reconfiguration methods and algorithmic diversity. Algorithms were developed to optimize the resources for embedding of computational graphs of tasks in the system architecture and reconfiguration of these tasks after a failure has occurred. The computational structure represented by a path and the complete binary tree was considered and the mesh and hypercube architectures were targeted for their embeddings. The innovative concept of Hybrid Algorithm Technique was introduced. This new technique provides a mechanism for obtaining fault tolerance while exhibiting improved performance.
A Space-Time Signal Decomposition Algorithm for Downlink MIMO DS-CDMA Receivers

NASA Astrophysics Data System (ADS)

Wang, Yung-Yi; Fang, Wen-Hsien; Chen, Jiunn-Tsair

We propose a dimension reduction algorithm for the receiver of the downlink of direct-sequence code-division multiple access (DS-CDMA) systems in which both the transmitters and the receivers employ antenna arrays of multiple elements. To estimate the high order channel parameters, we develop a layered architecture using dimension-reduced parameter estimation algorithms to estimate the frequency-selective multipath channels. In the proposed architecture, to exploit the space-time geometric characteristics of multipath channels, spatial beamformers and constrained (or unconstrained) temporal filters are adopted for clustered-multipath grouping and path isolation. In conjunction with the multiple access interference (MAI) suppression techniques, the proposed architecture jointly estimates the direction of arrivals, propagation delays, and fading amplitudes of the downlink fading multipaths. With the outputs of the proposed architecture, the signals of interest can then be naturally detected by using path-wise maximum ratio combining. Compared to the traditional techniques, such as the Joint-Angle-and-Delay-Estimation (JADE) algorithm for DOA-delay joint estimation and the space-time minimum mean square error (ST-MMSE) algorithm for signal detection, computer simulations show that the proposed algorithm substantially mitigate the computational complexity at the expense of only slight performance degradation.
Patterns and Practices for Future Architectures

DTIC Science & Technology

2014-08-01

14. SUBJECT TERMS computing architecture, graph algorithms, high-performance computing, big data , GPU 15. NUMBER OF PAGES 44 16. PRICE CODE 17...at Vertex 1 6 Figure 4: Data Structures Created by Kernel 1 of Single CPU, List Implementation Using the Graph in the Example from Section 1.2 9...Figure 5: Kernel 2 of Graph500 BFS Reference Implementation: Single CPU, List 10 Figure 6: Data Structures for Sequential CSR Algorithm 12 Figure 7
Implementation of an ADI method on parallel computers

NASA Technical Reports Server (NTRS)

Fatoohi, Raad A.; Grosch, Chester E.

1987-01-01

The implementation of an ADI method for solving the diffusion equation on three parallel/vector computers is discussed. The computers were chosen so as to encompass a variety of architectures. They are: the MPP, an SIMD machine with 16K bit serial processors; FLEX/32, an MIMD machine with 20 processors; and CRAY/2, an MIMD machine with four vector processors. The Gaussian elimination algorithm is used to solve a set of tridiagonal systems on the FLEX/32 and CRAY/2 while the cyclic elimination algorithm is used to solve these systems on the MPP. The implementation of the method is discussed in relation to these architectures and measures of the performance on each machine are given. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally, conclusions are presented.
Implementation of an ADI method on parallel computers

NASA Technical Reports Server (NTRS)

Fatoohi, Raad A.; Grosch, Chester E.

1987-01-01

In this paper the implementation of an ADI method for solving the diffusion equation on three parallel/vector computers is discussed. The computers were chosen so as to encompass a variety of architectures. They are the MPP, an SIMD machine with 16-Kbit serial processors; Flex/32, an MIMD machine with 20 processors; and Cray/2, an MIMD machine with four vector processors. The Gaussian elimination algorithm is used to solve a set of tridiagonal systems on the Flex/32 and Cray/2 while the cyclic elimination algorithm is used to solve these systems on the MPP. The implementation of the method is discussed in relation to these architectures and measures of the performance on each machine are given. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally conclusions are presented.
VTK-m: Accelerating the Visualization Toolkit for Massively Threaded Architectures

DOE PAGES

Moreland, Kenneth; Sewell, Christopher; Usher, William; ...

2016-05-09

Here, one of the most critical challenges for high-performance computing (HPC) scientific visualization is execution on massively threaded processors. Of the many fundamental changes we are seeing in HPC systems, one of the most profound is a reliance on new processor types optimized for execution bandwidth over latency hiding. Our current production scientific visualization software is not designed for these new types of architectures. To address this issue, the VTK-m framework serves as a container for algorithms, provides flexible data representation, and simplifies the design of visualization algorithms on new and future computer architecture.
VTK-m: Accelerating the Visualization Toolkit for Massively Threaded Architectures

DOE PAGES

Moreland, Kenneth; Sewell, Christopher; Usher, William; ...

2016-05-09

Execution on massively threaded processors is one of the most critical challenges for high-performance computing (HPC) scientific visualization. Of the many fundamental changes we are seeing in HPC systems, one of the most profound is a reliance on new processor types optimized for execution bandwidth over latency hiding. Moreover, our current production scientific visualization software is not designed for these new types of architectures. In order to address this issue, the VTK-m framework serves as a container for algorithms, provides flexible data representation, and simplifies the design of visualization algorithms on new and future computer architecture.
Exploring Gigabyte Datasets in Real Time: Architectures, Interfaces and Time-Critical Design

NASA Technical Reports Server (NTRS)

Bryson, Steve; Gerald-Yamasaki, Michael (Technical Monitor)

1998-01-01

Architectures and Interfaces: The implications of real-time interaction on software architecture design: decoupling of interaction/graphics and computation into asynchronous processes. The performance requirements of graphics and computation for interaction. Time management in such an architecture. Examples of how visualization algorithms must be modified for high performance. Brief survey of interaction techniques and design, including direct manipulation and manipulation via widgets. talk discusses how human factors considerations drove the design and implementation of the virtual wind tunnel. Time-Critical Design: A survey of time-critical techniques for both computation and rendering. Emphasis on the assignment of a time budget to both the overall visualization environment and to each individual visualization technique in the environment. The estimation of the benefit and cost of an individual technique. Examples of the modification of visualization algorithms to allow time-critical control.
The GOES-R Product Generation Architecture

NASA Astrophysics Data System (ADS)

Dittberner, G. J.; Kalluri, S.; Hansen, D.; Weiner, A.; Tarpley, A.; Marley, S.

2011-12-01

The GOES-R system will substantially improve users' ability to succeed in their work by providing data with significantly enhanced instruments, higher resolution, much shorter relook times, and an increased number and diversity of products. The Product Generation architecture is designed to provide the computer and memory resources necessary to achieve the necessary latency and availability for these products. Over time, new and updated algorithms are expected to be added and old ones removed as science advances and new products are developed. The GOES-R GS architecture is being planned to maintain functionality so that when such changes are implemented, operational product generation will continue without interruption. The primary parts of the PG infrastructure are the Service Based Architecture (SBA) and the Data Fabric (DF). SBA is the middleware that encapsulates and manages science algorithms that generate products. It is divided into three parts, the Executive, which manages and configures the algorithm as a service, the Dispatcher, which provides data to the algorithm, and the Strategy, which determines when the algorithm can execute with the available data. SBA is a distributed architecture, with services connected to each other over a compute grid and is highly scalable. This plug-and-play architecture allows algorithms to be added, removed, or updated without affecting any other services or software currently running and producing data. Algorithms require product data from other algorithms, so a scalable and reliable messaging is necessary. The SBA uses the DF to provide this data communication layer between algorithms. The DF provides an abstract interface over a distributed and persistent multi-layered storage system (e.g., memory based caching above disk-based storage) and an event management system that allows event-driven algorithm services to know when instrument data are available and where they reside. Together, the SBA and the DF provide a flexible, high performance architecture that can meet the needs of product processing now and as they grow in the future.
Developing Information Power Grid Based Algorithms and Software

NASA Technical Reports Server (NTRS)

Dongarra, Jack

1998-01-01

This exploratory study initiated our effort to understand performance modeling on parallel systems. The basic goal of performance modeling is to understand and predict the performance of a computer program or set of programs on a computer system. Performance modeling has numerous applications, including evaluation of algorithms, optimization of code implementations, parallel library development, comparison of system architectures, parallel system design, and procurement of new systems. Our work lays the basis for the construction of parallel libraries that allow for the reconstruction of application codes on several distinct architectures so as to assure performance portability. Following our strategy, once the requirements of applications are well understood, one can then construct a library in a layered fashion. The top level of this library will consist of architecture-independent geometric, numerical, and symbolic algorithms that are needed by the sample of applications. These routines should be written in a language that is portable across the targeted architectures.
Strategies for concurrent processing of complex algorithms in data driven architectures

NASA Technical Reports Server (NTRS)

Stoughton, John W.; Mielke, Roland R.; Som, Sukhamony

1990-01-01

The performance modeling and enhancement for periodic execution of large-grain, decision-free algorithms in data flow architectures is examined. Applications include real-time implementation of control and signal processing algorithms where performance is required to be highly predictable. The mapping of algorithms onto the specified class of data flow architectures is realized by a marked graph model called ATAMM (Algorithm To Architecture Mapping Model). Performance measures and bounds are established. Algorithm transformation techniques are identified for performance enhancement and reduction of resource (computing element) requirements. A systematic design procedure is described for generating operating conditions for predictable performance both with and without resource constraints. An ATAMM simulator is used to test and validate the performance prediction by the design procedure. Experiments on a three resource testbed provide verification of the ATAMM model and the design procedure.
Strategies for concurrent processing of complex algorithms in data driven architectures

NASA Technical Reports Server (NTRS)

Som, Sukhamoy; Stoughton, John W.; Mielke, Roland R.

1990-01-01

Performance modeling and performance enhancement for periodic execution of large-grain, decision-free algorithms in data flow architectures are discussed. Applications include real-time implementation of control and signal processing algorithms where performance is required to be highly predictable. The mapping of algorithms onto the specified class of data flow architectures is realized by a marked graph model called algorithm to architecture mapping model (ATAMM). Performance measures and bounds are established. Algorithm transformation techniques are identified for performance enhancement and reduction of resource (computing element) requirements. A systematic design procedure is described for generating operating conditions for predictable performance both with and without resource constraints. An ATAMM simulator is used to test and validate the performance prediction by the design procedure. Experiments on a three resource testbed provide verification of the ATAMM model and the design procedure.
FPGA Implementation of Generalized Hebbian Algorithm for Texture Classification

PubMed Central

Lin, Shiow-Jyu; Hwang, Wen-Jyi; Lee, Wei-Hao

2012-01-01

This paper presents a novel hardware architecture for principal component analysis. The architecture is based on the Generalized Hebbian Algorithm (GHA) because of its simplicity and effectiveness. The architecture is separated into three portions: the weight vector updating unit, the principal computation unit and the memory unit. In the weight vector updating unit, the computation of different synaptic weight vectors shares the same circuit for reducing the area costs. To show the effectiveness of the circuit, a texture classification system based on the proposed architecture is physically implemented by Field Programmable Gate Array (FPGA). It is embedded in a System-On-Programmable-Chip (SOPC) platform for performance measurement. Experimental results show that the proposed architecture is an efficient design for attaining both high speed performance and low area costs. PMID:22778640
Applications of an architecture design and assessment system (ADAS)

NASA Technical Reports Server (NTRS)

Gray, F. Gail; Debrunner, Linda S.; White, Tennis S.

1988-01-01

A new Architecture Design and Assessment System (ADAS) tool package is introduced, and a range of possible applications is illustrated. ADAS was used to evaluate the performance of an advanced fault-tolerant computer architecture in a modern flight control application. Bottlenecks were identified and possible solutions suggested. The tool was also used to inject faults into the architecture and evaluate the synchronization algorithm, and improvements are suggested. Finally, ADAS was used as a front end research tool to aid in the design of reconfiguration algorithms in a distributed array architecture.
RGCA: A Reliable GPU Cluster Architecture for Large-Scale Internet of Things Computing Based on Effective Performance-Energy Optimization

PubMed Central

Chen, Qingkui; Zhao, Deyu; Wang, Jingjuan

2017-01-01

This paper aims to develop a low-cost, high-performance and high-reliability computing system to process large-scale data using common data mining algorithms in the Internet of Things (IoT) computing environment. Considering the characteristics of IoT data processing, similar to mainstream high performance computing, we use a GPU (Graphics Processing Unit) cluster to achieve better IoT services. Firstly, we present an energy consumption calculation method (ECCM) based on WSNs. Then, using the CUDA (Compute Unified Device Architecture) Programming model, we propose a Two-level Parallel Optimization Model (TLPOM) which exploits reasonable resource planning and common compiler optimization techniques to obtain the best blocks and threads configuration considering the resource constraints of each node. The key to this part is dynamic coupling Thread-Level Parallelism (TLP) and Instruction-Level Parallelism (ILP) to improve the performance of the algorithms without additional energy consumption. Finally, combining the ECCM and the TLPOM, we use the Reliable GPU Cluster Architecture (RGCA) to obtain a high-reliability computing system considering the nodes’ diversity, algorithm characteristics, etc. The results show that the performance of the algorithms significantly increased by 34.1%, 33.96% and 24.07% for Fermi, Kepler and Maxwell on average with TLPOM and the RGCA ensures that our IoT computing system provides low-cost and high-reliability services. PMID:28777325
RGCA: A Reliable GPU Cluster Architecture for Large-Scale Internet of Things Computing Based on Effective Performance-Energy Optimization.

PubMed

Fang, Yuling; Chen, Qingkui; Xiong, Neal N; Zhao, Deyu; Wang, Jingjuan

2017-08-04

This paper aims to develop a low-cost, high-performance and high-reliability computing system to process large-scale data using common data mining algorithms in the Internet of Things (IoT) computing environment. Considering the characteristics of IoT data processing, similar to mainstream high performance computing, we use a GPU (Graphics Processing Unit) cluster to achieve better IoT services. Firstly, we present an energy consumption calculation method (ECCM) based on WSNs. Then, using the CUDA (Compute Unified Device Architecture) Programming model, we propose a Two-level Parallel Optimization Model (TLPOM) which exploits reasonable resource planning and common compiler optimization techniques to obtain the best blocks and threads configuration considering the resource constraints of each node. The key to this part is dynamic coupling Thread-Level Parallelism (TLP) and Instruction-Level Parallelism (ILP) to improve the performance of the algorithms without additional energy consumption. Finally, combining the ECCM and the TLPOM, we use the Reliable GPU Cluster Architecture (RGCA) to obtain a high-reliability computing system considering the nodes' diversity, algorithm characteristics, etc. The results show that the performance of the algorithms significantly increased by 34.1%, 33.96% and 24.07% for Fermi, Kepler and Maxwell on average with TLPOM and the RGCA ensures that our IoT computing system provides low-cost and high-reliability services.
Parallel processing architecture for computing inverse differential kinematic equations of the PUMA arm

NASA Technical Reports Server (NTRS)

Hsia, T. C.; Lu, G. Z.; Han, W. H.

1987-01-01

In advanced robot control problems, on-line computation of inverse Jacobian solution is frequently required. Parallel processing architecture is an effective way to reduce computation time. A parallel processing architecture is developed for the inverse Jacobian (inverse differential kinematic equation) of the PUMA arm. The proposed pipeline/parallel algorithm can be inplemented on an IC chip using systolic linear arrays. This implementation requires 27 processing cells and 25 time units. Computation time is thus significantly reduced.
FPGA-based real-time phase measuring profilometry algorithm design and implementation

NASA Astrophysics Data System (ADS)

Zhan, Guomin; Tang, Hongwei; Zhong, Kai; Li, Zhongwei; Shi, Yusheng

2016-11-01

Phase measuring profilometry (PMP) has been widely used in many fields, like Computer Aided Verification (CAV), Flexible Manufacturing System (FMS) et al. High frame-rate (HFR) real-time vision-based feedback control will be a common demands in near future. However, the instruction time delay in the computer caused by numerous repetitive operations greatly limit the efficiency of data processing. FPGA has the advantages of pipeline architecture and parallel execution, and it fit for handling PMP algorithm. In this paper, we design a fully pipelined hardware architecture for PMP. The functions of hardware architecture includes rectification, phase calculation, phase shifting, and stereo matching. The experiment verified the performance of this method, and the factors that may influence the computation accuracy was analyzed.
Satellite on-board processing for earth resources data

NASA Technical Reports Server (NTRS)

Bodenheimer, R. E.; Gonzalez, R. C.; Gupta, J. N.; Hwang, K.; Rochelle, R. W.; Wilson, J. B.; Wintz, P. A.

1975-01-01

Results of a survey of earth resources user applications and their data requirements, earth resources multispectral scanner sensor technology, and preprocessing algorithms for correcting the sensor outputs and for data bulk reduction are presented along with a candidate data format. Computational requirements required to implement the data analysis algorithms are included along with a review of computer architectures and organizations. Computer architectures capable of handling the algorithm computational requirements are suggested and the environmental effects of an on-board processor discussed. By relating performance parameters to the system requirements of each of the user requirements the feasibility of on-board processing is determined for each user. A tradeoff analysis is performed to determine the sensitivity of results to each of the system parameters. Significant results and conclusions are discussed, and recommendations are presented.

Advanced Architectures for Astrophysical Supercomputing

NASA Astrophysics Data System (ADS)

Barsdell, B. R.; Barnes, D. G.; Fluke, C. J.

2010-12-01

Astronomers have come to rely on the increasing performance of computers to reduce, analyze, simulate and visualize their data. In this environment, faster computation can mean more science outcomes or the opening up of new parameter spaces for investigation. If we are to avoid major issues when implementing codes on advanced architectures, it is important that we have a solid understanding of our algorithms. A recent addition to the high-performance computing scene that highlights this point is the graphics processing unit (GPU). The hardware originally designed for speeding-up graphics rendering in video games is now achieving speed-ups of O(100×) in general-purpose computation - performance that cannot be ignored. We are using a generalized approach, based on the analysis of astronomy algorithms, to identify the optimal problem-types and techniques for taking advantage of both current GPU hardware and future developments in computing architectures.
GPU-completeness: theory and implications

NASA Astrophysics Data System (ADS)

Lin, I.-Jong

2011-01-01

This paper formalizes a major insight into a class of algorithms that relate parallelism and performance. The purpose of this paper is to define a class of algorithms that trades off parallelism for quality of result (e.g. visual quality, compression rate), and we propose a similar method for algorithmic classification based on NP-Completeness techniques, applied toward parallel acceleration. We will define this class of algorithm as "GPU-Complete" and will postulate the necessary properties of the algorithms for admission into this class. We will also formally relate his algorithmic space and imaging algorithms space. This concept is based upon our experience in the print production area where GPUs (Graphic Processing Units) have shown a substantial cost/performance advantage within the context of HPdelivered enterprise services and commercial printing infrastructure. While CPUs and GPUs are converging in their underlying hardware and functional blocks, their system behaviors are clearly distinct in many ways: memory system design, programming paradigms, and massively parallel SIMD architecture. There are applications that are clearly suited to each architecture: for CPU: language compilation, word processing, operating systems, and other applications that are highly sequential in nature; for GPU: video rendering, particle simulation, pixel color conversion, and other problems clearly amenable to massive parallelization. While GPUs establishing themselves as a second, distinct computing architecture from CPUs, their end-to-end system cost/performance advantage in certain parts of computation inform the structure of algorithms and their efficient parallel implementations. While GPUs are merely one type of architecture for parallelization, we show that their introduction into the design space of printing systems demonstrate the trade-offs against competing multi-core, FPGA, and ASIC architectures. While each architecture has its own optimal application, we believe that the selection of architecture can be defined in terms of properties of GPU-Completeness. For a welldefined subset of algorithms, GPU-Completeness is intended to connect the parallelism, algorithms and efficient architectures into a unified framework to show that multiple layers of parallel implementation are guided by the same underlying trade-off.
Algorithm To Architecture Mapping Model (ATAMM) multicomputer operating system functional specification

NASA Technical Reports Server (NTRS)

Mielke, R.; Stoughton, J.; Som, S.; Obando, R.; Malekpour, M.; Mandala, B.

1990-01-01

A functional description of the ATAMM Multicomputer Operating System is presented. ATAMM (Algorithm to Architecture Mapping Model) is a marked graph model which describes the implementation of large grained, decomposed algorithms on data flow architectures. AMOS, the ATAMM Multicomputer Operating System, is an operating system which implements the ATAMM rules. A first generation version of AMOS which was developed for the Advanced Development Module (ADM) is described. A second generation version of AMOS being developed for the Generic VHSIC Spaceborne Computer (GVSC) is also presented.
Parallel Computing Strategies for Irregular Algorithms

NASA Technical Reports Server (NTRS)

Biswas, Rupak; Oliker, Leonid; Shan, Hongzhang; Biegel, Bryan (Technical Monitor)

2002-01-01

Parallel computing promises several orders of magnitude increase in our ability to solve realistic computationally-intensive problems, but relies on their efficient mapping and execution on large-scale multiprocessor architectures. Unfortunately, many important applications are irregular and dynamic in nature, making their effective parallel implementation a daunting task. Moreover, with the proliferation of parallel architectures and programming paradigms, the typical scientist is faced with a plethora of questions that must be answered in order to obtain an acceptable parallel implementation of the solution algorithm. In this paper, we consider three representative irregular applications: unstructured remeshing, sparse matrix computations, and N-body problems, and parallelize them using various popular programming paradigms on a wide spectrum of computer platforms ranging from state-of-the-art supercomputers to PC clusters. We present the underlying problems, the solution algorithms, and the parallel implementation strategies. Smart load-balancing, partitioning, and ordering techniques are used to enhance parallel performance. Overall results demonstrate the complexity of efficiently parallelizing irregular algorithms.
Algorithms and software used in selecting structure of machine-training cluster based on neurocomputers

NASA Astrophysics Data System (ADS)

Romanchuk, V. A.; Lukashenko, V. V.

2018-05-01

The technique of functioning of a control system by a computing cluster based on neurocomputers is proposed. Particular attention is paid to the method of choosing the structure of the computing cluster due to the fact that the existing methods are not effective because of a specialized hardware base - neurocomputers, which are highly parallel computer devices with an architecture different from the von Neumann architecture. A developed algorithm for choosing the computational structure of a cloud cluster is described, starting from the direction of data transfer in the flow control graph of the program and its adjacency matrix.
Strategies for concurrent processing of complex algorithms in data driven architectures

NASA Technical Reports Server (NTRS)

Stoughton, John W.; Mielke, Roland R.

1988-01-01

Research directed at developing a graph theoretical model for describing data and control flow associated with the execution of large grained algorithms in a special distributed computer environment is presented. This model is identified by the acronym ATAMM which represents Algorithms To Architecture Mapping Model. The purpose of such a model is to provide a basis for establishing rules for relating an algorithm to its execution in a multiprocessor environment. Specifications derived from the model lead directly to the description of a data flow architecture which is a consequence of the inherent behavior of the data and control flow described by the model. The purpose of the ATAMM based architecture is to provide an analytical basis for performance evaluation. The ATAMM model and architecture specifications are demonstrated on a prototype system for concept validation.
Fast semivariogram computation using FPGA architectures

NASA Astrophysics Data System (ADS)

Lagadapati, Yamuna; Shirvaikar, Mukul; Dong, Xuanliang

2015-02-01

The semivariogram is a statistical measure of the spatial distribution of data and is based on Markov Random Fields (MRFs). Semivariogram analysis is a computationally intensive algorithm that has typically seen applications in the geosciences and remote sensing areas. Recently, applications in the area of medical imaging have been investigated, resulting in the need for efficient real time implementation of the algorithm. The semivariogram is a plot of semivariances for different lag distances between pixels. A semi-variance, γ(h), is defined as the half of the expected squared differences of pixel values between any two data locations with a lag distance of h. Due to the need to examine each pair of pixels in the image or sub-image being processed, the base algorithm complexity for an image window with n pixels is O(n2). Field Programmable Gate Arrays (FPGAs) are an attractive solution for such demanding applications due to their parallel processing capability. FPGAs also tend to operate at relatively modest clock rates measured in a few hundreds of megahertz, but they can perform tens of thousands of calculations per clock cycle while operating in the low range of power. This paper presents a technique for the fast computation of the semivariogram using two custom FPGA architectures. The design consists of several modules dedicated to the constituent computational tasks. A modular architecture approach is chosen to allow for replication of processing units. This allows for high throughput due to concurrent processing of pixel pairs. The current implementation is focused on isotropic semivariogram computations only. Anisotropic semivariogram implementation is anticipated to be an extension of the current architecture, ostensibly based on refinements to the current modules. The algorithm is benchmarked using VHDL on a Xilinx XUPV5-LX110T development Kit, which utilizes the Virtex5 FPGA. Medical image data from MRI scans are utilized for the experiments. Computational speedup is measured with respect to Matlab implementation on a personal computer with an Intel i7 multi-core processor. Preliminary simulation results indicate that a significant advantage in speed can be attained by the architectures, making the algorithm viable for implementation in medical devices
Scaling Watershed Models: Modern Approaches to Science Computation with MapReduce, Parallelization, and Cloud Optimization

EPA Science Inventory

Environmental models are products of the computer architecture and software tools available at the time of development. Scientifically sound algorithms may persist in their original state even as system architectures and software development approaches evolve and progress. Dating...
Reverse time migration: A seismic processing application on the connection machine

NASA Technical Reports Server (NTRS)

Fiebrich, Rolf-Dieter

1987-01-01

The implementation of a reverse time migration algorithm on the Connection Machine, a massively parallel computer is described. Essential architectural features of this machine as well as programming concepts are presented. The data structures and parallel operations for the implementation of the reverse time migration algorithm are described. The algorithm matches the Connection Machine architecture closely and executes almost at the peak performance of this machine.
Improving Quantum Gate Simulation using a GPU

NASA Astrophysics Data System (ADS)

Gutierrez, Eladio; Romero, Sergio; Trenas, Maria A.; Zapata, Emilio L.

2008-11-01

Due to the increasing computing power of the graphics processing units (GPU), they are becoming more and more popular when solving general purpose algorithms. As the simulation of quantum computers results on a problem with exponential complexity, it is advisable to perform a parallel computation, such as the one provided by the SIMD multiprocessors present in recent GPUs. In this paper, we focus on an important quantum algorithm, the quantum Fourier transform (QTF), in order to evaluate different parallelization strategies on a novel GPU architecture. Our implementation makes use of the new CUDA software/hardware architecture developed recently by NVIDIA.
The GOES-R Product Generation Architecture - Post CDR Update

NASA Astrophysics Data System (ADS)

Dittberner, G.; Kalluri, S.; Weiner, A.

2012-12-01

The GOES-R system will substantially improve the accuracy of information available to users by providing data from significantly enhanced instruments, which will generate an increased number and diversity of products with higher resolution, and much shorter relook times. Considerably greater compute and memory resources are necessary to achieve the necessary latency and availability for these products. Over time, new and updated algorithms are expected to be added and old ones removed as science advances and new products are developed. The GOES-R GS architecture is being planned to maintain functionality so that when such changes are implemented, operational product generation will continue without interruption. The primary parts of the PG infrastructure are the Service Based Architecture (SBA) and the Data Fabric (DF). SBA is the middleware that encapsulates and manages science algorithms that generate products. It is divided into three parts, the Executive, which manages and configures the algorithm as a service, the Dispatcher, which provides data to the algorithm, and the Strategy, which determines when the algorithm can execute with the available data. SBA is a distributed architecture, with services connected to each other over a compute grid and is highly scalable. This plug-and-play architecture allows algorithms to be added, removed, or updated without affecting any other services or software currently running and producing data. Algorithms require product data from other algorithms, so a scalable and reliable messaging is necessary. The SBA uses the DF to provide this data communication layer between algorithms. The DF provides an abstract interface over a distributed and persistent multi-layered storage system (e.g., memory based caching above disk-based storage) and an event management system that allows event-driven algorithm services to know when instrument data are available and where they reside. Together, the SBA and the DF provide a flexible, high performance architecture that can meet the needs of product processing now and as they grow in the future.
FFT Computation with Systolic Arrays, A New Architecture

NASA Technical Reports Server (NTRS)

Boriakoff, Valentin

1994-01-01

The use of the Cooley-Tukey algorithm for computing the l-d FFT lends itself to a particular matrix factorization which suggests direct implementation by linearly-connected systolic arrays. Here we present a new systolic architecture that embodies this algorithm. This implementation requires a smaller number of processors and a smaller number of memory cells than other recent implementations, as well as having all the advantages of systolic arrays. For the implementation of the decimation-in-frequency case, word-serial data input allows continuous real-time operation without the need of a serial-to-parallel conversion device. No control or data stream switching is necessary. Computer simulation of this architecture was done in the context of a 1024 point DFT with a fixed point processor, and CMOS processor implementation has started.
Performance prediction: A case study using a multi-ring KSR-1 machine

NASA Technical Reports Server (NTRS)

Sun, Xian-He; Zhu, Jianping

1995-01-01

While computers with tens of thousands of processors have successfully delivered high performance power for solving some of the so-called 'grand-challenge' applications, the notion of scalability is becoming an important metric in the evaluation of parallel machine architectures and algorithms. In this study, the prediction of scalability and its application are carefully investigated. A simple formula is presented to show the relation between scalability, single processor computing power, and degradation of parallelism. A case study is conducted on a multi-ring KSR1 shared virtual memory machine. Experimental and theoretical results show that the influence of topology variation of an architecture is predictable. Therefore, the performance of an algorithm on a sophisticated, heirarchical architecture can be predicted and the best algorithm-machine combination can be selected for a given application.
Algorithm architecture co-design for ultra low-power image sensor

NASA Astrophysics Data System (ADS)

Laforest, T.; Dupret, A.; Verdant, A.; Lattard, D.; Villard, P.

2012-03-01

In a context of embedded video surveillance, stand alone leftbehind image sensors are used to detect events with high level of confidence, but also with a very low power consumption. Using a steady camera, motion detection algorithms based on background estimation to find regions in movement are simple to implement and computationally efficient. To reduce power consumption, the background is estimated using a down sampled image formed of macropixels. In order to extend the class of moving objects to be detected, we propose an original mixed mode architecture developed thanks to an algorithm architecture co-design methodology. This programmable architecture is composed of a vector of SIMD processors. A basic RISC architecture was optimized in order to implement motion detection algorithms with a dedicated set of 42 instructions. Definition of delta modulation as a calculation primitive has allowed to implement algorithms in a very compact way. Thereby, a 1920x1080@25fps CMOS image sensor performing integrated motion detection is proposed with a power estimation of 1.8 mW.
Multiplier Architecture for Coding Circuits

NASA Technical Reports Server (NTRS)

Wang, C. C.; Truong, T. K.; Shao, H. M.; Deutsch, L. J.

1986-01-01

Multipliers based on new algorithm for Galois-field (GF) arithmetic regular and expandable. Pipeline structures used for computing both multiplications and inverses. Designs suitable for implementation in very-large-scale integrated (VLSI) circuits. This general type of inverter and multiplier architecture especially useful in performing finite-field arithmetic of Reed-Solomon error-correcting codes and of some cryptographic algorithms.
A heterogeneous hierarchical architecture for real-time computing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Skroch, D.A.; Fornaro, R.J.

The need for high-speed data acquisition and control algorithms has prompted continued research in the area of multiprocessor systems and related programming techniques. The result presented here is a unique hardware and software architecture for high-speed real-time computer systems. The implementation of a prototype of this architecture has required the integration of architecture, operating systems and programming languages into a cohesive unit. This report describes a Heterogeneous Hierarchial Architecture for Real-Time (H{sup 2} ART) and system software for program loading and interprocessor communication.
A single VLSI chip for computing syndromes in the (225, 223) Reed-Solomon decoder

NASA Technical Reports Server (NTRS)

Hsu, I. S.; Truong, T. K.; Shao, H. M.; Deutsch, L. J.

1986-01-01

A description of a single VLSI chip for computing syndromes in the (255, 223) Reed-Solomon decoder is presented. The architecture that leads to this single VLSI chip design makes use of the dual basis multiplication algorithm. The same architecture can be applied to design VLSI chips to compute various kinds of number theoretic transforms.
Parallel computing of physical maps--a comparative study in SIMD and MIMD parallelism.

PubMed

Bhandarkar, S M; Chirravuri, S; Arnold, J

1996-01-01

Ordering clones from a genomic library into physical maps of whole chromosomes presents a central computational problem in genetics. Chromosome reconstruction via clone ordering is usually isomorphic to the NP-complete Optimal Linear Arrangement problem. Parallel SIMD and MIMD algorithms for simulated annealing based on Markov chain distribution are proposed and applied to the problem of chromosome reconstruction via clone ordering. Perturbation methods and problem-specific annealing heuristics are proposed and described. The SIMD algorithms are implemented on a 2048 processor MasPar MP-2 system which is an SIMD 2-D toroidal mesh architecture whereas the MIMD algorithms are implemented on an 8 processor Intel iPSC/860 which is an MIMD hypercube architecture. A comparative analysis of the various SIMD and MIMD algorithms is presented in which the convergence, speedup, and scalability characteristics of the various algorithms are analyzed and discussed. On a fine-grained, massively parallel SIMD architecture with a low synchronization overhead such as the MasPar MP-2, a parallel simulated annealing algorithm based on multiple periodically interacting searches performs the best. For a coarse-grained MIMD architecture with high synchronization overhead such as the Intel iPSC/860, a parallel simulated annealing algorithm based on multiple independent searches yields the best results. In either case, distribution of clonal data across multiple processors is shown to exacerbate the tendency of the parallel simulated annealing algorithm to get trapped in a local optimum.
VLSI architectures for computing multiplications and inverses in GF(2m)

NASA Technical Reports Server (NTRS)

Wang, C. C.; Truong, T. K.; Shao, H. M.; Deutsch, L. J.; Omura, J. K.

1985-01-01

Finite field arithmetic logic is central in the implementation of Reed-Solomon coders and in some cryptographic algorithms. There is a need for good multiplication and inversion algorithms that are easily realized on VLSI chips. Massey and Omura recently developed a new multiplication algorithm for Galois fields based on a normal basis representation. A pipeline structure is developed to realize the Massey-Omura multiplier in the finite field GF(2m). With the simple squaring property of the normal-basis representation used together with this multiplier, a pipeline architecture is also developed for computing inverse elements in GF(2m). The designs developed for the Massey-Omura multiplier and the computation of inverse elements are regular, simple, expandable and, therefore, naturally suitable for VLSI implementation.
VLSI architectures for computing multiplications and inverses in GF(2-m)

NASA Technical Reports Server (NTRS)

Wang, C. C.; Truong, T. K.; Shao, H. M.; Deutsch, L. J.; Omura, J. K.; Reed, I. S.

1983-01-01

Finite field arithmetic logic is central in the implementation of Reed-Solomon coders and in some cryptographic algorithms. There is a need for good multiplication and inversion algorithms that are easily realized on VLSI chips. Massey and Omura recently developed a new multiplication algorithm for Galois fields based on a normal basis representation. A pipeline structure is developed to realize the Massey-Omura multiplier in the finite field GF(2m). With the simple squaring property of the normal-basis representation used together with this multiplier, a pipeline architecture is also developed for computing inverse elements in GF(2m). The designs developed for the Massey-Omura multiplier and the computation of inverse elements are regular, simple, expandable and, therefore, naturally suitable for VLSI implementation.

VLSI architectures for computing multiplications and inverses in GF(2m).

PubMed

Wang, C C; Truong, T K; Shao, H M; Deutsch, L J; Omura, J K; Reed, I S

1985-08-01

Finite field arithmetic logic is central in the implementation of Reed-Solomon coders and in some cryptographic algorithms. There is a need for good multiplication and inversion algorithms that can be easily realized on VLSI chips. Massey and Omura recently developed a new multiplication algorithm for Galois fields based on a normal basis representation. In this paper, a pipeline structure is developed to realize the Massey-Omura multiplier in the finite field GF(2m). With the simple squaring property of the normal basis representation used together with this multiplier, a pipeline architecture is developed for computing inverse elements in GF(2m). The designs developed for the Massey-Omura multiplier and the computation of inverse elements are regular, simple, expandable, and therefore, naturally suitable for VLSI implementation.
Hybrid Architectures for Evolutionary Computing Algorithms

DTIC Science & Technology

2008-01-01

other EC algorithms to FPGA Core Burns P1026/MAPLD 200532 Genetic Algorithm Hardware References S. Scott, A. Samal , and S. Seth, “HGA: A Hardware Based...on Parallel and Distributed Processing (IPPS/SPDP 󈨦), pp. 316-320, Proceedings. IEEE Computer Society 1998. [12] Scott, S. D. , Samal , A., and...Algorithm Hardware References S. Scott, A. Samal , and S. Seth, “HGA: A Hardware Based Genetic Algorithm”, Proceedings of the 1995 ACM Third
DANoC: An Efficient Algorithm and Hardware Codesign of Deep Neural Networks on Chip.

PubMed

Zhou, Xichuan; Li, Shengli; Tang, Fang; Hu, Shengdong; Lin, Zhi; Zhang, Lei

2017-07-18

Deep neural networks (NNs) are the state-of-the-art models for understanding the content of images and videos. However, implementing deep NNs in embedded systems is a challenging task, e.g., a typical deep belief network could exhaust gigabytes of memory and result in bandwidth and computational bottlenecks. To address this challenge, this paper presents an algorithm and hardware codesign for efficient deep neural computation. A hardware-oriented deep learning algorithm, named the deep adaptive network, is proposed to explore the sparsity of neural connections. By adaptively removing the majority of neural connections and robustly representing the reserved connections using binary integers, the proposed algorithm could save up to 99.9% memory utility and computational resources without undermining classification accuracy. An efficient sparse-mapping-memory-based hardware architecture is proposed to fully take advantage of the algorithmic optimization. Different from traditional Von Neumann architecture, the deep-adaptive network on chip (DANoC) brings communication and computation in close proximity to avoid power-hungry parameter transfers between on-board memory and on-chip computational units. Experiments over different image classification benchmarks show that the DANoC system achieves competitively high accuracy and efficiency comparing with the state-of-the-art approaches.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Moreland, Kenneth; Sewell, Christopher; Usher, William

Here, one of the most critical challenges for high-performance computing (HPC) scientific visualization is execution on massively threaded processors. Of the many fundamental changes we are seeing in HPC systems, one of the most profound is a reliance on new processor types optimized for execution bandwidth over latency hiding. Our current production scientific visualization software is not designed for these new types of architectures. To address this issue, the VTK-m framework serves as a container for algorithms, provides flexible data representation, and simplifies the design of visualization algorithms on new and future computer architecture.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Moreland, Kenneth; Sewell, Christopher; Usher, William

Execution on massively threaded processors is one of the most critical challenges for high-performance computing (HPC) scientific visualization. Of the many fundamental changes we are seeing in HPC systems, one of the most profound is a reliance on new processor types optimized for execution bandwidth over latency hiding. Moreover, our current production scientific visualization software is not designed for these new types of architectures. In order to address this issue, the VTK-m framework serves as a container for algorithms, provides flexible data representation, and simplifies the design of visualization algorithms on new and future computer architecture.
Scalable service architecture for providing strong service guarantees

NASA Astrophysics Data System (ADS)

Christin, Nicolas; Liebeherr, Joerg

2002-07-01

For the past decade, a lot of Internet research has been devoted to providing different levels of service to applications. Initial proposals for service differentiation provided strong service guarantees, with strict bounds on delays, loss rates, and throughput, but required high overhead in terms of computational complexity and memory, both of which raise scalability concerns. Recently, the interest has shifted to service architectures with low overhead. However, these newer service architectures only provide weak service guarantees, which do not always address the needs of applications. In this paper, we describe a service architecture that supports strong service guarantees, can be implemented with low computational complexity, and only requires to maintain little state information. A key mechanism of the proposed service architecture is that it addresses scheduling and buffer management in a single algorithm. The presented architecture offers no solution for controlling the amount of traffic that enters the network. Instead, we plan on exploiting feedback mechanisms of TCP congestion control algorithms for the purpose of regulating the traffic entering the network.
Real-time dynamic simulation of the Cassini spacecraft using DARTS. Part 2: Parallel/vectorized real-time implementation

NASA Technical Reports Server (NTRS)

Fijany, A.; Roberts, J. A.; Jain, A.; Man, G. K.

1993-01-01

Part 1 of this paper presented the requirements for the real-time simulation of Cassini spacecraft along with some discussion of the DARTS algorithm. Here, in Part 2 we discuss the development and implementation of parallel/vectorized DARTS algorithm and architecture for real-time simulation. Development of the fast algorithms and architecture for real-time hardware-in-the-loop simulation of spacecraft dynamics is motivated by the fact that it represents a hard real-time problem, in the sense that the correctness of the simulation depends on both the numerical accuracy and the exact timing of the computation. For a given model fidelity, the computation should be computed within a predefined time period. Further reduction in computation time allows increasing the fidelity of the model (i.e., inclusion of more flexible modes) and the integration routine.
ICASE Computer Science Program

NASA Technical Reports Server (NTRS)

1985-01-01

The Institute for Computer Applications in Science and Engineering computer science program is discussed in outline form. Information is given on such topics as problem decomposition, algorithm development, programming languages, and parallel architectures.
Super and parallel computers and their impact on civil engineering

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kamat, M.P.

1986-01-01

This book presents the papers given at a conference on the use of supercomputers in civil engineering. Topics considered at the conference included solving nonlinear equations on a hypercube, a custom architectured parallel processing system, distributed data processing, algorithms, computer architecture, parallel processing, vector processing, computerized simulation, and cost benefit analysis.
Massively Parallel Solution of Poisson Equation on Coarse Grain MIMD Architectures

NASA Technical Reports Server (NTRS)

Fijany, A.; Weinberger, D.; Roosta, R.; Gulati, S.

1998-01-01

In this paper a new algorithm, designated as Fast Invariant Imbedding algorithm, for solution of Poisson equation on vector and massively parallel MIMD architectures is presented. This algorithm achieves the same optimal computational efficiency as other Fast Poisson solvers while offering a much better structure for vector and parallel implementation. Our implementation on the Intel Delta and Paragon shows that a speedup of over two orders of magnitude can be achieved even for moderate size problems.
The DANTE Boltzmann transport solver: An unstructured mesh, 3-D, spherical harmonics algorithm compatible with parallel computer architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

McGhee, J.M.; Roberts, R.M.; Morel, J.E.

1997-06-01

A spherical harmonics research code (DANTE) has been developed which is compatible with parallel computer architectures. DANTE provides 3-D, multi-material, deterministic, transport capabilities using an arbitrary finite element mesh. The linearized Boltzmann transport equation is solved in a second order self-adjoint form utilizing a Galerkin finite element spatial differencing scheme. The core solver utilizes a preconditioned conjugate gradient algorithm. Other distinguishing features of the code include options for discrete-ordinates and simplified spherical harmonics angular differencing, an exact Marshak boundary treatment for arbitrarily oriented boundary faces, in-line matrix construction techniques to minimize memory consumption, and an effective diffusion based preconditioner formore » scattering dominated problems. Algorithm efficiency is demonstrated for a massively parallel SIMD architecture (CM-5), and compatibility with MPP multiprocessor platforms or workstation clusters is anticipated.« less
Efficient architecture for spike sorting in reconfigurable hardware.

PubMed

Hwang, Wen-Jyi; Lee, Wei-Hao; Lin, Shiow-Jyu; Lai, Sheng-Ying

2013-11-01

This paper presents a novel hardware architecture for fast spike sorting. The architecture is able to perform both the feature extraction and clustering in hardware. The generalized Hebbian algorithm (GHA) and fuzzy C-means (FCM) algorithm are used for feature extraction and clustering, respectively. The employment of GHA allows efficient computation of principal components for subsequent clustering operations. The FCM is able to achieve near optimal clustering for spike sorting. Its performance is insensitive to the selection of initial cluster centers. The hardware implementations of GHA and FCM feature low area costs and high throughput. In the GHA architecture, the computation of different weight vectors share the same circuit for lowering the area costs. Moreover, in the FCM hardware implementation, the usual iterative operations for updating the membership matrix and cluster centroid are merged into one single updating process to evade the large storage requirement. To show the effectiveness of the circuit, the proposed architecture is physically implemented by field programmable gate array (FPGA). It is embedded in a System-on-Chip (SOC) platform for performance measurement. Experimental results show that the proposed architecture is an efficient spike sorting design for attaining high classification correct rate and high speed computation.
The flight telerobotic servicer: From functional architecture to computer architecture

NASA Technical Reports Server (NTRS)

Lumia, Ronald; Fiala, John

1989-01-01

After a brief tutorial on the NASA/National Bureau of Standards Standard Reference Model for Telerobot Control System Architecture (NASREM) functional architecture, the approach to its implementation is shown. First, interfaces must be defined which are capable of supporting the known algorithms. This is illustrated by considering the interfaces required for the SERVO level of the NASREM functional architecture. After interface definition, the specific computer architecture for the implementation must be determined. This choice is obviously technology dependent. An example illustrating one possible mapping of the NASREM functional architecture to a particular set of computers which implements it is shown. The result of choosing the NASREM functional architecture is that it provides a technology independent paradigm which can be mapped into a technology dependent implementation capable of evolving with technology in the laboratory and in space.
DFT algorithms for bit-serial GaAs array processor architectures

NASA Technical Reports Server (NTRS)

Mcmillan, Gary B.

1988-01-01

Systems and Processes Engineering Corporation (SPEC) has developed an innovative array processor architecture for computing Fourier transforms and other commonly used signal processing algorithms. This architecture is designed to extract the highest possible array performance from state-of-the-art GaAs technology. SPEC's architectural design includes a high performance RISC processor implemented in GaAs, along with a Floating Point Coprocessor and a unique Array Communications Coprocessor, also implemented in GaAs technology. Together, these data processors represent the latest in technology, both from an architectural and implementation viewpoint. SPEC has examined numerous algorithms and parallel processing architectures to determine the optimum array processor architecture. SPEC has developed an array processor architecture with integral communications ability to provide maximum node connectivity. The Array Communications Coprocessor embeds communications operations directly in the core of the processor architecture. A Floating Point Coprocessor architecture has been defined that utilizes Bit-Serial arithmetic units, operating at very high frequency, to perform floating point operations. These Bit-Serial devices reduce the device integration level and complexity to a level compatible with state-of-the-art GaAs device technology.
Computational structures for robotic computations

NASA Technical Reports Server (NTRS)

Lee, C. S. G.; Chang, P. R.

1987-01-01

The computational problem of inverse kinematics and inverse dynamics of robot manipulators by taking advantage of parallelism and pipelining architectures is discussed. For the computation of inverse kinematic position solution, a maximum pipelined CORDIC architecture has been designed based on a functional decomposition of the closed-form joint equations. For the inverse dynamics computation, an efficient p-fold parallel algorithm to overcome the recurrence problem of the Newton-Euler equations of motion to achieve the time lower bound of O(log sub 2 n) has also been developed.
MIT CSAIL and Lincoln Laboratory Task Force Report

DTIC Science & Technology

2016-08-01

projects have been very diverse, spanning several areas of CSAIL concentration, including robotics, big data analytics , wireless communications...spanning several areas of CSAIL concentration, including robotics, big data analytics , wireless communications, computing architectures and...to machine learning systems and algorithms, such as recommender systems, and “Big Data ” analytics . Advanced computing architectures broadly refer to
A Model for Minimizing Numeric Function Generator Complexity and Delay

DTIC Science & Technology

2007-12-01

allow computation of difficult mathematical functions in less time and with less hardware than commonly employed methods. They compute piecewise...Programmable Gate Arrays (FPGAs). The algorithms and estimation techniques apply to various NFG architectures and mathematical functions. This...thesis compares hardware utilization and propagation delay for various NFG architectures, mathematical functions, word widths, and segmentation methods
Optical Computing Based on Neuronal Models

DTIC Science & Technology

1988-05-01

walking, and cognition are far too complex for existing sequential digital computers. Therefore new architectures, hardware, and algorithms modeled...collective behavior, and iterative processing into optical processing and artificial neurodynamical systems. Another intriguing promise of neural nets is...with architectures, implementations, and programming; and material research s -7- called for. Our future research in neurodynamics will continue to
The 2nd Symposium on the Frontiers of Massively Parallel Computations

NASA Technical Reports Server (NTRS)

Mills, Ronnie (Editor)

1988-01-01

Programming languages, computer graphics, neural networks, massively parallel computers, SIMD architecture, algorithms, digital terrain models, sort computation, simulation of charged particle transport on the massively parallel processor and image processing are among the topics discussed.
Cloud Computing for Mission Design and Operations

NASA Technical Reports Server (NTRS)

Arrieta, Juan; Attiyah, Amy; Beswick, Robert; Gerasimantos, Dimitrios

2012-01-01

The space mission design and operations community already recognizes the value of cloud computing and virtualization. However, natural and valid concerns, like security, privacy, up-time, and vendor lock-in, have prevented a more widespread and expedited adoption into official workflows. In the interest of alleviating these concerns, we propose a series of guidelines for internally deploying a resource-oriented hub of data and algorithms. These guidelines provide a roadmap for implementing an architecture inspired in the cloud computing model: associative, elastic, semantical, interconnected, and adaptive. The architecture can be summarized as exposing data and algorithms as resource-oriented Web services, coordinated via messaging, and running on virtual machines; it is simple, and based on widely adopted standards, protocols, and tools. The architecture may help reduce common sources of complexity intrinsic to data-driven, collaborative interactions and, most importantly, it may provide the means for teams and agencies to evaluate the cloud computing model in their specific context, with minimal infrastructure changes, and before committing to a specific cloud services provider.

Efficient Phase Unwrapping Architecture for Digital Holographic Microscopy

PubMed Central

Hwang, Wen-Jyi; Cheng, Shih-Chang; Cheng, Chau-Jern

2011-01-01

This paper presents a novel phase unwrapping architecture for accelerating the computational speed of digital holographic microscopy (DHM). A fast Fourier transform (FFT) based phase unwrapping algorithm providing a minimum squared error solution is adopted for hardware implementation because of its simplicity and robustness to noise. The proposed architecture is realized in a pipeline fashion to maximize throughput of the computation. Moreover, the number of hardware multipliers and dividers are minimized to reduce the hardware costs. The proposed architecture is used as a custom user logic in a system on programmable chip (SOPC) for physical performance measurement. Experimental results reveal that the proposed architecture is effective for expediting the computational speed while consuming low hardware resources for designing an embedded DHM system. PMID:22163688
Rational use of cognitive resources: levels of analysis between the computational and the algorithmic.

PubMed

Griffiths, Thomas L; Lieder, Falk; Goodman, Noah D

2015-04-01

Marr's levels of analysis-computational, algorithmic, and implementation-have served cognitive science well over the last 30 years. But the recent increase in the popularity of the computational level raises a new challenge: How do we begin to relate models at different levels of analysis? We propose that it is possible to define levels of analysis that lie between the computational and the algorithmic, providing a way to build a bridge between computational- and algorithmic-level models. The key idea is to push the notion of rationality, often used in defining computational-level models, deeper toward the algorithmic level. We offer a simple recipe for reverse-engineering the mind's cognitive strategies by deriving optimal algorithms for a series of increasingly more realistic abstract computational architectures, which we call "resource-rational analysis." Copyright © 2015 Cognitive Science Society, Inc.
Digital optical computers at the optoelectronic computing systems center

NASA Technical Reports Server (NTRS)

Jordan, Harry F.

1991-01-01

The Digital Optical Computing Program within the National Science Foundation Engineering Research Center for Opto-electronic Computing Systems has as its specific goal research on optical computing architectures suitable for use at the highest possible speeds. The program can be targeted toward exploiting the time domain because other programs in the Center are pursuing research on parallel optical systems, exploiting optical interconnection and optical devices and materials. Using a general purpose computing architecture as the focus, we are developing design techniques, tools and architecture for operation at the speed of light limit. Experimental work is being done with the somewhat low speed components currently available but with architectures which will scale up in speed as faster devices are developed. The design algorithms and tools developed for a general purpose, stored program computer are being applied to other systems such as optimally controlled optical communication networks.
A Programmable Five Qubit Quantum Computer Using Trapped Atomic Ions

NASA Astrophysics Data System (ADS)

Debnath, Shantanu

Quantum computers can solve certain problems more efficiently compared to conventional classical methods. In the endeavor to build a quantum computer, several competing platforms have emerged that can implement certain quantum algorithms using a few qubits. However, the demonstrations so far have been done usually by tailoring the hardware to meet the requirements of a particular algorithm implemented for a limited number of instances. Although such proof of principal implementations are important to verify the working of algorithms on a physical system, they further need to have the potential to serve as a general purpose quantum computer allowing the flexibility required for running multiple algorithms and be scaled up to host more qubits. Here we demonstrate a small programmable quantum computer based on five trapped atomic ions each of which serves as a qubit. By optically resolving each ion we can individually address them in order to perform a complete set of single-qubit and fully connected two-qubit quantum gates and alsoperform efficient individual qubit measurements. We implement a computation architecture that accepts an algorithm from a user interface in the form of a standard logic gate sequence and decomposes it into fundamental quantum operations that are native to the hardware using a set of compilation instructions that are defined within the software. These operations are then effected through a pattern of laser pulses that perform coherent rotations on targeted qubits in the chain. The architecture implemented in the experiment therefore gives us unprecedented flexibility in the programming of any quantum algorithm while staying blind to the underlying hardware. As a demonstration we implement the Deutsch-Jozsa and Bernstein-Vazirani algorithms on the five-qubit processor and achieve average success rates of 95 and 90 percent, respectively. We also implement a five-qubit coherent quantum Fourier transform and examine its performance in the period finding and phase estimation protocol. We find fidelities of 84 and 62 percent, respectively. While maintaining the same computation architecture the system can be scaled to more ions using resources that scale favorably (O(N. 2)) with the numberof qubits N.
FPGA design for constrained energy minimization

NASA Astrophysics Data System (ADS)

Wang, Jianwei; Chang, Chein-I.; Cao, Mang

2004-02-01

The Constrained Energy Minimization (CEM) has been widely used for hyperspectral detection and classification. The feasibility of implementing the CEM as a real-time processing algorithm in systolic arrays has been also demonstrated. The main challenge of realizing the CEM in hardware architecture in the computation of the inverse of the data correlation matrix performed in the CEM, which requires a complete set of data samples. In order to cope with this problem, the data correlation matrix must be calculated in a causal manner which only needs data samples up to the sample at the time it is processed. This paper presents a Field Programmable Gate Arrays (FPGA) design of such a causal CEM. The main feature of the proposed FPGA design is to use the Coordinate Rotation DIgital Computer (CORDIC) algorithm that can convert a Givens rotation of a vector to a set of shift-add operations. As a result, the CORDIC algorithm can be easily implemented in hardware architecture, therefore in FPGA. Since the computation of the inverse of the data correlction involves a series of Givens rotations, the utility of the CORDIC algorithm allows the causal CEM to perform real-time processing in FPGA. In this paper, an FPGA implementation of the causal CEM will be studied and its detailed architecture will be also described.
PNNLs Data Intensive Computing research battles Homeland Security threats

ScienceCinema

David Thurman; Joe Kielman; Katherine Wolf; David Atkinson

2018-05-11

The Pacific Northwest National Laboratorys (PNNL's) approach to data intensive computing (DIC) is focused on three key research areas: hybrid hardware architecture, software architectures, and analytic algorithms. Advancements in these areas will help to address, and solve, DIC issues associated with capturing, managing, analyzing and understanding, in near real time, data at volumes and rates that push the frontiers of current technologies.
PNNL pushing scientific discovery through data intensive computing breakthroughs

ScienceCinema

Deborah Gracio; David Koppenaal; Ruby Leung

2018-05-18

The Pacific Northwest National Laboratory's approach to data intensive computing (DIC) is focused on three key research areas: hybrid hardware architectures, software architectures, and analytic algorithms. Advancements in these areas will help to address, and solve, DIC issues associated with capturing, managing, analyzing and understanding, in near real time, data at volumes and rates that push the frontiers of current technologies.
Irregular Applications: Architectures & Algorithms

DOE Office of Scientific and Technical Information (OSTI.GOV)

Feo, John T.; Villa, Oreste; Tumeo, Antonino

Irregular applications are characterized by irregular data structures, control and communication patterns. Novel irregular high performance applications which deal with large data sets and require have recently appeared. Unfortunately, current high performance systems and software infrastructures executes irregular algorithms poorly. Only coordinated efforts by end user, area specialists and computer scientists that consider both the architecture and the software stack may be able to provide solutions to the challenges of modern irregular applications.
Determining physiological cross-sectional area of extensor carpi radialis longus and brevis as a whole and by regions using 3D computer muscle models created from digitized fiber bundle data.

PubMed

Ravichandiran, Kajeandra; Ravichandiran, Mayoorendra; Oliver, Michele L; Singh, Karan S; McKee, Nancy H; Agur, Anne M R

2009-09-01

Architectural parameters and physiological cross-sectional area (PCSA) are important determinants of muscle function. Extensor carpi radialis longus (ECRL) and brevis (ECRB) are used in muscle transfers; however, their regional architectural differences have not been investigated. The aim of this study is to develop computational algorithms to quantify and compare architectural parameters (fiber bundle length, pennation angle, and volume) and PCSA of ECRL and ECRB. Fiber bundles distributed throughout the volume of ECRL (75+/-20) and ECRB (110+/-30) were digitized in eight formalin embalmed cadaveric specimens. The digitized data was reconstructed in Autodesk Maya with computational algorithms implemented in Python. The mean PCSA and fiber bundle length were significantly different between ECRL and ECRB (p < or = 0.05). Superficial ECRL had significantly longer fiber bundle length than the deep region, whereas the PCSA of superficial ECRB was significantly larger than the deep region. The regional quantification of architectural parameters and PCSA provides a framework for the exploration of partial tendon transfers of ECRL and ECRB.
Report from the MPP Working Group to the NASA Associate Administrator for Space Science and Applications

NASA Technical Reports Server (NTRS)

Fischer, James R.; Grosch, Chester; Mcanulty, Michael; Odonnell, John; Storey, Owen

1987-01-01

NASA's Office of Space Science and Applications (OSSA) gave a select group of scientists the opportunity to test and implement their computational algorithms on the Massively Parallel Processor (MPP) located at Goddard Space Flight Center, beginning in late 1985. One year later, the Working Group presented its report, which addressed the following: algorithms, programming languages, architecture, programming environments, the way theory relates, and performance measured. The findings point to a number of demonstrated computational techniques for which the MPP architecture is ideally suited. For example, besides executing much faster on the MPP than on conventional computers, systolic VLSI simulation (where distances are short), lattice simulation, neural network simulation, and image problems were found to be easier to program on the MPP's architecture than on a CYBER 205 or even a VAX. The report also makes technical recommendations covering all aspects of MPP use, and recommendations concerning the future of the MPP and machines based on similar architectures, expansion of the Working Group, and study of the role of future parallel processors for space station, EOS, and the Great Observatories era.
Marc Henry de Frahan | NREL

Science.gov Websites

Computing Project, Marc develops high-fidelity turbulence models to enhance simulation accuracy and efficient numerical algorithms for future high performance computing hardware architectures. Research Interests High performance computing High order numerical methods for computational fluid dynamics Fluid
Parallel language constructs for tensor product computations on loosely coupled architectures

NASA Technical Reports Server (NTRS)

Mehrotra, Piyush; Van Rosendale, John

1989-01-01

A set of language primitives designed to allow the specification of parallel numerical algorithms at a higher level is described. The authors focus on tensor product array computations, a simple but important class of numerical algorithms. They consider first the problem of programming one-dimensional kernel routines, such as parallel tridiagonal solvers, and then look at how such parallel kernels can be combined to form parallel tensor product algorithms.
Effects of Computer Architecture on FFT (Fast Fourier Transform) Algorithm Performance.

DTIC Science & Technology

1983-12-01

Criteria for Efficient Implementation of FFT Algorithms," IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-30, pp. 107-109, Feb...1982. Burrus, C. S. and P. W. Eschenbacher. "An In-Place, In-Order Prime Factor FFT Algorithm," IEEE Transactions on Acoustics, Speech, and Signal... Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-30, pp. 217-226, Apr. 1982. Control Data Corporation. CDC Cyber 170 Computer Systems
Shared Memory Parallelization of an Implicit ADI-type CFD Code

NASA Technical Reports Server (NTRS)

Hauser, Th.; Huang, P. G.

1999-01-01

A parallelization study designed for ADI-type algorithms is presented using the OpenMP specification for shared-memory multiprocessor programming. Details of optimizations specifically addressed to cache-based computer architectures are described and performance measurements for the single and multiprocessor implementation are summarized. The paper demonstrates that optimization of memory access on a cache-based computer architecture controls the performance of the computational algorithm. A hybrid MPI/OpenMP approach is proposed for clusters of shared memory machines to further enhance the parallel performance. The method is applied to develop a new LES/DNS code, named LESTool. A preliminary DNS calculation of a fully developed channel flow at a Reynolds number of 180, Re(sub tau) = 180, has shown good agreement with existing data.
Scalable Parallel Density-based Clustering and Applications

NASA Astrophysics Data System (ADS)

Patwary, Mostofa Ali

2014-04-01

Recently, density-based clustering algorithms (DBSCAN and OPTICS) have gotten significant attention of the scientific community due to their unique capability of discovering arbitrary shaped clusters and eliminating noise data. These algorithms have several applications, which require high performance computing, including finding halos and subhalos (clusters) from massive cosmology data in astrophysics, analyzing satellite images, X-ray crystallography, and anomaly detection. However, parallelization of these algorithms are extremely challenging as they exhibit inherent sequential data access order, unbalanced workload resulting in low parallel efficiency. To break the data access sequentiality and to achieve high parallelism, we develop new parallel algorithms, both for DBSCAN and OPTICS, designed using graph algorithmic techniques. For example, our parallel DBSCAN algorithm exploits the similarities between DBSCAN and computing connected components. Using datasets containing up to a billion floating point numbers, we show that our parallel density-based clustering algorithms significantly outperform the existing algorithms, achieving speedups up to 27.5 on 40 cores on shared memory architecture and speedups up to 5,765 using 8,192 cores on distributed memory architecture. In our experiments, we found that while achieving the scalability, our algorithms produce clustering results with comparable quality to the classical algorithms.
Fast underdetermined BSS architecture design methodology for real time applications.

PubMed

Mopuri, Suresh; Reddy, P Sreenivasa; Acharyya, Amit; Naik, Ganesh R

2015-01-01

In this paper, we propose a high speed architecture design methodology for the Under-determined Blind Source Separation (UBSS) algorithm using our recently proposed high speed Discrete Hilbert Transform (DHT) targeting real time applications. In UBSS algorithm, unlike the typical BSS, the number of sensors are less than the number of the sources, which is of more interest in the real time applications. The DHT architecture has been implemented based on sub matrix multiplication method to compute M point DHT, which uses N point architecture recursively and where M is an integer multiples of N. The DHT architecture and state of the art architecture are coded in VHDL for 16 bit word length and ASIC implementation is carried out using UMC 90 - nm technology @V DD = 1V and @ 1MHZ clock frequency. The proposed architecture implementation and experimental comparison results show that the DHT design is two times faster than state of the art architecture.
State-of-the-art in Heterogeneous Computing

DOE PAGES

Brodtkorb, Andre R.; Dyken, Christopher; Hagen, Trond R.; ...

2010-01-01

Node level heterogeneous architectures have become attractive during the last decade for several reasons: compared to traditional symmetric CPUs, they offer high peak performance and are energy and/or cost efficient. With the increase of fine-grained parallelism in high-performance computing, as well as the introduction of parallelism in workstations, there is an acute need for a good overview and understanding of these architectures. We give an overview of the state-of-the-art in heterogeneous computing, focusing on three commonly found architectures: the Cell Broadband Engine Architecture, graphics processing units (GPUs), and field programmable gate arrays (FPGAs). We present a review of hardware, availablemore » software tools, and an overview of state-of-the-art techniques and algorithms. Furthermore, we present a qualitative and quantitative comparison of the architectures, and give our view on the future of heterogeneous computing.« less
High-Speed Systolic Array Testbed.

DTIC Science & Technology

1987-10-01

applications since the concept was introduced by H.T. Kung In 1978. This highly parallel architecture of nearet neighbor data communciation and...must be addressed. For instance, should bit-serial or bit parallei computation be utilized. Does the dynamic range of the candidate applications or...numericai stability of the algorithms used require computations In fixed point and Integer format or the architecturally more complex and slower floating
Video image processing

NASA Technical Reports Server (NTRS)

Murray, N. D.

1985-01-01

Current technology projections indicate a lack of availability of special purpose computing for Space Station applications. Potential functions for video image special purpose processing are being investigated, such as smoothing, enhancement, restoration and filtering, data compression, feature extraction, object detection and identification, pixel interpolation/extrapolation, spectral estimation and factorization, and vision synthesis. Also, architectural approaches are being identified and a conceptual design generated. Computationally simple algorithms will be research and their image/vision effectiveness determined. Suitable algorithms will be implimented into an overall architectural approach that will provide image/vision processing at video rates that are flexible, selectable, and programmable. Information is given in the form of charts, diagrams and outlines.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Manteuffel, T.A.

The objective of this project is the development of numerical solution techniques for deterministic models of the transport of neutral and charged particles and the demonstration of their effectiveness in both a production environment and on advanced architecture computers. The primary focus is on various versions of the linear Boltzman equation. These equations are fundamental in many important applications. This project is an attempt to integrate the development of numerical algorithms with the process of developing production software. A major thrust of this reject will be the implementation of these algorithms on advanced architecture machines that reside at the Advancedmore » Computing Laboratory (ACL) at Los Alamos National Laboratories (LANL).« less

Advances in Orion's On-Orbit Guidance and Targeting System Architecture

NASA Technical Reports Server (NTRS)

Scarritt, Sara K.; Fill, Thomas; Robinson, Shane

2015-01-01

NASA's manned spaceflight programs have a rich history of advancing onboard guidance and targeting technology. In order to support future missions, the guidance and targeting architecture for the Orion Multi-Purpose Crew Vehicle must be able to operate in complete autonomy, without any support from the ground. Orion's guidance and targeting system must be sufficiently flexible to easily adapt to a wide array of undecided future missions, yet also not cause an undue computational burden on the flight computer. This presents a unique design challenge from the perspective of both algorithm development and system architecture construction. The present work shows how Orion's guidance and targeting system addresses these challenges. On the algorithm side, the system advances the state-of-the-art by: (1) steering burns with a simple closed-loop guidance strategy based on Shuttle heritage, and (2) planning maneuvers with a cutting-edge two-level targeting routine. These algorithms are then placed into an architecture designed to leverage the advantages of each and ensure that they function in concert with one another. The resulting system is characterized by modularity and simplicity. As such, it is adaptable to the on-orbit phases of any future mission that Orion may attempt.
Hybrid massively parallel fast sweeping method for static Hamilton-Jacobi equations

NASA Astrophysics Data System (ADS)

Detrixhe, Miles; Gibou, Frédéric

2016-10-01

The fast sweeping method is a popular algorithm for solving a variety of static Hamilton-Jacobi equations. Fast sweeping algorithms for parallel computing have been developed, but are severely limited. In this work, we present a multilevel, hybrid parallel algorithm that combines the desirable traits of two distinct parallel methods. The fine and coarse grained components of the algorithm take advantage of heterogeneous computer architecture common in high performance computing facilities. We present the algorithm and demonstrate its effectiveness on a set of example problems including optimal control, dynamic games, and seismic wave propagation. We give results for convergence, parallel scaling, and show state-of-the-art speedup values for the fast sweeping method.
Parallel Processing Systems for Passive Ranging During Helicopter Flight

NASA Technical Reports Server (NTRS)

Sridhar, Bavavar; Suorsa, Raymond E.; Showman, Robert D. (Technical Monitor)

1994-01-01

The complexity of rotorcraft missions involving operations close to the ground result in high pilot workload. In order to allow a pilot time to perform mission-oriented tasks, sensor-aiding and automation of some of the guidance and control functions are highly desirable. Images from an electro-optical sensor provide a covert way of detecting objects in the flight path of a low-flying helicopter. Passive ranging consists of processing a sequence of images using techniques based on optical low computation and recursive estimation. The passive ranging algorithm has to extract obstacle information from imagery at rates varying from five to thirty or more frames per second depending on the helicopter speed. We have implemented and tested the passive ranging algorithm off-line using helicopter-collected images. However, the real-time data and computation requirements of the algorithm are beyond the capability of any off-the-shelf microprocessor or digital signal processor. This paper describes the computational requirements of the algorithm and uses parallel processing technology to meet these requirements. Various issues in the selection of a parallel processing architecture are discussed and four different computer architectures are evaluated regarding their suitability to process the algorithm in real-time. Based on this evaluation, we conclude that real-time passive ranging is a realistic goal and can be achieved with a short time.
On the Usage of GPUs for Efficient Motion Estimation in Medical Image Sequences

PubMed Central

Thiyagalingam, Jeyarajan; Goodman, Daniel; Schnabel, Julia A.; Trefethen, Anne; Grau, Vicente

2011-01-01

Images are ubiquitous in biomedical applications from basic research to clinical practice. With the rapid increase in resolution, dimensionality of the images and the need for real-time performance in many applications, computational requirements demand proper exploitation of multicore architectures. Towards this, GPU-specific implementations of image analysis algorithms are particularly promising. In this paper, we investigate the mapping of an enhanced motion estimation algorithm to novel GPU-specific architectures, the resulting challenges and benefits therein. Using a database of three-dimensional image sequences, we show that the mapping leads to substantial performance gains, up to a factor of 60, and can provide near-real-time experience. We also show how architectural peculiarities of these devices can be best exploited in the benefit of algorithms, most specifically for addressing the challenges related to their access patterns and different memory configurations. Finally, we evaluate the performance of the algorithm on three different GPU architectures and perform a comprehensive analysis of the results. PMID:21869880
Semivariogram Analysis of Bone Images Implemented on FPGA Architectures.

PubMed

Shirvaikar, Mukul; Lagadapati, Yamuna; Dong, Xuanliang

2017-03-01

Osteoporotic fractures are a major concern for the healthcare of elderly and female populations. Early diagnosis of patients with a high risk of osteoporotic fractures can be enhanced by introducing second-order statistical analysis of bone image data using techniques such as variogram analysis. Such analysis is computationally intensive thereby creating an impediment for introduction into imaging machines found in common clinical settings. This paper investigates the fast implementation of the semivariogram algorithm, which has been proven to be effective in modeling bone strength, and should be of interest to readers in the areas of computer-aided diagnosis and quantitative image analysis. The semivariogram is a statistical measure of the spatial distribution of data, and is based on Markov Random Fields (MRFs). Semivariogram analysis is a computationally intensive algorithm that has typically seen applications in the geosciences and remote sensing areas. Recently, applications in the area of medical imaging have been investigated, resulting in the need for efficient real time implementation of the algorithm. A semi-variance, γ ( h ), is defined as the half of the expected squared differences of pixel values between any two data locations with a lag distance of h . Due to the need to examine each pair of pixels in the image or sub-image being processed, the base algorithm complexity for an image window with n pixels is O ( n 2 ) Field Programmable Gate Arrays (FPGAs) are an attractive solution for such demanding applications due to their parallel processing capability. FPGAs also tend to operate at relatively modest clock rates measured in a few hundreds of megahertz. This paper presents a technique for the fast computation of the semivariogram using two custom FPGA architectures. A modular architecture approach is chosen to allow for replication of processing units. This allows for high throughput due to concurrent processing of pixel pairs. The current implementation is focused on isotropic semivariogram computations only. The algorithm is benchmarked using VHDL on a Xilinx XUPV5-LX110T development Kit, which utilizes the Virtex5 FPGA. Medical image data from DXA scans are utilized for the experiments. Implementation results show that a significant advantage in computational speed is attained by the architectures with respect to implementation on a personal computer with an Intel i7 multi-core processor.
Semivariogram Analysis of Bone Images Implemented on FPGA Architectures

PubMed Central

Shirvaikar, Mukul; Lagadapati, Yamuna; Dong, Xuanliang

2016-01-01

Osteoporotic fractures are a major concern for the healthcare of elderly and female populations. Early diagnosis of patients with a high risk of osteoporotic fractures can be enhanced by introducing second-order statistical analysis of bone image data using techniques such as variogram analysis. Such analysis is computationally intensive thereby creating an impediment for introduction into imaging machines found in common clinical settings. This paper investigates the fast implementation of the semivariogram algorithm, which has been proven to be effective in modeling bone strength, and should be of interest to readers in the areas of computer-aided diagnosis and quantitative image analysis. The semivariogram is a statistical measure of the spatial distribution of data, and is based on Markov Random Fields (MRFs). Semivariogram analysis is a computationally intensive algorithm that has typically seen applications in the geosciences and remote sensing areas. Recently, applications in the area of medical imaging have been investigated, resulting in the need for efficient real time implementation of the algorithm. A semi-variance, γ(h), is defined as the half of the expected squared differences of pixel values between any two data locations with a lag distance of h. Due to the need to examine each pair of pixels in the image or sub-image being processed, the base algorithm complexity for an image window with n pixels is O (n2) Field Programmable Gate Arrays (FPGAs) are an attractive solution for such demanding applications due to their parallel processing capability. FPGAs also tend to operate at relatively modest clock rates measured in a few hundreds of megahertz. This paper presents a technique for the fast computation of the semivariogram using two custom FPGA architectures. A modular architecture approach is chosen to allow for replication of processing units. This allows for high throughput due to concurrent processing of pixel pairs. The current implementation is focused on isotropic semivariogram computations only. The algorithm is benchmarked using VHDL on a Xilinx XUPV5-LX110T development Kit, which utilizes the Virtex5 FPGA. Medical image data from DXA scans are utilized for the experiments. Implementation results show that a significant advantage in computational speed is attained by the architectures with respect to implementation on a personal computer with an Intel i7 multi-core processor. PMID:28428829
Specialized Computer Systems for Environment Visualization

NASA Astrophysics Data System (ADS)

Al-Oraiqat, Anas M.; Bashkov, Evgeniy A.; Zori, Sergii A.

2018-06-01

The need for real time image generation of landscapes arises in various fields as part of tasks solved by virtual and augmented reality systems, as well as geographic information systems. Such systems provide opportunities for collecting, storing, analyzing and graphically visualizing geographic data. Algorithmic and hardware software tools for increasing the realism and efficiency of the environment visualization in 3D visualization systems are proposed. This paper discusses a modified path tracing algorithm with a two-level hierarchy of bounding volumes and finding intersections with Axis-Aligned Bounding Box. The proposed algorithm eliminates the branching and hence makes the algorithm more suitable to be implemented on the multi-threaded CPU and GPU. A modified ROAM algorithm is used to solve the qualitative visualization of reliefs' problems and landscapes. The algorithm is implemented on parallel systems—cluster and Compute Unified Device Architecture-networks. Results show that the implementation on MPI clusters is more efficient than Graphics Processing Unit/Graphics Processing Clusters and allows real-time synthesis. The organization and algorithms of the parallel GPU system for the 3D pseudo stereo image/video synthesis are proposed. With realizing possibility analysis on a parallel GPU-architecture of each stage, 3D pseudo stereo synthesis is performed. An experimental prototype of a specialized hardware-software system 3D pseudo stereo imaging and video was developed on the CPU/GPU. The experimental results show that the proposed adaptation of 3D pseudo stereo imaging to the architecture of GPU-systems is efficient. Also it accelerates the computational procedures of 3D pseudo-stereo synthesis for the anaglyph and anamorphic formats of the 3D stereo frame without performing optimization procedures. The acceleration is on average 11 and 54 times for test GPUs.
An efficient and portable SIMD algorithm for charge/current deposition in Particle-In-Cell codes

NASA Astrophysics Data System (ADS)

Vincenti, H.; Lobet, M.; Lehe, R.; Sasanka, R.; Vay, J.-L.

2017-01-01

In current computer architectures, data movement (from die to network) is by far the most energy consuming part of an algorithm (≈ 20 pJ/word on-die to ≈10,000 pJ/word on the network). To increase memory locality at the hardware level and reduce energy consumption related to data movement, future exascale computers tend to use many-core processors on each compute nodes that will have a reduced clock speed to allow for efficient cooling. To compensate for frequency decrease, machine vendors are making use of long SIMD instruction registers that are able to process multiple data with one arithmetic operator in one clock cycle. SIMD register length is expected to double every four years. As a consequence, Particle-In-Cell (PIC) codes will have to achieve good vectorization to fully take advantage of these upcoming architectures. In this paper, we present a new algorithm that allows for efficient and portable SIMD vectorization of current/charge deposition routines that are, along with the field gathering routines, among the most time consuming parts of the PIC algorithm. Our new algorithm uses a particular data structure that takes into account memory alignment constraints and avoids gather/scatter instructions that can significantly affect vectorization performances on current CPUs. The new algorithm was successfully implemented in the 3D skeleton PIC code PICSAR and tested on Haswell Xeon processors (AVX2-256 bits wide data registers). Results show a factor of × 2 to × 2.5 speed-up in double precision for particle shape factor of orders 1- 3. The new algorithm can be applied as is on future KNL (Knights Landing) architectures that will include AVX-512 instruction sets with 512 bits register lengths (8 doubles/16 singles).
GOES-R GS Product Generation Infrastructure Operations

NASA Astrophysics Data System (ADS)

Blanton, M.; Gundy, J.

2012-12-01

GOES-R GS Product Generation Infrastructure Operations: The GOES-R Ground System (GS) will produce a much larger set of products with higher data density than previous GOES systems. This requires considerably greater compute and memory resources to achieve the necessary latency and availability for these products. Over time, new algorithms could be added and existing ones removed or updated, but the GOES-R GS cannot go down during this time. To meet these GOES-R GS processing needs, the Harris Corporation will implement a Product Generation (PG) infrastructure that is scalable, extensible, extendable, modular and reliable. The primary parts of the PG infrastructure are the Service Based Architecture (SBA), which includes the Distributed Data Fabric (DDF). The SBA is the middleware that encapsulates and manages science algorithms that generate products. The SBA is divided into three parts, the Executive, which manages and configures the algorithm as a service, the Dispatcher, which provides data to the algorithm, and the Strategy, which determines when the algorithm can execute with the available data. The SBA is a distributed architecture, with services connected to each other over a compute grid and is highly scalable. This plug-and-play architecture allows algorithms to be added, removed, or updated without affecting any other services or software currently running and producing data. Algorithms require product data from other algorithms, so a scalable and reliable messaging is necessary. The SBA uses the DDF to provide this data communication layer between algorithms. The DDF provides an abstract interface over a distributed and persistent multi-layered storage system (memory based caching above disk-based storage) and an event system that allows algorithm services to know when data is available and to get the data that they need to begin processing when they need it. Together, the SBA and the DDF provide a flexible, high performance architecture that can meet the needs of product processing now and as they grow in the future.
Design requirements and development of an airborne descent path definition algorithm for time navigation

NASA Technical Reports Server (NTRS)

Izumi, K. H.; Thompson, J. L.; Groce, J. L.; Schwab, R. W.

1986-01-01

The design requirements for a 4D path definition algorithm are described. These requirements were developed for the NASA ATOPS as an extension of the Local Flow Management/Profile Descent algorithm. They specify the processing flow, functional and data architectures, and system input requirements, and recommended the addition of a broad path revision (reinitialization) function capability. The document also summarizes algorithm design enhancements and the implementation status of the algorithm on an in-house PDP-11/70 computer. Finally, the requirements for the pilot-computer interfaces, the lateral path processor, and guidance and steering function are described.
GaAs Supercomputing: Architecture, Language, And Algorithms For Image Processing

NASA Astrophysics Data System (ADS)

Johl, John T.; Baker, Nick C.

1988-10-01

The application of high-speed GaAs processors in a parallel system matches the demanding computational requirements of image processing. The architecture of the McDonnell Douglas Astronautics Company (MDAC) vector processor is described along with the algorithms and language translator. Most image and signal processing algorithms can utilize parallel processing and show a significant performance improvement over sequential versions. The parallelization performed by this system is within each vector instruction. Since each vector has many elements, each requiring some computation, useful concurrent arithmetic operations can easily be performed. Balancing the memory bandwidth with the computation rate of the processors is an important design consideration for high efficiency and utilization. The architecture features a bus-based execution unit consisting of four to eight 32-bit GaAs RISC microprocessors running at a 200 MHz clock rate for a peak performance of 1.6 BOPS. The execution unit is connected to a vector memory with three buses capable of transferring two input words and one output word every 10 nsec. The address generators inside the vector memory perform different vector addressing modes and feed the data to the execution unit. The functions discussed in this paper include basic MATRIX OPERATIONS, 2-D SPATIAL CONVOLUTION, HISTOGRAM, and FFT. For each of these algorithms, assembly language programs were run on a behavioral model of the system to obtain performance figures.
Survey of new vector computers: The CRAY 1S from CRAY research; the CYBER 205 from CDC and the parallel computer from ICL - architecture and programming

NASA Technical Reports Server (NTRS)

Gentzsch, W.

1982-01-01

Problems which can arise with vector and parallel computers are discussed in a user oriented context. Emphasis is placed on the algorithms used and the programming techniques adopted. Three recently developed supercomputers are examined and typical application examples are given in CRAY FORTRAN, CYBER 205 FORTRAN and DAP (distributed array processor) FORTRAN. The systems performance is compared. The addition of parts of two N x N arrays is considered. The influence of the architecture on the algorithms and programming language is demonstrated. Numerical analysis of magnetohydrodynamic differential equations by an explicit difference method is illustrated, showing very good results for all three systems. The prognosis for supercomputer development is assessed.
FPGA-accelerated algorithm for the regular expression matching system

NASA Astrophysics Data System (ADS)

Russek, P.; Wiatr, K.

2015-01-01

This article describes an algorithm to support a regular expressions matching system. The goal was to achieve an attractive performance system with low energy consumption. The basic idea of the algorithm comes from a concept of the Bloom filter. It starts from the extraction of static sub-strings for strings of regular expressions. The algorithm is devised to gain from its decomposition into parts which are intended to be executed by custom hardware and the central processing unit (CPU). The pipelined custom processor architecture is proposed and a software algorithm explained accordingly. The software part of the algorithm was coded in C and runs on a processor from the ARM family. The hardware architecture was described in VHDL and implemented in field programmable gate array (FPGA). The performance results and required resources of the above experiments are given. An example of target application for the presented solution is computer and network security systems. The idea was tested on nearly 100,000 body-based viruses from the ClamAV virus database. The solution is intended for the emerging technology of clusters of low-energy computing nodes.
Parallel algorithm for computation of second-order sequential best rotations

NASA Astrophysics Data System (ADS)

Redif, Soydan; Kasap, Server

2013-12-01

Algorithms for computing an approximate polynomial matrix eigenvalue decomposition of para-Hermitian systems have emerged as a powerful, generic signal processing tool. A technique that has shown much success in this regard is the sequential best rotation (SBR2) algorithm. Proposed is a scheme for parallelising SBR2 with a view to exploiting the modern architectural features and inherent parallelism of field-programmable gate array (FPGA) technology. Experiments show that the proposed scheme can achieve low execution times while requiring minimal FPGA resources.
High Performance GPU-Based Fourier Volume Rendering.

PubMed

Abdellah, Marwan; Eldeib, Ayman; Sharawi, Amr

2015-01-01

Fourier volume rendering (FVR) is a significant visualization technique that has been used widely in digital radiography. As a result of its (N (2)log⁡N) time complexity, it provides a faster alternative to spatial domain volume rendering algorithms that are (N (3)) computationally complex. Relying on the Fourier projection-slice theorem, this technique operates on the spectral representation of a 3D volume instead of processing its spatial representation to generate attenuation-only projections that look like X-ray radiographs. Due to the rapid evolution of its underlying architecture, the graphics processing unit (GPU) became an attractive competent platform that can deliver giant computational raw power compared to the central processing unit (CPU) on a per-dollar-basis. The introduction of the compute unified device architecture (CUDA) technology enables embarrassingly-parallel algorithms to run efficiently on CUDA-capable GPU architectures. In this work, a high performance GPU-accelerated implementation of the FVR pipeline on CUDA-enabled GPUs is presented. This proposed implementation can achieve a speed-up of 117x compared to a single-threaded hybrid implementation that uses the CPU and GPU together by taking advantage of executing the rendering pipeline entirely on recent GPU architectures.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Keyes, D.; McInnes, L. C.; Woodward, C.

This report is an outcome of the workshop Multiphysics Simulations: Challenges and Opportunities, sponsored by the Institute of Computing in Science (ICiS). Additional information about the workshop, including relevant reading and presentations on multiphysics issues in applications, algorithms, and software, is available via https://sites.google.com/site/icismultiphysics2011/. We consider multiphysics applications from algorithmic and architectural perspectives, where 'algorithmic' includes both mathematical analysis and computational complexity and 'architectural' includes both software and hardware environments. Many diverse multiphysics applications can be reduced, en route to their computational simulation, to a common algebraic coupling paradigm. Mathematical analysis of multiphysics coupling in this form is not alwaysmore » practical for realistic applications, but model problems representative of applications discussed herein can provide insight. A variety of software frameworks for multiphysics applications have been constructed and refined within disciplinary communities and executed on leading-edge computer systems. We examine several of these, expose some commonalities among them, and attempt to extrapolate best practices to future systems. From our study, we summarize challenges and forecast opportunities. We also initiate a modest suite of test problems encompassing features present in many applications.« less
Automated Software Acceleration in Programmable Logic for an Efficient NFFT Algorithm Implementation: A Case Study.

PubMed

Rodríguez, Manuel; Magdaleno, Eduardo; Pérez, Fernando; García, Cristhian

2017-03-28

Non-equispaced Fast Fourier transform (NFFT) is a very important algorithm in several technological and scientific areas such as synthetic aperture radar, computational photography, medical imaging, telecommunications, seismic analysis and so on. However, its computation complexity is high. In this paper, we describe an efficient NFFT implementation with a hardware coprocessor using an All-Programmable System-on-Chip (APSoC). This is a hybrid device that employs an Advanced RISC Machine (ARM) as Processing System with Programmable Logic for high-performance digital signal processing through parallelism and pipeline techniques. The algorithm has been coded in C language with pragma directives to optimize the architecture of the system. We have used the very novel Software Develop System-on-Chip (SDSoC) evelopment tool that simplifies the interface and partitioning between hardware and software. This provides shorter development cycles and iterative improvements by exploring several architectures of the global system. The computational results shows that hardware acceleration significantly outperformed the software based implementation.
Automated Software Acceleration in Programmable Logic for an Efficient NFFT Algorithm Implementation: A Case Study

PubMed Central

Rodríguez, Manuel; Magdaleno, Eduardo; Pérez, Fernando; García, Cristhian

2017-01-01

Non-equispaced Fast Fourier transform (NFFT) is a very important algorithm in several technological and scientific areas such as synthetic aperture radar, computational photography, medical imaging, telecommunications, seismic analysis and so on. However, its computation complexity is high. In this paper, we describe an efficient NFFT implementation with a hardware coprocessor using an All-Programmable System-on-Chip (APSoC). This is a hybrid device that employs an Advanced RISC Machine (ARM) as Processing System with Programmable Logic for high-performance digital signal processing through parallelism and pipeline techniques. The algorithm has been coded in C language with pragma directives to optimize the architecture of the system. We have used the very novel Software Develop System-on-Chip (SDSoC) evelopment tool that simplifies the interface and partitioning between hardware and software. This provides shorter development cycles and iterative improvements by exploring several architectures of the global system. The computational results shows that hardware acceleration significantly outperformed the software based implementation. PMID:28350358
Job Scheduling in a Heterogeneous Grid Environment

NASA Technical Reports Server (NTRS)

Shan, Hong-Zhang; Smith, Warren; Oliker, Leonid; Biswas, Rupak

2004-01-01

Computational grids have the potential for solving large-scale scientific problems using heterogeneous and geographically distributed resources. However, a number of major technical hurdles must be overcome before this potential can be realized. One problem that is critical to effective utilization of computational grids is the efficient scheduling of jobs. This work addresses this problem by describing and evaluating a grid scheduling architecture and three job migration algorithms. The architecture is scalable and does not assume control of local site resources. The job migration policies use the availability and performance of computer systems, the network bandwidth available between systems, and the volume of input and output data associated with each job. An extensive performance comparison is presented using real workloads from leading computational centers. The results, based on several key metrics, demonstrate that the performance of our distributed migration algorithms is significantly greater than that of a local scheduling framework and comparable to a non-scalable global scheduling approach.
Hybrid massively parallel fast sweeping method for static Hamilton–Jacobi equations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Detrixhe, Miles, E-mail: mdetrixhe@engineering.ucsb.edu; University of California Santa Barbara, Santa Barbara, CA, 93106; Gibou, Frédéric, E-mail: fgibou@engineering.ucsb.edu

The fast sweeping method is a popular algorithm for solving a variety of static Hamilton–Jacobi equations. Fast sweeping algorithms for parallel computing have been developed, but are severely limited. In this work, we present a multilevel, hybrid parallel algorithm that combines the desirable traits of two distinct parallel methods. The fine and coarse grained components of the algorithm take advantage of heterogeneous computer architecture common in high performance computing facilities. We present the algorithm and demonstrate its effectiveness on a set of example problems including optimal control, dynamic games, and seismic wave propagation. We give results for convergence, parallel scaling,more » and show state-of-the-art speedup values for the fast sweeping method.« less

Kanerva's sparse distributed memory: An associative memory algorithm well-suited to the Connection Machine

NASA Technical Reports Server (NTRS)

Rogers, David

1988-01-01

The advent of the Connection Machine profoundly changes the world of supercomputers. The highly nontraditional architecture makes possible the exploration of algorithms that were impractical for standard Von Neumann architectures. Sparse distributed memory (SDM) is an example of such an algorithm. Sparse distributed memory is a particularly simple and elegant formulation for an associative memory. The foundations for sparse distributed memory are described, and some simple examples of using the memory are presented. The relationship of sparse distributed memory to three important computational systems is shown: random-access memory, neural networks, and the cerebellum of the brain. Finally, the implementation of the algorithm for sparse distributed memory on the Connection Machine is discussed.
Vectorization, threading, and cache-blocking considerations for hydrocodes on emerging architectures

DOE PAGES

Fung, J.; Aulwes, R. T.; Bement, M. T.; ...

2015-07-14

This work reports on considerations for improving computational performance in preparation for current and expected changes to computer architecture. The algorithms studied will include increasingly complex prototypes for radiation hydrodynamics codes, such as gradient routines and diffusion matrix assembly (e.g., in [1-6]). The meshes considered for the algorithms are structured or unstructured meshes. The considerations applied for performance improvements are meant to be general in terms of architecture (not specifically graphical processing unit (GPUs) or multi-core machines, for example) and include techniques for vectorization, threading, tiling, and cache blocking. Out of a survey of optimization techniques on applications such asmore » diffusion and hydrodynamics, we make general recommendations with a view toward making these techniques conceptually accessible to the applications code developer. Published 2015. This article is a U.S. Government work and is in the public domain in the USA.« less
Low complexity 1D IDCT for 16-bit parallel architectures

NASA Astrophysics Data System (ADS)

Bivolarski, Lazar

2007-09-01

This paper shows that using the Loeffler, Ligtenberg, and Moschytz factorization of 8-point IDCT [2] one-dimensional (1-D) algorithm as a fast approximation of the Discrete Cosine Transform (DCT) and using only 16 bit numbers, it is possible to create in an IEEE 1180-1990 compliant and multiplierless algorithm with low computational complexity. This algorithm as characterized by its structure is efficiently implemented on parallel high performance architectures as well as due to its low complexity is sufficient for wide range of other architectures. Additional constraint on this work was the requirement of compliance with the existing MPEG standards. The hardware implementation complexity and low resources where also part of the design criteria for this algorithm. This implementation is also compliant with the precision requirements described in MPEG IDCT precision specification ISO/IEC 23002-1. Complexity analysis is performed as an extension to the simple measure of shifts and adds for the multiplierless algorithm as additional operations are included in the complexity measure to better describe the actual transform implementation complexity.
Elastic Cloud Computing Architecture and System for Heterogeneous Spatiotemporal Computing

NASA Astrophysics Data System (ADS)

Shi, X.

2017-10-01

Spatiotemporal computation implements a variety of different algorithms. When big data are involved, desktop computer or standalone application may not be able to complete the computation task due to limited memory and computing power. Now that a variety of hardware accelerators and computing platforms are available to improve the performance of geocomputation, different algorithms may have different behavior on different computing infrastructure and platforms. Some are perfect for implementation on a cluster of graphics processing units (GPUs), while GPUs may not be useful on certain kind of spatiotemporal computation. This is the same situation in utilizing a cluster of Intel's many-integrated-core (MIC) or Xeon Phi, as well as Hadoop or Spark platforms, to handle big spatiotemporal data. Furthermore, considering the energy efficiency requirement in general computation, Field Programmable Gate Array (FPGA) may be a better solution for better energy efficiency when the performance of computation could be similar or better than GPUs and MICs. It is expected that an elastic cloud computing architecture and system that integrates all of GPUs, MICs, and FPGAs could be developed and deployed to support spatiotemporal computing over heterogeneous data types and computational problems.
Web Image Retrieval Using Self-Organizing Feature Map.

ERIC Educational Resources Information Center

Wu, Qishi; Iyengar, S. Sitharama; Zhu, Mengxia

2001-01-01

Provides an overview of current image retrieval systems. Describes the architecture of the SOFM (Self Organizing Feature Maps) based image retrieval system, discussing the system architecture and features. Introduces the Kohonen model, and describes the implementation details of SOFM computation and its learning algorithm. Presents a test example…
Three-dimensional photoacoustic tomography based on graphics-processing-unit-accelerated finite element method.

PubMed

Peng, Kuan; He, Ling; Zhu, Ziqiang; Tang, Jingtian; Xiao, Jiaying

2013-12-01

Compared with commonly used analytical reconstruction methods, the frequency-domain finite element method (FEM) based approach has proven to be an accurate and flexible algorithm for photoacoustic tomography. However, the FEM-based algorithm is computationally demanding, especially for three-dimensional cases. To enhance the algorithm's efficiency, in this work a parallel computational strategy is implemented in the framework of the FEM-based reconstruction algorithm using a graphic-processing-unit parallel frame named the "compute unified device architecture." A series of simulation experiments is carried out to test the accuracy and accelerating effect of the improved method. The results obtained indicate that the parallel calculation does not change the accuracy of the reconstruction algorithm, while its computational cost is significantly reduced by a factor of 38.9 with a GTX 580 graphics card using the improved method.
A novel VLSI processor architecture for supercomputing arrays

NASA Technical Reports Server (NTRS)

Venkateswaran, N.; Pattabiraman, S.; Devanathan, R.; Ahmed, Ashaf; Venkataraman, S.; Ganesh, N.

1993-01-01

Design of the processor element for general purpose massively parallel supercomputing arrays is highly complex and cost ineffective. To overcome this, the architecture and organization of the functional units of the processor element should be such as to suit the diverse computational structures and simplify mapping of complex communication structures of different classes of algorithms. This demands that the computation and communication structures of different class of algorithms be unified. While unifying the different communication structures is a difficult process, analysis of a wide class of algorithms reveals that their computation structures can be expressed in terms of basic IP,IP,OP,CM,R,SM, and MAA operations. The execution of these operations is unified on the PAcube macro-cell array. Based on this PAcube macro-cell array, we present a novel processor element called the GIPOP processor, which has dedicated functional units to perform the above operations. The architecture and organization of these functional units are such to satisfy the two important criteria mentioned above. The structure of the macro-cell and the unification process has led to a very regular and simpler design of the GIPOP processor. The production cost of the GIPOP processor is drastically reduced as it is designed on high performance mask programmable PAcube arrays.
Hardware Acceleration of Adaptive Neural Algorithms.

DOE Office of Scientific and Technical Information (OSTI.GOV)

James, Conrad D.

As tradit ional numerical computing has faced challenges, researchers have turned towards alternative computing approaches to reduce power - per - computation metrics and improve algorithm performance. Here, we describe an approach towards non - conventional computing that strengthens the connection between machine learning and neuroscience concepts. The Hardware Acceleration of Adaptive Neural Algorithms (HAANA) project ha s develop ed neural machine learning algorithms and hardware for applications in image processing and cybersecurity. While machine learning methods are effective at extracting relevant features from many types of data, the effectiveness of these algorithms degrades when subjected to real - worldmore » conditions. Our team has generated novel neural - inspired approa ches to improve the resiliency and adaptability of machine learning algorithms. In addition, we have also designed and fabricated hardware architectures and microelectronic devices specifically tuned towards the training and inference operations of neural - inspired algorithms. Finally, our multi - scale simulation framework allows us to assess the impact of microelectronic device properties on algorithm performance.« less
Iterative algorithms for large sparse linear systems on parallel computers

NASA Technical Reports Server (NTRS)

Adams, L. M.

1982-01-01

Algorithms for assembling in parallel the sparse system of linear equations that result from finite difference or finite element discretizations of elliptic partial differential equations, such as those that arise in structural engineering are developed. Parallel linear stationary iterative algorithms and parallel preconditioned conjugate gradient algorithms are developed for solving these systems. In addition, a model for comparing parallel algorithms on array architectures is developed and results of this model for the algorithms are given.
Lattice QCD Calculations in Nuclear Physics towards the Exascale

NASA Astrophysics Data System (ADS)

Joo, Balint

2017-01-01

The combination of algorithmic advances and new highly parallel computing architectures are enabling lattice QCD calculations to tackle ever more complex problems in nuclear physics. In this talk I will review some computational challenges that are encountered in large scale cold nuclear physics campaigns such as those in hadron spectroscopy calculations. I will discuss progress in addressing these with algorithmic improvements such as multi-grid solvers and software for recent hardware architectures such as GPUs and Intel Xeon Phi, Knights Landing. Finally, I will highlight some current topics for research and development as we head towards the Exascale era This material is funded by the U.S. Department of Energy, Office Of Science, Offices of Nuclear Physics, High Energy Physics and Advanced Scientific Computing Research, as well as the Office of Nuclear Physics under contract DE-AC05-06OR23177.
F-Nets and Software Cabling: Deriving a Formal Model and Language for Portable Parallel Programming

NASA Technical Reports Server (NTRS)

DiNucci, David C.; Saini, Subhash (Technical Monitor)

1998-01-01

Parallel programming is still being based upon antiquated sequence-based definitions of the terms "algorithm" and "computation", resulting in programs which are architecture dependent and difficult to design and analyze. By focusing on obstacles inherent in existing practice, a more portable model is derived here, which is then formalized into a model called Soviets which utilizes a combination of imperative and functional styles. This formalization suggests more general notions of algorithm and computation, as well as insights into the meaning of structured programming in a parallel setting. To illustrate how these principles can be applied, a very-high-level graphical architecture-independent parallel language, called Software Cabling, is described, with many of the features normally expected from today's computer languages (e.g. data abstraction, data parallelism, and object-based programming constructs).
Accelerating artificial intelligence with reconfigurable computing

NASA Astrophysics Data System (ADS)

Cieszewski, Radoslaw

Reconfigurable computing is emerging as an important area of research in computer architectures and software systems. Many algorithms can be greatly accelerated by placing the computationally intense portions of an algorithm into reconfigurable hardware. Reconfigurable computing combines many benefits of both software and ASIC implementations. Like software, the mapped circuit is flexible, and can be changed over the lifetime of the system. Similar to an ASIC, reconfigurable systems provide a method to map circuits into hardware. Reconfigurable systems therefore have the potential to achieve far greater performance than software as a result of bypassing the fetch-decode-execute operations of traditional processors, and possibly exploiting a greater level of parallelism. Such a field, where there is many different algorithms which can be accelerated, is an artificial intelligence. This paper presents example hardware implementations of Artificial Neural Networks, Genetic Algorithms and Expert Systems.
Practical Sub-Nyquist Sampling via Array-Based Compressed Sensing Receiver Architecture

DTIC Science & Technology

2016-07-10

different array ele- ments at different sub-Nyquist sampling rates. Signal processing inspired by the sparse fast Fourier transform allows for signal...reconstruction algorithms can be computationally demanding (REF). The related sparse Fourier transform algorithms aim to reduce the processing time nec- essary to...compute the DFT of frequency-sparse signals [7]. In particular, the sparse fast Fourier transform (sFFT) achieves processing time better than the
Simulating Hydrologic Flow and Reactive Transport with PFLOTRAN and PETSc on Emerging Fine-Grained Parallel Computer Architectures

NASA Astrophysics Data System (ADS)

Mills, R. T.; Rupp, K.; Smith, B. F.; Brown, J.; Knepley, M.; Zhang, H.; Adams, M.; Hammond, G. E.

2017-12-01

As the high-performance computing community pushes towards the exascale horizon, power and heat considerations have driven the increasing importance and prevalence of fine-grained parallelism in new computer architectures. High-performance computing centers have become increasingly reliant on GPGPU accelerators and "manycore" processors such as the Intel Xeon Phi line, and 512-bit SIMD registers have even been introduced in the latest generation of Intel's mainstream Xeon server processors. The high degree of fine-grained parallelism and more complicated memory hierarchy considerations of such "manycore" processors present several challenges to existing scientific software. Here, we consider how the massively parallel, open-source hydrologic flow and reactive transport code PFLOTRAN - and the underlying Portable, Extensible Toolkit for Scientific Computation (PETSc) library on which it is built - can best take advantage of such architectures. We will discuss some key features of these novel architectures and our code optimizations and algorithmic developments targeted at them, and present experiences drawn from working with a wide range of PFLOTRAN benchmark problems on these architectures.
A high throughput architecture for a low complexity soft-output demapping algorithm

NASA Astrophysics Data System (ADS)

Ali, I.; Wasenmüller, U.; Wehn, N.

2015-11-01

Iterative channel decoders such as Turbo-Code and LDPC decoders show exceptional performance and therefore they are a part of many wireless communication receivers nowadays. These decoders require a soft input, i.e., the logarithmic likelihood ratio (LLR) of the received bits with a typical quantization of 4 to 6 bits. For computing the LLR values from a received complex symbol, a soft demapper is employed in the receiver. The implementation cost of traditional soft-output demapping methods is relatively large in high order modulation systems, and therefore low complexity demapping algorithms are indispensable in low power receivers. In the presence of multiple wireless communication standards where each standard defines multiple modulation schemes, there is a need to have an efficient demapper architecture covering all the flexibility requirements of these standards. Another challenge associated with hardware implementation of the demapper is to achieve a very high throughput in double iterative systems, for instance, MIMO and Code-Aided Synchronization. In this paper, we present a comprehensive communication and hardware performance evaluation of low complexity soft-output demapping algorithms to select the best algorithm for implementation. The main goal of this work is to design a high throughput, flexible, and area efficient architecture. We describe architectures to execute the investigated algorithms. We implement these architectures on a FPGA device to evaluate their hardware performance. The work has resulted in a hardware architecture based on the figured out best low complexity algorithm delivering a high throughput of 166 Msymbols/second for Gray mapped 16-QAM modulation on Virtex-5. This efficient architecture occupies only 127 slice registers, 248 slice LUTs and 2 DSP48Es.
A parallel computing engine for a class of time critical processes.

PubMed

Nabhan, T M; Zomaya, A Y

1997-01-01

This paper focuses on the efficient parallel implementation of systems of numerically intensive nature over loosely coupled multiprocessor architectures. These analytical models are of significant importance to many real-time systems that have to meet severe time constants. A parallel computing engine (PCE) has been developed in this work for the efficient simplification and the near optimal scheduling of numerical models over the different cooperating processors of the parallel computer. First, the analytical system is efficiently coded in its general form. The model is then simplified by using any available information (e.g., constant parameters). A task graph representing the interconnections among the different components (or equations) is generated. The graph can then be compressed to control the computation/communication requirements. The task scheduler employs a graph-based iterative scheme, based on the simulated annealing algorithm, to map the vertices of the task graph onto a Multiple-Instruction-stream Multiple-Data-stream (MIMD) type of architecture. The algorithm uses a nonanalytical cost function that properly considers the computation capability of the processors, the network topology, the communication time, and congestion possibilities. Moreover, the proposed technique is simple, flexible, and computationally viable. The efficiency of the algorithm is demonstrated by two case studies with good results.
Hybrid architecture for building secure sensor networks

NASA Astrophysics Data System (ADS)

Owens, Ken R., Jr.; Watkins, Steve E.

2012-04-01

Sensor networks have various communication and security architectural concerns. Three approaches are defined to address these concerns for sensor networks. The first area is the utilization of new computing architectures that leverage embedded virtualization software on the sensor. Deploying a small, embedded virtualization operating system on the sensor nodes that is designed to communicate to low-cost cloud computing infrastructure in the network is the foundation to delivering low-cost, secure sensor networks. The second area focuses on securing the sensor. Sensor security components include developing an identification scheme, and leveraging authentication algorithms and protocols that address security assurance within the physical, communication network, and application layers. This function will primarily be accomplished through encrypting the communication channel and integrating sensor network firewall and intrusion detection/prevention components to the sensor network architecture. Hence, sensor networks will be able to maintain high levels of security. The third area addresses the real-time and high priority nature of the data that sensor networks collect. This function requires that a quality-of-service (QoS) definition and algorithm be developed for delivering the right data at the right time. A hybrid architecture is proposed that combines software and hardware features to handle network traffic with diverse QoS requirements.
A high performance parallel computing architecture for robust image features

NASA Astrophysics Data System (ADS)

Zhou, Renyan; Liu, Leibo; Wei, Shaojun

2014-03-01

A design of parallel architecture for image feature detection and description is proposed in this article. The major component of this architecture is a 2D cellular network composed of simple reprogrammable processors, enabling the Hessian Blob Detector and Haar Response Calculation, which are the most computing-intensive stage of the Speeded Up Robust Features (SURF) algorithm. Combining this 2D cellular network and dedicated hardware for SURF descriptors, this architecture achieves real-time image feature detection with minimal software in the host processor. A prototype FPGA implementation of the proposed architecture achieves 1318.9 GOPS general pixel processing @ 100 MHz clock and achieves up to 118 fps in VGA (640 × 480) image feature detection. The proposed architecture is stand-alone and scalable so it is easy to be migrated into VLSI implementation.
Algorithms and software for solving finite element equations on serial and parallel architectures

NASA Technical Reports Server (NTRS)

George, Alan

1989-01-01

Over the past 15 years numerous new techniques have been developed for solving systems of equations and eigenvalue problems arising in finite element computations. A package called SPARSPAK has been developed by the author and his co-workers which exploits these new methods. The broad objective of this research project is to incorporate some of this software in the Computational Structural Mechanics (CSM) testbed, and to extend the techniques for use on multiprocessor architectures.
An information-theoretic approach to motor action decoding with a reconfigurable parallel architecture.

PubMed

Craciun, Stefan; Brockmeier, Austin J; George, Alan D; Lam, Herman; Príncipe, José C

2011-01-01

Methods for decoding movements from neural spike counts using adaptive filters often rely on minimizing the mean-squared error. However, for non-Gaussian distribution of errors, this approach is not optimal for performance. Therefore, rather than using probabilistic modeling, we propose an alternate non-parametric approach. In order to extract more structure from the input signal (neuronal spike counts) we propose using minimum error entropy (MEE), an information-theoretic approach that minimizes the error entropy as part of an iterative cost function. However, the disadvantage of using MEE as the cost function for adaptive filters is the increase in computational complexity. In this paper we present a comparison between the decoding performance of the analytic Wiener filter and a linear filter trained with MEE, which is then mapped to a parallel architecture in reconfigurable hardware tailored to the computational needs of the MEE filter. We observe considerable speedup from the hardware design. The adaptation of filter weights for the multiple-input, multiple-output linear filters, necessary in motor decoding, is a highly parallelizable algorithm. It can be decomposed into many independent computational blocks with a parallel architecture readily mapped to a field-programmable gate array (FPGA) and scales to large numbers of neurons. By pipelining and parallelizing independent computations in the algorithm, the proposed parallel architecture has sublinear increases in execution time with respect to both window size and filter order.

Two-dimensional nonsteady viscous flow simulation on the Navier-Stokes computer miniNode

NASA Technical Reports Server (NTRS)

Nosenchuck, Daniel M.; Littman, Michael G.; Flannery, William

1986-01-01

The needs of large-scale scientific computation are outpacing the growth in performance of mainframe supercomputers. In particular, problems in fluid mechanics involving complex flow simulations require far more speed and capacity than that provided by current and proposed Class VI supercomputers. To address this concern, the Navier-Stokes Computer (NSC) was developed. The NSC is a parallel-processing machine, comprised of individual Nodes, each comparable in performance to current supercomputers. The global architecture is that of a hypercube, and a 128-Node NSC has been designed. New architectural features, such as a reconfigurable many-function ALU pipeline and a multifunction memory-ALU switch, have provided the capability to efficiently implement a wide range of algorithms. Efficient algorithms typically involve numerically intensive tasks, which often include conditional operations. These operations may be efficiently implemented on the NSC without, in general, sacrificing vector-processing speed. To illustrate the architecture, programming, and several of the capabilities of the NSC, the simulation of two-dimensional, nonsteady viscous flows on a prototype Node, called the miniNode, is presented.
Automatic partitioning of unstructured meshes for the parallel solution of problems in computational mechanics

NASA Technical Reports Server (NTRS)

Farhat, Charbel; Lesoinne, Michel

1993-01-01

Most of the recently proposed computational methods for solving partial differential equations on multiprocessor architectures stem from the 'divide and conquer' paradigm and involve some form of domain decomposition. For those methods which also require grids of points or patches of elements, it is often necessary to explicitly partition the underlying mesh, especially when working with local memory parallel processors. In this paper, a family of cost-effective algorithms for the automatic partitioning of arbitrary two- and three-dimensional finite element and finite difference meshes is presented and discussed in view of a domain decomposed solution procedure and parallel processing. The influence of the algorithmic aspects of a solution method (implicit/explicit computations), and the architectural specifics of a multiprocessor (SIMD/MIMD, startup/transmission time), on the design of a mesh partitioning algorithm are discussed. The impact of the partitioning strategy on load balancing, operation count, operator conditioning, rate of convergence and processor mapping is also addressed. Finally, the proposed mesh decomposition algorithms are demonstrated with realistic examples of finite element, finite volume, and finite difference meshes associated with the parallel solution of solid and fluid mechanics problems on the iPSC/2 and iPSC/860 multiprocessors.
Architecture Framework for Trapped-Ion Quantum Computer based on Performance Simulation Tool

NASA Astrophysics Data System (ADS)

Ahsan, Muhammad

The challenge of building scalable quantum computer lies in striking appropriate balance between designing a reliable system architecture from large number of faulty computational resources and improving the physical quality of system components. The detailed investigation of performance variation with physics of the components and the system architecture requires adequate performance simulation tool. In this thesis we demonstrate a software tool capable of (1) mapping and scheduling the quantum circuit on a realistic quantum hardware architecture with physical resource constraints, (2) evaluating the performance metrics such as the execution time and the success probability of the algorithm execution, and (3) analyzing the constituents of these metrics and visualizing resource utilization to identify system components which crucially define the overall performance. Using this versatile tool, we explore vast design space for modular quantum computer architecture based on trapped ions. We find that while success probability is uniformly determined by the fidelity of physical quantum operation, the execution time is a function of system resources invested at various layers of design hierarchy. At physical level, the number of lasers performing quantum gates, impact the latency of the fault-tolerant circuit blocks execution. When these blocks are used to construct meaningful arithmetic circuit such as quantum adders, the number of ancilla qubits for complicated non-clifford gates and entanglement resources to establish long-distance communication channels, become major performance limiting factors. Next, in order to factorize large integers, these adders are assembled into modular exponentiation circuit comprising bulk of Shor's algorithm. At this stage, the overall scaling of resource-constraint performance with the size of problem, describes the effectiveness of chosen design. By matching the resource investment with the pace of advancement in hardware technology, we find optimal designs for different types of quantum adders. Conclusively, we show that 2,048-bit Shor's algorithm can be reliably executed within the resource budget of 1.5 million qubits.
Implementation in an FPGA circuit of Edge detection algorithm based on the Discrete Wavelet Transforms

NASA Astrophysics Data System (ADS)

Bouganssa, Issam; Sbihi, Mohamed; Zaim, Mounia

2017-07-01

The 2D Discrete Wavelet Transform (DWT) is a computationally intensive task that is usually implemented on specific architectures in many imaging systems in real time. In this paper, a high throughput edge or contour detection algorithm is proposed based on the discrete wavelet transform. A technique for applying the filters on the three directions (Horizontal, Vertical and Diagonal) of the image is used to present the maximum of the existing contours. The proposed architectures were designed in VHDL and mapped to a Xilinx Sparten6 FPGA. The results of the synthesis show that the proposed architecture has a low area cost and can operate up to 100 MHz, which can perform 2D wavelet analysis for a sequence of images while maintaining the flexibility of the system to support an adaptive algorithm.
Design of an FMCW radar baseband signal processing system for automotive application.

PubMed

Lin, Jau-Jr; Li, Yuan-Ping; Hsu, Wei-Chiang; Lee, Ta-Sung

2016-01-01

For a typical FMCW automotive radar system, a new design of baseband signal processing architecture and algorithms is proposed to overcome the ghost targets and overlapping problems in the multi-target detection scenario. To satisfy the short measurement time constraint without increasing the RF front-end loading, a three-segment waveform with different slopes is utilized. By introducing a new pairing mechanism and a spatial filter design algorithm, the proposed detection architecture not only provides high accuracy and reliability, but also requires low pairing time and computational loading. This proposed baseband signal processing architecture and algorithms balance the performance and complexity, and are suitable to be implemented in a real automotive radar system. Field measurement results demonstrate that the proposed automotive radar signal processing system can perform well in a realistic application scenario.
Study of the mapping of Navier-Stokes algorithms onto multiple-instruction/multiple-data-stream computers

NASA Technical Reports Server (NTRS)

Eberhardt, D. S.; Baganoff, D.; Stevens, K.

1984-01-01

Implicit approximate-factored algorithms have certain properties that are suitable for parallel processing. A particular computational fluid dynamics (CFD) code, using this algorithm, is mapped onto a multiple-instruction/multiple-data-stream (MIMD) computer architecture. An explanation of this mapping procedure is presented, as well as some of the difficulties encountered when trying to run the code concurrently. Timing results are given for runs on the Ames Research Center's MIMD test facility which consists of two VAX 11/780's with a common MA780 multi-ported memory. Speedups exceeding 1.9 for characteristic CFD runs were indicated by the timing results.
Architecture and data processing alternatives for the tse computer. Volume 4: Image rotation using tse operations

NASA Technical Reports Server (NTRS)

Kao, M. H.; Bodenheimer, R. E.

1976-01-01

The tse computer's capability of achieving image congruence between temporal and multiple images with misregistration due to rotational differences is reported. The coordinate transformations are obtained and a general algorithms is devised to perform image rotation using tse operations very efficiently. The details of this algorithm as well as its theoretical implications are presented. Step by step procedures of image registration are described in detail. Numerous examples are also employed to demonstrate the correctness and the effectiveness of the algorithms and conclusions and recommendations are made.
Ndarts

NASA Technical Reports Server (NTRS)

Jain, Abhinandan

2011-01-01

Ndarts software provides algorithms for computing quantities associated with the dynamics of articulated, rigid-link, multibody systems. It is designed as a general-purpose dynamics library that can be used for the modeling of robotic platforms, space vehicles, molecular dynamics, and other such applications. The architecture and algorithms in Ndarts are based on the Spatial Operator Algebra (SOA) theory for computational multibody and robot dynamics developed at JPL. It uses minimal, internal coordinate models. The algorithms are low-order, recursive scatter/ gather algorithms. In comparison with the earlier Darts++ software, this version has a more general and cleaner design needed to support a larger class of computational dynamics needs. It includes a frames infrastructure, allows algorithms to operate on subgraphs of the system, and implements lazy and deferred computation for better efficiency. Dynamics modeling modules such as Ndarts are core building blocks of control and simulation software for space, robotic, mechanism, bio-molecular, and material systems modeling.
Optimization of image processing algorithms on mobile platforms

NASA Astrophysics Data System (ADS)

Poudel, Pramod; Shirvaikar, Mukul

2011-03-01

This work presents a technique to optimize popular image processing algorithms on mobile platforms such as cell phones, net-books and personal digital assistants (PDAs). The increasing demand for video applications like context-aware computing on mobile embedded systems requires the use of computationally intensive image processing algorithms. The system engineer has a mandate to optimize them so as to meet real-time deadlines. A methodology to take advantage of the asymmetric dual-core processor, which includes an ARM and a DSP core supported by shared memory, is presented with implementation details. The target platform chosen is the popular OMAP 3530 processor for embedded media systems. It has an asymmetric dual-core architecture with an ARM Cortex-A8 and a TMS320C64x Digital Signal Processor (DSP). The development platform was the BeagleBoard with 256 MB of NAND RAM and 256 MB SDRAM memory. The basic image correlation algorithm is chosen for benchmarking as it finds widespread application for various template matching tasks such as face-recognition. The basic algorithm prototypes conform to OpenCV, a popular computer vision library. OpenCV algorithms can be easily ported to the ARM core which runs a popular operating system such as Linux or Windows CE. However, the DSP is architecturally more efficient at handling DFT algorithms. The algorithms are tested on a variety of images and performance results are presented measuring the speedup obtained due to dual-core implementation. A major advantage of this approach is that it allows the ARM processor to perform important real-time tasks, while the DSP addresses performance-hungry algorithms.
Architecture and data processing alternatives for Tse computer. Volume 1: Tse logic design concepts and the development of image processing machine architectures

NASA Technical Reports Server (NTRS)

Rickard, D. A.; Bodenheimer, R. E.

1976-01-01

Digital computer components which perform two dimensional array logic operations (Tse logic) on binary data arrays are described. The properties of Golay transforms which make them useful in image processing are reviewed, and several architectures for Golay transform processors are presented with emphasis on the skeletonizing algorithm. Conventional logic control units developed for the Golay transform processors are described. One is a unique microprogrammable control unit that uses a microprocessor to control the Tse computer. The remaining control units are based on programmable logic arrays. Performance criteria are established and utilized to compare the various Golay transform machines developed. A critique of Tse logic is presented, and recommendations for additional research are included.
Scalable Algorithms for Clustering Large Geospatiotemporal Data Sets on Manycore Architectures

NASA Astrophysics Data System (ADS)

Mills, R. T.; Hoffman, F. M.; Kumar, J.; Sreepathi, S.; Sripathi, V.

2016-12-01

The increasing availability of high-resolution geospatiotemporal data sets from sources such as observatory networks, remote sensing platforms, and computational Earth system models has opened new possibilities for knowledge discovery using data sets fused from disparate sources. Traditional algorithms and computing platforms are impractical for the analysis and synthesis of data sets of this size; however, new algorithmic approaches that can effectively utilize the complex memory hierarchies and the extremely high levels of available parallelism in state-of-the-art high-performance computing platforms can enable such analysis. We describe a massively parallel implementation of accelerated k-means clustering and some optimizations to boost computational intensity and utilization of wide SIMD lanes on state-of-the art multi- and manycore processors, including the second-generation Intel Xeon Phi ("Knights Landing") processor based on the Intel Many Integrated Core (MIC) architecture, which includes several new features, including an on-package high-bandwidth memory. We also analyze the code in the context of a few practical applications to the analysis of climatic and remotely-sensed vegetation phenology data sets, and speculate on some of the new applications that such scalable analysis methods may enable.
Majorana-Based Fermionic Quantum Computation.

PubMed

O'Brien, T E; Rożek, P; Akhmerov, A R

2018-06-01

Because Majorana zero modes store quantum information nonlocally, they are protected from noise, and have been proposed as a building block for a quantum computer. We show how to use the same protection from noise to implement universal fermionic quantum computation. Our architecture requires only two Majorana modes to encode a fermionic quantum degree of freedom, compared to alternative implementations which require a minimum of four Majorana modes for a spin quantum degree of freedom. The fermionic degrees of freedom support both unitary coupled cluster variational quantum eigensolver and quantum phase estimation algorithms, proposed for quantum chemistry simulations. Because we avoid the Jordan-Wigner transformation, our scheme has a lower overhead for implementing both of these algorithms, allowing for simulation of the Trotterized Hubbard Hamiltonian in O(1) time per unitary step. We finally demonstrate magic state distillation in our fermionic architecture, giving a universal set of topologically protected fermionic quantum gates.
Majorana-Based Fermionic Quantum Computation

NASA Astrophysics Data System (ADS)

O'Brien, T. E.; RoŻek, P.; Akhmerov, A. R.

2018-06-01

Because Majorana zero modes store quantum information nonlocally, they are protected from noise, and have been proposed as a building block for a quantum computer. We show how to use the same protection from noise to implement universal fermionic quantum computation. Our architecture requires only two Majorana modes to encode a fermionic quantum degree of freedom, compared to alternative implementations which require a minimum of four Majorana modes for a spin quantum degree of freedom. The fermionic degrees of freedom support both unitary coupled cluster variational quantum eigensolver and quantum phase estimation algorithms, proposed for quantum chemistry simulations. Because we avoid the Jordan-Wigner transformation, our scheme has a lower overhead for implementing both of these algorithms, allowing for simulation of the Trotterized Hubbard Hamiltonian in O (1 ) time per unitary step. We finally demonstrate magic state distillation in our fermionic architecture, giving a universal set of topologically protected fermionic quantum gates.
VLSI implementation of a new LMS-based algorithm for noise removal in ECG signal

NASA Astrophysics Data System (ADS)

Satheeskumaran, S.; Sabrigiriraj, M.

2016-06-01

Least mean square (LMS)-based adaptive filters are widely deployed for removing artefacts in electrocardiogram (ECG) due to less number of computations. But they posses high mean square error (MSE) under noisy environment. The transform domain variable step-size LMS algorithm reduces the MSE at the cost of computational complexity. In this paper, a variable step-size delayed LMS adaptive filter is used to remove the artefacts from the ECG signal for improved feature extraction. The dedicated digital Signal processors provide fast processing, but they are not flexible. By using field programmable gate arrays, the pipelined architectures can be used to enhance the system performance. The pipelined architecture can enhance the operation efficiency of the adaptive filter and save the power consumption. This technique provides high signal-to-noise ratio and low MSE with reduced computational complexity; hence, it is a useful method for monitoring patients with heart-related problem.
An ATR architecture for algorithm development and testing

NASA Astrophysics Data System (ADS)

Breivik, Gøril M.; Løkken, Kristin H.; Brattli, Alvin; Palm, Hans C.; Haavardsholm, Trym

2013-05-01

A research platform with four cameras in the infrared and visible spectral domains is under development at the Norwegian Defence Research Establishment (FFI). The platform will be mounted on a high-speed jet aircraft and will primarily be used for image acquisition and for development and test of automatic target recognition (ATR) algorithms. The sensors on board produce large amounts of data, the algorithms can be computationally intensive and the data processing is complex. This puts great demands on the system architecture; it has to run in real-time and at the same time be suitable for algorithm development. In this paper we present an architecture for ATR systems that is designed to be exible, generic and efficient. The architecture is module based so that certain parts, e.g. specific ATR algorithms, can be exchanged without affecting the rest of the system. The modules are generic and can be used in various ATR system configurations. A software framework in C++ that handles large data ows in non-linear pipelines is used for implementation. The framework exploits several levels of parallelism and lets the hardware processing capacity be fully utilised. The ATR system is under development and has reached a first level that can be used for segmentation algorithm development and testing. The implemented system consists of several modules, and although their content is still limited, the segmentation module includes two different segmentation algorithms that can be easily exchanged. We demonstrate the system by applying the two segmentation algorithms to infrared images from sea trial recordings.
A Parallel Trade Study Architecture for Design Optimization of Complex Systems

NASA Technical Reports Server (NTRS)

Kim, Hongman; Mullins, James; Ragon, Scott; Soremekun, Grant; Sobieszczanski-Sobieski, Jaroslaw

2005-01-01

Design of a successful product requires evaluating many design alternatives in a limited design cycle time. This can be achieved through leveraging design space exploration tools and available computing resources on the network. This paper presents a parallel trade study architecture to integrate trade study clients and computing resources on a network using Web services. The parallel trade study solution is demonstrated to accelerate design of experiments, genetic algorithm optimization, and a cost as an independent variable (CAIV) study for a space system application.
Research and implementation of the algorithm for unwrapped and distortion correction basing on CORDIC for panoramic image

NASA Astrophysics Data System (ADS)

Zhang, Zhenhai; Li, Kejie; Wu, Xiaobing; Zhang, Shujiang

2008-03-01

The unwrapped and correcting algorithm based on Coordinate Rotation Digital Computer (CORDIC) and bilinear interpolation algorithm was presented in this paper, with the purpose of processing dynamic panoramic annular image. An original annular panoramic image captured by panoramic annular lens (PAL) can be unwrapped and corrected to conventional rectangular image without distortion, which is much more coincident with people's vision. The algorithm for panoramic image processing is modeled by VHDL and implemented in FPGA. The experimental results show that the proposed panoramic image algorithm for unwrapped and distortion correction has the lower computation complexity and the architecture for dynamic panoramic image processing has lower hardware cost and power consumption. And the proposed algorithm is valid.
Architecture and data processing alternatives for the TSE computer. Volume 3: Execution of a parallel counting algorithm using array logic (Tse) devices

NASA Technical Reports Server (NTRS)

Metcalfe, A. G.; Bodenheimer, R. E.

1976-01-01

A parallel algorithm for counting the number of logic-l elements in a binary array or image developed during preliminary investigation of the Tse concept is described. The counting algorithm is implemented using a basic combinational structure. Modifications which improve the efficiency of the basic structure are also presented. A programmable Tse computer structure is proposed, along with a hardware control unit, Tse instruction set, and software program for execution of the counting algorithm. Finally, a comparison is made between the different structures in terms of their more important characteristics.
Interaction sorting method for molecular dynamics on multi-core SIMD CPU architecture.

PubMed

Matvienko, Sergey; Alemasov, Nikolay; Fomin, Eduard

2015-02-01

Molecular dynamics (MD) is widely used in computational biology for studying binding mechanisms of molecules, molecular transport, conformational transitions, protein folding, etc. The method is computationally expensive; thus, the demand for the development of novel, much more efficient algorithms is still high. Therefore, the new algorithm designed in 2007 and called interaction sorting (IS) clearly attracted interest, as it outperformed the most efficient MD algorithms. In this work, a new IS modification is proposed which allows the algorithm to utilize SIMD processor instructions. This paper shows that the improvement provides an additional gain in performance, 9% to 45% in comparison to the original IS method.
Code generator for implementing dual tree complex wavelet transform on reconfigurable architectures for mobile applications

PubMed Central

Canbay, Ferhat; Levent, Vecdi Emre; Serbes, Gorkem; Ugurdag, H. Fatih; Goren, Sezer

2016-01-01

The authors aimed to develop an application for producing different architectures to implement dual tree complex wavelet transform (DTCWT) having near shift-invariance property. To obtain a low-cost and portable solution for implementing the DTCWT in multi-channel real-time applications, various embedded-system approaches are realised. For comparison, the DTCWT was implemented in C language on a personal computer and on a PIC microcontroller. However, in the former approach portability and in the latter desired speed performance properties cannot be achieved. Hence, implementation of the DTCWT on a reconfigurable platform such as field programmable gate array, which provides portable, low-cost, low-power, and high-performance computing, is considered as the most feasible solution. At first, they used the system generator DSP design tool of Xilinx for algorithm design. However, the design implemented by using such tools is not optimised in terms of area and power. To overcome all these drawbacks mentioned above, they implemented the DTCWT algorithm by using Verilog Hardware Description Language, which has its own difficulties. To overcome these difficulties, simplify the usage of proposed algorithms and the adaptation procedures, a code generator program that can produce different architectures is proposed. PMID:27733925

Code generator for implementing dual tree complex wavelet transform on reconfigurable architectures for mobile applications.

PubMed

Canbay, Ferhat; Levent, Vecdi Emre; Serbes, Gorkem; Ugurdag, H Fatih; Goren, Sezer; Aydin, Nizamettin

2016-09-01

The authors aimed to develop an application for producing different architectures to implement dual tree complex wavelet transform (DTCWT) having near shift-invariance property. To obtain a low-cost and portable solution for implementing the DTCWT in multi-channel real-time applications, various embedded-system approaches are realised. For comparison, the DTCWT was implemented in C language on a personal computer and on a PIC microcontroller. However, in the former approach portability and in the latter desired speed performance properties cannot be achieved. Hence, implementation of the DTCWT on a reconfigurable platform such as field programmable gate array, which provides portable, low-cost, low-power, and high-performance computing, is considered as the most feasible solution. At first, they used the system generator DSP design tool of Xilinx for algorithm design. However, the design implemented by using such tools is not optimised in terms of area and power. To overcome all these drawbacks mentioned above, they implemented the DTCWT algorithm by using Verilog Hardware Description Language, which has its own difficulties. To overcome these difficulties, simplify the usage of proposed algorithms and the adaptation procedures, a code generator program that can produce different architectures is proposed.
GPU-accelerated computing for Lagrangian coherent structures of multi-body gravitational regimes

NASA Astrophysics Data System (ADS)

Lin, Mingpei; Xu, Ming; Fu, Xiaoyu

2017-04-01

Based on a well-established theoretical foundation, Lagrangian Coherent Structures (LCSs) have elicited widespread research on the intrinsic structures of dynamical systems in many fields, including the field of astrodynamics. Although the application of LCSs in dynamical problems seems straightforward theoretically, its associated computational cost is prohibitive. We propose a block decomposition algorithm developed on Compute Unified Device Architecture (CUDA) platform for the computation of the LCSs of multi-body gravitational regimes. In order to take advantage of GPU's outstanding computing properties, such as Shared Memory, Constant Memory, and Zero-Copy, the algorithm utilizes a block decomposition strategy to facilitate computation of finite-time Lyapunov exponent (FTLE) fields of arbitrary size and timespan. Simulation results demonstrate that this GPU-based algorithm can satisfy double-precision accuracy requirements and greatly decrease the time needed to calculate final results, increasing speed by approximately 13 times. Additionally, this algorithm can be generalized to various large-scale computing problems, such as particle filters, constellation design, and Monte-Carlo simulation.
Block Architecture Problem with Depth First Search Solution and Its Application

NASA Astrophysics Data System (ADS)

Rahim, Robbi; Abdullah, Dahlan; Simarmata, Janner; Pranolo, Andri; Saleh Ahmar, Ansari; Hidayat, Rahmat; Napitupulu, Darmawan; Nurdiyanto, Heri; Febriadi, Bayu; Zamzami, Z.

2018-01-01

Searching is a common process performed by many computer users, Raita algorithm is one algorithm that can be used to match and find information in accordance with the patterns entered. Raita algorithm applied to the file search application using java programming language and the results obtained from the testing process of the file search quickly and with accurate results and support many data types.
Solution of partial differential equations on vector and parallel computers

NASA Technical Reports Server (NTRS)

Ortega, J. M.; Voigt, R. G.

1985-01-01

The present status of numerical methods for partial differential equations on vector and parallel computers was reviewed. The relevant aspects of these computers are discussed and a brief review of their development is included, with particular attention paid to those characteristics that influence algorithm selection. Both direct and iterative methods are given for elliptic equations as well as explicit and implicit methods for initial boundary value problems. The intent is to point out attractive methods as well as areas where this class of computer architecture cannot be fully utilized because of either hardware restrictions or the lack of adequate algorithms. Application areas utilizing these computers are briefly discussed.
Supercomputing resources empowering superstack with interactive and integrated systems

NASA Astrophysics Data System (ADS)

Rückemann, Claus-Peter

2012-09-01

This paper presents the results from the development and implementation of Superstack algorithms to be dynamically used with integrated systems and supercomputing resources. Processing of geophysical data, thus named geoprocessing, is an essential part of the analysis of geoscientific data. The theory of Superstack algorithms and the practical application on modern computing architectures was inspired by developments introduced with processing of seismic data on mainframes and within the last years leading to high end scientific computing applications. There are several stacking algorithms known but with low signal to noise ratio in seismic data the use of iterative algorithms like the Superstack can support analysis and interpretation. The new Superstack algorithms are in use with wave theory and optical phenomena on highly performant computing resources for huge data sets as well as for sophisticated application scenarios in geosciences and archaeology.
Development of seismic tomography software for hybrid supercomputers

NASA Astrophysics Data System (ADS)

Nikitin, Alexandr; Serdyukov, Alexandr; Duchkov, Anton

2015-04-01

Seismic tomography is a technique used for computing velocity model of geologic structure from first arrival travel times of seismic waves. The technique is used in processing of regional and global seismic data, in seismic exploration for prospecting and exploration of mineral and hydrocarbon deposits, and in seismic engineering for monitoring the condition of engineering structures and the surrounding host medium. As a consequence of development of seismic monitoring systems and increasing volume of seismic data, there is a growing need for new, more effective computational algorithms for use in seismic tomography applications with improved performance, accuracy and resolution. To achieve this goal, it is necessary to use modern high performance computing systems, such as supercomputers with hybrid architecture that use not only CPUs, but also accelerators and co-processors for computation. The goal of this research is the development of parallel seismic tomography algorithms and software package for such systems, to be used in processing of large volumes of seismic data (hundreds of gigabytes and more). These algorithms and software package will be optimized for the most common computing devices used in modern hybrid supercomputers, such as Intel Xeon CPUs, NVIDIA Tesla accelerators and Intel Xeon Phi co-processors. In this work, the following general scheme of seismic tomography is utilized. Using the eikonal equation solver, arrival times of seismic waves are computed based on assumed velocity model of geologic structure being analyzed. In order to solve the linearized inverse problem, tomographic matrix is computed that connects model adjustments with travel time residuals, and the resulting system of linear equations is regularized and solved to adjust the model. The effectiveness of parallel implementations of existing algorithms on target architectures is considered. During the first stage of this work, algorithms were developed for execution on supercomputers using multicore CPUs only, with preliminary performance tests showing good parallel efficiency on large numerical grids. Porting of the algorithms to hybrid supercomputers is currently ongoing.
An evolutionary algorithm that constructs recurrent neural networks.

PubMed

Angeline, P J; Saunders, G M; Pollack, J B

1994-01-01

Standard methods for simultaneously inducing the structure and weights of recurrent neural networks limit every task to an assumed class of architectures. Such a simplification is necessary since the interactions between network structure and function are not well understood. Evolutionary computations, which include genetic algorithms and evolutionary programming, are population-based search methods that have shown promise in many similarly complex tasks. This paper argues that genetic algorithms are inappropriate for network acquisition and describes an evolutionary program, called GNARL, that simultaneously acquires both the structure and weights for recurrent networks. GNARL's empirical acquisition method allows for the emergence of complex behaviors and topologies that are potentially excluded by the artificial architectural constraints imposed in standard network induction methods.
A static data flow simulation study at Ames Research Center

NASA Technical Reports Server (NTRS)

Barszcz, Eric; Howard, Lauri S.

1987-01-01

Demands in computational power, particularly in the area of computational fluid dynamics (CFD), led NASA Ames Research Center to study advanced computer architectures. One architecture being studied is the static data flow architecture based on research done by Jack B. Dennis at MIT. To improve understanding of this architecture, a static data flow simulator, written in Pascal, has been implemented for use on a Cray X-MP/48. A matrix multiply and a two-dimensional fast Fourier transform (FFT), two algorithms used in CFD work at Ames, have been run on the simulator. Execution times can vary by a factor of more than 2 depending on the partitioning method used to assign instructions to processing elements. Service time for matching tokens has proved to be a major bottleneck. Loop control and array address calculation overhead can double the execution time. The best sustained MFLOPS rates were less than 50% of the maximum capability of the machine.
An architecture for real-time vision processing

NASA Technical Reports Server (NTRS)

Chien, Chiun-Hong

1994-01-01

To study the feasibility of developing an architecture for real time vision processing, a task queue server and parallel algorithms for two vision operations were designed and implemented on an i860-based Mercury Computing System 860VS array processor. The proposed architecture treats each vision function as a task or set of tasks which may be recursively divided into subtasks and processed by multiple processors coordinated by a task queue server accessible by all processors. Each idle processor subsequently fetches a task and associated data from the task queue server for processing and posts the result to shared memory for later use. Load balancing can be carried out within the processing system without the requirement for a centralized controller. The author concludes that real time vision processing cannot be achieved without both sequential and parallel vision algorithms and a good parallel vision architecture.
Parallel conjugate gradient algorithms for manipulator dynamic simulation

NASA Technical Reports Server (NTRS)

Fijany, Amir; Scheld, Robert E.

1989-01-01

Parallel conjugate gradient algorithms for the computation of multibody dynamics are developed for the specialized case of a robot manipulator. For an n-dimensional positive-definite linear system, the Classical Conjugate Gradient (CCG) algorithms are guaranteed to converge in n iterations, each with a computation cost of O(n); this leads to a total computational cost of O(n sq) on a serial processor. A conjugate gradient algorithms is presented that provide greater efficiency using a preconditioner, which reduces the number of iterations required, and by exploiting parallelism, which reduces the cost of each iteration. Two Preconditioned Conjugate Gradient (PCG) algorithms are proposed which respectively use a diagonal and a tridiagonal matrix, composed of the diagonal and tridiagonal elements of the mass matrix, as preconditioners. Parallel algorithms are developed to compute the preconditioners and their inversions in O(log sub 2 n) steps using n processors. A parallel algorithm is also presented which, on the same architecture, achieves the computational time of O(log sub 2 n) for each iteration. Simulation results for a seven degree-of-freedom manipulator are presented. Variants of the proposed algorithms are also developed which can be efficiently implemented on the Robot Mathematics Processor (RMP).
WATERLOPP V2/64: A highly parallel machine for numerical computation

NASA Astrophysics Data System (ADS)

Ostlund, Neil S.

1985-07-01

Current technological trends suggest that the high performance scientific machines of the future are very likely to consist of a large number (greater than 1024) of processors connected and communicating with each other in some as yet undetermined manner. Such an assembly of processors should behave as a single machine in obtaining numerical solutions to scientific problems. However, the appropriate way of organizing both the hardware and software of such an assembly of processors is an unsolved and active area of research. It is particularly important to minimize the organizational overhead of interprocessor comunication, global synchronization, and contention for shared resources if the performance of a large number ( n) of processors is to be anything like the desirable n times the performance of a single processor. In many situations, adding a processor actually decreases the performance of the overall system since the extra organizational overhead is larger than the extra processing power added. The systolic loop architecture is a new multiple processor architecture which attemps at a solution to the problem of how to organize a large number of asynchronous processors into an effective computational system while minimizing the organizational overhead. This paper gives a brief overview of the basic systolic loop architecture, systolic loop algorithms for numerical computation, and a 64-processor implementation of the architecture, WATERLOOP V2/64, that is being used as a testbed for exploring the hardware, software, and algorithmic aspects of the architecture.
GPU accelerated fuzzy connected image segmentation by using CUDA.

PubMed

Zhuge, Ying; Cao, Yong; Miller, Robert W

2009-01-01

Image segmentation techniques using fuzzy connectedness principles have shown their effectiveness in segmenting a variety of objects in several large applications in recent years. However, one problem of these algorithms has been their excessive computational requirements when processing large image datasets. Nowadays commodity graphics hardware provides high parallel computing power. In this paper, we present a parallel fuzzy connected image segmentation algorithm on Nvidia's Compute Unified Device Architecture (CUDA) platform for segmenting large medical image data sets. Our experiments based on three data sets with small, medium, and large data size demonstrate the efficiency of the parallel algorithm, which achieves a speed-up factor of 7.2x, 7.3x, and 14.4x, correspondingly, for the three data sets over the sequential implementation of fuzzy connected image segmentation algorithm on CPU.
Assessment of the information content of patterns: an algorithm

NASA Astrophysics Data System (ADS)

Daemi, M. Farhang; Beurle, R. L.

1991-12-01

A preliminary investigation confirmed the possibility of assessing the translational and rotational information content of simple artificial images. The calculation is tedious, and for more realistic patterns it is essential to implement the method on a computer. This paper describes an algorithm developed for this purpose which confirms the results of the preliminary investigation. Use of the algorithm facilitates much more comprehensive analysis of the combined effect of continuous rotation and fine translation, and paves the way for analysis of more realistic patterns. Owing to the volume of calculation involved in these algorithms, extensive computing facilities were necessary. The major part of the work was carried out using an ICL 3900 series mainframe computer as well as other powerful workstations such as a RISC architecture MIPS machine.
On some methods for improving time of reachability sets computation for the dynamic system control problem

NASA Astrophysics Data System (ADS)

Zimovets, Artem; Matviychuk, Alexander; Ushakov, Vladimir

2016-12-01

The paper presents two different approaches to reduce the time of computer calculation of reachability sets. First of these two approaches use different data structures for storing the reachability sets in the computer memory for calculation in single-threaded mode. Second approach is based on using parallel algorithms with reference to the data structures from the first approach. Within the framework of this paper parallel algorithm of approximate reachability set calculation on computer with SMP-architecture is proposed. The results of numerical modelling are presented in the form of tables which demonstrate high efficiency of parallel computing technology and also show how computing time depends on the used data structure.
The symbolic computation and automatic analysis of trajectories

NASA Technical Reports Server (NTRS)

Grossman, Robert

1991-01-01

Research was generally done on computation of trajectories of dynamical systems, especially control systems. Algorithms were further developed for rewriting expressions involving differential operators. The differential operators involved arise in the local analysis of nonlinear control systems. An initial design was completed of the system architecture for software to analyze nonlinear control systems using data base computing.
77 FR 37091 - Agency Information Collection Activities: Request for Comments for a New Information Collection

Federal Register 2010, 2011, 2012, 2013, 2014

2012-06-20

... analysis and design, and computer software design and coding. Given the fact that over $500 million were... acoustic algorithms, computer architecture, and source code that dated to the 1970s. Since that time... 2012. Version 3.0 is an entirely new, state-of-the-art computer program used for predicting noise...
Application of ant colony Algorithm and particle swarm optimization in architectural design

NASA Astrophysics Data System (ADS)

Song, Ziyi; Wu, Yunfa; Song, Jianhua

2018-02-01

By studying the development of ant colony algorithm and particle swarm algorithm, this paper expounds the core idea of the algorithm, explores the combination of algorithm and architectural design, sums up the application rules of intelligent algorithm in architectural design, and combines the characteristics of the two algorithms, obtains the research route and realization way of intelligent algorithm in architecture design. To establish algorithm rules to assist architectural design. Taking intelligent algorithm as the beginning of architectural design research, the authors provide the theory foundation of ant colony Algorithm and particle swarm algorithm in architectural design, popularize the application range of intelligent algorithm in architectural design, and provide a new idea for the architects.
Parallel Lattice Basis Reduction Using a Multi-threaded Schnorr-Euchner LLL Algorithm

NASA Astrophysics Data System (ADS)

Backes, Werner; Wetzel, Susanne

In this paper, we introduce a new parallel variant of the LLL lattice basis reduction algorithm. Our new, multi-threaded algorithm is the first to provide an efficient, parallel implementation of the Schorr-Euchner algorithm for today’s multi-processor, multi-core computer architectures. Experiments with sparse and dense lattice bases show a speed-up factor of about 1.8 for the 2-thread and about factor 3.2 for the 4-thread version of our new parallel lattice basis reduction algorithm in comparison to the traditional non-parallel algorithm.
Intricacies of modern supercomputing illustrated with recent advances in simulations of strongly correlated electron systems

NASA Astrophysics Data System (ADS)

Schulthess, Thomas C.

2013-03-01

The continued thousand-fold improvement in sustained application performance per decade on modern supercomputers keeps opening new opportunities for scientific simulations. But supercomputers have become very complex machines, built with thousands or tens of thousands of complex nodes consisting of multiple CPU cores or, most recently, a combination of CPU and GPU processors. Efficient simulations on such high-end computing systems require tailored algorithms that optimally map numerical methods to particular architectures. These intricacies will be illustrated with simulations of strongly correlated electron systems, where the development of quantum cluster methods, Monte Carlo techniques, as well as their optimal implementation by means of algorithms with improved data locality and high arithmetic density have gone hand in hand with evolving computer architectures. The present work would not have been possible without continued access to computing resources at the National Center for Computational Science of Oak Ridge National Laboratory, which is funded by the Facilities Division of the Office of Advanced Scientific Computing Research, and the Swiss National Supercomputing Center (CSCS) that is funded by ETH Zurich.
Problems Related to Parallelization of CFD Algorithms on GPU, Multi-GPU and Hybrid Architectures

NASA Astrophysics Data System (ADS)

Biazewicz, Marek; Kurowski, Krzysztof; Ludwiczak, Bogdan; Napieraia, Krystyna

2010-09-01

Computational Fluid Dynamics (CFD) is one of the branches of fluid mechanics, which uses numerical methods and algorithms to solve and analyze fluid flows. CFD is used in various domains, such as oil and gas reservoir uncertainty analysis, aerodynamic body shapes optimization (e.g. planes, cars, ships, sport helmets, skis), natural phenomena analysis, numerical simulation for weather forecasting or realistic visualizations. CFD problem is very complex and needs a lot of computational power to obtain the results in a reasonable time. We have implemented a parallel application for two-dimensional CFD simulation with a free surface approximation (MAC method) using new hardware architectures, in particular multi-GPU and hybrid computing environments. For this purpose we decided to use NVIDIA graphic cards with CUDA environment due to its simplicity of programming and good computations performance. We used finite difference discretization of Navier-Stokes equations, where fluid is propagated over an Eulerian Grid. In this model, the behavior of the fluid inside the cell depends only on the properties of local, surrounding cells, therefore it is well suited for the GPU-based architecture. In this paper we demonstrate how to use efficiently the computing power of GPUs for CFD. Additionally, we present some best practices to help users analyze and improve the performance of CFD applications executed on GPU. Finally, we discuss various challenges around the multi-GPU implementation on the example of matrix multiplication.

Approximate Computing Techniques for Iterative Graph Algorithms

DOE Office of Scientific and Technical Information (OSTI.GOV)

Panyala, Ajay R.; Subasi, Omer; Halappanavar, Mahantesh

Approximate computing enables processing of large-scale graphs by trading off quality for performance. Approximate computing techniques have become critical not only due to the emergence of parallel architectures but also the availability of large scale datasets enabling data-driven discovery. Using two prototypical graph algorithms, PageRank and community detection, we present several approximate computing heuristics to scale the performance with minimal loss of accuracy. We present several heuristics including loop perforation, data caching, incomplete graph coloring and synchronization, and evaluate their efficiency. We demonstrate performance improvements of up to 83% for PageRank and up to 450x for community detection, with lowmore » impact of accuracy for both the algorithms. We expect the proposed approximate techniques will enable scalable graph analytics on data of importance to several applications in science and their subsequent adoption to scale similar graph algorithms.« less
Streaming parallel GPU acceleration of large-scale filter-based spiking neural networks.

PubMed

Slażyński, Leszek; Bohte, Sander

2012-01-01

The arrival of graphics processing (GPU) cards suitable for massively parallel computing promises affordable large-scale neural network simulation previously only available at supercomputing facilities. While the raw numbers suggest that GPUs may outperform CPUs by at least an order of magnitude, the challenge is to develop fine-grained parallel algorithms to fully exploit the particulars of GPUs. Computation in a neural network is inherently parallel and thus a natural match for GPU architectures: given inputs, the internal state for each neuron can be updated in parallel. We show that for filter-based spiking neurons, like the Spike Response Model, the additive nature of membrane potential dynamics enables additional update parallelism. This also reduces the accumulation of numerical errors when using single precision computation, the native precision of GPUs. We further show that optimizing simulation algorithms and data structures to the GPU's architecture has a large pay-off: for example, matching iterative neural updating to the memory architecture of the GPU speeds up this simulation step by a factor of three to five. With such optimizations, we can simulate in better-than-realtime plausible spiking neural networks of up to 50 000 neurons, processing over 35 million spiking events per second.
Development of Parallel Architectures for Sensor Array Processing. Volume 1

DTIC Science & Technology

1993-08-01

required for the DOA estimation [ 1-7]. The Multiple Signal Classification ( MUSIC ) [ 1] and the Estimation of Signal Parameters by Rotational...manifold and the estimated subspace. Although MUSIC is a high resolution algorithm, it has several drawbacks including the fact that complete knowledge of...thoroughly, MUSIC algorithm was selected to develop special purpose hardware for real time computation. Summary of the MUSIC algorithm is as follows
Algorithms and Libraries

NASA Technical Reports Server (NTRS)

Dongarra, Jack

1998-01-01

This exploratory study initiated our inquiry into algorithms and applications that would benefit by latency tolerant approach to algorithm building, including the construction of new algorithms where appropriate. In a multithreaded execution, when a processor reaches a point where remote memory access is necessary, the request is sent out on the network and a context--switch occurs to a new thread of computation. This effectively masks a long and unpredictable latency due to remote loads, thereby providing tolerance to remote access latency. We began to develop standards to profile various algorithm and application parameters, such as the degree of parallelism, granularity, precision, instruction set mix, interprocessor communication, latency etc. These tools will continue to develop and evolve as the Information Power Grid environment matures. To provide a richer context for this research, the project also focused on issues of fault-tolerance and computation migration of numerical algorithms and software. During the initial phase we tried to increase our understanding of the bottlenecks in single processor performance. Our work began by developing an approach for the automatic generation and optimization of numerical software for processors with deep memory hierarchies and pipelined functional units. Based on the results we achieved in this study we are planning to study other architectures of interest, including development of cost models, and developing code generators appropriate to these architectures.
A biconjugate gradient type algorithm on massively parallel architectures

NASA Technical Reports Server (NTRS)

Freund, Roland W.; Hochbruck, Marlis

1991-01-01

The biconjugate gradient (BCG) method is the natural generalization of the classical conjugate gradient algorithm for Hermitian positive definite matrices to general non-Hermitian linear systems. Unfortunately, the original BCG algorithm is susceptible to possible breakdowns and numerical instabilities. Recently, Freund and Nachtigal have proposed a novel BCG type approach, the quasi-minimal residual method (QMR), which overcomes the problems of BCG. Here, an implementation is presented of QMR based on an s-step version of the nonsymmetric look-ahead Lanczos algorithm. The main feature of the s-step Lanczos algorithm is that, in general, all inner products, except for one, can be computed in parallel at the end of each block; this is unlike the other standard Lanczos process where inner products are generated sequentially. The resulting implementation of QMR is particularly attractive on massively parallel SIMD architectures, such as the Connection Machine.
A Computationally Efficient Visual Saliency Algorithm Suitable for an Analog CMOS Implementation.

PubMed

D'Angelo, Robert; Wood, Richard; Lowry, Nathan; Freifeld, Geremy; Huang, Haiyao; Salthouse, Christopher D; Hollosi, Brent; Muresan, Matthew; Uy, Wes; Tran, Nhut; Chery, Armand; Poppe, Dorothy C; Sonkusale, Sameer

2018-06-27

Computer vision algorithms are often limited in their application by the large amount of data that must be processed. Mammalian vision systems mitigate this high bandwidth requirement by prioritizing certain regions of the visual field with neural circuits that select the most salient regions. This work introduces a novel and computationally efficient visual saliency algorithm for performing this neuromorphic attention-based data reduction. The proposed algorithm has the added advantage that it is compatible with an analog CMOS design while still achieving comparable performance to existing state-of-the-art saliency algorithms. This compatibility allows for direct integration with the analog-to-digital conversion circuitry present in CMOS image sensors. This integration leads to power savings in the converter by quantizing only the salient pixels. Further system-level power savings are gained by reducing the amount of data that must be transmitted and processed in the digital domain. The analog CMOS compatible formulation relies on a pulse width (i.e., time mode) encoding of the pixel data that is compatible with pulse-mode imagers and slope based converters often used in imager designs. This letter begins by discussing this time-mode encoding for implementing neuromorphic architectures. Next, the proposed algorithm is derived. Hardware-oriented optimizations and modifications to this algorithm are proposed and discussed. Next, a metric for quantifying saliency accuracy is proposed, and simulation results of this metric are presented. Finally, an analog synthesis approach for a time-mode architecture is outlined, and postsynthesis transistor-level simulations that demonstrate functionality of an implementation in a modern CMOS process are discussed.
Architectures and Applications for Scalable Quantum Information Systems

DTIC Science & Technology

2007-01-01

quantum computation models, such as adiabatic quantum computing , can be converted to quantum circuits. Therefore, in our design flow’s first phase...vol. 26, no. 5, pp. 1484–1509, 1997. [19] A. Childs, E. Farhi, and J. Preskill, “Robustness of adiabatic quantum computation ,” Phys. Rev. A, vol. 65...magnetic resonance computer with three quantum bits that simulates an adiabatic quantum optimization algorithm. Adiabatic
Polar exponential sensor arrays unify iconic and Hough space representation

NASA Technical Reports Server (NTRS)

Weiman, Carl F. R.

1990-01-01

The log-polar coordinate system, inherent in both polar exponential sensor arrays and log-polar remapped video imagery, is identical to the coordinate system of its corresponding Hough transform parameter space. The resulting unification of iconic and Hough domains simplifies computation for line recognition and eliminates the slope quantization problems inherent in the classical Cartesian Hough transform. The geometric organization of the algorithm is more amenable to massively parallel architectures than that of the Cartesian version. The neural architecture of the human visual cortex meets the geometric requirements to execute 'in-place' log-Hough algorithms of the kind described here.
Embodiment of Learning in Electro-Optical Signal Processors

NASA Astrophysics Data System (ADS)

Hermans, Michiel; Antonik, Piotr; Haelterman, Marc; Massar, Serge

2016-09-01

Delay-coupled electro-optical systems have received much attention for their dynamical properties and their potential use in signal processing. In particular, it has recently been demonstrated, using the artificial intelligence algorithm known as reservoir computing, that photonic implementations of such systems solve complex tasks such as speech recognition. Here, we show how the backpropagation algorithm can be physically implemented on the same electro-optical delay-coupled architecture used for computation with only minor changes to the original design. We find that, compared to when the backpropagation algorithm is not used, the error rate of the resulting computing device, evaluated on three benchmark tasks, decreases considerably. This demonstrates that electro-optical analog computers can embody a large part of their own training process, allowing them to be applied to new, more difficult tasks.
Embodiment of Learning in Electro-Optical Signal Processors.

PubMed

Hermans, Michiel; Antonik, Piotr; Haelterman, Marc; Massar, Serge

2016-09-16

Delay-coupled electro-optical systems have received much attention for their dynamical properties and their potential use in signal processing. In particular, it has recently been demonstrated, using the artificial intelligence algorithm known as reservoir computing, that photonic implementations of such systems solve complex tasks such as speech recognition. Here, we show how the backpropagation algorithm can be physically implemented on the same electro-optical delay-coupled architecture used for computation with only minor changes to the original design. We find that, compared to when the backpropagation algorithm is not used, the error rate of the resulting computing device, evaluated on three benchmark tasks, decreases considerably. This demonstrates that electro-optical analog computers can embody a large part of their own training process, allowing them to be applied to new, more difficult tasks.
Performances of multiprocessor multidisk architectures for continuous media storage

NASA Astrophysics Data System (ADS)

Gennart, Benoit A.; Messerli, Vincent; Hersch, Roger D.

1996-03-01

Multimedia interfaces increase the need for large image databases, capable of storing and reading streams of data with strict synchronicity and isochronicity requirements. In order to fulfill these requirements, we consider a parallel image server architecture which relies on arrays of intelligent disk nodes, each disk node being composed of one processor and one or more disks. This contribution analyzes through bottleneck performance evaluation and simulation the behavior of two multi-processor multi-disk architectures: a point-to-point architecture and a shared-bus architecture similar to current multiprocessor workstation architectures. We compare the two architectures on the basis of two multimedia algorithms: the compute-bound frame resizing by resampling and the data-bound disk-to-client stream transfer. The results suggest that the shared bus is a potential bottleneck despite its very high hardware throughput (400Mbytes/s) and that an architecture with addressable local memories located closely to their respective processors could partially remove this bottleneck. The point- to-point architecture is scalable and able to sustain high throughputs for simultaneous compute- bound and data-bound operations.
Rapid Human-Computer Interactive Conceptual Design of Mobile and Manipulative Robot Systems

DTIC Science & Technology

2015-05-19

algorithm based on Age-Fitness Pareto Optimization (AFPO) ([9]) with an additional user prefer- ence objective and a neural network-based user model, we...greater than 40, which is about 5 times further than any robot traveled in our experiments. 6 3.3 Methods The algorithm uses a client -server computational...architecture. The client here is an interactive pro- gram which takes a pair of controllers as input, simulates4 two copies of the robot with
Power Efficient Hardware Architecture of SHA-1 Algorithm for Trusted Mobile Computing

NASA Astrophysics Data System (ADS)

Kim, Mooseop; Ryou, Jaecheol

The Trusted Mobile Platform (TMP) is developed and promoted by the Trusted Computing Group (TCG), which is an industry standard body to enhance the security of the mobile computing environment. The built-in SHA-1 engine in TMP is one of the most important circuit blocks and contributes the performance of the whole platform because it is used as key primitives supporting platform integrity and command authentication. Mobile platforms have very stringent limitations with respect to available power, physical circuit area, and cost. Therefore special architecture and design methods for low power SHA-1 circuit are required. In this paper, we present a novel and efficient hardware architecture of low power SHA-1 design for TMP. Our low power SHA-1 hardware can compute 512-bit data block using less than 7,000 gates and has a power consumption about 1.1 mA on a 0.25μm CMOS process.
Parallel Signal Processing and System Simulation using aCe

NASA Technical Reports Server (NTRS)

Dorband, John E.; Aburdene, Maurice F.

2003-01-01

Recently, networked and cluster computation have become very popular for both signal processing and system simulation. A new language is ideally suited for parallel signal processing applications and system simulation since it allows the programmer to explicitly express the computations that can be performed concurrently. In addition, the new C based parallel language (ace C) for architecture-adaptive programming allows programmers to implement algorithms and system simulation applications on parallel architectures by providing them with the assurance that future parallel architectures will be able to run their applications with a minimum of modification. In this paper, we will focus on some fundamental features of ace C and present a signal processing application (FFT).
A highly efficient 3D level-set grain growth algorithm tailored for ccNUMA architecture

NASA Astrophysics Data System (ADS)

Mießen, C.; Velinov, N.; Gottstein, G.; Barrales-Mora, L. A.

2017-12-01

A highly efficient simulation model for 2D and 3D grain growth was developed based on the level-set method. The model introduces modern computational concepts to achieve excellent performance on parallel computer architectures. Strong scalability was measured on cache-coherent non-uniform memory access (ccNUMA) architectures. To achieve this, the proposed approach considers the application of local level-set functions at the grain level. Ideal and non-ideal grain growth was simulated in 3D with the objective to study the evolution of statistical representative volume elements in polycrystals. In addition, microstructure evolution in an anisotropic magnetic material affected by an external magnetic field was simulated.
Relation of Parallel Discrete Event Simulation algorithms with physical models

NASA Astrophysics Data System (ADS)

Shchur, L. N.; Shchur, L. V.

2015-09-01

We extend concept of local simulation times in parallel discrete event simulation (PDES) in order to take into account architecture of the current hardware and software in high-performance computing. We shortly review previous research on the mapping of PDES on physical problems, and emphasise how physical results may help to predict parallel algorithms behaviour.
High-Speed GPU-Based Fully Three-Dimensional Diffuse Optical Tomographic System

PubMed Central

Saikia, Manob Jyoti; Kanhirodan, Rajan; Mohan Vasu, Ram

2014-01-01

We have developed a graphics processor unit (GPU-) based high-speed fully 3D system for diffuse optical tomography (DOT). The reduction in execution time of 3D DOT algorithm, a severely ill-posed problem, is made possible through the use of (1) an algorithmic improvement that uses Broyden approach for updating the Jacobian matrix and thereby updating the parameter matrix and (2) the multinode multithreaded GPU and CUDA (Compute Unified Device Architecture) software architecture. Two different GPU implementations of DOT programs are developed in this study: (1) conventional C language program augmented by GPU CUDA and CULA routines (C GPU), (2) MATLAB program supported by MATLAB parallel computing toolkit for GPU (MATLAB GPU). The computation time of the algorithm on host CPU and the GPU system is presented for C and Matlab implementations. The forward computation uses finite element method (FEM) and the problem domain is discretized into 14610, 30823, and 66514 tetrahedral elements. The reconstruction time, so achieved for one iteration of the DOT reconstruction for 14610 elements, is 0.52 seconds for a C based GPU program for 2-plane measurements. The corresponding MATLAB based GPU program took 0.86 seconds. The maximum number of reconstructed frames so achieved is 2 frames per second. PMID:24891848
High-Speed GPU-Based Fully Three-Dimensional Diffuse Optical Tomographic System.

PubMed

Saikia, Manob Jyoti; Kanhirodan, Rajan; Mohan Vasu, Ram

2014-01-01

We have developed a graphics processor unit (GPU-) based high-speed fully 3D system for diffuse optical tomography (DOT). The reduction in execution time of 3D DOT algorithm, a severely ill-posed problem, is made possible through the use of (1) an algorithmic improvement that uses Broyden approach for updating the Jacobian matrix and thereby updating the parameter matrix and (2) the multinode multithreaded GPU and CUDA (Compute Unified Device Architecture) software architecture. Two different GPU implementations of DOT programs are developed in this study: (1) conventional C language program augmented by GPU CUDA and CULA routines (C GPU), (2) MATLAB program supported by MATLAB parallel computing toolkit for GPU (MATLAB GPU). The computation time of the algorithm on host CPU and the GPU system is presented for C and Matlab implementations. The forward computation uses finite element method (FEM) and the problem domain is discretized into 14610, 30823, and 66514 tetrahedral elements. The reconstruction time, so achieved for one iteration of the DOT reconstruction for 14610 elements, is 0.52 seconds for a C based GPU program for 2-plane measurements. The corresponding MATLAB based GPU program took 0.86 seconds. The maximum number of reconstructed frames so achieved is 2 frames per second.
An Efficient VLSI Architecture for Multi-Channel Spike Sorting Using a Generalized Hebbian Algorithm

PubMed Central

Chen, Ying-Lun; Hwang, Wen-Jyi; Ke, Chi-En

2015-01-01

A novel VLSI architecture for multi-channel online spike sorting is presented in this paper. In the architecture, the spike detection is based on nonlinear energy operator (NEO), and the feature extraction is carried out by the generalized Hebbian algorithm (GHA). To lower the power consumption and area costs of the circuits, all of the channels share the same core for spike detection and feature extraction operations. Each channel has dedicated buffers for storing the detected spikes and the principal components of that channel. The proposed circuit also contains a clock gating system supplying the clock to only the buffers of channels currently using the computation core to further reduce the power consumption. The architecture has been implemented by an application-specific integrated circuit (ASIC) with 90-nm technology. Comparisons to the existing works show that the proposed architecture has lower power consumption and hardware area costs for real-time multi-channel spike detection and feature extraction. PMID:26287193
An Efficient VLSI Architecture for Multi-Channel Spike Sorting Using a Generalized Hebbian Algorithm.

PubMed

Chen, Ying-Lun; Hwang, Wen-Jyi; Ke, Chi-En

2015-08-13

A novel VLSI architecture for multi-channel online spike sorting is presented in this paper. In the architecture, the spike detection is based on nonlinear energy operator (NEO), and the feature extraction is carried out by the generalized Hebbian algorithm (GHA). To lower the power consumption and area costs of the circuits, all of the channels share the same core for spike detection and feature extraction operations. Each channel has dedicated buffers for storing the detected spikes and the principal components of that channel. The proposed circuit also contains a clock gating system supplying the clock to only the buffers of channels currently using the computation core to further reduce the power consumption. The architecture has been implemented by an application-specific integrated circuit (ASIC) with 90-nm technology. Comparisons to the existing works show that the proposed architecture has lower power consumption and hardware area costs for real-time multi-channel spike detection and feature extraction.

Parallel computing works

DOE Office of Scientific and Technical Information (OSTI.GOV)

Not Available

An account of the Caltech Concurrent Computation Program (C{sup 3}P), a five year project that focused on answering the question: Can parallel computers be used to do large-scale scientific computations '' As the title indicates, the question is answered in the affirmative, by implementing numerous scientific applications on real parallel computers and doing computations that produced new scientific results. In the process of doing so, C{sup 3}P helped design and build several new computers, designed and implemented basic system software, developed algorithms for frequently used mathematical computations on massively parallel machines, devised performance models and measured the performance of manymore » computers, and created a high performance computing facility based exclusively on parallel computers. While the initial focus of C{sup 3}P was the hypercube architecture developed by C. Seitz, many of the methods developed and lessons learned have been applied successfully on other massively parallel architectures.« less
77 FR 37736 - Agency Information Collection Activities: Request for Comments for a New Information Collection

Federal Register 2010, 2011, 2012, 2013, 2014

2012-06-22

... analysis and design, and computer software design and coding. Given the fact that over $500 million were... acoustic algorithms, computer architecture, and source code that dated to the 1970s. Since that time... towards the end of 2012. Version 3.0 is an entirely new, state-of-the-art computer program used for...
A novel parallel architecture for local histogram equalization

NASA Astrophysics Data System (ADS)

Ohannessian, Mesrob I.; Choueiter, Ghinwa F.; Diab, Hassan

2005-07-01

Local histogram equalization is an image enhancement algorithm that has found wide application in the pre-processing stage of areas such as computer vision, pattern recognition and medical imaging. The computationally intensive nature of the procedure, however, is a main limitation when real time interactive applications are in question. This work explores the possibility of performing parallel local histogram equalization, using an array of special purpose elementary processors, through an HDL implementation that targets FPGA or ASIC platforms. A novel parallelization scheme is presented and the corresponding architecture is derived. The algorithm is reduced to pixel-level operations. Processing elements are assigned image blocks, to maintain a reasonable performance-cost ratio. To further simplify both processor and memory organizations, a bit-serial access scheme is used. A brief performance assessment is provided to illustrate and quantify the merit of the approach.
Speeding-up Bioinformatics Algorithms with Heterogeneous Architectures: Highly Heterogeneous Smith-Waterman (HHeterSW).

PubMed

Gálvez, Sergio; Ferusic, Adis; Esteban, Francisco J; Hernández, Pilar; Caballero, Juan A; Dorado, Gabriel

2016-10-01

The Smith-Waterman algorithm has a great sensitivity when used for biological sequence-database searches, but at the expense of high computing-power requirements. To overcome this problem, there are implementations in literature that exploit the different hardware-architectures available in a standard PC, such as GPU, CPU, and coprocessors. We introduce an application that splits the original database-search problem into smaller parts, resolves each of them by executing the most efficient implementations of the Smith-Waterman algorithms in different hardware architectures, and finally unifies the generated results. Using non-overlapping hardware allows simultaneous execution, and up to 2.58-fold performance gain, when compared with any other algorithm to search sequence databases. Even the performance of the popular BLAST heuristic is exceeded in 78% of the tests. The application has been tested with standard hardware: Intel i7-4820K CPU, Intel Xeon Phi 31S1P coprocessors, and nVidia GeForce GTX 960 graphics cards. An important increase in performance has been obtained in a wide range of situations, effectively exploiting the available hardware.
Performance of the Wavelet Decomposition on Massively Parallel Architectures

NASA Technical Reports Server (NTRS)

El-Ghazawi, Tarek A.; LeMoigne, Jacqueline; Zukor, Dorothy (Technical Monitor)

2001-01-01

Traditionally, Fourier Transforms have been utilized for performing signal analysis and representation. But although it is straightforward to reconstruct a signal from its Fourier transform, no local description of the signal is included in its Fourier representation. To alleviate this problem, Windowed Fourier transforms and then wavelet transforms have been introduced, and it has been proven that wavelets give a better localization than traditional Fourier transforms, as well as a better division of the time- or space-frequency plane than Windowed Fourier transforms. Because of these properties and after the development of several fast algorithms for computing the wavelet representation of any signal, in particular the Multi-Resolution Analysis (MRA) developed by Mallat, wavelet transforms have increasingly been applied to signal analysis problems, especially real-life problems, in which speed is critical. In this paper we present and compare efficient wavelet decomposition algorithms on different parallel architectures. We report and analyze experimental measurements, using NASA remotely sensed images. Results show that our algorithms achieve significant performance gains on current high performance parallel systems, and meet scientific applications and multimedia requirements. The extensive performance measurements collected over a number of high-performance computer systems have revealed important architectural characteristics of these systems, in relation to the processing demands of the wavelet decomposition of digital images.
Motion camera based on a custom vision sensor and an FPGA architecture

NASA Astrophysics Data System (ADS)

Arias-Estrada, Miguel

1998-09-01

A digital camera for custom focal plane arrays was developed. The camera allows the test and development of analog or mixed-mode arrays for focal plane processing. The camera is used with a custom sensor for motion detection to implement a motion computation system. The custom focal plane sensor detects moving edges at the pixel level using analog VLSI techniques. The sensor communicates motion events using the event-address protocol associated to a temporal reference. In a second stage, a coprocessing architecture based on a field programmable gate array (FPGA) computes the time-of-travel between adjacent pixels. The FPGA allows rapid prototyping and flexible architecture development. Furthermore, the FPGA interfaces the sensor to a compact PC computer which is used for high level control and data communication to the local network. The camera could be used in applications such as self-guided vehicles, mobile robotics and smart surveillance systems. The programmability of the FPGA allows the exploration of further signal processing like spatial edge detection or image segmentation tasks. The article details the motion algorithm, the sensor architecture, the use of the event- address protocol for velocity vector computation and the FPGA architecture used in the motion camera system.
BCM: toolkit for Bayesian analysis of Computational Models using samplers.

PubMed

Thijssen, Bram; Dijkstra, Tjeerd M H; Heskes, Tom; Wessels, Lodewyk F A

2016-10-21

Computational models in biology are characterized by a large degree of uncertainty. This uncertainty can be analyzed with Bayesian statistics, however, the sampling algorithms that are frequently used for calculating Bayesian statistical estimates are computationally demanding, and each algorithm has unique advantages and disadvantages. It is typically unclear, before starting an analysis, which algorithm will perform well on a given computational model. We present BCM, a toolkit for the Bayesian analysis of Computational Models using samplers. It provides efficient, multithreaded implementations of eleven algorithms for sampling from posterior probability distributions and for calculating marginal likelihoods. BCM includes tools to simplify the process of model specification and scripts for visualizing the results. The flexible architecture allows it to be used on diverse types of biological computational models. In an example inference task using a model of the cell cycle based on ordinary differential equations, BCM is significantly more efficient than existing software packages, allowing more challenging inference problems to be solved. BCM represents an efficient one-stop-shop for computational modelers wishing to use sampler-based Bayesian statistics.
Manyscale Computing for Sensor Processing in Support of Space Situational Awareness

NASA Astrophysics Data System (ADS)

Schmalz, M.; Chapman, W.; Hayden, E.; Sahni, S.; Ranka, S.

2014-09-01

Increasing image and signal data burden associated with sensor data processing in support of space situational awareness implies continuing computational throughput growth beyond the petascale regime. In addition to growing applications data burden and diversity, the breadth, diversity and scalability of high performance computing architectures and their various organizations challenge the development of a single, unifying, practicable model of parallel computation. Therefore, models for scalable parallel processing have exploited architectural and structural idiosyncrasies, yielding potential misapplications when legacy programs are ported among such architectures. In response to this challenge, we have developed a concise, efficient computational paradigm and software called Manyscale Computing to facilitate efficient mapping of annotated application codes to heterogeneous parallel architectures. Our theory, algorithms, software, and experimental results support partitioning and scheduling of application codes for envisioned parallel architectures, in terms of work atoms that are mapped (for example) to threads or thread blocks on computational hardware. Because of the rigor, completeness, conciseness, and layered design of our manyscale approach, application-to-architecture mapping is feasible and scalable for architectures at petascales, exascales, and above. Further, our methodology is simple, relying primarily on a small set of primitive mapping operations and support routines that are readily implemented on modern parallel processors such as graphics processing units (GPUs) and hybrid multi-processors (HMPs). In this paper, we overview the opportunities and challenges of manyscale computing for image and signal processing in support of space situational awareness applications. We discuss applications in terms of a layered hardware architecture (laboratory > supercomputer > rack > processor > component hierarchy). Demonstration applications include performance analysis and results in terms of execution time as well as storage, power, and energy consumption for bus-connected and/or networked architectures. The feasibility of the manyscale paradigm is demonstrated by addressing four principal challenges: (1) architectural/structural diversity, parallelism, and locality, (2) masking of I/O and memory latencies, (3) scalability of design as well as implementation, and (4) efficient representation/expression of parallel applications. Examples will demonstrate how manyscale computing helps solve these challenges efficiently on real-world computing systems.
Efficient fuzzy C-means architecture for image segmentation.

PubMed

Li, Hui-Ya; Hwang, Wen-Jyi; Chang, Chia-Yen

2011-01-01

This paper presents a novel VLSI architecture for image segmentation. The architecture is based on the fuzzy c-means algorithm with spatial constraint for reducing the misclassification rate. In the architecture, the usual iterative operations for updating the membership matrix and cluster centroid are merged into one single updating process to evade the large storage requirement. In addition, an efficient pipelined circuit is used for the updating process for accelerating the computational speed. Experimental results show that the the proposed circuit is an effective alternative for real-time image segmentation with low area cost and low misclassification rate.
A GPU-paralleled implementation of an enhanced face recognition algorithm

NASA Astrophysics Data System (ADS)

Chen, Hao; Liu, Xiyang; Shao, Shuai; Zan, Jiguo

2013-03-01

Face recognition algorithm based on compressed sensing and sparse representation is hotly argued in these years. The scheme of this algorithm increases recognition rate as well as anti-noise capability. However, the computational cost is expensive and has become a main restricting factor for real world applications. In this paper, we introduce a GPU-accelerated hybrid variant of face recognition algorithm named parallel face recognition algorithm (pFRA). We describe here how to carry out parallel optimization design to take full advantage of many-core structure of a GPU. The pFRA is tested and compared with several other implementations under different data sample size. Finally, Our pFRA, implemented with NVIDIA GPU and Computer Unified Device Architecture (CUDA) programming model, achieves a significant speedup over the traditional CPU implementations.
A parallel algorithm for the eigenvalues and eigenvectors for a general complex matrix

NASA Technical Reports Server (NTRS)

Shroff, Gautam

1989-01-01

A new parallel Jacobi-like algorithm is developed for computing the eigenvalues of a general complex matrix. Most parallel methods for this parallel typically display only linear convergence. Sequential norm-reducing algorithms also exit and they display quadratic convergence in most cases. The new algorithm is a parallel form of the norm-reducing algorithm due to Eberlein. It is proven that the asymptotic convergence rate of this algorithm is quadratic. Numerical experiments are presented which demonstrate the quadratic convergence of the algorithm and certain situations where the convergence is slow are also identified. The algorithm promises to be very competitive on a variety of parallel architectures.
Experimental comparison of two quantum computing architectures.

PubMed

Linke, Norbert M; Maslov, Dmitri; Roetteler, Martin; Debnath, Shantanu; Figgatt, Caroline; Landsman, Kevin A; Wright, Kenneth; Monroe, Christopher

2017-03-28

We run a selection of algorithms on two state-of-the-art 5-qubit quantum computers that are based on different technology platforms. One is a publicly accessible superconducting transmon device (www. ibm.com/ibm-q) with limited connectivity, and the other is a fully connected trapped-ion system. Even though the two systems have different native quantum interactions, both can be programed in a way that is blind to the underlying hardware, thus allowing a comparison of identical quantum algorithms between different physical systems. We show that quantum algorithms and circuits that use more connectivity clearly benefit from a better-connected system of qubits. Although the quantum systems here are not yet large enough to eclipse classical computers, this experiment exposes critical factors of scaling quantum computers, such as qubit connectivity and gate expressivity. In addition, the results suggest that codesigning particular quantum applications with the hardware itself will be paramount in successfully using quantum computers in the future.
Biclustering Protein Complex Interactions with a Biclique FindingAlgorithm

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ding, Chris; Zhang, Anne Ya; Holbrook, Stephen

2006-12-01

Biclustering has many applications in text mining, web clickstream mining, and bioinformatics. When data entries are binary, the tightest biclusters become bicliques. We propose a flexible and highly efficient algorithm to compute bicliques. We first generalize the Motzkin-Straus formalism for computing the maximal clique from L{sub 1} constraint to L{sub p} constraint, which enables us to provide a generalized Motzkin-Straus formalism for computing maximal-edge bicliques. By adjusting parameters, the algorithm can favor biclusters with more rows less columns, or vice verse, thus increasing the flexibility of the targeted biclusters. We then propose an algorithm to solve the generalized Motzkin-Straus optimizationmore » problem. The algorithm is provably convergent and has a computational complexity of O(|E|) where |E| is the number of edges. It relies on a matrix vector multiplication and runs efficiently on most current computer architectures. Using this algorithm, we bicluster the yeast protein complex interaction network. We find that biclustering protein complexes at the protein level does not clearly reflect the functional linkage among protein complexes in many cases, while biclustering at protein domain level can reveal many underlying linkages. We show several new biologically significant results.« less
Characterization of robotics parallel algorithms and mapping onto a reconfigurable SIMD machine

NASA Technical Reports Server (NTRS)

Lee, C. S. G.; Lin, C. T.

1989-01-01

The kinematics, dynamics, Jacobian, and their corresponding inverse computations are six essential problems in the control of robot manipulators. Efficient parallel algorithms for these computations are discussed and analyzed. Their characteristics are identified and a scheme on the mapping of these algorithms to a reconfigurable parallel architecture is presented. Based on the characteristics including type of parallelism, degree of parallelism, uniformity of the operations, fundamental operations, data dependencies, and communication requirement, it is shown that most of the algorithms for robotic computations possess highly regular properties and some common structures, especially the linear recursive structure. Moreover, they are well-suited to be implemented on a single-instruction-stream multiple-data-stream (SIMD) computer with reconfigurable interconnection network. The model of a reconfigurable dual network SIMD machine with internal direct feedback is introduced. A systematic procedure internal direct feedback is introduced. A systematic procedure to map these computations to the proposed machine is presented. A new scheduling problem for SIMD machines is investigated and a heuristic algorithm, called neighborhood scheduling, that reorders the processing sequence of subtasks to reduce the communication time is described. Mapping results of a benchmark algorithm are illustrated and discussed.
FPGA-based architecture for motion recovering in real-time

NASA Astrophysics Data System (ADS)

Arias-Estrada, Miguel; Maya-Rueda, Selene E.; Torres-Huitzil, Cesar

2002-03-01

A key problem in the computer vision field is the measurement of object motion in a scene. The main goal is to compute an approximation of the 3D motion from the analysis of an image sequence. Once computed, this information can be used as a basis to reach higher level goals in different applications. Motion estimation algorithms pose a significant computational load for the sequential processors limiting its use in practical applications. In this work we propose a hardware architecture for motion estimation in real time based on FPGA technology. The technique used for motion estimation is Optical Flow due to its accuracy, and the density of velocity estimation, however other techniques are being explored. The architecture is composed of parallel modules working in a pipeline scheme to reach high throughput rates near gigaflops. The modules are organized in a regular structure to provide a high degree of flexibility to cover different applications. Some results will be presented and the real-time performance will be discussed and analyzed. The architecture is prototyped in an FPGA board with a Virtex device interfaced to a digital imager.
GPU acceleration of a petascale application for turbulent mixing at high Schmidt number using OpenMP 4.5

NASA Astrophysics Data System (ADS)

Clay, M. P.; Buaria, D.; Yeung, P. K.; Gotoh, T.

2018-07-01

This paper reports on the successful implementation of a massively parallel GPU-accelerated algorithm for the direct numerical simulation of turbulent mixing at high Schmidt number. The work stems from a recent development (Comput. Phys. Commun., vol. 219, 2017, 313-328), in which a low-communication algorithm was shown to attain high degrees of scalability on the Cray XE6 architecture when overlapping communication and computation via dedicated communication threads. An even higher level of performance has now been achieved using OpenMP 4.5 on the Cray XK7 architecture, where on each node the 16 integer cores of an AMD Interlagos processor share a single Nvidia K20X GPU accelerator. In the new algorithm, data movements are minimized by performing virtually all of the intensive scalar field computations in the form of combined compact finite difference (CCD) operations on the GPUs. A memory layout in departure from usual practices is found to provide much better performance for a specific kernel required to apply the CCD scheme. Asynchronous execution enabled by adding the OpenMP 4.5 NOWAIT clause to TARGET constructs improves scalability when used to overlap computation on the GPUs with computation and communication on the CPUs. On the 27-petaflops supercomputer Titan at Oak Ridge National Laboratory, USA, a GPU-to-CPU speedup factor of approximately 5 is consistently observed at the largest problem size of 81923 grid points for the scalar field computed with 8192 XK7 nodes.
Parallel compression/decompression-based datapath architecture for multibeam mask writers

NASA Astrophysics Data System (ADS)

Chaudhary, Narendra; Savari, Serap A.

2017-06-01

Multibeam electron beam systems will be used in the future for mask writing and for complimentary lithography. The major challenges of the multibeam systems are in meeting throughput requirements and in handling the large data volumes associated with writing grayscale data on the wafer. In terms of future communications and computational requirements Amdahl's Law suggests that a simple increase of computation power and parallelism may not be a sustainable solution. We propose a parallel data compression algorithm to exploit the sparsity of mask data and a grayscale video-like representation of data. To improve the communication and computational efficiency of these systems at the write time we propose an alternate datapath architecture partly motivated by multibeam direct write lithography and partly motivated by the circuit testing literature, where parallel decompression reduces clock cycles. We explain a deflection plate architecture inspired by NuFlare Technology's multibeam mask writing system and how our datapath architecture can be easily added to it to improve performance.
Parallel compression/decompression-based datapath architecture for multibeam mask writers

NASA Astrophysics Data System (ADS)

Chaudhary, Narendra; Savari, Serap A.

2017-10-01

Multibeam electron beam systems will be used in the future for mask writing and for complementary lithography. The major challenges of the multibeam systems are in meeting throughput requirements and in handling the large data volumes associated with writing grayscale data on the wafer. In terms of future communications and computational requirements, Amdahl's law suggests that a simple increase of computation power and parallelism may not be a sustainable solution. We propose a parallel data compression algorithm to exploit the sparsity of mask data and a grayscale video-like representation of data. To improve the communication and computational efficiency of these systems at the write time, we propose an alternate datapath architecture partly motivated by multibeam direct-write lithography and partly motivated by the circuit testing literature, where parallel decompression reduces clock cycles. We explain a deflection plate architecture inspired by NuFlare Technology's multibeam mask writing system and how our datapath architecture can be easily added to it to improve performance.
Mark 4A antenna control system data handling architecture study

NASA Technical Reports Server (NTRS)

Briggs, H. C.; Eldred, D. B.

1991-01-01

A high-level review was conducted to provide an analysis of the existing architecture used to handle data and implement control algorithms for NASA's Deep Space Network (DSN) antennas and to make system-level recommendations for improving this architecture so that the DSN antennas can support the ever-tightening requirements of the next decade and beyond. It was found that the existing system is seriously overloaded, with processor utilization approaching 100 percent. A number of factors contribute to this overloading, including dated hardware, inefficient software, and a message-passing strategy that depends on serial connections between machines. At the same time, the system has shortcomings and idiosyncrasies that require extensive human intervention. A custom operating system kernel and an obscure programming language exacerbate the problems and should be modernized. A new architecture is presented that addresses these and other issues. Key features of the new architecture include a simplified message passing hierarchy that utilizes a high-speed local area network, redesign of particular processing function algorithms, consolidation of functions, and implementation of the architecture in modern hardware and software using mainstream computer languages and operating systems. The system would also allow incremental hardware improvements as better and faster hardware for such systems becomes available, and costs could potentially be low enough that redundancy would be provided economically. Such a system could support DSN requirements for the foreseeable future, though thorough consideration must be given to hard computational requirements, porting existing software functionality to the new system, and issues of fault tolerance and recovery.
Eigensolution of finite element problems in a completely connected parallel architecture

NASA Technical Reports Server (NTRS)

Akl, Fred A.; Morel, Michael R.

1989-01-01

A parallel algorithm for the solution of the generalized eigenproblem in linear elastic finite element analysis, (K)(phi)=(M)(phi)(omega), where (K) and (M) are of order N, and (omega) is of order q is presented. The parallel algorithm is based on a completely connected parallel architecture in which each processor is allowed to communicate with all other processors. The algorithm has been successfully implemented on a tightly coupled multiple-instruction-multiple-data (MIMD) parallel processing computer, Cray X-MP. A finite element model is divided into m domains each of which is assumed to process n elements. Each domain is then assigned to a processor, or to a logical processor (task) if the number of domains exceeds the number of physical processors. The macro-tasking library routines are used in mapping each domain to a user task. Computational speed-up and efficiency are used to determine the effectiveness of the algorithm. The effect of the number of domains, the number of degrees-of-freedom located along the global fronts and the dimension of the subspace on the performance of the algorithm are investigated. For a 64-element rectangular plate, speed-ups of 1.86, 3.13, 3.18 and 3.61 are achieved on two, four, six and eight processors, respectively.

An 81.6 μW FastICA processor for epileptic seizure detection.

PubMed

Yang, Chia-Hsiang; Shih, Yi-Hsin; Chiueh, Herming

2015-02-01

To improve the performance of epileptic seizure detection, independent component analysis (ICA) is applied to multi-channel signals to separate artifacts and signals of interest. FastICA is an efficient algorithm to compute ICA. To reduce the energy dissipation, eigenvalue decomposition (EVD) is utilized in the preprocessing stage to reduce the convergence time of iterative calculation of ICA components. EVD is computed efficiently through an array structure of processing elements running in parallel. Area-efficient EVD architecture is realized by leveraging the approximate Jacobi algorithm, leading to a 77.2% area reduction. By choosing proper memory element and reduced wordlength, the power and area of storage memory are reduced by 95.6% and 51.7%, respectively. The chip area is minimized through fixed-point implementation and architectural transformations. Given a latency constraint of 0.1 s, an 86.5% area reduction is achieved compared to the direct-mapped architecture. Fabricated in 90 nm CMOS, the core area of the chip is 0.40 mm(2). The FastICA processor, part of an integrated epileptic control SoC, dissipates 81.6 μW at 0.32 V. The computation delay of a frame of 256 samples for 8 channels is 84.2 ms. Compared to prior work, 0.5% power dissipation, 26.7% silicon area, and 3.4 × computation speedup are achieved. The performance of the chip was verified by human dataset.
Performance analysis of parallel branch and bound search with the hypercube architecture

NASA Technical Reports Server (NTRS)

Mraz, Richard T.

1987-01-01

With the availability of commercial parallel computers, researchers are examining new classes of problems which might benefit from parallel computing. This paper presents results of an investigation of the class of search intensive problems. The specific problem discussed is the Least-Cost Branch and Bound search method of deadline job scheduling. The object-oriented design methodology was used to map the problem into a parallel solution. While the initial design was good for a prototype, the best performance resulted from fine-tuning the algorithm for a specific computer. The experiments analyze the computation time, the speed up over a VAX 11/785, and the load balance of the problem when using loosely coupled multiprocessor system based on the hypercube architecture.
Job Superscheduler Architecture and Performance in Computational Grid Environments

NASA Technical Reports Server (NTRS)

Shan, Hongzhang; Oliker, Leonid; Biswas, Rupak

2003-01-01

Computational grids hold great promise in utilizing geographically separated heterogeneous resources to solve large-scale complex scientific problems. However, a number of major technical hurdles, including distributed resource management and effective job scheduling, stand in the way of realizing these gains. In this paper, we propose a novel grid superscheduler architecture and three distributed job migration algorithms. We also model the critical interaction between the superscheduler and autonomous local schedulers. Extensive performance comparisons with ideal, central, and local schemes using real workloads from leading computational centers are conducted in a simulation environment. Additionally, synthetic workloads are used to perform a detailed sensitivity analysis of our superscheduler. Several key metrics demonstrate that substantial performance gains can be achieved via smart superscheduling in distributed computational grids.
Algorithms and Architectures for Elastic-Wave Inversion Final Report CRADA No. TC02144.0

DOE Office of Scientific and Technical Information (OSTI.GOV)

Larsen, S.; Lindtjorn, O.

2017-08-15

This was a collaborative effort between Lawrence Livermore National Security, LLC as manager and operator of Lawrence Livermore National Laboratory (LLNL) and Schlumberger Technology Corporation (STC), to perform a computational feasibility study that investigates hardware platforms and software algorithms applicable to STC for Reverse Time Migration (RTM) / Reverse Time Inversion (RTI) of 3-D seismic data.
BPF-type region-of-interest reconstruction for parallel translational computed tomography.

PubMed

Wu, Weiwen; Yu, Hengyong; Wang, Shaoyu; Liu, Fenglin

2017-01-01

The objective of this study is to present and test a new ultra-low-cost linear scan based tomography architecture. Similar to linear tomosynthesis, the source and detector are translated in opposite directions and the data acquisition system targets on a region-of-interest (ROI) to acquire data for image reconstruction. This kind of tomographic architecture was named parallel translational computed tomography (PTCT). In previous studies, filtered backprojection (FBP)-type algorithms were developed to reconstruct images from PTCT. However, the reconstructed ROI images from truncated projections have severe truncation artefact. In order to overcome this limitation, we in this study proposed two backprojection filtering (BPF)-type algorithms named MP-BPF and MZ-BPF to reconstruct ROI images from truncated PTCT data. A weight function is constructed to deal with data redundancy for multi-linear translations modes. Extensive numerical simulations are performed to evaluate the proposed MP-BPF and MZ-BPF algorithms for PTCT in fan-beam geometry. Qualitative and quantitative results demonstrate that the proposed BPF-type algorithms cannot only more accurately reconstruct ROI images from truncated projections but also generate high-quality images for the entire image support in some circumstances.
Reliability Evaluation of Computer Systems.

DTIC Science & Technology

1981-01-01

algorithms in hardware is not restricted by the designs of particular circuits. Applications could be made in new computer architectures; one candidate...pp. 137-148, IEEE, Chicago, Illinois, September 1963. (With J.F. Wakerly ) "Design of Low-Cost General-Purpose Self- Diagnosing Computers," Proc...34 Proc., IEEE Int’l Solid-State Circuits Conference, Philadelphia, Pennsylvania, February 16-18, 1977. (With J.F. Wakerly ) "Microcomputers in the
Descriptive and Computer Aided Drawing Perspective on an Unfolded Polyhedral Projection Surface

NASA Astrophysics Data System (ADS)

Dzwierzynska, Jolanta

2017-10-01

The aim of the herby study is to develop a method of direct and practical mapping of perspective on an unfolded prism polyhedral projection surface. The considered perspective representation is a rectilinear central projection onto a surface composed of several flat elements. In the paper two descriptive methods of drawing perspective are presented: direct and indirect. The graphical mapping of the effects of the representation is realized directly on the unfolded flat projection surface. That is due to the projective and graphical connection between points displayed on the polyhedral background and their counterparts received on the unfolded flat surface. For a significant improvement of the construction of line, analytical algorithms are formulated. They draw a perspective image of a segment of line passing through two different points determined by their coordinates in a spatial coordinate system of axis x, y, z. Compared to other perspective construction methods that use information about points, for computer vision and the computer aided design, our algorithms utilize data about lines, which are applied very often in architectural forms. Possibility of drawing lines in the considered perspective enables drawing an edge perspective image of an architectural object. The application of the changeable base elements of perspective as a horizon height and a station point location enable drawing perspective image from different viewing positions. The analytical algorithms for drawing perspective images are formulated in Mathcad software, however, they can be implemented in the majority of computer graphical packages, which can make drawing perspective more efficient and easier. The representation presented in the paper and the way of its direct mapping on the flat unfolded projection surface can find application in presentation of architectural space in advertisement and art.
A multiarchitecture parallel-processing development environment

NASA Technical Reports Server (NTRS)

Townsend, Scott; Blech, Richard; Cole, Gary

1993-01-01

A description is given of the hardware and software of a multiprocessor test bed - the second generation Hypercluster system. The Hypercluster architecture consists of a standard hypercube distributed-memory topology, with multiprocessor shared-memory nodes. By using standard, off-the-shelf hardware, the system can be upgraded to use rapidly improving computer technology. The Hypercluster's multiarchitecture nature makes it suitable for researching parallel algorithms in computational field simulation applications (e.g., computational fluid dynamics). The dedicated test-bed environment of the Hypercluster and its custom-built software allows experiments with various parallel-processing concepts such as message passing algorithms, debugging tools, and computational 'steering'. Such research would be difficult, if not impossible, to achieve on shared, commercial systems.
A software methodology for compiling quantum programs

NASA Astrophysics Data System (ADS)

Häner, Thomas; Steiger, Damian S.; Svore, Krysta; Troyer, Matthias

2018-04-01

Quantum computers promise to transform our notions of computation by offering a completely new paradigm. To achieve scalable quantum computation, optimizing compilers and a corresponding software design flow will be essential. We present a software architecture for compiling quantum programs from a high-level language program to hardware-specific instructions. We describe the necessary layers of abstraction and their differences and similarities to classical layers of a computer-aided design flow. For each layer of the stack, we discuss the underlying methods for compilation and optimization. Our software methodology facilitates more rapid innovation among quantum algorithm designers, quantum hardware engineers, and experimentalists. It enables scalable compilation of complex quantum algorithms and can be targeted to any specific quantum hardware implementation.
Multiprocessor and memory architecture of the neurocomputer SYNAPSE-1.

PubMed

Ramacher, U; Raab, W; Anlauf, J; Hachmann, U; Beichter, J; Brüls, N; Wesseling, M; Sicheneder, E; Männer, R; Glass, J

1993-12-01

A general purpose neurocomputer, SYNAPSE-1, which exhibits a multiprocessor and memory architecture is presented. It offers wide flexibility with respect to neural algorithms and a speed-up factor of several orders of magnitude--including learning. The computational power is provided by a 2-dimensional systolic array of neural signal processors. Since the weights are stored outside these NSPs, memory size and processing power can be adapted individually to the application needs. A neural algorithms programming language, embedded in C(+2) has been defined for the user to cope with the neurocomputer. In a benchmark test, the prototype of SYNAPSE-1 was 8000 times as fast as a standard workstation.
GPU implementation of prior image constrained compressed sensing (PICCS)

NASA Astrophysics Data System (ADS)

Nett, Brian E.; Tang, Jie; Chen, Guang-Hong

2010-04-01

The Prior Image Constrained Compressed Sensing (PICCS) algorithm (Med. Phys. 35, pg. 660, 2008) has been applied to several computed tomography applications with both standard CT systems and flat-panel based systems designed for guiding interventional procedures and radiation therapy treatment delivery. The PICCS algorithm typically utilizes a prior image which is reconstructed via the standard Filtered Backprojection (FBP) reconstruction algorithm. The algorithm then iteratively solves for the image volume that matches the measured data, while simultaneously assuring the image is similar to the prior image. The PICCS algorithm has demonstrated utility in several applications including: improved temporal resolution reconstruction, 4D respiratory phase specific reconstructions for radiation therapy, and cardiac reconstruction from data acquired on an interventional C-arm. One disadvantage of the PICCS algorithm, just as other iterative algorithms, is the long computation times typically associated with reconstruction. In order for an algorithm to gain clinical acceptance reconstruction must be achievable in minutes rather than hours. In this work the PICCS algorithm has been implemented on the GPU in order to significantly reduce the reconstruction time of the PICCS algorithm. The Compute Unified Device Architecture (CUDA) was used in this implementation.
Scheduling time-critical graphics on multiple processors

NASA Technical Reports Server (NTRS)

Meyer, Tom W.; Hughes, John F.

1995-01-01

This paper describes an algorithm for the scheduling of time-critical rendering and computation tasks on single- and multiple-processor architectures, with minimal pipelining. It was developed to manage scientific visualization scenes consisting of hundreds of objects, each of which can be computed and displayed at thousands of possible resolution levels. The algorithm generates the time-critical schedule using progressive-refinement techniques; it always returns a feasible schedule and, when allowed to run to completion, produces a near-optimal schedule which takes advantage of almost the entire multiple-processor system.
Research on conceptual/innovative design for the life cycle

NASA Technical Reports Server (NTRS)

Cagan, Jonathan; Agogino, Alice M.

1990-01-01

The goal of this research is developing and integrating qualitative and quantitative methods for life cycle design. The definition of the problem includes formal computer-based methods limited to final detailing stages of design; CAD data bases do not capture design intent or design history; and life cycle issues were ignored during early stages of design. Viewgraphs outline research in conceptual design; the SYMON (SYmbolic MONotonicity analyzer) algorithm; multistart vector quantization optimization algorithm; intelligent manufacturing: IDES - Influence Diagram Architecture; and 1st PRINCE (FIRST PRINciple Computational Evaluator).
EOS MLS Science Data Processing System: A Description of Architecture and Capabilities

NASA Technical Reports Server (NTRS)

Cuddy, David T.; Echeverri, Mark D.; Wagner, Paul A.; Hanzel, Audrey T.; Fuller, Ryan A.

2006-01-01

This paper describes the architecture and capabilities of the Science Data Processing System (SDPS) for the EOS MLS. The SDPS consists of two major components--the Science Computing Facility and the Science Investigator-led Processing System. The Science Computing Facility provides the facilities for the EOS MLS Science Team to perform the functions of scientific algorithm development, processing software development, quality control of data products, and scientific analyses. The Science Investigator-led Processing System processes and reprocesses the science data for the entire mission and delivers the data products to the Science Computing Facility and to the Goddard Space Flight Center Earth Science Distributed Active Archive Center, which archives and distributes the standard science products.
A FAST ITERATIVE METHOD FOR SOLVING THE EIKONAL EQUATION ON TETRAHEDRAL DOMAINS

PubMed Central

Fu, Zhisong; Kirby, Robert M.; Whitaker, Ross T.

2014-01-01

Generating numerical solutions to the eikonal equation and its many variations has a broad range of applications in both the natural and computational sciences. Efficient solvers on cutting-edge, parallel architectures require new algorithms that may not be theoretically optimal, but that are designed to allow asynchronous solution updates and have limited memory access patterns. This paper presents a parallel algorithm for solving the eikonal equation on fully unstructured tetrahedral meshes. The method is appropriate for the type of fine-grained parallelism found on modern massively-SIMD architectures such as graphics processors and takes into account the particular constraints and capabilities of these computing platforms. This work builds on previous work for solving these equations on triangle meshes; in this paper we adapt and extend previous two-dimensional strategies to accommodate three-dimensional, unstructured, tetrahedralized domains. These new developments include a local update strategy with data compaction for tetrahedral meshes that provides solutions on both serial and parallel architectures, with a generalization to inhomogeneous, anisotropic speed functions. We also propose two new update schemes, specialized to mitigate the natural data increase observed when moving to three dimensions, and the data structures necessary for efficiently mapping data to parallel SIMD processors in a way that maintains computational density. Finally, we present descriptions of the implementations for a single CPU, as well as multicore CPUs with shared memory and SIMD architectures, with comparative results against state-of-the-art eikonal solvers. PMID:25221418
A survey of an introduction to fault diagnosis algorithms

NASA Technical Reports Server (NTRS)

Mathur, F. P.

1972-01-01

This report surveys the field of diagnosis and introduces some of the key algorithms and heuristics currently in use. Fault diagnosis is an important and a rapidly growing discipline. This is important in the design of self-repairable computers because the present diagnosis resolution of its fault-tolerant computer is limited to a functional unit or processor. Better resolution is necessary before failed units can become partially reuseable. The approach that holds the greatest promise is that of resident microdiagnostics; however, that presupposes a microprogrammable architecture for the computer being self-diagnosed. The presentation is tutorial and contains examples. An extensive bibliography of some 220 entries is included.
A low-power and high-quality implementation of the discrete cosine transformation

NASA Astrophysics Data System (ADS)

Heyne, B.; Götze, J.

2007-06-01

In this paper a computationally efficient and high-quality preserving DCT architecture is presented. It is obtained by optimizing the Loeffler DCT based on the Cordic algorithm. The computational complexity is reduced from 11 multiply and 29 add operations (Loeffler DCT) to 38 add and 16 shift operations (which is similar to the complexity of the binDCT). The experimental results show that the proposed DCT algorithm not only reduces the computational complexity significantly, but also retains the good transformation quality of the Loeffler DCT. Therefore, the proposed Cordic based Loeffler DCT is especially suited for low-power and high-quality CODECs in battery-based systems.
Workflow of the Grover algorithm simulation incorporating CUDA and GPGPU

NASA Astrophysics Data System (ADS)

Lu, Xiangwen; Yuan, Jiabin; Zhang, Weiwei

2013-09-01

The Grover quantum search algorithm, one of only a few representative quantum algorithms, can speed up many classical algorithms that use search heuristics. No true quantum computer has yet been developed. For the present, simulation is one effective means of verifying the search algorithm. In this work, we focus on the simulation workflow using a compute unified device architecture (CUDA). Two simulation workflow schemes are proposed. These schemes combine the characteristics of the Grover algorithm and the parallelism of general-purpose computing on graphics processing units (GPGPU). We also analyzed the optimization of memory space and memory access from this perspective. We implemented four programs on CUDA to evaluate the performance of schemes and optimization. Through experimentation, we analyzed the organization of threads suited to Grover algorithm simulations, compared the storage costs of the four programs, and validated the effectiveness of optimization. Experimental results also showed that the distinguished program on CUDA outperformed the serial program of libquantum on a CPU with a speedup of up to 23 times (12 times on average), depending on the scale of the simulation.
Protein alignment algorithms with an efficient backtracking routine on multiple GPUs.

PubMed

Blazewicz, Jacek; Frohmberg, Wojciech; Kierzynka, Michal; Pesch, Erwin; Wojciechowski, Pawel

2011-05-20

Pairwise sequence alignment methods are widely used in biological research. The increasing number of sequences is perceived as one of the upcoming challenges for sequence alignment methods in the nearest future. To overcome this challenge several GPU (Graphics Processing Unit) computing approaches have been proposed lately. These solutions show a great potential of a GPU platform but in most cases address the problem of sequence database scanning and computing only the alignment score whereas the alignment itself is omitted. Thus, the need arose to implement the global and semiglobal Needleman-Wunsch, and Smith-Waterman algorithms with a backtracking procedure which is needed to construct the alignment. In this paper we present the solution that performs the alignment of every given sequence pair, which is a required step for progressive multiple sequence alignment methods, as well as for DNA recognition at the DNA assembly stage. Performed tests show that the implementation, with performance up to 6.3 GCUPS on a single GPU for affine gap penalties, is very efficient in comparison to other CPU and GPU-based solutions. Moreover, multiple GPUs support with load balancing makes the application very scalable. The article shows that the backtracking procedure of the sequence alignment algorithms may be designed to fit in with the GPU architecture. Therefore, our algorithm, apart from scores, is able to compute pairwise alignments. This opens a wide range of new possibilities, allowing other methods from the area of molecular biology to take advantage of the new computational architecture. Performed tests show that the efficiency of the implementation is excellent. Moreover, the speed of our GPU-based algorithms can be almost linearly increased when using more than one graphics card.
Evaluation of Cache-based Superscalar and Cacheless Vector Architectures for Scientific Computations

NASA Technical Reports Server (NTRS)

Oliker, Leonid; Carter, Jonathan; Shalf, John; Skinner, David; Ethier, Stephane; Biswas, Rupak; Djomehri, Jahed; VanderWijngaart, Rob

2003-01-01

The growing gap between sustained and peak performance for scientific applications has become a well-known problem in high performance computing. The recent development of parallel vector systems offers the potential to bridge this gap for a significant number of computational science codes and deliver a substantial increase in computing capabilities. This paper examines the intranode performance of the NEC SX6 vector processor and the cache-based IBM Power3/4 superscalar architectures across a number of key scientific computing areas. First, we present the performance of a microbenchmark suite that examines a full spectrum of low-level machine characteristics. Next, we study the behavior of the NAS Parallel Benchmarks using some simple optimizations. Finally, we evaluate the perfor- mance of several numerical codes from key scientific computing domains. Overall results demonstrate that the SX6 achieves high performance on a large fraction of our application suite and in many cases significantly outperforms the RISC-based architectures. However, certain classes of applications are not easily amenable to vectorization and would likely require extensive reengineering of both algorithm and implementation to utilize the SX6 effectively.

A Standard Platform for Testing and Comparison of MDAO Architectures

NASA Technical Reports Server (NTRS)

Gray, Justin S.; Moore, Kenneth T.; Hearn, Tristan A.; Naylor, Bret A.

2012-01-01

The Multidisciplinary Design Analysis and Optimization (MDAO) community has developed a multitude of algorithms and techniques, called architectures, for performing optimizations on complex engineering systems which involve coupling between multiple discipline analyses. These architectures seek to efficiently handle optimizations with computationally expensive analyses including multiple disciplines. We propose a new testing procedure that can provide a quantitative and qualitative means of comparison among architectures. The proposed test procedure is implemented within the open source framework, OpenMDAO, and comparative results are presented for five well-known architectures: MDF, IDF, CO, BLISS, and BLISS-2000. We also demonstrate how using open source soft- ware development methods can allow the MDAO community to submit new problems and architectures to keep the test suite relevant.
Efficient self-organizing multilayer neural network for nonlinear system modeling.

PubMed

Han, Hong-Gui; Wang, Li-Dan; Qiao, Jun-Fei

2013-07-01

It has been shown extensively that the dynamic behaviors of a neural system are strongly influenced by the network architecture and learning process. To establish an artificial neural network (ANN) with self-organizing architecture and suitable learning algorithm for nonlinear system modeling, an automatic axon-neural network (AANN) is investigated in the following respects. First, the network architecture is constructed automatically to change both the number of hidden neurons and topologies of the neural network during the training process. The approach introduced in adaptive connecting-and-pruning algorithm (ACP) is a type of mixed mode operation, which is equivalent to pruning or adding the connecting of the neurons, as well as inserting some required neurons directly. Secondly, the weights are adjusted, using a feedforward computation (FC) to obtain the information for the gradient during learning computation. Unlike most of the previous studies, AANN is able to self-organize the architecture and weights, and to improve the network performances. Also, the proposed AANN has been tested on a number of benchmark problems, ranging from nonlinear function approximating to nonlinear systems modeling. The experimental results show that AANN can have better performances than that of some existing neural networks. Crown Copyright © 2013. Published by Elsevier Ltd. All rights reserved.
Big data mining analysis method based on cloud computing

NASA Astrophysics Data System (ADS)

Cai, Qing Qiu; Cui, Hong Gang; Tang, Hao

2017-08-01

Information explosion era, large data super-large, discrete and non-(semi) structured features have gone far beyond the traditional data management can carry the scope of the way. With the arrival of the cloud computing era, cloud computing provides a new technical way to analyze the massive data mining, which can effectively solve the problem that the traditional data mining method cannot adapt to massive data mining. This paper introduces the meaning and characteristics of cloud computing, analyzes the advantages of using cloud computing technology to realize data mining, designs the mining algorithm of association rules based on MapReduce parallel processing architecture, and carries out the experimental verification. The algorithm of parallel association rule mining based on cloud computing platform can greatly improve the execution speed of data mining.
Bit-parallel arithmetic in a massively-parallel associative processor

NASA Technical Reports Server (NTRS)

Scherson, Isaac D.; Kramer, David A.; Alleyne, Brian D.

1992-01-01

A simple but powerful new architecture based on a classical associative processor model is presented. Algorithms for performing the four basic arithmetic operations both for integer and floating point operands are described. For m-bit operands, the proposed architecture makes it possible to execute complex operations in O(m) cycles as opposed to O(m exp 2) for bit-serial machines. A word-parallel, bit-parallel, massively-parallel computing system can be constructed using this architecture with VLSI technology. The operation of this system is demonstrated for the fast Fourier transform and matrix multiplication.
PEM-PCA: a parallel expectation-maximization PCA face recognition architecture.

PubMed

Rujirakul, Kanokmon; So-In, Chakchai; Arnonkijpanich, Banchar

2014-01-01

Principal component analysis or PCA has been traditionally used as one of the feature extraction techniques in face recognition systems yielding high accuracy when requiring a small number of features. However, the covariance matrix and eigenvalue decomposition stages cause high computational complexity, especially for a large database. Thus, this research presents an alternative approach utilizing an Expectation-Maximization algorithm to reduce the determinant matrix manipulation resulting in the reduction of the stages' complexity. To improve the computational time, a novel parallel architecture was employed to utilize the benefits of parallelization of matrix computation during feature extraction and classification stages including parallel preprocessing, and their combinations, so-called a Parallel Expectation-Maximization PCA architecture. Comparing to a traditional PCA and its derivatives, the results indicate lower complexity with an insignificant difference in recognition precision leading to high speed face recognition systems, that is, the speed-up over nine and three times over PCA and Parallel PCA.
Essential issues in multiprocessor systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gajski, D.D.; Peir, J.K.

1985-06-01

During the past several years, a great number of proposals have been made with the objective to increase supercomputer performance by an order of magnitude on the basis of a utilization of new computer architectures. The present paper is concerned with a suitable classification scheme for comparing these architectures. It is pointed out that there are basically four schools of thought as to the most important factor for an enhancement of computer performance. According to one school, the development of faster circuits will make it possible to retain present architectures, except, possibly, for a mechanism providing synchronization of parallel processes.more » A second school assigns priority to the optimization and vectorization of compilers, which will detect parallelism and help users to write better parallel programs. A third school believes in the predominant importance of new parallel algorithms, while the fourth school supports new models of computation. The merits of the four approaches are critically evaluated. 50 references.« less
Highly parallel sparse Cholesky factorization

NASA Technical Reports Server (NTRS)

Gilbert, John R.; Schreiber, Robert

1990-01-01

Several fine grained parallel algorithms were developed and compared to compute the Cholesky factorization of a sparse matrix. The experimental implementations are on the Connection Machine, a distributed memory SIMD machine whose programming model conceptually supplies one processor per data element. In contrast to special purpose algorithms in which the matrix structure conforms to the connection structure of the machine, the focus is on matrices with arbitrary sparsity structure. The most promising algorithm is one whose inner loop performs several dense factorizations simultaneously on a 2-D grid of processors. Virtually any massively parallel dense factorization algorithm can be used as the key subroutine. The sparse code attains execution rates comparable to those of the dense subroutine. Although at present architectural limitations prevent the dense factorization from realizing its potential efficiency, it is concluded that a regular data parallel architecture can be used efficiently to solve arbitrarily structured sparse problems. A performance model is also presented and it is used to analyze the algorithms.
Implementing Scientific Simulation Codes Highly Tailored for Vector Architectures Using Custom Configurable Computing Machines

NASA Technical Reports Server (NTRS)

Rutishauser, David

2006-01-01

The motivation for this work comes from an observation that amidst the push for Massively Parallel (MP) solutions to high-end computing problems such as numerical physical simulations, large amounts of legacy code exist that are highly optimized for vector supercomputers. Because re-hosting legacy code often requires a complete re-write of the original code, which can be a very long and expensive effort, this work examines the potential to exploit reconfigurable computing machines in place of a vector supercomputer to implement an essentially unmodified legacy source code. Custom and reconfigurable computing resources could be used to emulate an original application's target platform to the extent required to achieve high performance. To arrive at an architecture that delivers the desired performance subject to limited resources involves solving a multi-variable optimization problem with constraints. Prior research in the area of reconfigurable computing has demonstrated that designing an optimum hardware implementation of a given application under hardware resource constraints is an NP-complete problem. The premise of the approach is that the general issue of applying reconfigurable computing resources to the implementation of an application, maximizing the performance of the computation subject to physical resource constraints, can be made a tractable problem by assuming a computational paradigm, such as vector processing. This research contributes a formulation of the problem and a methodology to design a reconfigurable vector processing implementation of a given application that satisfies a performance metric. A generic, parametric, architectural framework for vector processing implemented in reconfigurable logic is developed as a target for a scheduling/mapping algorithm that maps an input computation to a given instance of the architecture. This algorithm is integrated with an optimization framework to arrive at a specification of the architecture parameters that attempts to minimize execution time, while staying within resource constraints. The flexibility of using a custom reconfigurable implementation is exploited in a unique manner to leverage the lessons learned in vector supercomputer development. The vector processing framework is tailored to the application, with variable parameters that are fixed in traditional vector processing. Benchmark data that demonstrates the functionality and utility of the approach is presented. The benchmark data includes an identified bottleneck in a real case study example vector code, the NASA Langley Terminal Area Simulation System (TASS) application.
Traffic engineering and regenerator placement in GMPLS networks with restoration

NASA Astrophysics Data System (ADS)

Yetginer, Emre; Karasan, Ezhan

2002-07-01

In this paper we study regenerator placement and traffic engineering of restorable paths in Generalized Multipro-tocol Label Switching (GMPLS) networks. Regenerators are necessary in optical networks due to transmission impairments. We study a network architecture where there are regenerators at selected nodes and we propose two heuristic algorithms for the regenerator placement problem. Performances of these algorithms in terms of required number of regenerators and computational complexity are evaluated. In this network architecture with sparse regeneration, offline computation of working and restoration paths is studied with bandwidth reservation and path rerouting as the restoration scheme. We study two approaches for selecting working and restoration paths from a set of candidate paths and formulate each method as an Integer Linear Programming (ILP) prob-lem. Traffic uncertainty model is developed in order to compare these methods based on their robustness with respect to changing traffic patterns. Traffic engineering methods are compared based on number of additional demands due to traffic uncertainty that can be carried. Regenerator placement algorithms are also evaluated from a traffic engineering point of view.
Rapid Calculation of Max-Min Fair Rates for Multi-Commodity Flows in Fat-Tree Networks

DOE PAGES

Mollah, Md Atiqul; Yuan, Xin; Pakin, Scott; ...

2017-08-29

Max-min fairness is often used in the performance modeling of interconnection networks. Existing methods to compute max-min fair rates for multi-commodity flows have high complexity and are computationally infeasible for large networks. In this paper, we show that by considering topological features, this problem can be solved efficiently for the fat-tree topology that is widely used in data centers and high performance compute clusters. Several efficient new algorithms are developed for this problem, including a parallel algorithm that can take advantage of multi-core and shared-memory architectures. Using these algorithms, we demonstrate that it is possible to find the max-min fairmore » rate allocation for multi-commodity flows in fat-tree networks that support tens of thousands of nodes. We evaluate the run-time performance of the proposed algorithms and show improvement in orders of magnitude over the previously best known method. Finally, we further demonstrate a new application of max-min fair rate allocation that is only computationally feasible using our new algorithms.« less
Rapid Calculation of Max-Min Fair Rates for Multi-Commodity Flows in Fat-Tree Networks

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mollah, Md Atiqul; Yuan, Xin; Pakin, Scott

Max-min fairness is often used in the performance modeling of interconnection networks. Existing methods to compute max-min fair rates for multi-commodity flows have high complexity and are computationally infeasible for large networks. In this paper, we show that by considering topological features, this problem can be solved efficiently for the fat-tree topology that is widely used in data centers and high performance compute clusters. Several efficient new algorithms are developed for this problem, including a parallel algorithm that can take advantage of multi-core and shared-memory architectures. Using these algorithms, we demonstrate that it is possible to find the max-min fairmore » rate allocation for multi-commodity flows in fat-tree networks that support tens of thousands of nodes. We evaluate the run-time performance of the proposed algorithms and show improvement in orders of magnitude over the previously best known method. Finally, we further demonstrate a new application of max-min fair rate allocation that is only computationally feasible using our new algorithms.« less
Real-time traffic sign detection and recognition

NASA Astrophysics Data System (ADS)

Herbschleb, Ernst; de With, Peter H. N.

2009-01-01

The continuous growth of imaging databases increasingly requires analysis tools for extraction of features. In this paper, a new architecture for the detection of traffic signs is proposed. The architecture is designed to process a large database with tens of millions of images with a resolution up to 4,800x2,400 pixels. Because of the size of the database, a high reliability as well as a high throughput is required. The novel architecture consists of a three-stage algorithm with multiple steps per stage, combining both color and specific spatial information. The first stage contains an area-limitation step which is performance critical in both the detection rate as the overall processing time. The second stage locates suggestions for traffic signs using recently published feature processing. The third stage contains a validation step to enhance reliability of the algorithm. During this stage, the traffic signs are recognized. Experiments show a convincing detection rate of 99%. With respect to computational speed, the throughput for line-of-sight images of 800×600 pixels is 35 Hz and for panorama images it is 4 Hz. Our novel architecture outperforms existing algorithms, with respect to both detection rate and throughput
A Metascalable Computing Framework for Large Spatiotemporal-Scale Atomistic Simulations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Nomura, K; Seymour, R; Wang, W

2009-02-17

A metascalable (or 'design once, scale on new architectures') parallel computing framework has been developed for large spatiotemporal-scale atomistic simulations of materials based on spatiotemporal data locality principles, which is expected to scale on emerging multipetaflops architectures. The framework consists of: (1) an embedded divide-and-conquer (EDC) algorithmic framework based on spatial locality to design linear-scaling algorithms for high complexity problems; (2) a space-time-ensemble parallel (STEP) approach based on temporal locality to predict long-time dynamics, while introducing multiple parallelization axes; and (3) a tunable hierarchical cellular decomposition (HCD) parallelization framework to map these O(N) algorithms onto a multicore cluster based onmore » hybrid implementation combining message passing and critical section-free multithreading. The EDC-STEP-HCD framework exposes maximal concurrency and data locality, thereby achieving: (1) inter-node parallel efficiency well over 0.95 for 218 billion-atom molecular-dynamics and 1.68 trillion electronic-degrees-of-freedom quantum-mechanical simulations on 212,992 IBM BlueGene/L processors (superscalability); (2) high intra-node, multithreading parallel efficiency (nanoscalability); and (3) nearly perfect time/ensemble parallel efficiency (eon-scalability). The spatiotemporal scale covered by MD simulation on a sustained petaflops computer per day (i.e. petaflops {center_dot} day of computing) is estimated as NT = 2.14 (e.g. N = 2.14 million atoms for T = 1 microseconds).« less
Rapid architecture alternative modeling (RAAM): A framework for capability-based analysis of system of systems architectures

NASA Astrophysics Data System (ADS)

Iacobucci, Joseph V.

The research objective for this manuscript is to develop a Rapid Architecture Alternative Modeling (RAAM) methodology to enable traceable Pre-Milestone A decision making during the conceptual phase of design of a system of systems. Rather than following current trends that place an emphasis on adding more analysis which tends to increase the complexity of the decision making problem, RAAM improves on current methods by reducing both runtime and model creation complexity. RAAM draws upon principles from computer science, system architecting, and domain specific languages to enable the automatic generation and evaluation of architecture alternatives. For example, both mission dependent and mission independent metrics are considered. Mission dependent metrics are determined by the performance of systems accomplishing a task, such as Probability of Success. In contrast, mission independent metrics, such as acquisition cost, are solely determined and influenced by the other systems in the portfolio. RAAM also leverages advances in parallel computing to significantly reduce runtime by defining executable models that are readily amendable to parallelization. This allows the use of cloud computing infrastructures such as Amazon's Elastic Compute Cloud and the PASTEC cluster operated by the Georgia Institute of Technology Research Institute (GTRI). Also, the amount of data that can be generated when fully exploring the design space can quickly exceed the typical capacity of computational resources at the analyst's disposal. To counter this, specific algorithms and techniques are employed. Streaming algorithms and recursive architecture alternative evaluation algorithms are used that reduce computer memory requirements. Lastly, a domain specific language is created to provide a reduction in the computational time of executing the system of systems models. A domain specific language is a small, usually declarative language that offers expressive power focused on a particular problem domain by establishing an effective means to communicate the semantics from the RAAM framework. These techniques make it possible to include diverse multi-metric models within the RAAM framework in addition to system and operational level trades. A canonical example was used to explore the uses of the methodology. The canonical example contains all of the features of a full system of systems architecture analysis study but uses fewer tasks and systems. Using RAAM with the canonical example it was possible to consider both system and operational level trades in the same analysis. Once the methodology had been tested with the canonical example, a Suppression of Enemy Air Defenses (SEAD) capability model was developed. Due to the sensitive nature of analyses on that subject, notional data was developed. The notional data has similar trends and properties to realistic Suppression of Enemy Air Defenses data. RAAM was shown to be traceable and provided a mechanism for a unified treatment of a variety of metrics. The SEAD capability model demonstrated lower computer runtimes and reduced model creation complexity as compared to methods currently in use. To determine the usefulness of the implementation of the methodology on current computing hardware, RAAM was tested with system of system architecture studies of different sizes. This was necessary since system of systems may be called upon to accomplish thousands of tasks. It has been clearly demonstrated that RAAM is able to enumerate and evaluate the types of large, complex design spaces usually encountered in capability based design, oftentimes providing the ability to efficiently search the entire decision space. The core algorithms for generation and evaluation of alternatives scale linearly with expected problem sizes. The SEAD capability model outputs prompted the discovery a new issue, the data storage and manipulation requirements for an analysis. Two strategies were developed to counter large data sizes, the use of portfolio views and top 'n' analysis. This proved the usefulness of the RAAM framework and methodology during Pre-Milestone A capability based analysis. (Abstract shortened by UMI.).
Parallel architectures for iterative methods on adaptive, block structured grids

NASA Technical Reports Server (NTRS)

Gannon, D.; Vanrosendale, J.

1983-01-01

A parallel computer architecture well suited to the solution of partial differential equations in complicated geometries is proposed. Algorithms for partial differential equations contain a great deal of parallelism. But this parallelism can be difficult to exploit, particularly on complex problems. One approach to extraction of this parallelism is the use of special purpose architectures tuned to a given problem class. The architecture proposed here is tuned to boundary value problems on complex domains. An adaptive elliptic algorithm which maps effectively onto the proposed architecture is considered in detail. Two levels of parallelism are exploited by the proposed architecture. First, by making use of the freedom one has in grid generation, one can construct grids which are locally regular, permitting a one to one mapping of grids to systolic style processor arrays, at least over small regions. All local parallelism can be extracted by this approach. Second, though there may be a regular global structure to the grids constructed, there will be parallelism at this level. One approach to finding and exploiting this parallelism is to use an architecture having a number of processor clusters connected by a switching network. The use of such a network creates a highly flexible architecture which automatically configures to the problem being solved.
Efficient Hardware Implementation of the Horn-Schunck Algorithm for High-Resolution Real-Time Dense Optical Flow Sensor

PubMed Central

Komorkiewicz, Mateusz; Kryjak, Tomasz; Gorgon, Marek

2014-01-01

This article presents an efficient hardware implementation of the Horn-Schunck algorithm that can be used in an embedded optical flow sensor. An architecture is proposed, that realises the iterative Horn-Schunck algorithm in a pipelined manner. This modification allows to achieve data throughput of 175 MPixels/s and makes processing of Full HD video stream (1, 920 × 1, 080 @ 60 fps) possible. The structure of the optical flow module as well as pre- and post-filtering blocks and a flow reliability computation unit is described in details. Three versions of optical flow modules, with different numerical precision, working frequency and obtained results accuracy are proposed. The errors caused by switching from floating- to fixed-point computations are also evaluated. The described architecture was tested on popular sequences from an optical flow dataset of the Middlebury University. It achieves state-of-the-art results among hardware implementations of single scale methods. The designed fixed-point architecture achieves performance of 418 GOPS with power efficiency of 34 GOPS/W. The proposed floating-point module achieves 103 GFLOPS, with power efficiency of 24 GFLOPS/W. Moreover, a 100 times speedup compared to a modern CPU with SIMD support is reported. A complete, working vision system realized on Xilinx VC707 evaluation board is also presented. It is able to compute optical flow for Full HD video stream received from an HDMI camera in real-time. The obtained results prove that FPGA devices are an ideal platform for embedded vision systems. PMID:24526303
Efficient hardware implementation of the Horn-Schunck algorithm for high-resolution real-time dense optical flow sensor.

PubMed

Komorkiewicz, Mateusz; Kryjak, Tomasz; Gorgon, Marek

2014-02-12

This article presents an efficient hardware implementation of the Horn-Schunck algorithm that can be used in an embedded optical flow sensor. An architecture is proposed, that realises the iterative Horn-Schunck algorithm in a pipelined manner. This modification allows to achieve data throughput of 175 MPixels/s and makes processing of Full HD video stream (1; 920 × 1; 080 @ 60 fps) possible. The structure of the optical flow module as well as pre- and post-filtering blocks and a flow reliability computation unit is described in details. Three versions of optical flow modules, with different numerical precision, working frequency and obtained results accuracy are proposed. The errors caused by switching from floating- to fixed-point computations are also evaluated. The described architecture was tested on popular sequences from an optical flow dataset of the Middlebury University. It achieves state-of-the-art results among hardware implementations of single scale methods. The designed fixed-point architecture achieves performance of 418 GOPS with power efficiency of 34 GOPS/W. The proposed floating-point module achieves 103 GFLOPS, with power efficiency of 24 GFLOPS/W. Moreover, a 100 times speedup compared to a modern CPU with SIMD support is reported. A complete, working vision system realized on Xilinx VC707 evaluation board is also presented. It is able to compute optical flow for Full HD video stream received from an HDMI camera in real-time. The obtained results prove that FPGA devices are an ideal platform for embedded vision systems.
An Evolutionary Algorithm to Generate Ellipsoid Detectors for Negative Selection

DTIC Science & Technology

2005-03-21

of Congress on Evolutionary Computation. Honolulu,. 58. Lamont, Gary B., Robert E. Marmelstein, and David A. Van Veldhuizen . A Distributed Architecture...antibody and an antigen is a function of several processes including electrostatic interactions, hydrogen bonding, van der Waals interaction, and others [20...Kelly, Patrick M., Don R. Hush, and James M. White. “An Adaptive Algorithm for Modifying Hyperellipsoidal Decision Surfaces”. Journal of Artificial
Faster than Real-Time Dynamic Simulation for Large-Size Power System with Detailed Dynamic Models using High-Performance Computing Platform

DOE Office of Scientific and Technical Information (OSTI.GOV)

Huang, Renke; Jin, Shuangshuang; Chen, Yousu

This paper presents a faster-than-real-time dynamic simulation software package that is designed for large-size power system dynamic simulation. It was developed on the GridPACKTM high-performance computing (HPC) framework. The key features of the developed software package include (1) faster-than-real-time dynamic simulation for a WECC system (17,000 buses) with different types of detailed generator, controller, and relay dynamic models, (2) a decoupled parallel dynamic simulation algorithm with optimized computation architecture to better leverage HPC resources and technologies, (3) options for HPC-based linear and iterative solvers, (4) hidden HPC details, such as data communication and distribution, to enable development centered on mathematicalmore » models and algorithms rather than on computational details for power system researchers, and (5) easy integration of new dynamic models and related algorithms into the software package.« less
Parallel/distributed direct method for solving linear systems

NASA Technical Reports Server (NTRS)

Lin, Avi

1990-01-01

A new family of parallel schemes for directly solving linear systems is presented and analyzed. It is shown that these schemes exhibit a near optimal performance and enjoy several important features: (1) For large enough linear systems, the design of the appropriate paralleled algorithm is insensitive to the number of processors as its performance grows monotonically with them; (2) It is especially good for large matrices, with dimensions large relative to the number of processors in the system; (3) It can be used in both distributed parallel computing environments and tightly coupled parallel computing systems; and (4) This set of algorithms can be mapped onto any parallel architecture without any major programming difficulties or algorithmical changes.

Extreme-scale Algorithms and Solver Resilience

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dongarra, Jack

A widening gap exists between the peak performance of high-performance computers and the performance achieved by complex applications running on these platforms. Over the next decade, extreme-scale systems will present major new challenges to algorithm development that could amplify this mismatch in such a way that it prevents the productive use of future DOE Leadership computers due to the following; Extreme levels of parallelism due to multicore processors; An increase in system fault rates requiring algorithms to be resilient beyond just checkpoint/restart; Complex memory hierarchies and costly data movement in both energy and performance; Heterogeneous system architectures (mixing CPUs, GPUs,more » etc.); and Conflicting goals of performance, resilience, and power requirements.« less
Gauss Elimination: Workhorse of Linear Algebra.

DTIC Science & Technology

1995-08-05

linear algebra computation for solving systems, computing determinants and determining the rank of matrix. All of these are discussed in varying contexts. These include different arithmetic or algebraic setting such as integer arithmetic or polynomial rings as well as conventional real (floating-point) arithmetic. These have effects on both accuracy and complexity analyses of the algorithm. These, too, are covered here. The impact of modern parallel computer architecture on GE is also
Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards

PubMed Central

Massanes, Francesc; Cadennes, Marie; Brankov, Jovan G.

2012-01-01

In this paper we describe and evaluate a fast implementation of a classical block matching motion estimation algorithm for multiple Graphical Processing Units (GPUs) using the Compute Unified Device Architecture (CUDA) computing engine. The implemented block matching algorithm (BMA) uses summed absolute difference (SAD) error criterion and full grid search (FS) for finding optimal block displacement. In this evaluation we compared the execution time of a GPU and CPU implementation for images of various sizes, using integer and non-integer search grids. The results show that use of a GPU card can shorten computation time by a factor of 200 times for integer and 1000 times for a non-integer search grid. The additional speedup for non-integer search grid comes from the fact that GPU has built-in hardware for image interpolation. Further, when using multiple GPU cards, the presented evaluation shows the importance of the data splitting method across multiple cards, but an almost linear speedup with a number of cards is achievable. In addition we compared execution time of the proposed FS GPU implementation with two existing, highly optimized non-full grid search CPU based motion estimations methods, namely implementation of the Pyramidal Lucas Kanade Optical flow algorithm in OpenCV and Simplified Unsymmetrical multi-Hexagon search in H.264/AVC standard. In these comparisons, FS GPU implementation still showed modest improvement even though the computational complexity of FS GPU implementation is substantially higher than non-FS CPU implementation. We also demonstrated that for an image sequence of 720×480 pixels in resolution, commonly used in video surveillance, the proposed GPU implementation is sufficiently fast for real-time motion estimation at 30 frames-per-second using two NVIDIA C1060 Tesla GPU cards. PMID:22347787
Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards.

PubMed

Massanes, Francesc; Cadennes, Marie; Brankov, Jovan G

2011-07-01

In this paper we describe and evaluate a fast implementation of a classical block matching motion estimation algorithm for multiple Graphical Processing Units (GPUs) using the Compute Unified Device Architecture (CUDA) computing engine. The implemented block matching algorithm (BMA) uses summed absolute difference (SAD) error criterion and full grid search (FS) for finding optimal block displacement. In this evaluation we compared the execution time of a GPU and CPU implementation for images of various sizes, using integer and non-integer search grids.The results show that use of a GPU card can shorten computation time by a factor of 200 times for integer and 1000 times for a non-integer search grid. The additional speedup for non-integer search grid comes from the fact that GPU has built-in hardware for image interpolation. Further, when using multiple GPU cards, the presented evaluation shows the importance of the data splitting method across multiple cards, but an almost linear speedup with a number of cards is achievable.In addition we compared execution time of the proposed FS GPU implementation with two existing, highly optimized non-full grid search CPU based motion estimations methods, namely implementation of the Pyramidal Lucas Kanade Optical flow algorithm in OpenCV and Simplified Unsymmetrical multi-Hexagon search in H.264/AVC standard. In these comparisons, FS GPU implementation still showed modest improvement even though the computational complexity of FS GPU implementation is substantially higher than non-FS CPU implementation.We also demonstrated that for an image sequence of 720×480 pixels in resolution, commonly used in video surveillance, the proposed GPU implementation is sufficiently fast for real-time motion estimation at 30 frames-per-second using two NVIDIA C1060 Tesla GPU cards.
HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi

DOE PAGES

Dongarra, Jack; Gates, Mark; Haidar, Azzam; ...

2015-01-01

This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms for multicore with Intel Xeon Phi coprocessors. In particular, we consider algorithms for solving linear systems. Further, we give an overview of the MAGMA MIC library, an open source, high performance library, that incorporates the developments presented here and, more broadly, provides the DLA functionality equivalent to that of the popular LAPACK library while targeting heterogeneous architectures that feature a mix of multicore CPUs and coprocessors. The LAPACK-compliance simplifies the use of the MAGMA MIC library in applications, while providing them with portably performant DLA.more » High performance is obtained through the use of the high-performance BLAS, hardware-specific tuning, and a hybridization methodology whereby we split the algorithm into computational tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components. Our methodology and programming techniques are incorporated into the MAGMA MIC API, which abstracts the application developer from the specifics of the Xeon Phi architecture and is therefore applicable to algorithms beyond the scope of DLA.« less
Dynamic Load-Balancing for Distributed Heterogeneous Computing of Parallel CFD Problems

NASA Technical Reports Server (NTRS)

Ecer, A.; Chien, Y. P.; Boenisch, T.; Akay, H. U.

2000-01-01

The developed methodology is aimed at improving the efficiency of executing block-structured algorithms on parallel, distributed, heterogeneous computers. The basic approach of these algorithms is to divide the flow domain into many sub- domains called blocks, and solve the governing equations over these blocks. Dynamic load balancing problem is defined as the efficient distribution of the blocks among the available processors over a period of several hours of computations. In environments with computers of different architecture, operating systems, CPU speed, memory size, load, and network speed, balancing the loads and managing the communication between processors becomes crucial. Load balancing software tools for mutually dependent parallel processes have been created to efficiently utilize an advanced computation environment and algorithms. These tools are dynamic in nature because of the chances in the computer environment during execution time. More recently, these tools were extended to a second operating system: NT. In this paper, the problems associated with this application will be discussed. Also, the developed algorithms were combined with the load sharing capability of LSF to efficiently utilize workstation clusters for parallel computing. Finally, results will be presented on running a NASA based code ADPAC to demonstrate the developed tools for dynamic load balancing.
Automatic Blocking Of QR and LU Factorizations for Locality

DOE Office of Scientific and Technical Information (OSTI.GOV)

Yi, Q; Kennedy, K; You, H

2004-03-26

QR and LU factorizations for dense matrices are important linear algebra computations that are widely used in scientific applications. To efficiently perform these computations on modern computers, the factorization algorithms need to be blocked when operating on large matrices to effectively exploit the deep cache hierarchy prevalent in today's computer memory systems. Because both QR (based on Householder transformations) and LU factorization algorithms contain complex loop structures, few compilers can fully automate the blocking of these algorithms. Though linear algebra libraries such as LAPACK provides manually blocked implementations of these algorithms, by automatically generating blocked versions of the computations, moremore » benefit can be gained such as automatic adaptation of different blocking strategies. This paper demonstrates how to apply an aggressive loop transformation technique, dependence hoisting, to produce efficient blockings for both QR and LU with partial pivoting. We present different blocking strategies that can be generated by our optimizer and compare the performance of auto-blocked versions with manually tuned versions in LAPACK, both using reference BLAS, ATLAS BLAS and native BLAS specially tuned for the underlying machine architectures.« less
Empirical study of parallel LRU simulation algorithms

NASA Technical Reports Server (NTRS)

Carr, Eric; Nicol, David M.

1994-01-01

This paper reports on the performance of five parallel algorithms for simulating a fully associative cache operating under the LRU (Least-Recently-Used) replacement policy. Three of the algorithms are SIMD, and are implemented on the MasPar MP-2 architecture. Two other algorithms are parallelizations of an efficient serial algorithm on the Intel Paragon. One SIMD algorithm is quite simple, but its cost is linear in the cache size. The two other SIMD algorithm are more complex, but have costs that are independent on the cache size. Both the second and third SIMD algorithms compute all stack distances; the second SIMD algorithm is completely general, whereas the third SIMD algorithm presumes and takes advantage of bounds on the range of reference tags. Both MIMD algorithm implemented on the Paragon are general and compute all stack distances; they differ in one step that may affect their respective scalability. We assess the strengths and weaknesses of these algorithms as a function of problem size and characteristics, and compare their performance on traces derived from execution of three SPEC benchmark programs.
A technique system for the measurement, reconstruction and character extraction of rice plant architecture

PubMed Central

Li, Xumeng; Wang, Xiaohui; Wei, Hailin; Zhu, Xinguang; Peng, Yulin; Li, Ming; Li, Tao; Huang, Huang

2017-01-01

This study developed a technique system for the measurement, reconstruction, and trait extraction of rice canopy architectures, which have challenged functional–structural plant modeling for decades and have become the foundation of the design of ideo-plant architectures. The system uses the location-separation-measurement method (LSMM) for the collection of data on the canopy architecture and the analytic geometry method for the reconstruction and visualization of the three-dimensional (3D) digital architecture of the rice plant. It also uses the virtual clipping method for extracting the key traits of the canopy architecture such as the leaf area, inclination, and azimuth distribution in spatial coordinates. To establish the technique system, we developed (i) simple tools to measure the spatial position of the stem axis and azimuth of the leaf midrib and to capture images of tillers and leaves; (ii) computer software programs for extracting data on stem diameter, leaf nodes, and leaf midrib curves from the tiller images and data on leaf length, width, and shape from the leaf images; (iii) a database of digital architectures that stores the measured data and facilitates the reconstruction of the 3D visual architecture and the extraction of architectural traits; and (iv) computation algorithms for virtual clipping to stratify the rice canopy, to extend the stratified surface from the horizontal plane to a general curved surface (including a cylindrical surface), and to implement in silico. Each component of the technique system was quantitatively validated and visually compared to images, and the sensitivity of the virtual clipping algorithms was analyzed. This technique is inexpensive and accurate and provides high throughput for the measurement, reconstruction, and trait extraction of rice canopy architectures. The technique provides a more practical method of data collection to serve functional–structural plant models of rice and for the optimization of rice canopy types. Moreover, the technique can be easily adapted for other cereal crops such as wheat, which has numerous stems and leaves sheltering each other. PMID:28558045
Distributed computing methodology for training neural networks in an image-guided diagnostic application.

PubMed

Plagianakos, V P; Magoulas, G D; Vrahatis, M N

2006-03-01

Distributed computing is a process through which a set of computers connected by a network is used collectively to solve a single problem. In this paper, we propose a distributed computing methodology for training neural networks for the detection of lesions in colonoscopy. Our approach is based on partitioning the training set across multiple processors using a parallel virtual machine. In this way, interconnected computers of varied architectures can be used for the distributed evaluation of the error function and gradient values, and, thus, training neural networks utilizing various learning methods. The proposed methodology has large granularity and low synchronization, and has been implemented and tested. Our results indicate that the parallel virtual machine implementation of the training algorithms developed leads to considerable speedup, especially when large network architectures and training sets are used.
FPGA-based coprocessor for matrix algorithms implementation

NASA Astrophysics Data System (ADS)

Amira, Abbes; Bensaali, Faycal

2003-03-01

Matrix algorithms are important in many types of applications including image and signal processing. These areas require enormous computing power. A close examination of the algorithms used in these, and related, applications reveals that many of the fundamental actions involve matrix operations such as matrix multiplication which is of O (N3) on a sequential computer and O (N3/p) on a parallel system with p processors complexity. This paper presents an investigation into the design and implementation of different matrix algorithms such as matrix operations, matrix transforms and matrix decompositions using an FPGA based environment. Solutions for the problem of processing large matrices have been proposed. The proposed system architectures are scalable, modular and require less area and time complexity with reduced latency when compared with existing structures.
Quantum Simulation of Tunneling in Small Systems

PubMed Central

Sornborger, Andrew T.

2012-01-01

A number of quantum algorithms have been performed on small quantum computers; these include Shor's prime factorization algorithm, error correction, Grover's search algorithm and a number of analog and digital quantum simulations. Because of the number of gates and qubits necessary, however, digital quantum particle simulations remain untested. A contributing factor to the system size required is the number of ancillary qubits needed to implement matrix exponentials of the potential operator. Here, we show that a set of tunneling problems may be investigated with no ancillary qubits and a cost of one single-qubit operator per time step for the potential evolution, eliminating at least half of the quantum gates required for the algorithm and more than that in the general case. Such simulations are within reach of current quantum computer architectures. PMID:22916333
Impact of new computing systems on computational mechanics and flight-vehicle structures technology

NASA Technical Reports Server (NTRS)

Noor, A. K.; Storaasli, O. O.; Fulton, R. E.

1984-01-01

Advances in computer technology which may have an impact on computational mechanics and flight vehicle structures technology were reviewed. The characteristics of supersystems, highly parallel systems, and small systems are summarized. The interrelations of numerical algorithms and software with parallel architectures are discussed. A scenario for future hardware/software environment and engineering analysis systems is presented. Research areas with potential for improving the effectiveness of analysis methods in the new environment are identified.
Parallel heterogeneous architectures for efficient OMP compressive sensing reconstruction

NASA Astrophysics Data System (ADS)

Kulkarni, Amey; Stanislaus, Jerome L.; Mohsenin, Tinoosh

2014-05-01

Compressive Sensing (CS) is a novel scheme, in which a signal that is sparse in a known transform domain can be reconstructed using fewer samples. The signal reconstruction techniques are computationally intensive and have sluggish performance, which make them impractical for real-time processing applications . The paper presents novel architectures for Orthogonal Matching Pursuit algorithm, one of the popular CS reconstruction algorithms. We show the implementation results of proposed architectures on FPGA, ASIC and on a custom many-core platform. For FPGA and ASIC implementation, a novel thresholding method is used to reduce the processing time for the optimization problem by at least 25%. Whereas, for the custom many-core platform, efficient parallelization techniques are applied, to reconstruct signals with variant signal lengths of N and sparsity of m. The algorithm is divided into three kernels. Each kernel is parallelized to reduce execution time, whereas efficient reuse of the matrix operators allows us to reduce area. Matrix operations are efficiently paralellized by taking advantage of blocked algorithms. For demonstration purpose, all architectures reconstruct a 256-length signal with maximum sparsity of 8 using 64 measurements. Implementation on Xilinx Virtex-5 FPGA, requires 27.14 μs to reconstruct the signal using basic OMP. Whereas, with thresholding method it requires 18 μs. ASIC implementation reconstructs the signal in 13 μs. However, our custom many-core, operating at 1.18 GHz, takes 18.28 μs to complete. Our results show that compared to the previous published work of the same algorithm and matrix size, proposed architectures for FPGA and ASIC implementations perform 1.3x and 1.8x respectively faster. Also, the proposed many-core implementation performs 3000x faster than the CPU and 2000x faster than the GPU.
Radio Synthesis Imaging - A High Performance Computing and Communications Project

NASA Astrophysics Data System (ADS)

Crutcher, Richard M.

The National Science Foundation has funded a five-year High Performance Computing and Communications project at the National Center for Supercomputing Applications (NCSA) for the direct implementation of several of the computing recommendations of the Astronomy and Astrophysics Survey Committee (the "Bahcall report"). This paper is a summary of the project goals and a progress report. The project will implement a prototype of the next generation of astronomical telescope systems - remotely located telescopes connected by high-speed networks to very high performance, scalable architecture computers and on-line data archives, which are accessed by astronomers over Gbit/sec networks. Specifically, a data link has been installed between the BIMA millimeter-wave synthesis array at Hat Creek, California and NCSA at Urbana, Illinois for real-time transmission of data to NCSA. Data are automatically archived, and may be browsed and retrieved by astronomers using the NCSA Mosaic software. In addition, an on-line digital library of processed images will be established. BIMA data will be processed on a very high performance distributed computing system, with I/O, user interface, and most of the software system running on the NCSA Convex C3880 supercomputer or Silicon Graphics Onyx workstations connected by HiPPI to the high performance, massively parallel Thinking Machines Corporation CM-5. The very computationally intensive algorithms for calibration and imaging of radio synthesis array observations will be optimized for the CM-5 and new algorithms which utilize the massively parallel architecture will be developed. Code running simultaneously on the distributed computers will communicate using the Data Transport Mechanism developed by NCSA. The project will also use the BLANCA Gbit/s testbed network between Urbana and Madison, Wisconsin to connect an Onyx workstation in the University of Wisconsin Astronomy Department to the NCSA CM-5, for development of long-distance distributed computing. Finally, the project is developing 2D and 3D visualization software as part of the international AIPS++ project. This research and development project is being carried out by a team of experts in radio astronomy, algorithm development for massively parallel architectures, high-speed networking, database management, and Thinking Machines Corporation personnel. The development of this complete software, distributed computing, and data archive and library solution to the radio astronomy computing problem will advance our expertise in high performance computing and communications technology and the application of these techniques to astronomical data processing.
Unit cell-based computer-aided manufacturing system for tissue engineering.

PubMed

Kang, Hyun-Wook; Park, Jeong Hun; Kang, Tae-Yun; Seol, Young-Joon; Cho, Dong-Woo

2012-03-01

Scaffolds play an important role in the regeneration of artificial tissues or organs. A scaffold is a porous structure with a micro-scale inner architecture in the range of several to several hundreds of micrometers. Therefore, computer-aided construction of scaffolds should provide sophisticated functionality for porous structure design and a tool path generation strategy that can achieve micro-scale architecture. In this study, a new unit cell-based computer-aided manufacturing (CAM) system was developed for the automated design and fabrication of a porous structure with micro-scale inner architecture that can be applied to composite tissue regeneration. The CAM system was developed by first defining a data structure for the computing process of a unit cell representing a single pore structure. Next, an algorithm and software were developed and applied to construct porous structures with a single or multiple pore design using solid freeform fabrication technology and a 3D tooth/spine computer-aided design model. We showed that this system is quite feasible for the design and fabrication of a scaffold for tissue engineering.
An efficient three-dimensional Poisson solver for SIMD high-performance-computing architectures

NASA Technical Reports Server (NTRS)

Cohl, H.

1994-01-01

We present an algorithm that solves the three-dimensional Poisson equation on a cylindrical grid. The technique uses a finite-difference scheme with operator splitting. This splitting maps the banded structure of the operator matrix into a two-dimensional set of tridiagonal matrices, which are then solved in parallel. Our algorithm couples FFT techniques with the well-known ADI (Alternating Direction Implicit) method for solving Elliptic PDE's, and the implementation is extremely well suited for a massively parallel environment like the SIMD architecture of the MasPar MP-1. Due to the highly recursive nature of our problem, we believe that our method is highly efficient, as it avoids excessive interprocessor communication.
On some Aitken-like acceleration of the Schwarz method

NASA Astrophysics Data System (ADS)

Garbey, M.; Tromeur-Dervout, D.

2002-12-01

In this paper we present a family of domain decomposition based on Aitken-like acceleration of the Schwarz method seen as an iterative procedure with a linear rate of convergence. We first present the so-called Aitken-Schwarz procedure for linear differential operators. The solver can be a direct solver when applied to the Helmholtz problem with five-point finite difference scheme on regular grids. We then introduce the Steffensen-Schwarz variant which is an iterative domain decomposition solver that can be applied to linear and nonlinear problems. We show that these solvers have reasonable numerical efficiency compared to classical fast solvers for the Poisson problem or multigrids for more general linear and nonlinear elliptic problems. However, the salient feature of our method is that our algorithm has high tolerance to slow network in the context of distributed parallel computing and is attractive, generally speaking, to use with computer architecture for which performance is limited by the memory bandwidth rather than the flop performance of the CPU. This is nowadays the case for most parallel. computer using the RISC processor architecture. We will illustrate this highly desirable property of our algorithm with large-scale computing experiments.
Scaling to Nanotechnology Limits with the PIMS Computer Architecture and a new Scaling Rule

DOE Office of Scientific and Technical Information (OSTI.GOV)

Debenedictis, Erik P.

2015-02-01

We describe a new approach to computing that moves towards the limits of nanotechnology using a newly formulated sc aling rule. This is in contrast to the current computer industry scali ng away from von Neumann's original computer at the rate of Moore's Law. We extend Moore's Law to 3D, which l eads generally to architectures that integrate logic and memory. To keep pow er dissipation cons tant through a 2D surface of the 3D structure requires using adiabatic principles. We call our newly proposed architecture Processor In Memory and Storage (PIMS). We propose a new computational model that integratesmore » processing and memory into "tiles" that comprise logic, memory/storage, and communications functions. Since the programming model will be relatively stable as a system scales, programs repr esented by tiles could be executed in a PIMS system built with today's technology or could become the "schematic diagram" for implementation in an ultimate 3D nanotechnology of the future. We build a systems software approach that offers advantages over and above the technological and arch itectural advantages. Firs t, the algorithms may be more efficient in the conventional sens e of having fewer steps. Second, the algorithms may run with higher power efficiency per operation by being a better match for the adiabatic scaling ru le. The performance analysis based on demonstrated ideas in physical science suggests 80,000 x improvement in cost per operation for the (arguably) gene ral purpose function of emulating neurons in Deep Learning.« less
An efficient and portable SIMD algorithm for charge/current deposition in Particle-In-Cell codes

DOE PAGES

Vincenti, H.; Lobet, M.; Lehe, R.; ...

2016-09-19

In current computer architectures, data movement (from die to network) is by far the most energy consuming part of an algorithm (≈20pJ/word on-die to ≈10,000 pJ/word on the network). To increase memory locality at the hardware level and reduce energy consumption related to data movement, future exascale computers tend to use many-core processors on each compute nodes that will have a reduced clock speed to allow for efficient cooling. To compensate for frequency decrease, machine vendors are making use of long SIMD instruction registers that are able to process multiple data with one arithmetic operator in one clock cycle. SIMD registermore » length is expected to double every four years. As a consequence, Particle-In-Cell (PIC) codes will have to achieve good vectorization to fully take advantage of these upcoming architectures. In this paper, we present a new algorithm that allows for efficient and portable SIMD vectorization of current/charge deposition routines that are, along with the field gathering routines, among the most time consuming parts of the PIC algorithm. Our new algorithm uses a particular data structure that takes into account memory alignment constraints and avoids gather/scat;ter instructions that can significantly affect vectorization performances on current CPUs. The new algorithm was successfully implemented in the 3D skeleton PIC code PICSAR and tested on Haswell Xeon processors (AVX2-256 bits wide data registers). Results show a factor of ×2 to ×2.5 speed-up in double precision for particle shape factor of orders 1–3. The new algorithm can be applied as is on future KNL (Knights Landing) architectures that will include AVX-512 instruction sets with 512 bits register lengths (8 doubles/16 singles). Program summary Program Title: vec_deposition Program Files doi:http://dx.doi.org/10.17632/nh77fv9k8c.1 Licensing provisions: BSD 3-Clause Programming language: Fortran 90 External routines/libraries: OpenMP > 4.0 Nature of problem: Exascale architectures will have many-core processors per node with long vector data registers capable of performing one single instruction on multiple data during one clock cycle. Data register lengths are expected to double every four years and this pushes for new portable solutions for efficiently vectorizing Particle-In-Cell codes on these future many-core architectures. One of the main hotspot routines of the PIC algorithm is the current/charge deposition for which there is no efficient and portable vector algorithm. Solution method: Here we provide an efficient and portable vector algorithm of current/charge deposition routines that uses a new data structure, which significantly reduces gather/scatter operations. Vectorization is controlled using OpenMP 4.0 compiler directives for vectorization which ensures portability across different architectures. Restrictions: Here we do not provide the full PIC algorithm with an executable but only vector routines for current/charge deposition. These scalar/vector routines can be used as library routines in your 3D Particle-In-Cell code. However, to get the best performances out of vector routines you have to satisfy the two following requirements: (1) Your code should implement particle tiling (as explained in the manuscript) to allow for maximized cache reuse and reduce memory accesses that can hinder vector performances. The routines can be used directly on each particle tile. (2) You should compile your code with a Fortran 90 compiler (e.g Intel, gnu or cray) and provide proper alignment flags and compiler alignment directives (more details in README file).« less

An efficient and portable SIMD algorithm for charge/current deposition in Particle-In-Cell codes

DOE Office of Scientific and Technical Information (OSTI.GOV)

Vincenti, H.; Lobet, M.; Lehe, R.

In current computer architectures, data movement (from die to network) is by far the most energy consuming part of an algorithm (≈20pJ/word on-die to ≈10,000 pJ/word on the network). To increase memory locality at the hardware level and reduce energy consumption related to data movement, future exascale computers tend to use many-core processors on each compute nodes that will have a reduced clock speed to allow for efficient cooling. To compensate for frequency decrease, machine vendors are making use of long SIMD instruction registers that are able to process multiple data with one arithmetic operator in one clock cycle. SIMD registermore » length is expected to double every four years. As a consequence, Particle-In-Cell (PIC) codes will have to achieve good vectorization to fully take advantage of these upcoming architectures. In this paper, we present a new algorithm that allows for efficient and portable SIMD vectorization of current/charge deposition routines that are, along with the field gathering routines, among the most time consuming parts of the PIC algorithm. Our new algorithm uses a particular data structure that takes into account memory alignment constraints and avoids gather/scat;ter instructions that can significantly affect vectorization performances on current CPUs. The new algorithm was successfully implemented in the 3D skeleton PIC code PICSAR and tested on Haswell Xeon processors (AVX2-256 bits wide data registers). Results show a factor of ×2 to ×2.5 speed-up in double precision for particle shape factor of orders 1–3. The new algorithm can be applied as is on future KNL (Knights Landing) architectures that will include AVX-512 instruction sets with 512 bits register lengths (8 doubles/16 singles). Program summary Program Title: vec_deposition Program Files doi:http://dx.doi.org/10.17632/nh77fv9k8c.1 Licensing provisions: BSD 3-Clause Programming language: Fortran 90 External routines/libraries: OpenMP > 4.0 Nature of problem: Exascale architectures will have many-core processors per node with long vector data registers capable of performing one single instruction on multiple data during one clock cycle. Data register lengths are expected to double every four years and this pushes for new portable solutions for efficiently vectorizing Particle-In-Cell codes on these future many-core architectures. One of the main hotspot routines of the PIC algorithm is the current/charge deposition for which there is no efficient and portable vector algorithm. Solution method: Here we provide an efficient and portable vector algorithm of current/charge deposition routines that uses a new data structure, which significantly reduces gather/scatter operations. Vectorization is controlled using OpenMP 4.0 compiler directives for vectorization which ensures portability across different architectures. Restrictions: Here we do not provide the full PIC algorithm with an executable but only vector routines for current/charge deposition. These scalar/vector routines can be used as library routines in your 3D Particle-In-Cell code. However, to get the best performances out of vector routines you have to satisfy the two following requirements: (1) Your code should implement particle tiling (as explained in the manuscript) to allow for maximized cache reuse and reduce memory accesses that can hinder vector performances. The routines can be used directly on each particle tile. (2) You should compile your code with a Fortran 90 compiler (e.g Intel, gnu or cray) and provide proper alignment flags and compiler alignment directives (more details in README file).« less
Benchmarking Memory Performance with the Data Cube Operator

NASA Technical Reports Server (NTRS)

Frumkin, Michael A.; Shabanov, Leonid V.

2004-01-01

Data movement across a computer memory hierarchy and across computational grids is known to be a limiting factor for applications processing large data sets. We use the Data Cube Operator on an Arithmetic Data Set, called ADC, to benchmark capabilities of computers and of computational grids to handle large distributed data sets. We present a prototype implementation of a parallel algorithm for computation of the operatol: The algorithm follows a known approach for computing views from the smallest parent. The ADC stresses all levels of grid memory and storage by producing some of 2d views of an Arithmetic Data Set of d-tuples described by a small number of integers. We control data intensity of the ADC by selecting the tuple parameters, the sizes of the views, and the number of realized views. Benchmarking results of memory performance of a number of computer architectures and of a small computational grid are presented.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Li, C.; Yu, G.; Wang, K.

The physical designs of the new concept reactors which have complex structure, various materials and neutronic energy spectrum, have greatly improved the requirements to the calculation methods and the corresponding computing hardware. Along with the widely used parallel algorithm, heterogeneous platforms architecture has been introduced into numerical computations in reactor physics. Because of the natural parallel characteristics, the CPU-FPGA architecture is often used to accelerate numerical computation. This paper studies the application and features of this kind of heterogeneous platforms used in numerical calculation of reactor physics through practical examples. After the designed neutron diffusion module based on CPU-FPGA architecturemore » achieves a 11.2 speed up factor, it is proved to be feasible to apply this kind of heterogeneous platform into reactor physics. (authors)« less
Integrating the Apache Big Data Stack with HPC for Big Data

NASA Astrophysics Data System (ADS)

Fox, G. C.; Qiu, J.; Jha, S.

2014-12-01

There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development. However, the same is not so true for data intensive computing, even though commercially clouds devote much more resources to data analytics than supercomputers devote to simulations. We look at a sample of over 50 big data applications to identify characteristics of data intensive applications and to deduce needed runtime and architectures. We suggest a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks and use these to identify a few key classes of hardware/software architectures. Our analysis builds on combining HPC and ABDS the Apache big data software stack that is well used in modern cloud computing. Initial results on clouds and HPC systems are encouraging. We propose the development of SPIDAL - Scalable Parallel Interoperable Data Analytics Library -- built on system aand data abstractions suggested by the HPC-ABDS architecture. We discuss how it can be used in several application areas including Polar Science.
Experimental comparison of two quantum computing architectures

PubMed Central

Linke, Norbert M.; Maslov, Dmitri; Roetteler, Martin; Debnath, Shantanu; Figgatt, Caroline; Landsman, Kevin A.; Wright, Kenneth; Monroe, Christopher

2017-01-01

We run a selection of algorithms on two state-of-the-art 5-qubit quantum computers that are based on different technology platforms. One is a publicly accessible superconducting transmon device (www.research.ibm.com/ibm-q) with limited connectivity, and the other is a fully connected trapped-ion system. Even though the two systems have different native quantum interactions, both can be programed in a way that is blind to the underlying hardware, thus allowing a comparison of identical quantum algorithms between different physical systems. We show that quantum algorithms and circuits that use more connectivity clearly benefit from a better-connected system of qubits. Although the quantum systems here are not yet large enough to eclipse classical computers, this experiment exposes critical factors of scaling quantum computers, such as qubit connectivity and gate expressivity. In addition, the results suggest that codesigning particular quantum applications with the hardware itself will be paramount in successfully using quantum computers in the future. PMID:28325879
Performance and Accuracy of LAPACK's Symmetric TridiagonalEigensolvers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Demmel, Jim W.; Marques, Osni A.; Parlett, Beresford N.

2007-04-19

We compare four algorithms from the latest LAPACK 3.1 release for computing eigenpairs of a symmetric tridiagonal matrix. These include QR iteration, bisection and inverse iteration (BI), the Divide-and-Conquer method (DC), and the method of Multiple Relatively Robust Representations (MR). Our evaluation considers speed and accuracy when computing all eigenpairs, and additionally subset computations. Using a variety of carefully selected test problems, our study includes a variety of today's computer architectures. Our conclusions can be summarized as follows. (1) DC and MR are generally much faster than QR and BI on large matrices. (2) MR almost always does the fewestmore » floating point operations, but at a lower MFlop rate than all the other algorithms. (3) The exact performance of MR and DC strongly depends on the matrix at hand. (4) DC and QR are the most accurate algorithms with observed accuracy O({radical}ne). The accuracy of BI and MR is generally O(ne). (5) MR is preferable to BI for subset computations.« less
Parallelizing serial code for a distributed processing environment with an application to high frequency electromagnetic scattering

NASA Astrophysics Data System (ADS)

Work, Paul R.

1991-12-01

This thesis investigates the parallelization of existing serial programs in computational electromagnetics for use in a parallel environment. Existing algorithms for calculating the radar cross section of an object are covered, and a ray-tracing code is chosen for implementation on a parallel machine. Current parallel architectures are introduced and a suitable parallel machine is selected for the implementation of the chosen ray-tracing algorithm. The standard techniques for the parallelization of serial codes are discussed, including load balancing and decomposition considerations, and appropriate methods for the parallelization effort are selected. A load balancing algorithm is modified to increase the efficiency of the application, and a high level design of the structure of the serial program is presented. A detailed design of the modifications for the parallel implementation is also included, with both the high level and the detailed design specified in a high level design language called UNITY. The correctness of the design is proven using UNITY and standard logic operations. The theoretical and empirical results show that it is possible to achieve an efficient parallel application for a serial computational electromagnetic program where the characteristics of the algorithm and the target architecture critically influence the development of such an implementation.
Computer-Aided Diagnosis with Deep Learning Architecture: Applications to Breast Lesions in US Images and Pulmonary Nodules in CT Scans

NASA Astrophysics Data System (ADS)

Cheng, Jie-Zhi; Ni, Dong; Chou, Yi-Hong; Qin, Jing; Tiu, Chui-Mei; Chang, Yeun-Chung; Huang, Chiun-Sheng; Shen, Dinggang; Chen, Chung-Ming

2016-04-01

This paper performs a comprehensive study on the deep-learning-based computer-aided diagnosis (CADx) for the differential diagnosis of benign and malignant nodules/lesions by avoiding the potential errors caused by inaccurate image processing results (e.g., boundary segmentation), as well as the classification bias resulting from a less robust feature set, as involved in most conventional CADx algorithms. Specifically, the stacked denoising auto-encoder (SDAE) is exploited on the two CADx applications for the differentiation of breast ultrasound lesions and lung CT nodules. The SDAE architecture is well equipped with the automatic feature exploration mechanism and noise tolerance advantage, and hence may be suitable to deal with the intrinsically noisy property of medical image data from various imaging modalities. To show the outperformance of SDAE-based CADx over the conventional scheme, two latest conventional CADx algorithms are implemented for comparison. 10 times of 10-fold cross-validations are conducted to illustrate the efficacy of the SDAE-based CADx algorithm. The experimental results show the significant performance boost by the SDAE-based CADx algorithm over the two conventional methods, suggesting that deep learning techniques can potentially change the design paradigm of the CADx systems without the need of explicit design and selection of problem-oriented features.
Computer-Aided Diagnosis with Deep Learning Architecture: Applications to Breast Lesions in US Images and Pulmonary Nodules in CT Scans.

PubMed

Cheng, Jie-Zhi; Ni, Dong; Chou, Yi-Hong; Qin, Jing; Tiu, Chui-Mei; Chang, Yeun-Chung; Huang, Chiun-Sheng; Shen, Dinggang; Chen, Chung-Ming

2016-04-15

This paper performs a comprehensive study on the deep-learning-based computer-aided diagnosis (CADx) for the differential diagnosis of benign and malignant nodules/lesions by avoiding the potential errors caused by inaccurate image processing results (e.g., boundary segmentation), as well as the classification bias resulting from a less robust feature set, as involved in most conventional CADx algorithms. Specifically, the stacked denoising auto-encoder (SDAE) is exploited on the two CADx applications for the differentiation of breast ultrasound lesions and lung CT nodules. The SDAE architecture is well equipped with the automatic feature exploration mechanism and noise tolerance advantage, and hence may be suitable to deal with the intrinsically noisy property of medical image data from various imaging modalities. To show the outperformance of SDAE-based CADx over the conventional scheme, two latest conventional CADx algorithms are implemented for comparison. 10 times of 10-fold cross-validations are conducted to illustrate the efficacy of the SDAE-based CADx algorithm. The experimental results show the significant performance boost by the SDAE-based CADx algorithm over the two conventional methods, suggesting that deep learning techniques can potentially change the design paradigm of the CADx systems without the need of explicit design and selection of problem-oriented features.
Computer-Aided Diagnosis with Deep Learning Architecture: Applications to Breast Lesions in US Images and Pulmonary Nodules in CT Scans

PubMed Central

Cheng, Jie-Zhi; Ni, Dong; Chou, Yi-Hong; Qin, Jing; Tiu, Chui-Mei; Chang, Yeun-Chung; Huang, Chiun-Sheng; Shen, Dinggang; Chen, Chung-Ming

2016-01-01

This paper performs a comprehensive study on the deep-learning-based computer-aided diagnosis (CADx) for the differential diagnosis of benign and malignant nodules/lesions by avoiding the potential errors caused by inaccurate image processing results (e.g., boundary segmentation), as well as the classification bias resulting from a less robust feature set, as involved in most conventional CADx algorithms. Specifically, the stacked denoising auto-encoder (SDAE) is exploited on the two CADx applications for the differentiation of breast ultrasound lesions and lung CT nodules. The SDAE architecture is well equipped with the automatic feature exploration mechanism and noise tolerance advantage, and hence may be suitable to deal with the intrinsically noisy property of medical image data from various imaging modalities. To show the outperformance of SDAE-based CADx over the conventional scheme, two latest conventional CADx algorithms are implemented for comparison. 10 times of 10-fold cross-validations are conducted to illustrate the efficacy of the SDAE-based CADx algorithm. The experimental results show the significant performance boost by the SDAE-based CADx algorithm over the two conventional methods, suggesting that deep learning techniques can potentially change the design paradigm of the CADx systems without the need of explicit design and selection of problem-oriented features. PMID:27079888
Feasibility of a special-purpose computer to solve the Navier-Stokes equations

NASA Technical Reports Server (NTRS)

Gritton, E. C.; King, W. S.; Sutherland, I.; Gaines, R. S.; Gazley, C., Jr.; Grosch, C.; Juncosa, M.; Petersen, H.

1978-01-01

Orders-of-magnitude improvements in computer performance can be realized with a parallel array of thousands of fast microprocessors. In this architecture, wiring congestion is minimized by limiting processor communication to nearest neighbors. When certain standard algorithms are applied to a viscous flow problem and existing LSI technology is used, performance estimates of this conceptual design show a dramatic decrease in computational time when compared to the CDC 7600.
Non-preconditioned conjugate gradient on cell and FPGA based hybrid supercomputer nodes

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dubois, David H; Dubois, Andrew J; Boorman, Thomas M

2009-01-01

This work presents a detailed implementation of a double precision, non-preconditioned, Conjugate Gradient algorithm on a Roadrunner heterogeneous supercomputer node. These nodes utilize the Cell Broadband Engine Architecture{sup TM} in conjunction with x86 Opteron{sup TM} processors from AMD. We implement a common Conjugate Gradient algorithm, on a variety of systems, to compare and contrast performance. Implementation results are presented for the Roadrunner hybrid supercomputer, SRC Computers, Inc. MAPStation SRC-6 FPGA enhanced hybrid supercomputer, and AMD Opteron only. In all hybrid implementations wall clock time is measured, including all transfer overhead and compute timings.
Non-preconditioned conjugate gradient on cell and FPCA-based hybrid supercomputer nodes

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dubois, David H; Dubois, Andrew J; Boorman, Thomas M

2009-03-10

This work presents a detailed implementation of a double precision, Non-Preconditioned, Conjugate Gradient algorithm on a Roadrunner heterogeneous supercomputer node. These nodes utilize the Cell Broadband Engine Architecture{trademark} in conjunction with x86 Opteron{trademark} processors from AMD. We implement a common Conjugate Gradient algorithm, on a variety of systems, to compare and contrast performance. Implementation results are presented for the Roadrunner hybrid supercomputer, SRC Computers, Inc. MAPStation SRC-6 FPGA enhanced hybrid supercomputer, and AMD Opteron only. In all hybrid implementations wall clock time is measured, including all transfer overhead and compute timings.
Parallelized seeded region growing using CUDA.

PubMed

Park, Seongjin; Lee, Jeongjin; Lee, Hyunna; Shin, Juneseuk; Seo, Jinwook; Lee, Kyoung Ho; Shin, Yeong-Gil; Kim, Bohyoung

2014-01-01

This paper presents a novel method for parallelizing the seeded region growing (SRG) algorithm using Compute Unified Device Architecture (CUDA) technology, with intention to overcome the theoretical weakness of SRG algorithm of its computation time being directly proportional to the size of a segmented region. The segmentation performance of the proposed CUDA-based SRG is compared with SRG implementations on single-core CPUs, quad-core CPUs, and shader language programming, using synthetic datasets and 20 body CT scans. Based on the experimental results, the CUDA-based SRG outperforms the other three implementations, advocating that it can substantially assist the segmentation during massive CT screening tests.
Program Helps Simulate Neural Networks

NASA Technical Reports Server (NTRS)

Villarreal, James; Mcintire, Gary

1993-01-01

Neural Network Environment on Transputer System (NNETS) computer program provides users high degree of flexibility in creating and manipulating wide variety of neural-network topologies at processing speeds not found in conventional computing environments. Supports back-propagation and back-propagation-related algorithms. Back-propagation algorithm used is implementation of Rumelhart's generalized delta rule. NNETS developed on INMOS Transputer(R). Predefines back-propagation network, Jordan network, and reinforcement network to assist users in learning and defining own networks. Also enables users to configure other neural-network paradigms from NNETS basic architecture. Small portion of software written in OCCAM(R) language.
Prospective Architectures for Onboard vs Cloud-Based Decision Making for Unmanned Aerial Systems

NASA Technical Reports Server (NTRS)

Sankararaman, Shankar; Teubert, Christopher

2017-01-01

This paper investigates propsective architectures for decision-making in unmanned aerial systems. When these unmanned vehicles operate in urban environments, there are several sources of uncertainty that affect their behavior, and decision-making algorithms need to be robust to account for these different sources of uncertainty. It is important to account for several risk-factors that affect the flight of these unmanned systems, and facilitate decision-making by taking into consideration these various risk-factors. In addition, there are several technical challenges related to autonomous flight of unmanned aerial systems; these challenges include sensing, obstacle detection, path planning and navigation, trajectory generation and selection, etc. Many of these activities require significant computational power and in many situations, all of these activities need to be performed in real-time. In order to efficiently integrate these activities, it is important to develop a systematic architecture that can facilitate real-time decision-making. Four prospective architectures are discussed in this paper; on one end of the spectrum, the first architecture considers all activities/computations being performed onboard the vehicle whereas on the other end of the spectrum, the fourth and final architecture considers all activities/computations being performed in the cloud, using a new service known as Prognostics as a Service that is being developed at NASA Ames Research Center. The four different architectures are compared, their advantages and disadvantages are explained and conclusions are presented.
Optimizing Approximate Weighted Matching on Nvidia Kepler K40

DOE Office of Scientific and Technical Information (OSTI.GOV)

Naim, Md; Manne, Fredrik; Halappanavar, Mahantesh

Matching is a fundamental graph problem with numerous applications in science and engineering. While algorithms for computing optimal matchings are difficult to parallelize, approximation algorithms on the other hand generally compute high quality solutions and are amenable to parallelization. In this paper, we present efficient implementations of the current best algorithm for half-approximate weighted matching, the Suitor algorithm, on Nvidia Kepler K-40 platform. We develop four variants of the algorithm that exploit hardware features to address key challenges for a GPU implementation. We also experiment with different combinations of work assigned to a warp. Using an exhaustive set ofmore » $269$ inputs, we demonstrate that the new implementation outperforms the previous best GPU algorithm by $10$ to $$100\\times$$ for over $100$ instances, and from $100$ to $$1000\\times$$ for $15$ instances. We also demonstrate up to $$20\\times$$ speedup relative to $2$ threads, and up to $$5\\times$$ relative to $16$ threads on Intel Xeon platform with $16$ cores for the same algorithm. The new algorithms and implementations provided in this paper will have a direct impact on several applications that repeatedly use matching as a key compute kernel. Further, algorithm designs and insights provided in this paper will benefit other researchers implementing graph algorithms on modern GPU architectures.« less
Parallel peak pruning for scalable SMP contour tree computation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Carr, Hamish A.; Weber, Gunther H.; Sewell, Christopher M.

As data sets grow to exascale, automated data analysis and visualisation are increasingly important, to intermediate human understanding and to reduce demands on disk storage via in situ analysis. Trends in architecture of high performance computing systems necessitate analysis algorithms to make effective use of combinations of massively multicore and distributed systems. One of the principal analytic tools is the contour tree, which analyses relationships between contours to identify features of more than local importance. Unfortunately, the predominant algorithms for computing the contour tree are explicitly serial, and founded on serial metaphors, which has limited the scalability of this formmore » of analysis. While there is some work on distributed contour tree computation, and separately on hybrid GPU-CPU computation, there is no efficient algorithm with strong formal guarantees on performance allied with fast practical performance. Here in this paper, we report the first shared SMP algorithm for fully parallel contour tree computation, withfor-mal guarantees of O(lgnlgt) parallel steps and O(n lgn) work, and implementations with up to 10x parallel speed up in OpenMP and up to 50x speed up in NVIDIA Thrust.« less
System-on-chip architecture and validation for real-time transceiver optimization: APC implementation on FPGA

NASA Astrophysics Data System (ADS)

Suarez, Hernan; Zhang, Yan R.

2015-05-01

New radar applications need to perform complex algorithms and process large quantity of data to generate useful information for the users. This situation has motivated the search for better processing solutions that include low power high-performance processors, efficient algorithms, and high-speed interfaces. In this work, hardware implementation of adaptive pulse compression for real-time transceiver optimization are presented, they are based on a System-on-Chip architecture for Xilinx devices. This study also evaluates the performance of dedicated coprocessor as hardware accelerator units to speed up and improve the computation of computing-intensive tasks such matrix multiplication and matrix inversion which are essential units to solve the covariance matrix. The tradeoffs between latency and hardware utilization are also presented. Moreover, the system architecture takes advantage of the embedded processor, which is interconnected with the logic resources through the high performance AXI buses, to perform floating-point operations, control the processing blocks, and communicate with external PC through a customized software interface. The overall system functionality is demonstrated and tested for real-time operations using a Ku-band tested together with a low-cost channel emulator for different types of waveforms.
Topical perspective on massive threading and parallelism.

PubMed

Farber, Robert M

2011-09-01

Unquestionably computer architectures have undergone a recent and noteworthy paradigm shift that now delivers multi- and many-core systems with tens to many thousands of concurrent hardware processing elements per workstation or supercomputer node. GPGPU (General Purpose Graphics Processor Unit) technology in particular has attracted significant attention as new software development capabilities, namely CUDA (Compute Unified Device Architecture) and OpenCL™, have made it possible for students as well as small and large research organizations to achieve excellent speedup for many applications over more conventional computing architectures. The current scientific literature reflects this shift with numerous examples of GPGPU applications that have achieved one, two, and in some special cases, three-orders of magnitude increased computational performance through the use of massive threading to exploit parallelism. Multi-core architectures are also evolving quickly to exploit both massive-threading and massive-parallelism such as the 1.3 million threads Blue Waters supercomputer. The challenge confronting scientists in planning future experimental and theoretical research efforts--be they individual efforts with one computer or collaborative efforts proposing to use the largest supercomputers in the world is how to capitalize on these new massively threaded computational architectures--especially as not all computational problems will scale to massive parallelism. In particular, the costs associated with restructuring software (and potentially redesigning algorithms) to exploit the parallelism of these multi- and many-threaded machines must be considered along with application scalability and lifespan. This perspective is an overview of the current state of threading and parallelize with some insight into the future. Published by Elsevier Inc.

On recursive least-squares filtering algorithms and implementations. Ph.D. Thesis

NASA Technical Reports Server (NTRS)

Hsieh, Shih-Fu

1990-01-01

In many real-time signal processing applications, fast and numerically stable algorithms for solving least-squares problems are necessary and important. In particular, under non-stationary conditions, these algorithms must be able to adapt themselves to reflect the changes in the system and take appropriate adjustments to achieve optimum performances. Among existing algorithms, the QR-decomposition (QRD)-based recursive least-squares (RLS) methods have been shown to be useful and effective for adaptive signal processing. In order to increase the speed of processing and achieve high throughput rate, many algorithms are being vectorized and/or pipelined to facilitate high degrees of parallelism. A time-recursive formulation of RLS filtering employing block QRD will be considered first. Several methods, including a new non-continuous windowing scheme based on selectively rejecting contaminated data, were investigated for adaptive processing. Based on systolic triarrays, many other forms of systolic arrays are shown to be capable of implementing different algorithms. Various updating and downdating systolic algorithms and architectures for RLS filtering are examined and compared in details, which include Householder reflector, Gram-Schmidt procedure, and Givens rotation. A unified approach encompassing existing square-root-free algorithms is also proposed. For the sinusoidal spectrum estimation problem, a judicious method of separating the noise from the signal is of great interest. Various truncated QR methods are proposed for this purpose and compared to the truncated SVD method. Computer simulations provided for detailed comparisons show the effectiveness of these methods. This thesis deals with fundamental issues of numerical stability, computational efficiency, adaptivity, and VLSI implementation for the RLS filtering problems. In all, various new and modified algorithms and architectures are proposed and analyzed; the significance of any of the new method depends crucially on specific application.
A bibliography on parallel and vector numerical algorithms

NASA Technical Reports Server (NTRS)

Ortega, James M.; Voigt, Robert G.; Romine, Charles H.

1988-01-01

This is a bibliography on numerical methods. It also includes a number of other references on machine architecture, programming language, and other topics of interest to scientific computing. Certain conference proceedings and anthologies which have been published in book form are also listed.
A bibliography on parallel and vector numerical algorithms

NASA Technical Reports Server (NTRS)

Ortega, J. M.; Voigt, R. G.

1987-01-01

This is a bibliography of numerical methods. It also includes a number of other references on machine architecture, programming language, and other topics of interest to scientific computing. Certain conference proceedings and anthologies which have been published in book form are listed also.
A bibliography on parallel and vector numerical algorithms

NASA Technical Reports Server (NTRS)

Ortega, James M.; Voigt, Robert G.; Romine, Charles H.

1990-01-01

This is a bibliography on numerical methods. It also includes a number of other references on machine architecture, programming language, and other topics of interest to scientific computing. Certain conference proceedings and anthologies which have been published in book form are also listed.
Binary mesh partitioning for cache-efficient visualization.

PubMed

Tchiboukdjian, Marc; Danjean, Vincent; Raffin, Bruno

2010-01-01

One important bottleneck when visualizing large data sets is the data transfer between processor and memory. Cache-aware (CA) and cache-oblivious (CO) algorithms take into consideration the memory hierarchy to design cache efficient algorithms. CO approaches have the advantage to adapt to unknown and varying memory hierarchies. Recent CA and CO algorithms developed for 3D mesh layouts significantly improve performance of previous approaches, but they lack of theoretical performance guarantees. We present in this paper a {\\schmi O}(N\\log N) algorithm to compute a CO layout for unstructured but well shaped meshes. We prove that a coherent traversal of a N-size mesh in dimension d induces less than N/B+{\\schmi O}(N/M;{1/d}) cache-misses where B and M are the block size and the cache size, respectively. Experiments show that our layout computation is faster and significantly less memory consuming than the best known CO algorithm. Performance is comparable to this algorithm for classical visualization algorithm access patterns, or better when the BSP tree produced while computing the layout is used as an acceleration data structure adjusted to the layout. We also show that cache oblivious approaches lead to significant performance increases on recent GPU architectures.
On the impact of approximate computation in an analog DeSTIN architecture.

PubMed

Young, Steven; Lu, Junjie; Holleman, Jeremy; Arel, Itamar

2014-05-01

Deep machine learning (DML) holds the potential to revolutionize machine learning by automating rich feature extraction, which has become the primary bottleneck of human engineering in pattern recognition systems. However, the heavy computational burden renders DML systems implemented on conventional digital processors impractical for large-scale problems. The highly parallel computations required to implement large-scale deep learning systems are well suited to custom hardware. Analog computation has demonstrated power efficiency advantages of multiple orders of magnitude relative to digital systems while performing nonideal computations. In this paper, we investigate typical error sources introduced by analog computational elements and their impact on system-level performance in DeSTIN--a compositional deep learning architecture. These inaccuracies are evaluated on a pattern classification benchmark, clearly demonstrating the robustness of the underlying algorithm to the errors introduced by analog computational elements. A clear understanding of the impacts of nonideal computations is necessary to fully exploit the efficiency of analog circuits.
Importance of balanced architectures in the design of high-performance imaging systems

NASA Astrophysics Data System (ADS)

Sgro, Joseph A.; Stanton, Paul C.

1999-03-01

Imaging systems employed in demanding military and industrial applications, such as automatic target recognition and computer vision, typically require real-time high-performance computing resources. While high- performances computing systems have traditionally relied on proprietary architectures and custom components, recent advances in high performance general-purpose microprocessor technology have produced an abundance of low cost components suitable for use in high-performance computing systems. A common pitfall in the design of high performance imaging system, particularly systems employing scalable multiprocessor architectures, is the failure to balance computational and memory bandwidth. The performance of standard cluster designs, for example, in which several processors share a common memory bus, is typically constrained by memory bandwidth. The symptom characteristic of this problem is failure to the performance of the system to scale as more processors are added. The problem becomes exacerbated if I/O and memory functions share the same bus. The recent introduction of microprocessors with large internal caches and high performance external memory interfaces makes it practical to design high performance imaging system with balanced computational and memory bandwidth. Real word examples of such designs will be presented, along with a discussion of adapting algorithm design to best utilize available memory bandwidth.
Dawn: A Simulation Model for Evaluating Costs and Tradeoffs of Big Data Science Architectures

NASA Astrophysics Data System (ADS)

Cinquini, L.; Crichton, D. J.; Braverman, A. J.; Kyo, L.; Fuchs, T.; Turmon, M.

2014-12-01

In many scientific disciplines, scientists and data managers are bracing for an upcoming deluge of big data volumes, which will increase the size of current data archives by a factor of 10-100 times. For example, the next Climate Model Inter-comparison Project (CMIP6) will generate a global archive of model output of approximately 10-20 Peta-bytes, while the upcoming next generation of NASA decadal Earth Observing instruments are expected to collect tens of Giga-bytes/day. In radio-astronomy, the Square Kilometre Array (SKA) will collect data in the Exa-bytes/day range, of which (after reduction and processing) around 1.5 Exa-bytes/year will be stored. The effective and timely processing of these enormous data streams will require the design of new data reduction and processing algorithms, new system architectures, and new techniques for evaluating computation uncertainty. Yet at present no general software tool or framework exists that will allow system architects to model their expected data processing workflow, and determine the network, computational and storage resources needed to prepare their data for scientific analysis. In order to fill this gap, at NASA/JPL we have been developing a preliminary model named DAWN (Distributed Analytics, Workflows and Numerics) for simulating arbitrary complex workflows composed of any number of data processing and movement tasks. The model can be configured with a representation of the problem at hand (the data volumes, the processing algorithms, the available computing and network resources), and is able to evaluate tradeoffs between different possible workflows based on several estimators: overall elapsed time, separate computation and transfer times, resulting uncertainty, and others. So far, we have been applying DAWN to analyze architectural solutions for 4 different use cases from distinct science disciplines: climate science, astronomy, hydrology and a generic cloud computing use case. This talk will present preliminary results and discuss how DAWN can be evolved into a powerful tool for designing system architectures for data intensive science.
Virtual Network Embedding via Monte Carlo Tree Search.

PubMed

Haeri, Soroush; Trajkovic, Ljiljana

2018-02-01

Network virtualization helps overcome shortcomings of the current Internet architecture. The virtualized network architecture enables coexistence of multiple virtual networks (VNs) on an existing physical infrastructure. VN embedding (VNE) problem, which deals with the embedding of VN components onto a physical network, is known to be -hard. In this paper, we propose two VNE algorithms: MaVEn-M and MaVEn-S. MaVEn-M employs the multicommodity flow algorithm for virtual link mapping while MaVEn-S uses the shortest-path algorithm. They formalize the virtual node mapping problem by using the Markov decision process (MDP) framework and devise action policies (node mappings) for the proposed MDP using the Monte Carlo tree search algorithm. Service providers may adjust the execution time of the MaVEn algorithms based on the traffic load of VN requests. The objective of the algorithms is to maximize the profit of infrastructure providers. We develop a discrete event VNE simulator to implement and evaluate performance of MaVEn-M, MaVEn-S, and several recently proposed VNE algorithms. We introduce profitability as a new performance metric that captures both acceptance and revenue to cost ratios. Simulation results show that the proposed algorithms find more profitable solutions than the existing algorithms. Given additional computation time, they further improve embedding solutions.
Extreme-Scale Algorithms & Software Resilience (EASIR) Architecture-Aware Algorithms for Scalable Performance and Resilience on Heterogeneous Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Demmel, James W.

This project addresses both communication-avoiding algorithms, and reproducible floating-point computation. Communication, i.e. moving data, either between levels of memory or processors over a network, is much more expensive per operation than arithmetic (measured in time or energy), so we seek algorithms that greatly reduce communication. We developed many new algorithms for both dense and sparse, and both direct and iterative linear algebra, attaining new communication lower bounds, and getting large speedups in many cases. We also extended this work in several ways: (1) We minimize writes separately from reads, since writes may be much more expensive than reads on emergingmore » memory technologies, like Flash, sometimes doing asymptotically fewer writes than reads. (2) We extend the lower bounds and optimal algorithms to arbitrary algorithms that may be expressed as perfectly nested loops accessing arrays, where the array subscripts may be arbitrary affine functions of the loop indices (eg A(i), B(i,j+k, k+3*m-7, …) etc.). (3) We extend our communication-avoiding approach to some machine learning algorithms, such as support vector machines. This work has won a number of awards. We also address reproducible floating-point computation. We define reproducibility to mean getting bitwise identical results from multiple runs of the same program, perhaps with different hardware resources or other changes that should ideally not change the answer. Many users depend on reproducibility for debugging or correctness. However, dynamic scheduling of parallel computing resources, combined with nonassociativity of floating point addition, makes attaining reproducibility a challenge even for simple operations like summing a vector of numbers, or more complicated operations like the Basic Linear Algebra Subprograms (BLAS). We describe an algorithm that computes a reproducible sum of floating point numbers, independent of the order of summation. The algorithm depends only on a subset of the IEEE Floating Point Standard 754-2008, uses just 6 words to represent a “reproducible accumulator,” and requires just one read-only pass over the data, or one reduction in parallel. New instructions based on this work are being considered for inclusion in the future IEEE 754-2018 floating-point standard, and new reproducible BLAS are being considered for the next version of the BLAS standard.« less
A parallel algorithm for the initial screening of space debris collisions prediction using the SGP4/SDP4 models and GPU acceleration

NASA Astrophysics Data System (ADS)

Lin, Mingpei; Xu, Ming; Fu, Xiaoyu

2017-05-01

Currently, a tremendous amount of space debris in Earth's orbit imperils operational spacecraft. It is essential to undertake risk assessments of collisions and predict dangerous encounters in space. However, collision predictions for an enormous amount of space debris give rise to large-scale computations. In this paper, a parallel algorithm is established on the Compute Unified Device Architecture (CUDA) platform of NVIDIA Corporation for collision prediction. According to the parallel structure of NVIDIA graphics processors, a block decomposition strategy is adopted in the algorithm. Space debris is divided into batches, and the computation and data transfer operations of adjacent batches overlap. As a consequence, the latency to access shared memory during the entire computing process is significantly reduced, and a higher computing speed is reached. Theoretically, a simulation of collision prediction for space debris of any amount and for any time span can be executed. To verify this algorithm, a simulation example including 1382 pieces of debris, whose operational time scales vary from 1 min to 3 days, is conducted on Tesla C2075 of NVIDIA. The simulation results demonstrate that with the same computational accuracy as that of a CPU, the computing speed of the parallel algorithm on a GPU is 30 times that on a CPU. Based on this algorithm, collision prediction of over 150 Chinese spacecraft for a time span of 3 days can be completed in less than 3 h on a single computer, which meets the timeliness requirement of the initial screening task. Furthermore, the algorithm can be adapted for multiple tasks, including particle filtration, constellation design, and Monte-Carlo simulation of an orbital computation.
Implementation of real-time digital signal processing systems

NASA Technical Reports Server (NTRS)

Narasimha, M.; Peterson, A.; Narayan, S.

1978-01-01

Special purpose hardware implementation of DFT Computers and digital filters is considered in the light of newly introduced algorithms and IC devices. Recent work by Winograd on high-speed convolution techniques for computing short length DFT's, has motivated the development of more efficient algorithms, compared to the FFT, for evaluating the transform of longer sequences. Among these, prime factor algorithms appear suitable for special purpose hardware implementations. Architectural considerations in designing DFT computers based on these algorithms are discussed. With the availability of monolithic multiplier-accumulators, a direct implementation of IIR and FIR filters, using random access memories in place of shift registers, appears attractive. The memory addressing scheme involved in such implementations is discussed. A simple counter set-up to address the data memory in the realization of FIR filters is also described. The combination of a set of simple filters (weighting network) and a DFT computer is shown to realize a bank of uniform bandpass filters. The usefulness of this concept in arriving at a modular design for a million channel spectrum analyzer, based on microprocessors, is discussed.
Quantum plug n’ play: modular computation in the quantum regime

NASA Astrophysics Data System (ADS)

Thompson, Jayne; Modi, Kavan; Vedral, Vlatko; Gu, Mile

2018-01-01

Classical computation is modular. It exploits plug n’ play architectures which allow us to use pre-fabricated circuits without knowing their construction. This bestows advantages such as allowing parts of the computational process to be outsourced, and permitting individual circuit components to be exchanged and upgraded. Here, we introduce a formal framework to describe modularity in the quantum regime. We demonstrate a ‘no-go’ theorem, stipulating that it is not always possible to make use of quantum circuits without knowing their construction. This has significant consequences for quantum algorithms, forcing the circuit implementation of certain quantum algorithms to be rebuilt almost entirely from scratch after incremental changes in the problem—such as changing the number being factored in Shor’s algorithm. We develop a workaround capable of restoring modularity, and apply it to design a modular version of Shor’s algorithm that exhibits increased versatility and reduced complexity. In doing so we pave the way to a realistic framework whereby ‘quantum chips’ and remote servers can be invoked (or assembled) to implement various parts of a more complex quantum computation.
Evaluation of stochastic algorithms for financial mathematics problems from point of view of energy-efficiency

DOE Office of Scientific and Technical Information (OSTI.GOV)

Atanassov, E.; Dimitrov, D., E-mail: d.slavov@bas.bg, E-mail: emanouil@parallel.bas.bg, E-mail: gurov@bas.bg; Gurov, T.

2015-10-28

The recent developments in the area of high-performance computing are driven not only by the desire for ever higher performance but also by the rising costs of electricity. The use of various types of accelerators like GPUs, Intel Xeon Phi has become mainstream and many algorithms and applications have been ported to make use of them where available. In Financial Mathematics the question of optimal use of computational resources should also take into account the limitations on space, because in many use cases the servers are deployed close to the exchanges. In this work we evaluate various algorithms for optionmore » pricing that we have implemented for different target architectures in terms of their energy and space efficiency. Since it has been established that low-discrepancy sequences may be better than pseudorandom numbers for these types of algorithms, we also test the Sobol and Halton sequences. We present the raw results, the computed metrics and conclusions from our tests.« less
Evaluation of stochastic algorithms for financial mathematics problems from point of view of energy-efficiency

NASA Astrophysics Data System (ADS)

Atanassov, E.; Dimitrov, D.; Gurov, T.

2015-10-01

The recent developments in the area of high-performance computing are driven not only by the desire for ever higher performance but also by the rising costs of electricity. The use of various types of accelerators like GPUs, Intel Xeon Phi has become mainstream and many algorithms and applications have been ported to make use of them where available. In Financial Mathematics the question of optimal use of computational resources should also take into account the limitations on space, because in many use cases the servers are deployed close to the exchanges. In this work we evaluate various algorithms for option pricing that we have implemented for different target architectures in terms of their energy and space efficiency. Since it has been established that low-discrepancy sequences may be better than pseudorandom numbers for these types of algorithms, we also test the Sobol and Halton sequences. We present the raw results, the computed metrics and conclusions from our tests.
Development of iterative techniques for the solution of unsteady compressible viscous flows

NASA Technical Reports Server (NTRS)

Sankar, Lakshmi; Hixon, Duane

1993-01-01

The work done under this project was documented in detail as the Ph. D. dissertation of Dr. Duane Hixon. The objectives of the research project were evaluation of the generalized minimum residual method (GMRES) as a tool for accelerating 2-D and 3-D unsteady flows and evaluation of the suitability of the GMRES algorithm for unsteady flows, computed on parallel computer architectures.
Optimized Laplacian image sharpening algorithm based on graphic processing unit

NASA Astrophysics Data System (ADS)

Ma, Tinghuai; Li, Lu; Ji, Sai; Wang, Xin; Tian, Yuan; Al-Dhelaan, Abdullah; Al-Rodhaan, Mznah

2014-12-01

In classical Laplacian image sharpening, all pixels are processed one by one, which leads to large amount of computation. Traditional Laplacian sharpening processed on CPU is considerably time-consuming especially for those large pictures. In this paper, we propose a parallel implementation of Laplacian sharpening based on Compute Unified Device Architecture (CUDA), which is a computing platform of Graphic Processing Units (GPU), and analyze the impact of picture size on performance and the relationship between the processing time of between data transfer time and parallel computing time. Further, according to different features of different memory, an improved scheme of our method is developed, which exploits shared memory in GPU instead of global memory and further increases the efficiency. Experimental results prove that two novel algorithms outperform traditional consequentially method based on OpenCV in the aspect of computing speed.
3D Data Denoising via Nonlocal Means Filter by Using Parallel GPU Strategies

PubMed Central

Cuomo, Salvatore; De Michele, Pasquale; Piccialli, Francesco

2014-01-01

Nonlocal Means (NLM) algorithm is widely considered as a state-of-the-art denoising filter in many research fields. Its high computational complexity leads researchers to the development of parallel programming approaches and the use of massively parallel architectures such as the GPUs. In the recent years, the GPU devices had led to achieving reasonable running times by filtering, slice-by-slice, and 3D datasets with a 2D NLM algorithm. In our approach we design and implement a fully 3D NonLocal Means parallel approach, adopting different algorithm mapping strategies on GPU architecture and multi-GPU framework, in order to demonstrate its high applicability and scalability. The experimental results we obtained encourage the usability of our approach in a large spectrum of applicative scenarios such as magnetic resonance imaging (MRI) or video sequence denoising. PMID:25045397
Procedural Quantum Programming

NASA Astrophysics Data System (ADS)

Ömer, Bernhard

2002-09-01

While classical computing science has developed a variety of methods and programming languages around the concept of the universal computer, the typical description of quantum algorithms still uses a purely mathematical, non-constructive formalism which makes no difference between a hydrogen atom and a quantum computer. This paper investigates, how the concept of procedural programming languages, the most widely used classical formalism for describing and implementing algorithms, can be adopted to the field of quantum computing, and how non-classical features like the reversibility of unitary transformations, the non-observability of quantum states or the lack of copy and erase operations can be reflected semantically. It introduces the key concepts of procedural quantum programming (hybrid target architecture, operator hierarchy, quantum data types, memory management, etc.) and presents the experimental language QCL, which implements these principles.
Cloud computing for comparative genomics

PubMed Central

2010-01-01

Background Large comparative genomics studies and tools are becoming increasingly more compute-expensive as the number of available genome sequences continues to rise. The capacity and cost of local computing infrastructures are likely to become prohibitive with the increase, especially as the breadth of questions continues to rise. Alternative computing architectures, in particular cloud computing environments, may help alleviate this increasing pressure and enable fast, large-scale, and cost-effective comparative genomics strategies going forward. To test this, we redesigned a typical comparative genomics algorithm, the reciprocal smallest distance algorithm (RSD), to run within Amazon's Elastic Computing Cloud (EC2). We then employed the RSD-cloud for ortholog calculations across a wide selection of fully sequenced genomes. Results We ran more than 300,000 RSD-cloud processes within the EC2. These jobs were farmed simultaneously to 100 high capacity compute nodes using the Amazon Web Service Elastic Map Reduce and included a wide mix of large and small genomes. The total computation time took just under 70 hours and cost a total of $6,302 USD. Conclusions The effort to transform existing comparative genomics algorithms from local compute infrastructures is not trivial. However, the speed and flexibility of cloud computing environments provides a substantial boost with manageable cost. The procedure designed to transform the RSD algorithm into a cloud-ready application is readily adaptable to similar comparative genomics problems. PMID:20482786

Cloud computing for comparative genomics.

PubMed

Wall, Dennis P; Kudtarkar, Parul; Fusaro, Vincent A; Pivovarov, Rimma; Patil, Prasad; Tonellato, Peter J

2010-05-18

Large comparative genomics studies and tools are becoming increasingly more compute-expensive as the number of available genome sequences continues to rise. The capacity and cost of local computing infrastructures are likely to become prohibitive with the increase, especially as the breadth of questions continues to rise. Alternative computing architectures, in particular cloud computing environments, may help alleviate this increasing pressure and enable fast, large-scale, and cost-effective comparative genomics strategies going forward. To test this, we redesigned a typical comparative genomics algorithm, the reciprocal smallest distance algorithm (RSD), to run within Amazon's Elastic Computing Cloud (EC2). We then employed the RSD-cloud for ortholog calculations across a wide selection of fully sequenced genomes. We ran more than 300,000 RSD-cloud processes within the EC2. These jobs were farmed simultaneously to 100 high capacity compute nodes using the Amazon Web Service Elastic Map Reduce and included a wide mix of large and small genomes. The total computation time took just under 70 hours and cost a total of $6,302 USD. The effort to transform existing comparative genomics algorithms from local compute infrastructures is not trivial. However, the speed and flexibility of cloud computing environments provides a substantial boost with manageable cost. The procedure designed to transform the RSD algorithm into a cloud-ready application is readily adaptable to similar comparative genomics problems.
An Evaluation of an Ada Implementation of the Rete Algorithm for Embedded Flight Processors

DTIC Science & Technology

1990-12-01

computers was desired. The VAX VMS operating system has many built-in methods for determining program performance (including VAX PCA), but these methods... overviev , of the target environment-- the MIL-STD-1750A VHSIC Avionic Modular Processor ( VA.IP, running under the Ada Avionics Real-Time Software (AARTS... computers . Mil-STD-1750A, the Air Force’s standard flight computer architecture, however, places severe constraints on applications software processing
Gallium arsenide processing elements for motion estimation full-search algorithm

NASA Astrophysics Data System (ADS)

Lopez, Jose F.; Cortes, P.; Lopez, S.; Sarmiento, Roberto

2001-11-01

The Block-Matching motion estimation algorithm (BMA) is the most popular method for motion-compensated coding of image sequence. Among the several possible searching methods to compute this algorithm, the full-search BMA (FBMA) has obtained great interest from the scientific community due to its regularity, optimal solution and low control overhead which simplifies its VLSI realization. On the other hand, its main drawback is the demand of an enormous amount of computation. There are different ways of overcoming this factor, being the use of advanced technologies, such as Gallium Arsenide (GaAs), the one adopted in this article together with different techniques to reduce area overhead. By exploiting GaAs properties, improvements can be obtained in the implementation of feasible systems for real time video compression architectures. Different primitives used in the implementation of processing elements (PE) for a FBMA scheme are presented. As a result, Pes running at 270 MHz have been developed in order to study its functionality and performance. From these results, an implementation for MPEG applications is proposed, leading to an architecture running at 145 MHz with a power dissipation of 3.48 W and an area of 11.5 mm2.
Implementation of MPEG-2 encoder to multiprocessor system using multiple MVPs (TMS320C80)

NASA Astrophysics Data System (ADS)

Kim, HyungSun; Boo, Kenny; Chung, SeokWoo; Choi, Geon Y.; Lee, YongJin; Jeon, JaeHo; Park, Hyun Wook

1997-05-01

This paper presents the efficient algorithm mapping for the real-time MPEG-2 encoding on the KAIST image computing system (KICS), which has a parallel architecture using five multimedia video processors (MVPs). The MVP is a general purpose digital signal processor (DSP) of Texas Instrument. It combines one floating-point processor and four fixed- point DSPs on a single chip. The KICS uses the MVP as a primary processing element (PE). Two PEs form a cluster, and there are two processing clusters in the KICS. Real-time MPEG-2 encoder is implemented through the spatial and the functional partitioning strategies. Encoding process of spatially partitioned half of the video input frame is assigned to ne processing cluster. Two PEs perform the functionally partitioned MPEG-2 encoding tasks in the pipelined operation mode. One PE of a cluster carries out the transform coding part and the other performs the predictive coding part of the MPEG-2 encoding algorithm. One MVP among five MVPs is used for system control and interface with host computer. This paper introduces an implementation of the MPEG-2 algorithm with a parallel processing architecture.
Low-cost space-varying FIR filter architecture for computational imaging systems

NASA Astrophysics Data System (ADS)

Feng, Guotong; Shoaib, Mohammed; Schwartz, Edward L.; Dirk Robinson, M.

2010-01-01

Recent research demonstrates the advantage of designing electro-optical imaging systems by jointly optimizing the optical and digital subsystems. The optical systems designed using this joint approach intentionally introduce large and often space-varying optical aberrations that produce blurry optical images. Digital sharpening restores reduced contrast due to these intentional optical aberrations. Computational imaging systems designed in this fashion have several advantages including extended depth-of-field, lower system costs, and improved low-light performance. Currently, most consumer imaging systems lack the necessary computational resources to compensate for these optical systems with large aberrations in the digital processor. Hence, the exploitation of the advantages of the jointly designed computational imaging system requires low-complexity algorithms enabling space-varying sharpening. In this paper, we describe a low-cost algorithmic framework and associated hardware enabling the space-varying finite impulse response (FIR) sharpening required to restore largely aberrated optical images. Our framework leverages the space-varying properties of optical images formed using rotationally-symmetric optical lens elements. First, we describe an approach to leverage the rotational symmetry of the point spread function (PSF) about the optical axis allowing computational savings. Second, we employ a specially designed bank of sharpening filters tuned to the specific radial variation common to optical aberrations. We evaluate the computational efficiency and image quality achieved by using this low-cost space-varying FIR filter architecture.
Scalable Triadic Analysis of Large-Scale Graphs: Multi-Core vs. Multi-Processor vs. Multi-Threaded Shared Memory Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chin, George; Marquez, Andres; Choudhury, Sutanay

2012-09-01

Triadic analysis encompasses a useful set of graph mining methods that is centered on the concept of a triad, which is a subgraph of three nodes and the configuration of directed edges across the nodes. Such methods are often applied in the social sciences as well as many other diverse fields. Triadic methods commonly operate on a triad census that counts the number of triads of every possible edge configuration in a graph. Like other graph algorithms, triadic census algorithms do not scale well when graphs reach tens of millions to billions of nodes. To enable the triadic analysis ofmore » large-scale graphs, we developed and optimized a triad census algorithm to efficiently execute on shared memory architectures. We will retrace the development and evolution of a parallel triad census algorithm. Over the course of several versions, we continually adapted the code’s data structures and program logic to expose more opportunities to exploit parallelism on shared memory that would translate into improved computational performance. We will recall the critical steps and modifications that occurred during code development and optimization. Furthermore, we will compare the performances of triad census algorithm versions on three specific systems: Cray XMT, HP Superdome, and AMD multi-core NUMA machine. These three systems have shared memory architectures but with markedly different hardware capabilities to manage parallelism.« less
Parallel-Computing Architecture for JWST Wavefront-Sensing Algorithms

DTIC Science & Technology

2011-09-01

results due to the increasing cost and complexity of each test. 2. ALGORITHM OVERVIEW Phase retrieval is an image-based wavefront-sensing...broadband illumination problems we have found that hand-tuning the right matrix sizes can account for a speedup of 86x faster. This comes from hand-picking...Wavefront Sensing and Control”. Proceedings of SPIE (2007) vol. 6687 (08). [5] Greenhouse, M. A., Drury , M. P., Dunn, J. L., Glazer, S. D., Greville, E
Computational representation of the aponeuroses as NURBS surfaces in 3D musculoskeletal models.

PubMed

Wu, Florence T H; Ng-Thow-Hing, Victor; Singh, Karan; Agur, Anne M; McKee, Nancy H

2007-11-01

Computational musculoskeletal (MSK) models - 3D graphics-based models that accurately simulate the anatomical architecture and/or the biomechanical behaviour of organ systems consisting of skeletal muscles, tendons, ligaments, cartilage and bones - are valued biomedical tools, with applications ranging from pathological diagnosis to surgical planning. However, current MSK models are often limited by their oversimplifications in anatomical geometries, sometimes lacking discrete representations of connective tissue components entirely, which ultimately affect their accuracy in biomechanical simulation. In particular, the aponeuroses - the flattened fibrous connective sheets connecting muscle fibres to tendons - have never been geometrically modeled. The initiative was thus to extend Anatomy3D - a previously developed software bundle for reconstructing muscle fibre architecture - to incorporate aponeurosis-modeling capacity. Two different algorithms for aponeurosis reconstruction were written in the MEL scripting language of the animation software Maya 6.0, using its NURBS (non-uniform rational B-splines) modeling functionality for aponeurosis surface representation. Both algorithms were validated qualitatively against anatomical and functional criteria.
High Performance Radiation Transport Simulations on TITAN

DOE Office of Scientific and Technical Information (OSTI.GOV)

Baker, Christopher G; Davidson, Gregory G; Evans, Thomas M

2012-01-01

In this paper we describe the Denovo code system. Denovo solves the six-dimensional, steady-state, linear Boltzmann transport equation, of central importance to nuclear technology applications such as reactor core analysis (neutronics), radiation shielding, nuclear forensics and radiation detection. The code features multiple spatial differencing schemes, state-of-the-art linear solvers, the Koch-Baker-Alcouffe (KBA) parallel-wavefront sweep algorithm for inverting the transport operator, a new multilevel energy decomposition method scaling to hundreds of thousands of processing cores, and a modern, novel code architecture that supports straightforward integration of new features. In this paper we discuss the performance of Denovo on the 10--20 petaflop ORNLmore » GPU-based system, Titan. We describe algorithms and techniques used to exploit the capabilities of Titan's heterogeneous compute node architecture and the challenges of obtaining good parallel performance for this sparse hyperbolic PDE solver containing inherently sequential computations. Numerical results demonstrating Denovo performance on early Titan hardware are presented.« less
Computer aided design of architecture of degradable tissue engineering scaffolds.

PubMed

Heljak, M K; Kurzydlowski, K J; Swieszkowski, W

2017-11-01

One important factor affecting the process of tissue regeneration is scaffold stiffness loss, which should be properly balanced with the rate of tissue regeneration. The aim of the research reported here was to develop a computer tool for designing the architecture of biodegradable scaffolds fabricated by melt-dissolution deposition systems (e.g. Fused Deposition Modeling) to provide the required scaffold stiffness at each stage of degradation/regeneration. The original idea presented in the paper is that the stiffness of a tissue engineering scaffold can be controlled during degradation by means of a proper selection of the diameter of the constituent fibers and the distances between them. This idea is based on the size-effect on degradation of aliphatic polyesters. The presented computer tool combines a genetic algorithm and a diffusion-reaction model of polymer hydrolytic degradation. In particular, we show how to design the architecture of scaffolds made of poly(DL-lactide-co-glycolide) with the required Young's modulus change during hydrolytic degradation.
Automated problem scheduling and reduction of synchronization delay effects

NASA Technical Reports Server (NTRS)

Saltz, Joel H.

1987-01-01

It is anticipated that in order to make effective use of many future high performance architectures, programs will have to exhibit at least a medium grained parallelism. A framework is presented for partitioning very sparse triangular systems of linear equations that is designed to produce favorable preformance results in a wide variety of parallel architectures. Efficient methods for solving these systems are of interest because: (1) they provide a useful model problem for use in exploring heuristics for the aggregation, mapping and scheduling of relatively fine grained computations whose data dependencies are specified by directed acrylic graphs, and (2) because such efficient methods can find direct application in the development of parallel algorithms for scientific computation. Simple expressions are derived that describe how to schedule computational work with varying degrees of granularity. The Encore Multimax was used as a hardware simulator to investigate the performance effects of using the partitioning techniques presented in shared memory architectures with varying relative synchronization costs.
FUX-Sim: Implementation of a fast universal simulation/reconstruction framework for X-ray systems.

PubMed

Abella, Monica; Serrano, Estefania; Garcia-Blas, Javier; García, Ines; de Molina, Claudia; Carretero, Jesus; Desco, Manuel

2017-01-01

The availability of digital X-ray detectors, together with advances in reconstruction algorithms, creates an opportunity for bringing 3D capabilities to conventional radiology systems. The downside is that reconstruction algorithms for non-standard acquisition protocols are generally based on iterative approaches that involve a high computational burden. The development of new flexible X-ray systems could benefit from computer simulations, which may enable performance to be checked before expensive real systems are implemented. The development of simulation/reconstruction algorithms in this context poses three main difficulties. First, the algorithms deal with large data volumes and are computationally expensive, thus leading to the need for hardware and software optimizations. Second, these optimizations are limited by the high flexibility required to explore new scanning geometries, including fully configurable positioning of source and detector elements. And third, the evolution of the various hardware setups increases the effort required for maintaining and adapting the implementations to current and future programming models. Previous works lack support for completely flexible geometries and/or compatibility with multiple programming models and platforms. In this paper, we present FUX-Sim, a novel X-ray simulation/reconstruction framework that was designed to be flexible and fast. Optimized implementation for different families of GPUs (CUDA and OpenCL) and multi-core CPUs was achieved thanks to a modularized approach based on a layered architecture and parallel implementation of the algorithms for both architectures. A detailed performance evaluation demonstrates that for different system configurations and hardware platforms, FUX-Sim maximizes performance with the CUDA programming model (5 times faster than other state-of-the-art implementations). Furthermore, the CPU and OpenCL programming models allow FUX-Sim to be executed over a wide range of hardware platforms.
Logical NAND and NOR Operations Using Algorithmic Self-assembly of DNA Molecules

NASA Astrophysics Data System (ADS)

Wang, Yanfeng; Cui, Guangzhao; Zhang, Xuncai; Zheng, Yan

DNA self-assembly is the most advanced and versatile system that has been experimentally demonstrated for programmable construction of patterned systems on the molecular scale. It has been demonstrated that the simple binary arithmetic and logical operations can be computed by the process of self assembly of DNA tiles. Here we report a one-dimensional algorithmic self-assembly of DNA triple-crossover molecules that can be used to execute five steps of a logical NAND and NOR operations on a string of binary bits. To achieve this, abstract tiles were translated into DNA tiles based on triple-crossover motifs. Serving as input for the computation, long single stranded DNA molecules were used to nucleate growth of tiles into algorithmic crystals. Our method shows that engineered DNA self-assembly can be treated as a bottom-up design techniques, and can be capable of designing DNA computer organization and architecture.
Real-time implementation of optimized maximum noise fraction transform for feature extraction of hyperspectral images

NASA Astrophysics Data System (ADS)

Wu, Yuanfeng; Gao, Lianru; Zhang, Bing; Zhao, Haina; Li, Jun

2014-01-01

We present a parallel implementation of the optimized maximum noise fraction (G-OMNF) transform algorithm for feature extraction of hyperspectral images on commodity graphics processing units (GPUs). The proposed approach explored the algorithm data-level concurrency and optimized the computing flow. We first defined a three-dimensional grid, in which each thread calculates a sub-block data to easily facilitate the spatial and spectral neighborhood data searches in noise estimation, which is one of the most important steps involved in OMNF. Then, we optimized the processing flow and computed the noise covariance matrix before computing the image covariance matrix to reduce the original hyperspectral image data transmission. These optimization strategies can greatly improve the computing efficiency and can be applied to other feature extraction algorithms. The proposed parallel feature extraction algorithm was implemented on an Nvidia Tesla GPU using the compute unified device architecture and basic linear algebra subroutines library. Through the experiments on several real hyperspectral images, our GPU parallel implementation provides a significant speedup of the algorithm compared with the CPU implementation, especially for highly data parallelizable and arithmetically intensive algorithm parts, such as noise estimation. In order to further evaluate the effectiveness of G-OMNF, we used two different applications: spectral unmixing and classification for evaluation. Considering the sensor scanning rate and the data acquisition time, the proposed parallel implementation met the on-board real-time feature extraction.
Novel structures for Discrete Hartley Transform based on first-order moments

NASA Astrophysics Data System (ADS)

Xiong, Jun; Zheng, Wenjuan; Wang, Hao; Liu, Jianguo

2018-03-01

Discrete Hartley Transform (DHT) is an important tool in digital signal processing. In the present paper, the DHT is firstly transformed into the first-order moments-based form, then a new fast algorithm is proposed to calculate the first-order moments without multiplication. Based on the algorithm theory, the corresponding hardware architecture for DHT is proposed, which only contains shift operations and additions with no need for multipliers and large memory. To verify the availability and effectiveness, the proposed design is implemented with hardware description language and synthesized by Synopsys Design Compiler with 0.18-μm SMIC library. A series of experiments have proved that the proposed architecture has better performance in terms of the product of the hardware consumption and computation time.
Applications of Parallel Computation in Micro-Mechanics and Finite Element Method

NASA Technical Reports Server (NTRS)

Tan, Hui-Qian

1996-01-01

This project discusses the application of parallel computations related with respect to material analyses. Briefly speaking, we analyze some kind of material by elements computations. We call an element a cell here. A cell is divided into a number of subelements called subcells and all subcells in a cell have the identical structure. The detailed structure will be given later in this paper. It is obvious that the problem is "well-structured". SIMD machine would be a better choice. In this paper we try to look into the potentials of SIMD machine in dealing with finite element computation by developing appropriate algorithms on MasPar, a SIMD parallel machine. In section 2, the architecture of MasPar will be discussed. A brief review of the parallel programming language MPL also is given in that section. In section 3, some general parallel algorithms which might be useful to the project will be proposed. And, combining with the algorithms, some features of MPL will be discussed in more detail. In section 4, the computational structure of cell/subcell model will be given. The idea of designing the parallel algorithm for the model will be demonstrated. Finally in section 5, a summary will be given.
A programmable five qubit quantum computer using trapped atomic ions

NASA Astrophysics Data System (ADS)

Debnath, Shantanu

2017-04-01

In order to harness the power of quantum information processing, several candidate systems have been investigated, and tailored to demonstrate only specific computations. In my thesis work, we construct a general-purpose multi-qubit device using a linear chain of trapped ion qubits, which in principle can be programmed to run any quantum algorithm. To achieve such flexibility, we develop a pulse shaping technique to realize a set of fully connected two-qubit rotations that entangle arbitrary pairs of qubits using multiple motional modes of the chain. Following a computation architecture, such highly expressive two-qubit gates along with arbitrary single-qubit rotations can be used to compile modular universal logic gates that are effected by targeted optical fields and hence can be reconfigured according to any algorithm circuit programmed in the software. As a demonstration, we run the Deutsch-Jozsa and Bernstein-Vazirani algorithm, and a fully coherent quantum Fourier transform, that we use to solve the `period finding' and `quantum phase estimation' problem. Combining these results with recent demonstrations of quantum fault-tolerance, Grover's search algorithm, and simulation of boson hopping establishes the versatility of such a computation module that can potentially be connected to other modules for future large-scale computations.
Addressing fundamental architectural challenges of an activity-based intelligence and advanced analytics (ABIAA) system

NASA Astrophysics Data System (ADS)

Yager, Kevin; Albert, Thomas; Brower, Bernard V.; Pellechia, Matthew F.

2015-06-01

The domain of Geospatial Intelligence Analysis is rapidly shifting toward a new paradigm of Activity Based Intelligence (ABI) and information-based Tipping and Cueing. General requirements for an advanced ABIAA system present significant challenges in architectural design, computing resources, data volumes, workflow efficiency, data mining and analysis algorithms, and database structures. These sophisticated ABI software systems must include advanced algorithms that automatically flag activities of interest in less time and within larger data volumes than can be processed by human analysts. In doing this, they must also maintain the geospatial accuracy necessary for cross-correlation of multi-intelligence data sources. Historically, serial architectural workflows have been employed in ABIAA system design for tasking, collection, processing, exploitation, and dissemination. These simpler architectures may produce implementations that solve short term requirements; however, they have serious limitations that preclude them from being used effectively in an automated ABIAA system with multiple data sources. This paper discusses modern ABIAA architectural considerations providing an overview of an advanced ABIAA system and comparisons to legacy systems. It concludes with a recommended strategy and incremental approach to the research, development, and construction of a fully automated ABIAA system.
Coding for parallel execution of hardware-in-the-loop millimeter-wave scene generation models on multicore SIMD processor architectures

NASA Astrophysics Data System (ADS)

Olson, Richard F.

2013-05-01

Rendering of point scatterer based radar scenes for millimeter wave (mmW) seeker tests in real-time hardware-in-the-loop (HWIL) scene generation requires efficient algorithms and vector-friendly computer architectures for complex signal synthesis. New processor technology from Intel implements an extended 256-bit vector SIMD instruction set (AVX, AVX2) in a multi-core CPU design providing peak execution rates of hundreds of GigaFLOPS (GFLOPS) on one chip. Real world mmW scene generation code can approach peak SIMD execution rates only after careful algorithm and source code design. An effective software design will maintain high computing intensity emphasizing register-to-register SIMD arithmetic operations over data movement between CPU caches or off-chip memories. Engineers at the U.S. Army Aviation and Missile Research, Development and Engineering Center (AMRDEC) applied two basic parallel coding methods to assess new 256-bit SIMD multi-core architectures for mmW scene generation in HWIL. These include use of POSIX threads built on vector library functions and more portable, highlevel parallel code based on compiler technology (e.g. OpenMP pragmas and SIMD autovectorization). Since CPU technology is rapidly advancing toward high processor core counts and TeraFLOPS peak SIMD execution rates, it is imperative that coding methods be identified which produce efficient and maintainable parallel code. This paper describes the algorithms used in point scatterer target model rendering, the parallelization of those algorithms, and the execution performance achieved on an AVX multi-core machine using the two basic parallel coding methods. The paper concludes with estimates for scale-up performance on upcoming multi-core technology.
Platform Architecture for Decentralized Positioning Systems.

PubMed

Kasmi, Zakaria; Norrdine, Abdelmoumen; Blankenbach, Jörg

2017-04-26

A platform architecture for positioning systems is essential for the realization of a flexible localization system, which interacts with other systems and supports various positioning technologies and algorithms. The decentralized processing of a position enables pushing the application-level knowledge into a mobile station and avoids the communication with a central unit such as a server or a base station. In addition, the calculation of the position on low-cost and resource-constrained devices presents a challenge due to the limited computing, storage capacity, as well as power supply. Therefore, we propose a platform architecture that enables the design of a system with the reusability of the components, extensibility (e.g., with other positioning technologies) and interoperability. Furthermore, the position is computed on a low-cost device such as a microcontroller, which simultaneously performs additional tasks such as data collecting or preprocessing based on an operating system. The platform architecture is designed, implemented and evaluated on the basis of two positioning systems: a field strength system and a time of arrival-based positioning system.

Platform Architecture for Decentralized Positioning Systems

PubMed Central

Kasmi, Zakaria; Norrdine, Abdelmoumen; Blankenbach, Jörg

2017-01-01

A platform architecture for positioning systems is essential for the realization of a flexible localization system, which interacts with other systems and supports various positioning technologies and algorithms. The decentralized processing of a position enables pushing the application-level knowledge into a mobile station and avoids the communication with a central unit such as a server or a base station. In addition, the calculation of the position on low-cost and resource-constrained devices presents a challenge due to the limited computing, storage capacity, as well as power supply. Therefore, we propose a platform architecture that enables the design of a system with the reusability of the components, extensibility (e.g., with other positioning technologies) and interoperability. Furthermore, the position is computed on a low-cost device such as a microcontroller, which simultaneously performs additional tasks such as data collecting or preprocessing based on an operating system. The platform architecture is designed, implemented and evaluated on the basis of two positioning systems: a field strength system and a time of arrival-based positioning system. PMID:28445414
Acoustic transient classification with a template correlation processor.

PubMed

Edwards, R T

1999-10-01

I present an architecture for acoustic pattern classification using trinary-trinary template correlation. In spite of its computational simplicity, the algorithm and architecture represent a method which greatly reduces bandwidth of the input, storage requirements of the classifier memory, and power consumption of the system without compromising classification accuracy. The linear system should be amenable to training using recently-developed methods such as Independent Component Analysis (ICA), and we predict that behavior will be qualitatively similar to that of structures in the auditory cortex.
Smart photonic networks and computer security for image data

NASA Astrophysics Data System (ADS)

Campello, Jorge; Gill, John T.; Morf, Martin; Flynn, Michael J.

1998-02-01

Work reported here is part of a larger project on 'Smart Photonic Networks and Computer Security for Image Data', studying the interactions of coding and security, switching architecture simulations, and basic technologies. Coding and security: coding methods that are appropriate for data security in data fusion networks were investigated. These networks have several characteristics that distinguish them form other currently employed networks, such as Ethernet LANs or the Internet. The most significant characteristics are very high maximum data rates; predominance of image data; narrowcasting - transmission of data form one source to a designated set of receivers; data fusion - combining related data from several sources; simple sensor nodes with limited buffering. These characteristics affect both the lower level network design and the higher level coding methods.Data security encompasses privacy, integrity, reliability, and availability. Privacy, integrity, and reliability can be provided through encryption and coding for error detection and correction. Availability is primarily a network issue; network nodes must be protected against failure or routed around in the case of failure. One of the more promising techniques is the use of 'secret sharing'. We consider this method as a special case of our new space-time code diversity based algorithms for secure communication. These algorithms enable us to exploit parallelism and scalable multiplexing schemes to build photonic network architectures. A number of very high-speed switching and routing architectures and their relationships with very high performance processor architectures were studied. Indications are that routers for very high speed photonic networks can be designed using the very robust and distributed TCP/IP protocol, if suitable processor architecture support is available.
Parallel fuzzy connected image segmentation on GPU

PubMed Central

Zhuge, Ying; Cao, Yong; Udupa, Jayaram K.; Miller, Robert W.

2011-01-01

Purpose: Image segmentation techniques using fuzzy connectedness (FC) principles have shown their effectiveness in segmenting a variety of objects in several large applications. However, one challenge in these algorithms has been their excessive computational requirements when processing large image datasets. Nowadays, commodity graphics hardware provides a highly parallel computing environment. In this paper, the authors present a parallel fuzzy connected image segmentation algorithm implementation on NVIDIA’s compute unified device Architecture (cuda) platform for segmenting medical image data sets. Methods: In the FC algorithm, there are two major computational tasks: (i) computing the fuzzy affinity relations and (ii) computing the fuzzy connectedness relations. These two tasks are implemented as cuda kernels and executed on GPU. A dramatic improvement in speed for both tasks is achieved as a result. Results: Our experiments based on three data sets of small, medium, and large data size demonstrate the efficiency of the parallel algorithm, which achieves a speed-up factor of 24.4x, 18.1x, and 10.3x, correspondingly, for the three data sets on the NVIDIA Tesla C1060 over the implementation of the algorithm on CPU, and takes 0.25, 0.72, and 15.04 s, correspondingly, for the three data sets. Conclusions: The authors developed a parallel algorithm of the widely used fuzzy connected image segmentation method on the NVIDIA GPUs, which are far more cost- and speed-effective than both cluster of workstations and multiprocessing systems. A near-interactive speed of segmentation has been achieved, even for the large data set. PMID:21859037
Parallel fuzzy connected image segmentation on GPU.

PubMed

Zhuge, Ying; Cao, Yong; Udupa, Jayaram K; Miller, Robert W

2011-07-01

Image segmentation techniques using fuzzy connectedness (FC) principles have shown their effectiveness in segmenting a variety of objects in several large applications. However, one challenge in these algorithms has been their excessive computational requirements when processing large image datasets. Nowadays, commodity graphics hardware provides a highly parallel computing environment. In this paper, the authors present a parallel fuzzy connected image segmentation algorithm implementation on NVIDIA's compute unified device Architecture (CUDA) platform for segmenting medical image data sets. In the FC algorithm, there are two major computational tasks: (i) computing the fuzzy affinity relations and (ii) computing the fuzzy connectedness relations. These two tasks are implemented as CUDA kernels and executed on GPU. A dramatic improvement in speed for both tasks is achieved as a result. Our experiments based on three data sets of small, medium, and large data size demonstrate the efficiency of the parallel algorithm, which achieves a speed-up factor of 24.4x, 18.1x, and 10.3x, correspondingly, for the three data sets on the NVIDIA Tesla C1060 over the implementation of the algorithm on CPU, and takes 0.25, 0.72, and 15.04 s, correspondingly, for the three data sets. The authors developed a parallel algorithm of the widely used fuzzy connected image segmentation method on the NVIDIA GPUs, which are far more cost- and speed-effective than both cluster of workstations and multiprocessing systems. A near-interactive speed of segmentation has been achieved, even for the large data set.
CAMPAIGN: an open-source library of GPU-accelerated data clustering algorithms.

PubMed

Kohlhoff, Kai J; Sosnick, Marc H; Hsu, William T; Pande, Vijay S; Altman, Russ B

2011-08-15

Data clustering techniques are an essential component of a good data analysis toolbox. Many current bioinformatics applications are inherently compute-intense and work with very large datasets. Sequential algorithms are inadequate for providing the necessary performance. For this reason, we have created Clustering Algorithms for Massively Parallel Architectures, Including GPU Nodes (CAMPAIGN), a central resource for data clustering algorithms and tools that are implemented specifically for execution on massively parallel processing architectures. CAMPAIGN is a library of data clustering algorithms and tools, written in 'C for CUDA' for Nvidia GPUs. The library provides up to two orders of magnitude speed-up over respective CPU-based clustering algorithms and is intended as an open-source resource. New modules from the community will be accepted into the library and the layout of it is such that it can easily be extended to promising future platforms such as OpenCL. Releases of the CAMPAIGN library are freely available for download under the LGPL from https://simtk.org/home/campaign. Source code can also be obtained through anonymous subversion access as described on https://simtk.org/scm/?group_id=453. kjk33@cantab.net.
GPU-based parallel algorithm for blind image restoration using midfrequency-based methods

NASA Astrophysics Data System (ADS)

Xie, Lang; Luo, Yi-han; Bao, Qi-liang

2013-08-01

GPU-based general-purpose computing is a new branch of modern parallel computing, so the study of parallel algorithms specially designed for GPU hardware architecture is of great significance. In order to solve the problem of high computational complexity and poor real-time performance in blind image restoration, the midfrequency-based algorithm for blind image restoration was analyzed and improved in this paper. Furthermore, a midfrequency-based filtering method is also used to restore the image hardly with any recursion or iteration. Combining the algorithm with data intensiveness, data parallel computing and GPU execution model of single instruction and multiple threads, a new parallel midfrequency-based algorithm for blind image restoration is proposed in this paper, which is suitable for stream computing of GPU. In this algorithm, the GPU is utilized to accelerate the estimation of class-G point spread functions and midfrequency-based filtering. Aiming at better management of the GPU threads, the threads in a grid are scheduled according to the decomposition of the filtering data in frequency domain after the optimization of data access and the communication between the host and the device. The kernel parallelism structure is determined by the decomposition of the filtering data to ensure the transmission rate to get around the memory bandwidth limitation. The results show that, with the new algorithm, the operational speed is significantly increased and the real-time performance of image restoration is effectively improved, especially for high-resolution images.
Computational performance of a smoothed particle hydrodynamics simulation for shared-memory parallel computing

NASA Astrophysics Data System (ADS)

Nishiura, Daisuke; Furuichi, Mikito; Sakaguchi, Hide

2015-09-01

The computational performance of a smoothed particle hydrodynamics (SPH) simulation is investigated for three types of current shared-memory parallel computer devices: many integrated core (MIC) processors, graphics processing units (GPUs), and multi-core CPUs. We are especially interested in efficient shared-memory allocation methods for each chipset, because the efficient data access patterns differ between compute unified device architecture (CUDA) programming for GPUs and OpenMP programming for MIC processors and multi-core CPUs. We first introduce several parallel implementation techniques for the SPH code, and then examine these on our target computer architectures to determine the most effective algorithms for each processor unit. In addition, we evaluate the effective computing performance and power efficiency of the SPH simulation on each architecture, as these are critical metrics for overall performance in a multi-device environment. In our benchmark test, the GPU is found to produce the best arithmetic performance as a standalone device unit, and gives the most efficient power consumption. The multi-core CPU obtains the most effective computing performance. The computational speed of the MIC processor on Xeon Phi approached that of two Xeon CPUs. This indicates that using MICs is an attractive choice for existing SPH codes on multi-core CPUs parallelized by OpenMP, as it gains computational acceleration without the need for significant changes to the source code.
Parallelized Seeded Region Growing Using CUDA

PubMed Central

Park, Seongjin; Lee, Hyunna; Seo, Jinwook; Lee, Kyoung Ho; Shin, Yeong-Gil; Kim, Bohyoung

2014-01-01

This paper presents a novel method for parallelizing the seeded region growing (SRG) algorithm using Compute Unified Device Architecture (CUDA) technology, with intention to overcome the theoretical weakness of SRG algorithm of its computation time being directly proportional to the size of a segmented region. The segmentation performance of the proposed CUDA-based SRG is compared with SRG implementations on single-core CPUs, quad-core CPUs, and shader language programming, using synthetic datasets and 20 body CT scans. Based on the experimental results, the CUDA-based SRG outperforms the other three implementations, advocating that it can substantially assist the segmentation during massive CT screening tests. PMID:25309619
Aspects of GPU perfomance in algorithms with random memory access

NASA Astrophysics Data System (ADS)

Kashkovsky, Alexander V.; Shershnev, Anton A.; Vashchenkov, Pavel V.

2017-10-01

The numerical code for solving the Boltzmann equation on the hybrid computational cluster using the Direct Simulation Monte Carlo (DSMC) method showed that on Tesla K40 accelerators computational performance drops dramatically with increase of percentage of occupied GPU memory. Testing revealed that memory access time increases tens of times after certain critical percentage of memory is occupied. Moreover, it seems to be the common problem of all NVidia's GPUs arising from its architecture. Few modifications of the numerical algorithm were suggested to overcome this problem. One of them, based on the splitting the memory into "virtual" blocks, resulted in 2.5 times speed up.
Performance of GeantV EM Physics Models

NASA Astrophysics Data System (ADS)

Amadio, G.; Ananya, A.; Apostolakis, J.; Aurora, A.; Bandieramonte, M.; Bhattacharyya, A.; Bianchini, C.; Brun, R.; Canal, P.; Carminati, F.; Cosmo, G.; Duhem, L.; Elvira, D.; Folger, G.; Gheata, A.; Gheata, M.; Goulas, I.; Iope, R.; Jun, S. Y.; Lima, G.; Mohanty, A.; Nikitina, T.; Novak, M.; Pokorski, W.; Ribon, A.; Seghal, R.; Shadura, O.; Vallecorsa, S.; Wenzel, S.; Zhang, Y.

2017-10-01

The recent progress in parallel hardware architectures with deeper vector pipelines or many-cores technologies brings opportunities for HEP experiments to take advantage of SIMD and SIMT computing models. Launched in 2013, the GeantV project studies performance gains in propagating multiple particles in parallel, improving instruction throughput and data locality in HEP event simulation on modern parallel hardware architecture. Due to the complexity of geometry description and physics algorithms of a typical HEP application, performance analysis is indispensable in identifying factors limiting parallel execution. In this report, we will present design considerations and preliminary computing performance of GeantV physics models on coprocessors (Intel Xeon Phi and NVidia GPUs) as well as on mainstream CPUs.
A real-time architecture for time-aware agents.

PubMed

Prouskas, Konstantinos-Vassileios; Pitt, Jeremy V

2004-06-01

This paper describes the specification and implementation of a new three-layer time-aware agent architecture. This architecture is designed for applications and environments where societies of humans and agents play equally active roles, but interact and operate in completely different time frames. The architecture consists of three layers: the April real-time run-time (ART) layer, the time aware layer (TAL), and the application agents layer (AAL). The ART layer forms the underlying real-time agent platform. An original online, real-time, dynamic priority-based scheduling algorithm is described for scheduling the computation time of agent processes, and it is shown that the algorithm's O(n) complexity and scalable performance are sufficient for application in real-time domains. The TAL layer forms an abstraction layer through which human and agent interactions are temporally unified, that is, handled in a common way irrespective of their temporal representation and scale. A novel O(n2) interaction scheduling algorithm is described for predicting and guaranteeing interactions' initiation and completion times. The time-aware predicting component of a workflow management system is also presented as an instance of the AAL layer. The described time-aware architecture addresses two key challenges in enabling agents to be effectively configured and applied in environments where humans and agents play equally active roles. It provides flexibility and adaptability in its real-time mechanisms while placing them under direct agent control, and it temporally unifies human and agent interactions.
Parallel processors and nonlinear structural dynamics algorithms and software

NASA Technical Reports Server (NTRS)

Belytschko, Ted; Gilbertsen, Noreen D.; Neal, Mark O.; Plaskacz, Edward J.

1989-01-01

The adaptation of a finite element program with explicit time integration to a massively parallel SIMD (single instruction multiple data) computer, the CONNECTION Machine is described. The adaptation required the development of a new algorithm, called the exchange algorithm, in which all nodal variables are allocated to the element with an exchange of nodal forces at each time step. The architectural and C* programming language features of the CONNECTION Machine are also summarized. Various alternate data structures and associated algorithms for nonlinear finite element analysis are discussed and compared. Results are presented which demonstrate that the CONNECTION Machine is capable of outperforming the CRAY XMP/14.
Distributed chemical computing using ChemStar: an open source java remote method invocation architecture applied to large scale molecular data from PubChem.

PubMed

Karthikeyan, M; Krishnan, S; Pandey, Anil Kumar; Bender, Andreas; Tropsha, Alexander

2008-04-01

We present the application of a Java remote method invocation (RMI) based open source architecture to distributed chemical computing. This architecture was previously employed for distributed data harvesting of chemical information from the Internet via the Google application programming interface (API; ChemXtreme). Due to its open source character and its flexibility, the underlying server/client framework can be quickly adopted to virtually every computational task that can be parallelized. Here, we present the server/client communication framework as well as an application to distributed computing of chemical properties on a large scale (currently the size of PubChem; about 18 million compounds), using both the Marvin toolkit as well as the open source JOELib package. As an application, for this set of compounds, the agreement of log P and TPSA between the packages was compared. Outliers were found to be mostly non-druglike compounds and differences could usually be explained by differences in the underlying algorithms. ChemStar is the first open source distributed chemical computing environment built on Java RMI, which is also easily adaptable to user demands due to its "plug-in architecture". The complete source codes as well as calculated properties along with links to PubChem resources are available on the Internet via a graphical user interface at http://moltable.ncl.res.in/chemstar/.
Practical considerations in experimental computational sensing

NASA Astrophysics Data System (ADS)

Poon, Phillip K.

Computational sensing has demonstrated the ability to ameliorate or eliminate many trade-offs in traditional sensors. Rather than attempting to form a perfect image, then sampling at the Nyquist rate, and reconstructing the signal of interest prior to post-processing, the computational sensor attempts to utilize a priori knowledge, active or passive coding of the signal-of-interest combined with a variety of algorithms to overcome the trade-offs or to improve various task-specific metrics. While it is a powerful approach to radically new sensor architectures, published research tends to focus on architecture concepts and positive results. Little attention is given towards the practical issues when faced with implementing computational sensing prototypes. I will discuss the various practical challenges that I encountered while developing three separate applications of computational sensors. The first is a compressive sensing based object tracking camera, the SCOUT, which exploits the sparsity of motion between consecutive frames while using no moving parts to create a psuedo-random shift variant point-spread function. The second is a spectral imaging camera, the AFSSI-C, which uses a modified version of Principal Component Analysis with a Bayesian strategy to adaptively design spectral filters for direct spectral classification using a digital micro-mirror device (DMD) based architecture. The third demonstrates two separate architectures to perform spectral unmixing by using an adaptive algorithm or a hybrid techniques of using Maximum Noise Fraction and random filter selection from a liquid crystal on silicon based computational spectral imager, the LCSI. All of these applications demonstrate a variety of challenges that have been addressed or continue to challenge the computational sensing community. One issue is calibration, since many computational sensors require an inversion step and in the case of compressive sensing, lack of redundancy in the measurement data. Another issue is over multiplexing, as more light is collected per sample, the finite amount of dynamic range and quantization resolution can begin to degrade the recovery of the relevant information. A priori knowledge of the sparsity and or other statistics of the signal or noise is often used by computational sensors to outperform their isomorphic counterparts. This is demonstrated in all three of the sensors I have developed. These challenges and others will be discussed using a case-study approach through these three applications.
Parallel Algorithms for Computer Vision

DTIC Science & Technology

1990-04-01

NA86-1, Thinking Machines Corporation, Cambridge, MA, December 1986. [43] J. Little, G. Blelloch, and T. Cass. How to program the connection machine for... to program the connection machine for computer vision. In Proc. Workshop on Comp. Architecture for Pattern Analysis and Machine Intell., 1987. [92] J...In Proceedings of SPIE Conf. on Advances in Intelligent Robotics Systems, Bellingham, VA, 1987. SPIE. [91] J. Little, G. Blelloch, and T. Cass. How
Superior Generalization Capability of Hardware-Learing Algorithm Developed for Self-Learning Neuron-MOS Neural Networks

NASA Astrophysics Data System (ADS)

Kondo, Shuhei; Shibata, Tadashi; Ohmi, Tadahiro

1995-02-01

We have investigated the learning performance of the hardware backpropagation (HBP) algorithm, a hardware-oriented learning algorithm developed for the self-learning architecture of neural networks constructed using neuron MOS (metal-oxide-semiconductor) transistors. The solution to finding a mirror symmetry axis in a 4×4 binary pixel array was tested by computer simulation based on the HBP algorithm. Despite the inherent restrictions imposed on the hardware-learning algorithm, HBP exhibits equivalent learning performance to that of the original backpropagation (BP) algorithm when all the pertinent parameters are optimized. Very importantly, we have found that HBP has a superior generalization capability over BP; namely, HBP exhibits higher performance in solving problems that the network has not yet learnt.
Exascale computing and what it means for shock physics

NASA Astrophysics Data System (ADS)

Germann, Timothy

2015-06-01

The U.S. Department of Energy is preparing to launch an Exascale Computing Initiative, to address the myriad challenges required to deploy and effectively utilize an exascale-class supercomputer (i.e., one capable of performing 1018 operations per second) in the 2023 timeframe. Since physical (power dissipation) requirements limit clock rates to at most a few GHz, this will necessitate the coordination of on the order of a billion concurrent operations, requiring sophisticated system and application software, and underlying mathematical algorithms, that may differ radically from traditional approaches. Even at the smaller workstation or cluster level of computation, the massive concurrency and heterogeneity within each processor will impact computational scientists. Through the multi-institutional, multi-disciplinary Exascale Co-design Center for Materials in Extreme Environments (ExMatEx), we have initiated an early and deep collaboration between domain (computational materials) scientists, applied mathematicians, computer scientists, and hardware architects, in order to establish the relationships between algorithms, software stacks, and architectures needed to enable exascale-ready materials science application codes within the next decade. In my talk, I will discuss these challenges, and what it will mean for exascale-era electronic structure, molecular dynamics, and engineering-scale simulations of shock-compressed condensed matter. In particular, we anticipate that the emerging hierarchical, heterogeneous architectures can be exploited to achieve higher physical fidelity simulations using adaptive physics refinement. This work is supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research.
Petascale Many Body Methods for Complex Correlated Systems

NASA Astrophysics Data System (ADS)

Pruschke, Thomas

2012-02-01

Correlated systems constitute an important class of materials in modern condensed matter physics. Correlation among electrons are at the heart of all ordering phenomena and many intriguing novel aspects, such as quantum phase transitions or topological insulators, observed in a variety of compounds. Yet, theoretically describing these phenomena is still a formidable task, even if one restricts the models used to the smallest possible set of degrees of freedom. Here, modern computer architectures play an essential role, and the joint effort to devise efficient algorithms and implement them on state-of-the art hardware has become an extremely active field in condensed-matter research. To tackle this task single-handed is quite obviously not possible. The NSF-OISE funded PIRE collaboration ``Graduate Education and Research in Petascale Many Body Methods for Complex Correlated Systems'' is a successful initiative to bring together leading experts around the world to form a virtual international organization for addressing these emerging challenges and educate the next generation of computational condensed matter physicists. The collaboration includes research groups developing novel theoretical tools to reliably and systematically study correlated solids, experts in efficient computational algorithms needed to solve the emerging equations, and those able to use modern heterogeneous computer architectures to make then working tools for the growing community.
Parallel O(log n) algorithms for open- and closed-chain rigid multibody systems based on a new mass matrix factorization technique

NASA Technical Reports Server (NTRS)

Fijany, Amir

1993-01-01

In this paper, parallel O(log n) algorithms for computation of rigid multibody dynamics are developed. These parallel algorithms are derived by parallelization of new O(n) algorithms for the problem. The underlying feature of these O(n) algorithms is a drastically different strategy for decomposition of interbody force which leads to a new factorization of the mass matrix (M). Specifically, it is shown that a factorization of the inverse of the mass matrix in the form of the Schur Complement is derived as M(exp -1) = C - B(exp *)A(exp -1)B, wherein matrices C, A, and B are block tridiagonal matrices. The new O(n) algorithm is then derived as a recursive implementation of this factorization of M(exp -1). For the closed-chain systems, similar factorizations and O(n) algorithms for computation of Operational Space Mass Matrix lambda and its inverse lambda(exp -1) are also derived. It is shown that these O(n) algorithms are strictly parallel, that is, they are less efficient than other algorithms for serial computation of the problem. But, to our knowledge, they are the only known algorithms that can be parallelized and that lead to both time- and processor-optimal parallel algorithms for the problem, i.e., parallel O(log n) algorithms with O(n) processors. The developed parallel algorithms, in addition to their theoretical significance, are also practical from an implementation point of view due to their simple architectural requirements.

Acceleration of the Particle Swarm Optimization for Peierls-Nabarro modeling of dislocations in conventional and high-entropy alloys

NASA Astrophysics Data System (ADS)

Pei, Zongrui; Eisenbach, Markus

2017-06-01

Dislocations are among the most important defects in determining the mechanical properties of both conventional alloys and high-entropy alloys. The Peierls-Nabarro model supplies an efficient pathway to their geometries and mobility. The difficulty in solving the integro-differential Peierls-Nabarro equation is how to effectively avoid the local minima in the energy landscape of a dislocation core. Among the other methods to optimize the dislocation core structures, we choose the algorithm of Particle Swarm Optimization, an algorithm that simulates the social behaviors of organisms. By employing more particles (bigger swarm) and more iterative steps (allowing them to explore for longer time), the local minima can be effectively avoided. But this would require more computational cost. The advantage of this algorithm is that it is readily parallelized in modern high computing architecture. We demonstrate the performance of our parallelized algorithm scales linearly with the number of employed cores.
Computational Performance of Intel MIC, Sandy Bridge, and GPU Architectures: Implementation of a 1D c++/OpenMP Electrostatic Particle-In-Cell Code

DTIC Science & Technology

2014-05-01

fusion, space and astrophysical plasmas, but still the general picture can be presented quite well with the fluid approach [6, 7]. The microscopic...purpose computing CPU for algorithms where processing of large blocks of data is done in parallel. The reason for that is the GPU’s highly effective...parallel structure. Most of the image and video processing computations involve heavy matrix and vector op- erations over large amounts of data and
Genomecmp: computer software to detect genomic rearrangements using markers

NASA Astrophysics Data System (ADS)

Kulawik, Maciej; Nowak, Robert M.

2017-08-01

Detection of genomics rearrangements is a tough task, because of the size of data to be processed. As genome sequences may consist of hundreds of millions symbols, it is not only practically impossible to compare them by hand, but it is also complex problem for computer software. The way to significantly accelerate the process is to use rearrangement detection algorithm based on unique short sequences called markers. The algorithm described in this paper develops markers using base genome and find the markers positions on other genome. The algorithm has been extended by support for ambiguity symbols. Web application with graphical user interface has been created using three-layer architecture, where users could run the task simultaneously. The accuracy and efficiency of proposed solution has been studied using generated and real data.
The monitoring and managing application of cloud computing based on Internet of Things.

PubMed

Luo, Shiliang; Ren, Bin

2016-07-01

Cloud computing and the Internet of Things are the two hot points in the Internet application field. The application of the two new technologies is in hot discussion and research, but quite less on the field of medical monitoring and managing application. Thus, in this paper, we study and analyze the application of cloud computing and the Internet of Things on the medical field. And we manage to make a combination of the two techniques in the medical monitoring and managing field. The model architecture for remote monitoring cloud platform of healthcare information (RMCPHI) was established firstly. Then the RMCPHI architecture was analyzed. Finally an efficient PSOSAA algorithm was proposed for the medical monitoring and managing application of cloud computing. Simulation results showed that our proposed scheme can improve the efficiency about 50%. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Computer-aided system design

NASA Technical Reports Server (NTRS)

Walker, Carrie K.

1991-01-01

A technique has been developed for combining features of a systems architecture design and assessment tool and a software development tool. This technique reduces simulation development time and expands simulation detail. The Architecture Design and Assessment System (ADAS), developed at the Research Triangle Institute, is a set of computer-assisted engineering tools for the design and analysis of computer systems. The ADAS system is based on directed graph concepts and supports the synthesis and analysis of software algorithms mapped to candidate hardware implementations. Greater simulation detail is provided by the ADAS functional simulator. With the functional simulator, programs written in either Ada or C can be used to provide a detailed description of graph nodes. A Computer-Aided Software Engineering tool developed at the Charles Stark Draper Laboratory (CSDL CASE) automatically generates Ada or C code from engineering block diagram specifications designed with an interactive graphical interface. A technique to use the tools together has been developed, which further automates the design process.
Learning, epigenetics, and computation: An extension on Fitch's proposal. Comment on “Toward a computational framework for cognitive biology: Unifying approaches from cognitive neuroscience and comparative cognition” by W. Tecumseh Fitch

NASA Astrophysics Data System (ADS)

Okanoya, Kazuo

2014-09-01

The comparative computational approach of Fitch [1] attempts to renew the classical David Marr paradigm of computation, algorithm, and implementation, by introducing evolutionary view of the relationship between neural architecture and cognition. This comparative evolutionary view provides constraints useful in narrowing down the problem space for both cognition and neural mechanisms. I will provide two examples from our own studies that reinforce and extend Fitch's proposal.
Modern Computational Techniques for the HMMER Sequence Analysis

PubMed Central

2013-01-01

This paper focuses on the latest research and critical reviews on modern computing architectures, software and hardware accelerated algorithms for bioinformatics data analysis with an emphasis on one of the most important sequence analysis applications—hidden Markov models (HMM). We show the detailed performance comparison of sequence analysis tools on various computing platforms recently developed in the bioinformatics society. The characteristics of the sequence analysis, such as data and compute-intensive natures, make it very attractive to optimize and parallelize by using both traditional software approach and innovated hardware acceleration technologies. PMID:25937944
Computing NLTE Opacities -- Node Level Parallel Calculation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Holladay, Daniel

Presentation. The goal: to produce a robust library capable of computing reasonably accurate opacities inline with the assumption of LTE relaxed (non-LTE). Near term: demonstrate acceleration of non-LTE opacity computation. Far term (if funded): connect to application codes with in-line capability and compute opacities. Study science problems. Use efficient algorithms that expose many levels of parallelism and utilize good memory access patterns for use on advanced architectures. Portability to multiple types of hardware including multicore processors, manycore processors such as KNL, GPUs, etc. Easily coupled to radiation hydrodynamics and thermal radiative transfer codes.
Adapting a Navier-Stokes code to the ICL-DAP

NASA Technical Reports Server (NTRS)

Grosch, C. E.

1985-01-01

The results of an experiment are reported, i.c., to adapt a Navier-Stokes code, originally developed on a serial computer, to concurrent processing on the CL Distributed Array Processor (DAP). The algorithm used in solving the Navier-Stokes equations is briefly described. The architecture of the DAP and DAP FORTRAN are also described. The modifications of the algorithm so as to fit the DAP are given and discussed. Finally, performance results are given and conclusions are drawn.
Cascaded VLSI neural network architecture for on-line learning

NASA Technical Reports Server (NTRS)

Thakoor, Anilkumar P. (Inventor); Duong, Tuan A. (Inventor); Daud, Taher (Inventor)

1992-01-01

High-speed, analog, fully-parallel, and asynchronous building blocks are cascaded for larger sizes and enhanced resolution. A hardware compatible algorithm permits hardware-in-the-loop learning despite limited weight resolution. A computation intensive feature classification application was demonstrated with this flexible hardware and new algorithm at high speed. This result indicates that these building block chips can be embedded as an application specific coprocessor for solving real world problems at extremely high data rates.
Terascale spectral element algorithms and implementations.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Fischer, P. F.; Tufo, H. M.

1999-08-17

We describe the development and implementation of an efficient spectral element code for multimillion gridpoint simulations of incompressible flows in general two- and three-dimensional domains. We review basic and recently developed algorithmic underpinnings that have resulted in good parallel and vector performance on a broad range of architectures, including the terascale computing systems now coming online at the DOE labs. Sustained performance of 219 GFLOPS has been recently achieved on 2048 nodes of the Intel ASCI-Red machine at Sandia.
Hardware Architectures for Data-Intensive Computing Problems: A Case Study for String Matching

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tumeo, Antonino; Villa, Oreste; Chavarría-Miranda, Daniel

DNA analysis is an emerging application of high performance bioinformatic. Modern sequencing machinery are able to provide, in few hours, large input streams of data, which needs to be matched against exponentially growing databases of known fragments. The ability to recognize these patterns effectively and fastly may allow extending the scale and the reach of the investigations performed by biology scientists. Aho-Corasick is an exact, multiple pattern matching algorithm often at the base of this application. High performance systems are a promising platform to accelerate this algorithm, which is computationally intensive but also inherently parallel. Nowadays, high performance systems alsomore » include heterogeneous processing elements, such as Graphic Processing Units (GPUs), to further accelerate parallel algorithms. Unfortunately, the Aho-Corasick algorithm exhibits large performance variability, depending on the size of the input streams, on the number of patterns to search and on the number of matches, and poses significant challenges on current high performance software and hardware implementations. An adequate mapping of the algorithm on the target architecture, coping with the limit of the underlining hardware, is required to reach the desired high throughputs. In this paper, we discuss the implementation of the Aho-Corasick algorithm for GPU-accelerated high performance systems. We present an optimized implementation of Aho-Corasick for GPUs and discuss its tradeoffs on the Tesla T10 and he new Tesla T20 (codename Fermi) GPUs. We then integrate the optimized GPU code, respectively, in a MPI-based and in a pthreads-based load balancer to enable execution of the algorithm on clusters and large sharedmemory multiprocessors (SMPs) accelerated with multiple GPUs.« less
MC64-ClustalWP2: A Highly-Parallel Hybrid Strategy to Align Multiple Sequences in Many-Core Architectures

PubMed Central

Díaz, David; Esteban, Francisco J.; Hernández, Pilar; Caballero, Juan Antonio; Guevara, Antonio

2014-01-01

We have developed the MC64-ClustalWP2 as a new implementation of the Clustal W algorithm, integrating a novel parallelization strategy and significantly increasing the performance when aligning long sequences in architectures with many cores. It must be stressed that in such a process, the detailed analysis of both the software and hardware features and peculiarities is of paramount importance to reveal key points to exploit and optimize the full potential of parallelism in many-core CPU systems. The new parallelization approach has focused into the most time-consuming stages of this algorithm. In particular, the so-called progressive alignment has drastically improved the performance, due to a fine-grained approach where the forward and backward loops were unrolled and parallelized. Another key approach has been the implementation of the new algorithm in a hybrid-computing system, integrating both an Intel Xeon multi-core CPU and a Tilera Tile64 many-core card. A comparison with other Clustal W implementations reveals the high-performance of the new algorithm and strategy in many-core CPU architectures, in a scenario where the sequences to align are relatively long (more than 10 kb) and, hence, a many-core GPU hardware cannot be used. Thus, the MC64-ClustalWP2 runs multiple alignments more than 18x than the original Clustal W algorithm, and more than 7x than the best x86 parallel implementation to date, being publicly available through a web service. Besides, these developments have been deployed in cost-effective personal computers and should be useful for life-science researchers, including the identification of identities and differences for mutation/polymorphism analyses, biodiversity and evolutionary studies and for the development of molecular markers for paternity testing, germplasm management and protection, to assist breeding, illegal traffic control, fraud prevention and for the protection of the intellectual property (identification/traceability), including the protected designation of origin, among other applications. PMID:24710354
Quantum neural network-based EEG filtering for a brain-computer interface.

PubMed

Gandhi, Vaibhav; Prasad, Girijesh; Coyle, Damien; Behera, Laxmidhar; McGinnity, Thomas Martin

2014-02-01

A novel neural information processing architecture inspired by quantum mechanics and incorporating the well-known Schrodinger wave equation is proposed in this paper. The proposed architecture referred to as recurrent quantum neural network (RQNN) can characterize a nonstationary stochastic signal as time-varying wave packets. A robust unsupervised learning algorithm enables the RQNN to effectively capture the statistical behavior of the input signal and facilitates the estimation of signal embedded in noise with unknown characteristics. The results from a number of benchmark tests show that simple signals such as dc, staircase dc, and sinusoidal signals embedded within high noise can be accurately filtered and particle swarm optimization can be employed to select model parameters. The RQNN filtering procedure is applied in a two-class motor imagery-based brain-computer interface where the objective was to filter electroencephalogram (EEG) signals before feature extraction and classification to increase signal separability. A two-step inner-outer fivefold cross-validation approach is utilized to select the algorithm parameters subject-specifically for nine subjects. It is shown that the subject-specific RQNN EEG filtering significantly improves brain-computer interface performance compared to using only the raw EEG or Savitzky-Golay filtered EEG across multiple sessions.
Profugus

DOE Office of Scientific and Technical Information (OSTI.GOV)

Evans, Thomas; Hamilton, Steven; Slattery, Stuart

Profugus is an open-source mini-application (mini-app) for radiation transport and reactor applications. It contains the fundamental computational kernels used in the Exnihilo code suite from Oak Ridge National Laboratory. However, Exnihilo is production code with a substantial user base. Furthermore, Exnihilo is export controlled. This makes collaboration with computer scientists and computer engineers difficult. Profugus is designed to bridge that gap. By encapsulating the core numerical algorithms in an abbreviated code base that is open-source, computer scientists can analyze the algorithms and easily make code-architectural changes to test performance without compromising the production code values of Exnihilo. Profugus is notmore » meant to be production software with respect to problem analysis. The computational kernels in Profugus are designed to analyze performance, not correctness. Nonetheless, users of Profugus can setup and run problems with enough real-world features to be useful as proof-of-concept for actual production work.« less
Wavelet subband coding of computer simulation output using the A++ array class library

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bradley, J.N.; Brislawn, C.M.; Quinlan, D.J.

1995-07-01

The goal of the project is to produce utility software for off-line compression of existing data and library code that can be called from a simulation program for on-line compression of data dumps as the simulation proceeds. Naturally, we would like the amount of CPU time required by the compression algorithm to be small in comparison to the requirements of typical simulation codes. We also want the algorithm to accomodate a wide variety of smooth, multidimensional data types. For these reasons, the subband vector quantization (VQ) approach employed in has been replaced by a scalar quantization (SQ) strategy using amore » bank of almost-uniform scalar subband quantizers in a scheme similar to that used in the FBI fingerprint image compression standard. This eliminates the considerable computational burdens of training VQ codebooks for each new type of data and performing nearest-vector searches to encode the data. The comparison of subband VQ and SQ algorithms in indicated that, in practice, there is relatively little additional gain from using vector as opposed to scalar quantization on DWT subbands, even when the source imagery is from a very homogeneous population, and our subjective experience with synthetic computer-generated data supports this stance. It appears that a careful study is needed of the tradeoffs involved in selecting scalar vs. vector subband quantization, but such an analysis is beyond the scope of this paper. Our present work is focused on the problem of generating wavelet transform/scalar quantization (WSQ) implementations that can be ported easily between different hardware environments. This is an extremely important consideration given the great profusion of different high-performance computing architectures available, the high cost associated with learning how to map algorithms effectively onto a new architecture, and the rapid rate of evolution in the world of high-performance computing.« less
Space station data system analysis/architecture study. Task 2: Options development DR-5. Volume 1: Technology options

NASA Technical Reports Server (NTRS)

1985-01-01

The second task in the Space Station Data System (SSDS) Analysis/Architecture Study is the development of an information base that will support the conduct of trade studies and provide sufficient data to make key design/programmatic decisions. This volume identifies the preferred options in the technology category and characterizes these options with respect to performance attributes, constraints, cost, and risk. The technology category includes advanced materials, processes, and techniques that can be used to enhance the implementation of SSDS design structures. The specific areas discussed are mass storage, including space and round on-line storage and off-line storage; man/machine interface; data processing hardware, including flight computers and advanced/fault tolerant computer architectures; and software, including data compression algorithms, on-board high level languages, and software tools. Also discussed are artificial intelligence applications and hard-wire communications.
Efficient parallel architecture for highly coupled real-time linear system applications

NASA Technical Reports Server (NTRS)

Carroll, Chester C.; Homaifar, Abdollah; Barua, Soumavo

1988-01-01

A systematic procedure is developed for exploiting the parallel constructs of computation in a highly coupled, linear system application. An overall top-down design approach is adopted. Differential equations governing the application under consideration are partitioned into subtasks on the basis of a data flow analysis. The interconnected task units constitute a task graph which has to be computed in every update interval. Multiprocessing concepts utilizing parallel integration algorithms are then applied for efficient task graph execution. A simple scheduling routine is developed to handle task allocation while in the multiprocessor mode. Results of simulation and scheduling are compared on the basis of standard performance indices. Processor timing diagrams are developed on the basis of program output accruing to an optimal set of processors. Basic architectural attributes for implementing the system are discussed together with suggestions for processing element design. Emphasis is placed on flexible architectures capable of accommodating widely varying application specifics.
Diamond Eye: a distributed architecture for image data mining

NASA Astrophysics Data System (ADS)

Burl, Michael C.; Fowlkes, Charless; Roden, Joe; Stechert, Andre; Mukhtar, Saleem

1999-02-01

Diamond Eye is a distributed software architecture, which enables users (scientists) to analyze large image collections by interacting with one or more custom data mining servers via a Java applet interface. Each server is coupled with an object-oriented database and a computational engine, such as a network of high-performance workstations. The database provides persistent storage and supports querying of the 'mined' information. The computational engine provides parallel execution of expensive image processing, object recognition, and query-by-content operations. Key benefits of the Diamond Eye architecture are: (1) the design promotes trial evaluation of advanced data mining and machine learning techniques by potential new users (all that is required is to point a web browser to the appropriate URL), (2) software infrastructure that is common across a range of science mining applications is factored out and reused, and (3) the system facilitates closer collaborations between algorithm developers and domain experts.
A Disciplined Architectural Approach to Scaling Data Analysis for Massive, Scientific Data

NASA Astrophysics Data System (ADS)

Crichton, D. J.; Braverman, A. J.; Cinquini, L.; Turmon, M.; Lee, H.; Law, E.

2014-12-01

Data collections across remote sensing and ground-based instruments in astronomy, Earth science, and planetary science are outpacing scientists' ability to analyze them. Furthermore, the distribution, structure, and heterogeneity of the measurements themselves pose challenges that limit the scalability of data analysis using traditional approaches. Methods for developing science data processing pipelines, distribution of scientific datasets, and performing analysis will require innovative approaches that integrate cyber-infrastructure, algorithms, and data into more systematic approaches that can more efficiently compute and reduce data, particularly distributed data. This requires the integration of computer science, machine learning, statistics and domain expertise to identify scalable architectures for data analysis. The size of data returned from Earth Science observing satellites and the magnitude of data from climate model output, is predicted to grow into the tens of petabytes challenging current data analysis paradigms. This same kind of growth is present in astronomy and planetary science data. One of the major challenges in data science and related disciplines defining new approaches to scaling systems and analysis in order to increase scientific productivity and yield. Specific needs include: 1) identification of optimized system architectures for analyzing massive, distributed data sets; 2) algorithms for systematic analysis of massive data sets in distributed environments; and 3) the development of software infrastructures that are capable of performing massive, distributed data analysis across a comprehensive data science framework. NASA/JPL has begun an initiative in data science to address these challenges. Our goal is to evaluate how scientific productivity can be improved through optimized architectural topologies that identify how to deploy and manage the access, distribution, computation, and reduction of massive, distributed data, while managing the uncertainties of scientific conclusions derived from such capabilities. This talk will provide an overview of JPL's efforts in developing a comprehensive architectural approach to data science.

Designing, programming, and optimizing a (small) quantum computer

NASA Astrophysics Data System (ADS)

Svore, Krysta

In 1982, Richard Feynman proposed to use a computer founded on the laws of quantum physics to simulate physical systems. In the more than thirty years since, quantum computers have shown promise to solve problems in number theory, chemistry, and materials science that would otherwise take longer than the lifetime of the universe to solve on an exascale classical machine. The practical realization of a quantum computer requires understanding and manipulating subtle quantum states while experimentally controlling quantum interference. It also requires an end-to-end software architecture for programming, optimizing, and implementing a quantum algorithm on the quantum device hardware. In this talk, we will introduce recent advances in connecting abstract theory to present-day real-world applications through software. We will highlight recent advancement of quantum algorithms and the challenges in ultimately performing a scalable solution on a quantum device.
Fuzzy control of magnetic bearings

NASA Technical Reports Server (NTRS)

Feeley, J. J.; Niederauer, G. M.; Ahlstrom, D. J.

1991-01-01

The use of an adaptive fuzzy control algorithm implemented on a VLSI chip for the control of a magnetic bearing was considered. The architecture of the adaptive fuzzy controller is similar to that of a neural network. The performance of the fuzzy controller is compared to that of a conventional controller by computer simulation.
Group implicit concurrent algorithms in nonlinear structural dynamics

NASA Technical Reports Server (NTRS)

Ortiz, M.; Sotelino, E. D.

1989-01-01

During the 70's and 80's, considerable effort was devoted to developing efficient and reliable time stepping procedures for transient structural analysis. Mathematically, the equations governing this type of problems are generally stiff, i.e., they exhibit a wide spectrum in the linear range. The algorithms best suited to this type of applications are those which accurately integrate the low frequency content of the response without necessitating the resolution of the high frequency modes. This means that the algorithms must be unconditionally stable, which in turn rules out explicit integration. The most exciting possibility in the algorithms development area in recent years has been the advent of parallel computers with multiprocessing capabilities. So, this work is mainly concerned with the development of parallel algorithms in the area of structural dynamics. A primary objective is to devise unconditionally stable and accurate time stepping procedures which lend themselves to an efficient implementation in concurrent machines. Some features of the new computer architecture are summarized. A brief survey of current efforts in the area is presented. A new class of concurrent procedures, or Group Implicit algorithms is introduced and analyzed. The numerical simulation shows that GI algorithms hold considerable promise for application in coarse grain as well as medium grain parallel computers.
A Distributed Prognostic Health Management Architecture

NASA Technical Reports Server (NTRS)

Bhaskar, Saha; Saha, Sankalita; Goebel, Kai

2009-01-01

This paper introduces a generic distributed prognostic health management (PHM) architecture with specific application to the electrical power systems domain. Current state-of-the-art PHM systems are mostly centralized in nature, where all the processing is reliant on a single processor. This can lead to loss of functionality in case of a crash of the central processor or monitor. Furthermore, with increases in the volume of sensor data as well as the complexity of algorithms, traditional centralized systems become unsuitable for successful deployment, and efficient distributed architectures are required. A distributed architecture though, is not effective unless there is an algorithmic framework to take advantage of its unique abilities. The health management paradigm envisaged here incorporates a heterogeneous set of system components monitored by a varied suite of sensors and a particle filtering (PF) framework that has the power and the flexibility to adapt to the different diagnostic and prognostic needs. Both the diagnostic and prognostic tasks are formulated as a particle filtering problem in order to explicitly represent and manage uncertainties; however, typically the complexity of the prognostic routine is higher than the computational power of one computational element ( CE). Individual CEs run diagnostic routines until the system variable being monitored crosses beyond a nominal threshold, upon which it coordinates with other networked CEs to run the prognostic routine in a distributed fashion. Implementation results from a network of distributed embedded devices monitoring a prototypical aircraft electrical power system are presented, where the CEs are Sun Microsystems Small Programmable Object Technology (SPOT) devices.
Parallel Clustering Algorithm for Large-Scale Biological Data Sets

PubMed Central

Wang, Minchao; Zhang, Wu; Ding, Wang; Dai, Dongbo; Zhang, Huiran; Xie, Hao; Chen, Luonan; Guo, Yike; Xie, Jiang

2014-01-01

Backgrounds Recent explosion of biological data brings a great challenge for the traditional clustering algorithms. With increasing scale of data sets, much larger memory and longer runtime are required for the cluster identification problems. The affinity propagation algorithm outperforms many other classical clustering algorithms and is widely applied into the biological researches. However, the time and space complexity become a great bottleneck when handling the large-scale data sets. Moreover, the similarity matrix, whose constructing procedure takes long runtime, is required before running the affinity propagation algorithm, since the algorithm clusters data sets based on the similarities between data pairs. Methods Two types of parallel architectures are proposed in this paper to accelerate the similarity matrix constructing procedure and the affinity propagation algorithm. The memory-shared architecture is used to construct the similarity matrix, and the distributed system is taken for the affinity propagation algorithm, because of its large memory size and great computing capacity. An appropriate way of data partition and reduction is designed in our method, in order to minimize the global communication cost among processes. Result A speedup of 100 is gained with 128 cores. The runtime is reduced from serval hours to a few seconds, which indicates that parallel algorithm is capable of handling large-scale data sets effectively. The parallel affinity propagation also achieves a good performance when clustering large-scale gene data (microarray) and detecting families in large protein superfamilies. PMID:24705246
Task scheduling in dataflow computer architectures

NASA Technical Reports Server (NTRS)

Katsinis, Constantine

1994-01-01

Dataflow computers provide a platform for the solution of a large class of computational problems, which includes digital signal processing and image processing. Many typical applications are represented by a set of tasks which can be repetitively executed in parallel as specified by an associated dataflow graph. Research in this area aims to model these architectures, develop scheduling procedures, and predict the transient and steady state performance. Researchers at NASA have created a model and developed associated software tools which are capable of analyzing a dataflow graph and predicting its runtime performance under various resource and timing constraints. These models and tools were extended and used in this work. Experiments using these tools revealed certain properties of such graphs that require further study. Specifically, the transient behavior at the beginning of the execution of a graph can have a significant effect on the steady state performance. Transformation and retiming of the application algorithm and its initial conditions can produce a different transient behavior and consequently different steady state performance. The effect of such transformations on the resource requirements or under resource constraints requires extensive study. Task scheduling to obtain maximum performance (based on user-defined criteria), or to satisfy a set of resource constraints, can also be significantly affected by a transformation of the application algorithm. Since task scheduling is performed by heuristic algorithms, further research is needed to determine if new scheduling heuristics can be developed that can exploit such transformations. This work has provided the initial development for further long-term research efforts. A simulation tool was completed to provide insight into the transient and steady state execution of a dataflow graph. A set of scheduling algorithms was completed which can operate in conjunction with the modeling and performance tools previously developed. Initial studies on the performance of these algorithms were done to examine the effects of application algorithm transformations as measured by such quantities as number of processors, time between outputs, time between input and output, communication time, and memory size.
A real-time control system for the control of suspended interferometers based on hybrid computing techniques

NASA Astrophysics Data System (ADS)

Acernese, Fausto; Barone, Fabrizio; De Rosa, Rosario; Eleuteri, Antonio; Milano, Leopoldo; Pardi, Silvio; Ricciardi, Iolanda; Russo, Guido

2004-09-01

One of the main requirements of a digital system for the control of interferometric detectors of gravitational waves is the computing power, that is a direct consequence of the increasing complexity of the digital algorithms necessary for the control signals generation. For this specific task many specialized non standard real-time architectures have been developed, often very expensive and difficult to upgrade. On the other hand, such computing power is generally fully available for off-line applications on standard Pc based systems. Therefore, a possible and obvious solution may be provided by the integration of both the real-time and off-line architecture resulting in a hybrid control system architecture based on standards available components, trying to get both the advantages of the perfect data synchronization provided by the real-time systems and by the large computing power available on Pc based systems. Such integration may be provided by the implementation of the link between the two different architectures through the standard Ethernet network, whose data transfer speed is largely increasing in these years, using the TCP/IP, UDP and raw Ethernet protocols. In this paper we describe the architecture of an hybrid Ethernet based real-time control system prototype we implemented in Napoli, discussing its characteristics and performances. Finally we discuss a possible application to the real-time control of a suspended mass of the mode cleaner of the 3m prototype optical interferometer for gravitational wave detection (IDGW-3P) operational in Napoli.
Loops in hierarchical channel networks

NASA Astrophysics Data System (ADS)

Katifori, Eleni; Magnasco, Marcelo

2012-02-01

Nature provides us with many examples of planar distribution and structural networks having dense sets of closed loops. An archetype of this form of network organization is the vasculature of dicotyledonous leaves, which showcases a hierarchically-nested architecture. Although a number of methods have been proposed to measure aspects of the structure of such networks, a robust metric to quantify their hierarchical organization is still lacking. We present an algorithmic framework that allows mapping loopy networks to binary trees, preserving in the connectivity of the trees the architecture of the original graph. We apply this framework to investigate computer generated and natural graphs extracted from digitized images of dicotyledonous leaves and animal vasculature. We calculate various metrics on the corresponding trees and discuss the relationship of these quantities to the architectural organization of the original graphs. This algorithmic framework decouples the geometric information from the metric topology (connectivity and edge weight) and it ultimately allows us to perform a quantitative statistical comparison between predictions of theoretical models and naturally occurring loopy graphs.
Parallel VLSI architecture emulation and the organization of APSA/MPP

NASA Technical Reports Server (NTRS)

Odonnell, John T.

1987-01-01

The Applicative Programming System Architecture (APSA) combines an applicative language interpreter with a novel parallel computer architecture that is well suited for Very Large Scale Integration (VLSI) implementation. The Massively Parallel Processor (MPP) can simulate VLSI circuits by allocating one processing element in its square array to an area on a square VLSI chip. As long as there are not too many long data paths, the MPP can simulate a VLSI clock cycle very rapidly. The APSA circuit contains a binary tree with a few long paths and many short ones. A skewed H-tree layout allows every processing element to simulate a leaf cell and up to four tree nodes, with no loss in parallelism. Emulation of a key APSA algorithm on the MPP resulted in performance 16,000 times faster than a Vax. This speed will make it possible for the APSA language interpreter to run fast enough to support research in parallel list processing algorithms.
Implementation of a fully-balanced periodic tridiagonal solver on a parallel distributed memory architecture

NASA Technical Reports Server (NTRS)

Eidson, T. M.; Erlebacher, G.

1994-01-01

While parallel computers offer significant computational performance, it is generally necessary to evaluate several programming strategies. Two programming strategies for a fairly common problem - a periodic tridiagonal solver - are developed and evaluated. Simple model calculations as well as timing results are presented to evaluate the various strategies. The particular tridiagonal solver evaluated is used in many computational fluid dynamic simulation codes. The feature that makes this algorithm unique is that these simulation codes usually require simultaneous solutions for multiple right-hand-sides (RHS) of the system of equations. Each RHS solutions is independent and thus can be computed in parallel. Thus a Gaussian elimination type algorithm can be used in a parallel computation and the more complicated approaches such as cyclic reduction are not required. The two strategies are a transpose strategy and a distributed solver strategy. For the transpose strategy, the data is moved so that a subset of all the RHS problems is solved on each of the several processors. This usually requires significant data movement between processor memories across a network. The second strategy attempts to have the algorithm allow the data across processor boundaries in a chained manner. This usually requires significantly less data movement. An approach to accomplish this second strategy in a near-perfect load-balanced manner is developed. In addition, an algorithm will be shown to directly transform a sequential Gaussian elimination type algorithm into the parallel chained, load-balanced algorithm.
A computational model for simulating text comprehension.

PubMed

Lemaire, Benoît; Denhière, Guy; Bellissens, Cédrick; Jhean-Larose, Sandra

2006-11-01

In the present article, we outline the architecture of a computer program for simulating the process by which humans comprehend texts. The program is based on psycholinguistic theories about human memory and text comprehension processes, such as the construction-integration model (Kintsch, 1998), the latent semantic analysis theory of knowledge representation (Landauer & Dumais, 1997), and the predication algorithms (Kintsch, 2001; Lemaire & Bianco, 2003), and it is intended to help psycholinguists investigate the way humans comprehend texts.
A parameter estimation algorithm for LFM/BPSK hybrid modulated signal intercepted by Nyquist folding receiver

NASA Astrophysics Data System (ADS)

Qiu, Zhaoyang; Wang, Pei; Zhu, Jun; Tang, Bin

2016-12-01

Nyquist folding receiver (NYFR) is a novel ultra-wideband receiver architecture which can realize wideband receiving with a small amount of equipment. Linear frequency modulated/binary phase shift keying (LFM/BPSK) hybrid modulated signal is a novel kind of low probability interception signal with wide bandwidth. The NYFR is an effective architecture to intercept the LFM/BPSK signal and the LFM/BPSK signal intercepted by the NYFR will add the local oscillator modulation. A parameter estimation algorithm for the NYFR output signal is proposed. According to the NYFR prior information, the chirp singular value ratio spectrum is proposed to estimate the chirp rate. Then, based on the output self-characteristic, matching component function is designed to estimate Nyquist zone (NZ) index. Finally, matching code and subspace method are employed to estimate the phase change points and code length. Compared with the existing methods, the proposed algorithm has a better performance. It also has no need to construct a multi-channel structure, which means the computational complexity for the NZ index estimation is small. The simulation results demonstrate the efficacy of the proposed algorithm.
Parallelizing flow-accumulation calculations on graphics processing units—From iterative DEM preprocessing algorithm to recursive multiple-flow-direction algorithm

NASA Astrophysics Data System (ADS)

Qin, Cheng-Zhi; Zhan, Lijun

2012-06-01

As one of the important tasks in digital terrain analysis, the calculation of flow accumulations from gridded digital elevation models (DEMs) usually involves two steps in a real application: (1) using an iterative DEM preprocessing algorithm to remove the depressions and flat areas commonly contained in real DEMs, and (2) using a recursive flow-direction algorithm to calculate the flow accumulation for every cell in the DEM. Because both algorithms are computationally intensive, quick calculation of the flow accumulations from a DEM (especially for a large area) presents a practical challenge to personal computer (PC) users. In recent years, rapid increases in hardware capacity of the graphics processing units (GPUs) provided in modern PCs have made it possible to meet this challenge in a PC environment. Parallel computing on GPUs using a compute-unified-device-architecture (CUDA) programming model has been explored to speed up the execution of the single-flow-direction algorithm (SFD). However, the parallel implementation on a GPU of the multiple-flow-direction (MFD) algorithm, which generally performs better than the SFD algorithm, has not been reported. Moreover, GPU-based parallelization of the DEM preprocessing step in the flow-accumulation calculations has not been addressed. This paper proposes a parallel approach to calculate flow accumulations (including both iterative DEM preprocessing and a recursive MFD algorithm) on a CUDA-compatible GPU. For the parallelization of an MFD algorithm (MFD-md), two different parallelization strategies using a GPU are explored. The first parallelization strategy, which has been used in the existing parallel SFD algorithm on GPU, has the problem of computing redundancy. Therefore, we designed a parallelization strategy based on graph theory. The application results show that the proposed parallel approach to calculate flow accumulations on a GPU performs much faster than either sequential algorithms or other parallel GPU-based algorithms based on existing parallelization strategies.
Direct migration motion estimation and mode decision to decoder for a low-complexity decoder Wyner-Ziv video coding

NASA Astrophysics Data System (ADS)

Lei, Ted Chih-Wei; Tseng, Fan-Shuo

2017-07-01

This paper addresses the problem of high-computational complexity decoding in traditional Wyner-Ziv video coding (WZVC). The key focus is the migration of two traditionally high-computationally complex encoder algorithms, namely motion estimation and mode decision. In order to reduce the computational burden in this process, the proposed architecture adopts the partial boundary matching algorithm and four flexible types of block mode decision at the decoder. This approach does away with the need for motion estimation and mode decision at the encoder. The experimental results show that the proposed padding block-based WZVC not only decreases decoder complexity to approximately one hundredth that of the state-of-the-art DISCOVER decoding but also outperforms DISCOVER codec by up to 3 to 4 dB.
Critical Problems in Very Large Scale Computer Systems

DTIC Science & Technology

1988-09-30

253-6043 Srinivas Devadas (617) 253-0454 Thomas F. Knight, Jr. (617) 253-7807 F. Thomson Leighton (617) 253-3662 Charles E. Leiserson (617) 253-5833...J. Keen, P. Nuth, J. Larivee, and B . Totty, "Message-Driven Processor Architecture," MIT VLSI Memo No. 88-468, August 1988. *W. J. Dally and A. A...losses and gains) which are the first polynomial-time combinatorial algorithms for this problem. One algorithm runs in O(n2m2 lg 2 n Ig B ) time and the
Evaluation of SuperLU on multicore architectures

NASA Astrophysics Data System (ADS)

Li, X. S.

2008-07-01

The Chip Multiprocessor (CMP) will be the basic building block for computer systems ranging from laptops to supercomputers. New software developments at all levels are needed to fully utilize these systems. In this work, we evaluate performance of different high-performance sparse LU factorization and triangular solution algorithms on several representative multicore machines. We included both Pthreads and MPI implementations in this study and found that the Pthreads implementation consistently delivers good performance and that a left-looking algorithm is usually superior.
A software simulation study of a (255,223) Reed-Solomon encoder-decoder

NASA Technical Reports Server (NTRS)

Pollara, F.

1985-01-01

A set of software programs which simulates a (255,223) Reed-Solomon encoder/decoder pair is described. The transform decoder algorithm uses a modified Euclid algorithm, and closely follows the pipeline architecture proposed for the hardware decoder. Uncorrectable error patterns are detected by a simple test, and the inverse transform is computed by a finite field FFT. Numerical examples of the decoder operation are given for some test codewords, with and without errors. The use of the software package is briefly described.
Hypercube matrix computation task

NASA Technical Reports Server (NTRS)

Calalo, Ruel H.; Imbriale, William A.; Jacobi, Nathan; Liewer, Paulett C.; Lockhart, Thomas G.; Lyzenga, Gregory A.; Lyons, James R.; Manshadi, Farzin; Patterson, Jean E.

1988-01-01

A major objective of the Hypercube Matrix Computation effort at the Jet Propulsion Laboratory (JPL) is to investigate the applicability of a parallel computing architecture to the solution of large-scale electromagnetic scattering problems. Three scattering analysis codes are being implemented and assessed on a JPL/California Institute of Technology (Caltech) Mark 3 Hypercube. The codes, which utilize different underlying algorithms, give a means of evaluating the general applicability of this parallel architecture. The three analysis codes being implemented are a frequency domain method of moments code, a time domain finite difference code, and a frequency domain finite elements code. These analysis capabilities are being integrated into an electromagnetics interactive analysis workstation which can serve as a design tool for the construction of antennas and other radiating or scattering structures. The first two years of work on the Hypercube Matrix Computation effort is summarized. It includes both new developments and results as well as work previously reported in the Hypercube Matrix Computation Task: Final Report for 1986 to 1987 (JPL Publication 87-18).
Real-time skeleton tracking for embedded systems

NASA Astrophysics Data System (ADS)

Coleca, Foti; Klement, Sascha; Martinetz, Thomas; Barth, Erhardt

2013-03-01

Touch-free gesture technology is beginning to become more popular with consumers and may have a significant future impact on interfaces for digital photography. However, almost every commercial software framework for gesture and pose detection is aimed at either desktop PCs or high-powered GPUs, making mobile implementations for gesture recognition an attractive area for research and development. In this paper we present an algorithm for hand skeleton tracking and gesture recognition that runs on an ARM-based platform (Pandaboard ES, OMAP 4460 architecture). The algorithm uses self-organizing maps to fit a given topology (skeleton) into a 3D point cloud. This is a novel way of approaching the problem of pose recognition as it does not employ complex optimization techniques or data-based learning. After an initial background segmentation step, the algorithm is ran in parallel with heuristics, which detect and correct artifacts arising from insufficient or erroneous input data. We then optimize the algorithm for the ARM platform using fixed-point computation and the NEON SIMD architecture the OMAP4460 provides. We tested the algorithm with two different depth-sensing devices (Microsoft Kinect, PMD Camboard). For both input devices we were able to accurately track the skeleton at the native framerate of the cameras.
Parallelization of a blind deconvolution algorithm

NASA Astrophysics Data System (ADS)

Matson, Charles L.; Borelli, Kathy J.

2006-09-01

Often it is of interest to deblur imagery in order to obtain higher-resolution images. Deblurring requires knowledge of the blurring function - information that is often not available separately from the blurred imagery. Blind deconvolution algorithms overcome this problem by jointly estimating both the high-resolution image and the blurring function from the blurred imagery. Because blind deconvolution algorithms are iterative in nature, they can take minutes to days to deblur an image depending how many frames of data are used for the deblurring and the platforms on which the algorithms are executed. Here we present our progress in parallelizing a blind deconvolution algorithm to increase its execution speed. This progress includes sub-frame parallelization and a code structure that is not specialized to a specific computer hardware architecture.

Performance Evaluation of NWChem Ab-Initio Molecular Dynamics (AIMD) Simulations on the Intel® Xeon Phi™ Processor

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bylaska, Eric J.; Jacquelin, Mathias; De Jong, Wibe A.

2017-10-20

Ab-initio Molecular Dynamics (AIMD) methods are an important class of algorithms, as they enable scientists to understand the chemistry and dynamics of molecular and condensed phase systems while retaining a first-principles-based description of their interactions. Many-core architectures such as the Intel® Xeon Phi™ processor are an interesting and promising target for these algorithms, as they can provide the computational power that is needed to solve interesting problems in chemistry. In this paper, we describe the efforts of refactoring the existing AIMD plane-wave method of NWChem from an MPI-only implementation to a scalable, hybrid code that employs MPI and OpenMP tomore » exploit the capabilities of current and future many-core architectures. We describe the optimizations required to get close to optimal performance for the multiplication of the tall-and-skinny matrices that form the core of the computational algorithm. We present strong scaling results on the complete AIMD simulation for a test case that simulates 256 water molecules and that strong-scales well on a cluster of 1024 nodes of Intel Xeon Phi processors. We compare the performance obtained with a cluster of dual-socket Intel® Xeon® E5–2698v3 processors.« less
Efficient Geometric Sound Propagation Using Visibility Culling

NASA Astrophysics Data System (ADS)

Chandak, Anish

2011-07-01

Simulating propagation of sound can improve the sense of realism in interactive applications such as video games and can lead to better designs in engineering applications such as architectural acoustics. In this thesis, we present geometric sound propagation techniques which are faster than prior methods and map well to upcoming parallel multi-core CPUs. We model specular reflections by using the image-source method and model finite-edge diffraction by using the well-known Biot-Tolstoy-Medwin (BTM) model. We accelerate the computation of specular reflections by applying novel visibility algorithms, FastV and AD-Frustum, which compute visibility from a point. We accelerate finite-edge diffraction modeling by applying a novel visibility algorithm which computes visibility from a region. Our visibility algorithms are based on frustum tracing and exploit recent advances in fast ray-hierarchy intersections, data-parallel computations, and scalable, multi-core algorithms. The AD-Frustum algorithm adapts its computation to the scene complexity and allows small errors in computing specular reflection paths for higher computational efficiency. FastV and our visibility algorithm from a region are general, object-space, conservative visibility algorithms that together significantly reduce the number of image sources compared to other techniques while preserving the same accuracy. Our geometric propagation algorithms are an order of magnitude faster than prior approaches for modeling specular reflections and two to ten times faster for modeling finite-edge diffraction. Our algorithms are interactive, scale almost linearly on multi-core CPUs, and can handle large, complex, and dynamic scenes. We also compare the accuracy of our sound propagation algorithms with other methods. Once sound propagation is performed, it is desirable to listen to the propagated sound in interactive and engineering applications. We can generate smooth, artifact-free output audio signals by applying efficient audio-processing algorithms. We also present the first efficient audio-processing algorithm for scenarios with simultaneously moving source and moving receiver (MS-MR) which incurs less than 25% overhead compared to static source and moving receiver (SS-MR) or moving source and static receiver (MS-SR) scenario.
Real-time implementations of image segmentation algorithms on shared memory multicore architecture: a survey (Conference Presentation)

NASA Astrophysics Data System (ADS)

Akil, Mohamed

2017-05-01

The real-time processing is getting more and more important in many image processing applications. Image segmentation is one of the most fundamental tasks image analysis. As a consequence, many different approaches for image segmentation have been proposed. The watershed transform is a well-known image segmentation tool. The watershed transform is a very data intensive task. To achieve acceleration and obtain real-time processing of watershed algorithms, parallel architectures and programming models for multicore computing have been developed. This paper focuses on the survey of the approaches for parallel implementation of sequential watershed algorithms on multicore general purpose CPUs: homogeneous multicore processor with shared memory. To achieve an efficient parallel implementation, it's necessary to explore different strategies (parallelization/distribution/distributed scheduling) combined with different acceleration and optimization techniques to enhance parallelism. In this paper, we give a comparison of various parallelization of sequential watershed algorithms on shared memory multicore architecture. We analyze the performance measurements of each parallel implementation and the impact of the different sources of overhead on the performance of the parallel implementations. In this comparison study, we also discuss the advantages and disadvantages of the parallel programming models. Thus, we compare the OpenMP (an application programming interface for multi-Processing) with Ptheads (POSIX Threads) to illustrate the impact of each parallel programming model on the performance of the parallel implementations.
Method and system for training dynamic nonlinear adaptive filters which have embedded memory

NASA Technical Reports Server (NTRS)

Rabinowitz, Matthew (Inventor)

2002-01-01

Described herein is a method and system for training nonlinear adaptive filters (or neural networks) which have embedded memory. Such memory can arise in a multi-layer finite impulse response (FIR) architecture, or an infinite impulse response (IIR) architecture. We focus on filter architectures with separate linear dynamic components and static nonlinear components. Such filters can be structured so as to restrict their degrees of computational freedom based on a priori knowledge about the dynamic operation to be emulated. The method is detailed for an FIR architecture which consists of linear FIR filters together with nonlinear generalized single layer subnets. For the IIR case, we extend the methodology to a general nonlinear architecture which uses feedback. For these dynamic architectures, we describe how one can apply optimization techniques which make updates closer to the Newton direction than those of a steepest descent method, such as backpropagation. We detail a novel adaptive modified Gauss-Newton optimization technique, which uses an adaptive learning rate to determine both the magnitude and direction of update steps. For a wide range of adaptive filtering applications, the new training algorithm converges faster and to a smaller value of cost than both steepest-descent methods such as backpropagation-through-time, and standard quasi-Newton methods. We apply the algorithm to modeling the inverse of a nonlinear dynamic tracking system 5, as well as a nonlinear amplifier 6.
Wasatch: An architecture-proof multiphysics development environment using a Domain Specific Language and graph theory

DOE Office of Scientific and Technical Information (OSTI.GOV)

Saad, Tony; Sutherland, James C.

To address the coding and software challenges of modern hybrid architectures, we propose an approach to multiphysics code development for high-performance computing. This approach is based on using a Domain Specific Language (DSL) in tandem with a directed acyclic graph (DAG) representation of the problem to be solved that allows runtime algorithm generation. When coupled with a large-scale parallel framework, the result is a portable development framework capable of executing on hybrid platforms and handling the challenges of multiphysics applications. In addition, we share our experience developing a code in such an environment – an effort that spans an interdisciplinarymore » team of engineers and computer scientists.« less
Wasatch: An architecture-proof multiphysics development environment using a Domain Specific Language and graph theory

DOE PAGES

Saad, Tony; Sutherland, James C.

2016-05-04

To address the coding and software challenges of modern hybrid architectures, we propose an approach to multiphysics code development for high-performance computing. This approach is based on using a Domain Specific Language (DSL) in tandem with a directed acyclic graph (DAG) representation of the problem to be solved that allows runtime algorithm generation. When coupled with a large-scale parallel framework, the result is a portable development framework capable of executing on hybrid platforms and handling the challenges of multiphysics applications. In addition, we share our experience developing a code in such an environment – an effort that spans an interdisciplinarymore » team of engineers and computer scientists.« less
Vehicle Maneuver Detection with Accelerometer-Based Classification.

PubMed

Cervantes-Villanueva, Javier; Carrillo-Zapata, Daniel; Terroso-Saenz, Fernando; Valdes-Vela, Mercedes; Skarmeta, Antonio F

2016-09-29

In the mobile computing era, smartphones have become instrumental tools to develop innovative mobile context-aware systems. In that sense, their usage in the vehicular domain eases the development of novel and personal transportation solutions. In this frame, the present work introduces an innovative mechanism to perceive the current kinematic state of a vehicle on the basis of the accelerometer data from a smartphone mounted in the vehicle. Unlike previous proposals, the introduced architecture targets the computational limitations of such devices to carry out the detection process following an incremental approach. For its realization, we have evaluated different classification algorithms to act as agents within the architecture. Finally, our approach has been tested with a real-world dataset collected by means of the ad hoc mobile application developed.
An MPI-based MoSST core dynamics model

NASA Astrophysics Data System (ADS)

Jiang, Weiyuan; Kuang, Weijia

2008-09-01

Distributed systems are among the main cost-effective and expandable platforms for high-end scientific computing. Therefore scalable numerical models are important for effective use of such systems. In this paper, we present an MPI-based numerical core dynamics model for simulation of geodynamo and planetary dynamos, and for simulation of core-mantle interactions. The model is developed based on MPI libraries. Two algorithms are used for node-node communication: a "master-slave" architecture and a "divide-and-conquer" architecture. The former is easy to implement but not scalable in communication. The latter is scalable in both computation and communication. The model scalability is tested on Linux PC clusters with up to 128 nodes. This model is also benchmarked with a published numerical dynamo model solution.
Parallel-hierarchical processing and classification of laser beam profile images based on the GPU-oriented architecture

NASA Astrophysics Data System (ADS)

Yarovyi, Andrii A.; Timchenko, Leonid I.; Kozhemiako, Volodymyr P.; Kokriatskaia, Nataliya I.; Hamdi, Rami R.; Savchuk, Tamara O.; Kulyk, Oleksandr O.; Surtel, Wojciech; Amirgaliyev, Yedilkhan; Kashaganova, Gulzhan

2017-08-01

The paper deals with a problem of insufficient productivity of existing computer means for large image processing, which do not meet modern requirements posed by resource-intensive computing tasks of laser beam profiling. The research concentrated on one of the profiling problems, namely, real-time processing of spot images of the laser beam profile. Development of a theory of parallel-hierarchic transformation allowed to produce models for high-performance parallel-hierarchical processes, as well as algorithms and software for their implementation based on the GPU-oriented architecture using GPGPU technologies. The analyzed performance of suggested computerized tools for processing and classification of laser beam profile images allows to perform real-time processing of dynamic images of various sizes.
Building a robust vehicle detection and classification module

NASA Astrophysics Data System (ADS)

Grigoryev, Anton; Khanipov, Timur; Koptelov, Ivan; Bocharov, Dmitry; Postnikov, Vassily; Nikolaev, Dmitry

2015-12-01

The growing adoption of intelligent transportation systems (ITS) and autonomous driving requires robust real-time solutions for various event and object detection problems. Most of real-world systems still cannot rely on computer vision algorithms and employ a wide range of costly additional hardware like LIDARs. In this paper we explore engineering challenges encountered in building a highly robust visual vehicle detection and classification module that works under broad range of environmental and road conditions. The resulting technology is competitive to traditional non-visual means of traffic monitoring. The main focus of the paper is on software and hardware architecture, algorithm selection and domain-specific heuristics that help the computer vision system avoid implausible answers.
Prevention of Malicious Nodes Communication in MANETs by Using Authorized Tokens

NASA Astrophysics Data System (ADS)

Chandrakant, N.; Shenoy, P. Deepa; Venugopal, K. R.; Patnaik, L. M.

A rapid increase of wireless networks and mobile computing applications has changed the landscape of network security. A MANET is more susceptible to the attacks than wired network. As a result, attacks with malicious intent have been and will be devised to take advantage of these vulnerabilities and to cripple the MANET operation. Hence we need to search for new architecture and mechanisms to protect the wireless networks and mobile computing applications. In this paper, we examine the nodes that come under the vicinity of base node and members of the network and communication is provided to genuine nodes only. It is found that the proposed algorithm is a effective algorithm for security in MANETs.
The Process of Parallelizing the Conjunction Prediction Algorithm of ESA's SSA Conjunction Prediction Service Using GPGPU

NASA Astrophysics Data System (ADS)

Fehr, M.; Navarro, V.; Martin, L.; Fletcher, E.

2013-08-01

Space Situational Awareness[8] (SSA) is defined as the comprehensive knowledge, understanding and maintained awareness of the population of space objects, the space environment and existing threats and risks. As ESA's SSA Conjunction Prediction Service (CPS) requires the repetitive application of a processing algorithm against a data set of man-made space objects, it is crucial to exploit the highly parallelizable nature of this problem. Currently the CPS system makes use of OpenMP[7] for parallelization purposes using CPU threads, but only a GPU with its hundreds of cores can fully benefit from such high levels of parallelism. This paper presents the adaptation of several core algorithms[5] of the CPS for general-purpose computing on graphics processing units (GPGPU) using NVIDIAs Compute Unified Device Architecture (CUDA).
An effective detection algorithm for region duplication forgery in digital images

NASA Astrophysics Data System (ADS)

Yavuz, Fatih; Bal, Abdullah; Cukur, Huseyin

2016-04-01

Powerful image editing tools are very common and easy to use these days. This situation may cause some forgeries by adding or removing some information on the digital images. In order to detect these types of forgeries such as region duplication, we present an effective algorithm based on fixed-size block computation and discrete wavelet transform (DWT). In this approach, the original image is divided into fixed-size blocks, and then wavelet transform is applied for dimension reduction. Each block is processed by Fourier Transform and represented by circle regions. Four features are extracted from each block. Finally, the feature vectors are lexicographically sorted, and duplicated image blocks are detected according to comparison metric results. The experimental results show that the proposed algorithm presents computational efficiency due to fixed-size circle block architecture.
Neural architecture design based on extreme learning machine.

PubMed

Bueno-Crespo, Andrés; García-Laencina, Pedro J; Sancho-Gómez, José-Luis

2013-12-01

Selection of the optimal neural architecture to solve a pattern classification problem entails to choose the relevant input units, the number of hidden neurons and its corresponding interconnection weights. This problem has been widely studied in many research works but their solutions usually involve excessive computational cost in most of the problems and they do not provide a unique solution. This paper proposes a new technique to efficiently design the MultiLayer Perceptron (MLP) architecture for classification using the Extreme Learning Machine (ELM) algorithm. The proposed method provides a high generalization capability and a unique solution for the architecture design. Moreover, the selected final network only retains those input connections that are relevant for the classification task. Experimental results show these advantages. Copyright © 2013 Elsevier Ltd. All rights reserved.
Advanced biologically plausible algorithms for low-level image processing

NASA Astrophysics Data System (ADS)

Gusakova, Valentina I.; Podladchikova, Lubov N.; Shaposhnikov, Dmitry G.; Markin, Sergey N.; Golovan, Alexander V.; Lee, Seong-Whan

1999-08-01

At present, in computer vision, the approach based on modeling the biological vision mechanisms is extensively developed. However, up to now, real world image processing has no effective solution in frameworks of both biologically inspired and conventional approaches. Evidently, new algorithms and system architectures based on advanced biological motivation should be developed for solution of computational problems related to this visual task. Basic problems that should be solved for creation of effective artificial visual system to process real world imags are a search for new algorithms of low-level image processing that, in a great extent, determine system performance. In the present paper, the result of psychophysical experiments and several advanced biologically motivated algorithms for low-level processing are presented. These algorithms are based on local space-variant filter, context encoding visual information presented in the center of input window, and automatic detection of perceptually important image fragments. The core of latter algorithm are using local feature conjunctions such as noncolinear oriented segment and composite feature map formation. Developed algorithms were integrated into foveal active vision model, the MARR. It is supposed that proposed algorithms may significantly improve model performance while real world image processing during memorizing, search, and recognition.
Exploring quantum computing application to satellite data assimilation

NASA Astrophysics Data System (ADS)

Cheung, S.; Zhang, S. Q.

2015-12-01

This is an exploring work on potential application of quantum computing to a scientific data optimization problem. On classical computational platforms, the physical domain of a satellite data assimilation problem is represented by a discrete variable transform, and classical minimization algorithms are employed to find optimal solution of the analysis cost function. The computation becomes intensive and time-consuming when the problem involves large number of variables and data. The new quantum computer opens a very different approach both in conceptual programming and in hardware architecture for solving optimization problem. In order to explore if we can utilize the quantum computing machine architecture, we formulate a satellite data assimilation experimental case in the form of quadratic programming optimization problem. We find a transformation of the problem to map it into Quadratic Unconstrained Binary Optimization (QUBO) framework. Binary Wavelet Transform (BWT) will be applied to the data assimilation variables for its invertible decomposition and all calculations in BWT are performed by Boolean operations. The transformed problem will be experimented as to solve for a solution of QUBO instances defined on Chimera graphs of the quantum computer.
Open architecture CMM motion controller

NASA Astrophysics Data System (ADS)

Chang, David; Spence, Allan D.; Bigg, Steve; Heslip, Joe; Peterson, John

2001-12-01

Although initially the only Coordinate Measuring Machine (CMM) sensor available was a touch trigger probe, technological advances in sensors and computing have greatly increased the variety of available inspection sensors. Non-contact laser digitizers and analog scanning touch probes require very well tuned CMM motion control, as well as an extensible, open architecture interface. This paper describes the implementation of a retrofit CMM motion controller designed for open architecture interface to a variety of sensors. The controller is based on an Intel Pentium microcomputer and a Servo To Go motion interface electronics card. Motor amplifiers, safety, and additional interface electronics are housed in a separate enclosure. Host Signal Processing (HSP) is used for the motion control algorithm. Compared to the usual host plus DSP architecture, single CPU HSP simplifies integration with the various sensors, and implementation of software geometric error compensation. Motion control tuning is accomplished using a remote computer via 100BaseTX Ethernet. A Graphical User Interface (GUI) is used to enter geometric error compensation data, and to optimize the motion control tuning parameters. It is shown that this architecture achieves the required real time motion control response, yet is much easier to extend to additional sensors.
Bioinspired architecture approach for a one-billion transistor smart CMOS camera chip

NASA Astrophysics Data System (ADS)

Fey, Dietmar; Komann, Marcus

2007-05-01

In the paper we present a massively parallel VLSI architecture for future smart CMOS camera chips with up to one billion transistors. To exploit efficiently the potential offered by future micro- or nanoelectronic devices traditional on central structures oriented parallel architectures based on MIMD or SIMD approaches will fail. They require too long and too many global interconnects for the distribution of code or the access to common memory. On the other hand nature developed self-organising and emergent principles to manage successfully complex structures based on lots of interacting simple elements. Therefore we developed a new as Marching Pixels denoted emergent computing paradigm based on a mixture of bio-inspired computing models like cellular automaton and artificial ants. In the paper we present different Marching Pixels algorithms and the corresponding VLSI array architecture. A detailed synthesis result for a 0.18 μm CMOS process shows that a 256×256 pixel image is processed in less than 10 ms assuming a moderate 100 MHz clock rate for the processor array. Future higher integration densities and a 3D chip stacking technology will allow the integration and processing of Mega pixels within the same time since our architecture is fully scalable.
Associative architecture for image processing

NASA Astrophysics Data System (ADS)

Adar, Rutie; Akerib, Avidan

1997-09-01

This article presents a new generation in parallel processing architecture for real-time image processing. The approach is implemented in a real time image processor chip, called the XiumTM-2, based on combining a fully associative array which provides the parallel engine with a serial RISC core on the same die. The architecture is fully programmable and can be programmed to implement a wide range of color image processing, computer vision and media processing functions in real time. The associative part of the chip is based on patented pending methodology of Associative Computing Ltd. (ACL), which condenses 2048 associative processors, each of 128 'intelligent' bits. Each bit can be a processing bit or a memory bit. At only 33 MHz and 0.6 micron manufacturing technology process, the chip has a computational power of 3 billion ALU operations per second and 66 billion string search operations per second. The fully programmable nature of the XiumTM-2 chip enables developers to use ACL tools to write their own proprietary algorithms combined with existing image processing and analysis functions from ACL's extended set of libraries.
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

NASA Technical Reports Server (NTRS)

Jost, Gabriele; Jin, Haoqiang; anMey, Dieter; Hatay, Ferhat F.

2003-01-01

With the advent of parallel hardware and software technologies users are faced with the challenge to choose a programming paradigm best suited for the underlying computer architecture. With the current trend in parallel computer architectures towards clusters of shared memory symmetric multi-processors (SMP), parallel programming techniques have evolved to support parallelism beyond a single level. Which programming paradigm is the best will depend on the nature of the given problem, the hardware architecture, and the available software. In this study we will compare different programming paradigms for the parallelization of a selected benchmark application on a cluster of SMP nodes. We compare the timings of different implementations of the same CFD benchmark application employing the same numerical algorithm on a cluster of Sun Fire SMP nodes. The rest of the paper is structured as follows: In section 2 we briefly discuss the programming models under consideration. We describe our compute platform in section 3. The different implementations of our benchmark code are described in section 4 and the performance results are presented in section 5. We conclude our study in section 6.

The density matrix renormalization group algorithm on kilo-processor architectures: Implementation and trade-offs

NASA Astrophysics Data System (ADS)

Nemes, Csaba; Barcza, Gergely; Nagy, Zoltán; Legeza, Örs; Szolgay, Péter

2014-06-01

In the numerical analysis of strongly correlated quantum lattice models one of the leading algorithms developed to balance the size of the effective Hilbert space and the accuracy of the simulation is the density matrix renormalization group (DMRG) algorithm, in which the run-time is dominated by the iterative diagonalization of the Hamilton operator. As the most time-dominant step of the diagonalization can be expressed as a list of dense matrix operations, the DMRG is an appealing candidate to fully utilize the computing power residing in novel kilo-processor architectures. In the paper a smart hybrid CPU-GPU implementation is presented, which exploits the power of both CPU and GPU and tolerates problems exceeding the GPU memory size. Furthermore, a new CUDA kernel has been designed for asymmetric matrix-vector multiplication to accelerate the rest of the diagonalization. Besides the evaluation of the GPU implementation, the practical limits of an FPGA implementation are also discussed.
Dependency graph for code analysis on emerging architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shashkov, Mikhail Jurievich; Lipnikov, Konstantin

Direct acyclic dependency (DAG) graph is becoming the standard for modern multi-physics codes.The ideal DAG is the true block-scheme of a multi-physics code. Therefore, it is the convenient object for insitu analysis of the cost of computations and algorithmic bottlenecks related to statistical frequent data motion and dymanical machine state.
Requirements for fault-tolerant factoring on an atom-optics quantum computer.

PubMed

Devitt, Simon J; Stephens, Ashley M; Munro, William J; Nemoto, Kae

2013-01-01

Quantum information processing and its associated technologies have reached a pivotal stage in their development, with many experiments having established the basic building blocks. Moving forward, the challenge is to scale up to larger machines capable of performing computational tasks not possible today. This raises questions that need to be urgently addressed, such as what resources these machines will consume and how large will they be. Here we estimate the resources required to execute Shor's factoring algorithm on an atom-optics quantum computer architecture. We determine the runtime and size of the computer as a function of the problem size and physical error rate. Our results suggest that once the physical error rate is low enough to allow quantum error correction, optimization to reduce resources and increase performance will come mostly from integrating algorithms and circuits within the error correction environment, rather than from improving the physical hardware.
Architectural Techniques For Managing Non-volatile Caches

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mittal, Sparsh

As chip power dissipation becomes a critical challenge in scaling processor performance, computer architects are forced to fundamentally rethink the design of modern processors and hence, the chip-design industry is now at a major inflection point in its hardware roadmap. The high leakage power and low density of SRAM poses serious obstacles in its use for designing large on-chip caches and for this reason, researchers are exploring non-volatile memory (NVM) devices, such as spin torque transfer RAM, phase change RAM and resistive RAM. However, since NVMs are not strictly superior to SRAM, effective architectural techniques are required for making themmore » a universal memory solution. This book discusses techniques for designing processor caches using NVM devices. It presents algorithms and architectures for improving their energy efficiency, performance and lifetime. It also provides both qualitative and quantitative evaluation to help the reader gain insights and motivate them to explore further. This book will be highly useful for beginners as well as veterans in computer architecture, chip designers, product managers and technical marketing professionals.« less
CUDA-based real time surgery simulation.

PubMed

Liu, Youquan; De, Suvranu

2008-01-01

In this paper we present a general software platform that enables real time surgery simulation on the newly available compute unified device architecture (CUDA)from NVIDIA. CUDA-enabled GPUs harness the power of 128 processors which allow data parallel computations. Compared to the previous GPGPU, it is significantly more flexible with a C language interface. We report implementation of both collision detection and consequent deformation computation algorithms. Our test results indicate that the CUDA enables a twenty times speedup for collision detection and about fifteen times speedup for deformation computation on an Intel Core 2 Quad 2.66 GHz machine with GeForce 8800 GTX.
Proceedings of the 2013 International Conference on Mathematics and Computational Methods Applied to Nuclear Science and Engineering - M and C 2013

DOE Office of Scientific and Technical Information (OSTI.GOV)

NONE

2013-07-01

The Mathematics and Computation Division of the American Nuclear (ANS) and the Idaho Section of the ANS hosted the 2013 International Conference on Mathematics and Computational Methods Applied to Nuclear Science and Engineering (M and C 2013). This proceedings contains over 250 full papers with topics ranging from reactor physics; radiation transport; materials science; nuclear fuels; core performance and optimization; reactor systems and safety; fluid dynamics; medical applications; analytical and numerical methods; algorithms for advanced architectures; and validation verification, and uncertainty quantification.
Practical Application of Model-based Programming and State-based Architecture to Space Missions

NASA Technical Reports Server (NTRS)

Horvath, Gregory; Ingham, Michel; Chung, Seung; Martin, Oliver; Williams, Brian

2006-01-01

A viewgraph presentation to develop models from systems engineers that accomplish mission objectives and manage the health of the system is shown. The topics include: 1) Overview; 2) Motivation; 3) Objective/Vision; 4) Approach; 5) Background: The Mission Data System; 6) Background: State-based Control Architecture System; 7) Background: State Analysis; 8) Overview of State Analysis; 9) Background: MDS Software Frameworks; 10) Background: Model-based Programming; 10) Background: Titan Model-based Executive; 11) Model-based Execution Architecture; 12) Compatibility Analysis of MDS and Titan Architectures; 13) Integrating Model-based Programming and Execution into the Architecture; 14) State Analysis and Modeling; 15) IMU Subsystem State Effects Diagram; 16) Titan Subsystem Model: IMU Health; 17) Integrating Model-based Programming and Execution into the Software IMU; 18) Testing Program; 19) Computationally Tractable State Estimation & Fault Diagnosis; 20) Diagnostic Algorithm Performance; 21) Integration and Test Issues; 22) Demonstrated Benefits; and 23) Next Steps
Fog computing job scheduling optimization based on bees swarm

NASA Astrophysics Data System (ADS)

Bitam, Salim; Zeadally, Sherali; Mellouk, Abdelhamid

2018-04-01

Fog computing is a new computing architecture, composed of a set of near-user edge devices called fog nodes, which collaborate together in order to perform computational services such as running applications, storing an important amount of data, and transmitting messages. Fog computing extends cloud computing by deploying digital resources at the premise of mobile users. In this new paradigm, management and operating functions, such as job scheduling aim at providing high-performance, cost-effective services requested by mobile users and executed by fog nodes. We propose a new bio-inspired optimization approach called Bees Life Algorithm (BLA) aimed at addressing the job scheduling problem in the fog computing environment. Our proposed approach is based on the optimized distribution of a set of tasks among all the fog computing nodes. The objective is to find an optimal tradeoff between CPU execution time and allocated memory required by fog computing services established by mobile users. Our empirical performance evaluation results demonstrate that the proposal outperforms the traditional particle swarm optimization and genetic algorithm in terms of CPU execution time and allocated memory.
Accelerating scientific computations with mixed precision algorithms

NASA Astrophysics Data System (ADS)

Baboulin, Marc; Buttari, Alfredo; Dongarra, Jack; Kurzak, Jakub; Langou, Julie; Langou, Julien; Luszczek, Piotr; Tomov, Stanimire

2009-12-01

On modern architectures, the performance of 32-bit operations is often at least twice as fast as the performance of 64-bit operations. By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. The approach presented here can apply not only to conventional processors but also to other technologies such as Field Programmable Gate Arrays (FPGA), Graphical Processing Units (GPU), and the STI Cell BE processor. Results on modern processor architectures and the STI Cell BE are presented. Program summaryProgram title: ITER-REF Catalogue identifier: AECO_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AECO_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 7211 No. of bytes in distributed program, including test data, etc.: 41 862 Distribution format: tar.gz Programming language: FORTRAN 77 Computer: desktop, server Operating system: Unix/Linux RAM: 512 Mbytes Classification: 4.8 External routines: BLAS (optional) Nature of problem: On modern architectures, the performance of 32-bit operations is often at least twice as fast as the performance of 64-bit operations. By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. Solution method: Mixed precision algorithms stem from the observation that, in many cases, a single precision solution of a problem can be refined to the point where double precision accuracy is achieved. A common approach to the solution of linear systems, either dense or sparse, is to perform the LU factorization of the coefficient matrix using Gaussian elimination. First, the coefficient matrix A is factored into the product of a lower triangular matrix L and an upper triangular matrix U. Partial row pivoting is in general used to improve numerical stability resulting in a factorization PA=LU, where P is a permutation matrix. The solution for the system is achieved by first solving Ly=Pb (forward substitution) and then solving Ux=y (backward substitution). Due to round-off errors, the computed solution, x, carries a numerical error magnified by the condition number of the coefficient matrix A. In order to improve the computed solution, an iterative process can be applied, which produces a correction to the computed solution at each iteration, which then yields the method that is commonly known as the iterative refinement algorithm. Provided that the system is not too ill-conditioned, the algorithm produces a solution correct to the working precision. Running time: seconds/minutes
Scan line graphics generation on the massively parallel processor

NASA Technical Reports Server (NTRS)

Dorband, John E.

1988-01-01

Described here is how researchers implemented a scan line graphics generation algorithm on the Massively Parallel Processor (MPP). Pixels are computed in parallel and their results are applied to the Z buffer in large groups. To perform pixel value calculations, facilitate load balancing across the processors and apply the results to the Z buffer efficiently in parallel requires special virtual routing (sort computation) techniques developed by the author especially for use on single-instruction multiple-data (SIMD) architectures.
Digital Waveguide Architectures for Virtual Musical Instruments

NASA Astrophysics Data System (ADS)

Smith, Julius O.

Digital sound synthesis has become a standard staple of modern music studios, videogames, personal computers, and hand-held devices. As processing power has increased over the years, sound synthesis implementations have evolved from dedicated chip sets, to single-chip solutions, and ultimately to software implementations within processors used primarily for other tasks (such as for graphics or general purpose computing). With the cost of implementation dropping closer and closer to zero, there is increasing room for higher quality algorithms.
An efficient parallel algorithm: Poststack and prestack Kirchhoff 3D depth migration using flexi-depth iterations

NASA Astrophysics Data System (ADS)

Rastogi, Richa; Srivastava, Abhishek; Khonde, Kiran; Sirasala, Kirannmayi M.; Londhe, Ashutosh; Chavhan, Hitesh

2015-07-01

This paper presents an efficient parallel 3D Kirchhoff depth migration algorithm suitable for current class of multicore architecture. The fundamental Kirchhoff depth migration algorithm exhibits inherent parallelism however, when it comes to 3D data migration, as the data size increases the resource requirement of the algorithm also increases. This challenges its practical implementation even on current generation high performance computing systems. Therefore a smart parallelization approach is essential to handle 3D data for migration. The most compute intensive part of Kirchhoff depth migration algorithm is the calculation of traveltime tables due to its resource requirements such as memory/storage and I/O. In the current research work, we target this area and develop a competent parallel algorithm for post and prestack 3D Kirchhoff depth migration, using hybrid MPI+OpenMP programming techniques. We introduce a concept of flexi-depth iterations while depth migrating data in parallel imaging space, using optimized traveltime table computations. This concept provides flexibility to the algorithm by migrating data in a number of depth iterations, which depends upon the available node memory and the size of data to be migrated during runtime. Furthermore, it minimizes the requirements of storage, I/O and inter-node communication, thus making it advantageous over the conventional parallelization approaches. The developed parallel algorithm is demonstrated and analysed on Yuva II, a PARAM series of supercomputers. Optimization, performance and scalability experiment results along with the migration outcome show the effectiveness of the parallel algorithm.
Matrix-vector multiplication using digital partitioning for more accurate optical computing

NASA Technical Reports Server (NTRS)

Gary, C. K.

1992-01-01

Digital partitioning offers a flexible means of increasing the accuracy of an optical matrix-vector processor. This algorithm can be implemented with the same architecture required for a purely analog processor, which gives optical matrix-vector processors the ability to perform high-accuracy calculations at speeds comparable with or greater than electronic computers as well as the ability to perform analog operations at a much greater speed. Digital partitioning is compared with digital multiplication by analog convolution, residue number systems, and redundant number representation in terms of the size and the speed required for an equivalent throughput as well as in terms of the hardware requirements. Digital partitioning and digital multiplication by analog convolution are found to be the most efficient alogrithms if coding time and hardware are considered, and the architecture for digital partitioning permits the use of analog computations to provide the greatest throughput for a single processor.
Robotic disaster recovery efforts with ad-hoc deployable cloud computing

NASA Astrophysics Data System (ADS)

Straub, Jeremy; Marsh, Ronald; Mohammad, Atif F.

2013-06-01

Autonomous operations of search and rescue (SaR) robots is an ill posed problem, which is complexified by the dynamic disaster recovery environment. In a typical SaR response scenario, responder robots will require different levels of processing capabilities during various parts of the response effort and will need to utilize multiple algorithms. Placing these capabilities onboard the robot is a mediocre solution that precludes algorithm specific performance optimization and results in mediocre performance. Architecture for an ad-hoc, deployable cloud environment suitable for use in a disaster response scenario is presented. Under this model, each service provider is optimized for the task and maintains a database of situation-relevant information. This service-oriented architecture (SOA 3.0) compliant framework also serves as an example of the efficient use of SOA 3.0 in an actual cloud application.
Parallel mutual information estimation for inferring gene regulatory networks on GPUs

PubMed Central

2011-01-01

Background Mutual information is a measure of similarity between two variables. It has been widely used in various application domains including computational biology, machine learning, statistics, image processing, and financial computing. Previously used simple histogram based mutual information estimators lack the precision in quality compared to kernel based methods. The recently introduced B-spline function based mutual information estimation method is competitive to the kernel based methods in terms of quality but at a lower computational complexity. Results We present a new approach to accelerate the B-spline function based mutual information estimation algorithm with commodity graphics hardware. To derive an efficient mapping onto this type of architecture, we have used the Compute Unified Device Architecture (CUDA) programming model to design and implement a new parallel algorithm. Our implementation, called CUDA-MI, can achieve speedups of up to 82 using double precision on a single GPU compared to a multi-threaded implementation on a quad-core CPU for large microarray datasets. We have used the results obtained by CUDA-MI to infer gene regulatory networks (GRNs) from microarray data. The comparisons to existing methods including ARACNE and TINGe show that CUDA-MI produces GRNs of higher quality in less time. Conclusions CUDA-MI is publicly available open-source software, written in CUDA and C++ programming languages. It obtains significant speedup over sequential multi-threaded implementation by fully exploiting the compute capability of commonly used CUDA-enabled low-cost GPUs. PMID:21672264
Dynamic virtual machine allocation policy in cloud computing complying with service level agreement using CloudSim

NASA Astrophysics Data System (ADS)

Aneri, Parikh; Sumathy, S.

2017-11-01

Cloud computing provides services over the internet and provides application resources and data to the users based on their demand. Base of the Cloud Computing is consumer provider model. Cloud provider provides resources which consumer can access using cloud computing model in order to build their application based on their demand. Cloud data center is a bulk of resources on shared pool architecture for cloud user to access. Virtualization is the heart of the Cloud computing model, it provides virtual machine as per application specific configuration and those applications are free to choose their own configuration. On one hand, there is huge number of resources and on other hand it has to serve huge number of requests effectively. Therefore, resource allocation policy and scheduling policy play very important role in allocation and managing resources in this cloud computing model. This paper proposes the load balancing policy using Hungarian algorithm. Hungarian Algorithm provides dynamic load balancing policy with a monitor component. Monitor component helps to increase cloud resource utilization by managing the Hungarian algorithm by monitoring its state and altering its state based on artificial intelligent. CloudSim used in this proposal is an extensible toolkit and it simulates cloud computing environment.
Reconfigurable vision system for real-time applications

NASA Astrophysics Data System (ADS)

Torres-Huitzil, Cesar; Arias-Estrada, Miguel

2002-03-01

Recently, a growing community of researchers has used reconfigurable systems to solve computationally intensive problems. Reconfigurability provides optimized processors for systems on chip designs, and makes easy to import technology to a new system through reusable modules. The main objective of this work is the investigation of a reconfigurable computer system targeted for computer vision and real-time applications. The system is intended to circumvent the inherent computational load of most window-based computer vision algorithms. It aims to build a system for such tasks by providing an FPGA-based hardware architecture for task specific vision applications with enough processing power, using the minimum amount of hardware resources as possible, and a mechanism for building systems using this architecture. Regarding the software part of the system, a library of pre-designed and general-purpose modules that implement common window-based computer vision operations is being investigated. A common generic interface is established for these modules in order to define hardware/software components. These components can be interconnected to develop more complex applications, providing an efficient mechanism for transferring image and result data among modules. Some preliminary results are presented and discussed.
Multiscale high-order/low-order (HOLO) algorithms and applications

NASA Astrophysics Data System (ADS)

Chacón, L.; Chen, G.; Knoll, D. A.; Newman, C.; Park, H.; Taitano, W.; Willert, J. A.; Womeldorff, G.

2017-02-01

We review the state of the art in the formulation, implementation, and performance of so-called high-order/low-order (HOLO) algorithms for challenging multiscale problems. HOLO algorithms attempt to couple one or several high-complexity physical models (the high-order model, HO) with low-complexity ones (the low-order model, LO). The primary goal of HOLO algorithms is to achieve nonlinear convergence between HO and LO components while minimizing memory footprint and managing the computational complexity in a practical manner. Key to the HOLO approach is the use of the LO representations to address temporal stiffness, effectively accelerating the convergence of the HO/LO coupled system. The HOLO approach is broadly underpinned by the concept of nonlinear elimination, which enables segregation of the HO and LO components in ways that can effectively use heterogeneous architectures. The accuracy and efficiency benefits of HOLO algorithms are demonstrated with specific applications to radiation transport, gas dynamics, plasmas (both Eulerian and Lagrangian formulations), and ocean modeling. Across this broad application spectrum, HOLO algorithms achieve significant accuracy improvements at a fraction of the cost compared to conventional approaches. It follows that HOLO algorithms hold significant potential for high-fidelity system scale multiscale simulations leveraging exascale computing.
Special-purpose computer for holography HORN-4 with recurrence algorithm

NASA Astrophysics Data System (ADS)

Shimobaba, Tomoyoshi; Hishinuma, Sinsuke; Ito, Tomoyoshi

2002-10-01

We designed and built a special-purpose computer for holography, HORN-4 (HOlographic ReconstructioN) using PLD (Programmable Logic Device) technology. HORN computers have a pipeline architecture. We use HORN-4 as an attached processor to enhance the performance of a general-purpose computer when it is used to generate holograms using a "recurrence formulas" algorithm developed by our previous paper. In the HORN-4 system, we designed the pipeline by adopting our "recurrence formulas" algorithm which can calculate the phase on a hologram. As the result, we could integrate the pipeline composed of 21 units into one PLD chip. The units in the pipeline consists of one BPU (Basic Phase Unit) unit and twenty CU (Cascade Unit) units. These CU units can compute twenty light intensities on a hologram plane at one time. By mounting two of the PLD chips on a PCI (Peripheral Component Interconnect) universal board, HORN-4 can calculate holograms at high speed of about 42 Gflops equivalent. The cost of HORN-4 board is about 1700 US dollar. We could obtain 800×600 grids hologram from a 3D-image composed of 415 points in about 0.45 sec with the HORN-4 system.
Geocomputation over Hybrid Computer Architecture and Systems: Prior Works and On-going Initiatives at UARK

NASA Astrophysics Data System (ADS)

Shi, X.

2015-12-01

As NSF indicated - "Theory and experimentation have for centuries been regarded as two fundamental pillars of science. It is now widely recognized that computational and data-enabled science forms a critical third pillar." Geocomputation is the third pillar of GIScience and geosciences. With the exponential growth of geodata, the challenge of scalable and high performance computing for big data analytics become urgent because many research activities are constrained by the inability of software or tool that even could not complete the computation process. Heterogeneous geodata integration and analytics obviously magnify the complexity and operational time frame. Many large-scale geospatial problems may be not processable at all if the computer system does not have sufficient memory or computational power. Emerging computer architectures, such as Intel's Many Integrated Core (MIC) Architecture and Graphics Processing Unit (GPU), and advanced computing technologies provide promising solutions to employ massive parallelism and hardware resources to achieve scalability and high performance for data intensive computing over large spatiotemporal and social media data. Exploring novel algorithms and deploying the solutions in massively parallel computing environment to achieve the capability for scalable data processing and analytics over large-scale, complex, and heterogeneous geodata with consistent quality and high-performance has been the central theme of our research team in the Department of Geosciences at the University of Arkansas (UARK). New multi-core architectures combined with application accelerators hold the promise to achieve scalability and high performance by exploiting task and data levels of parallelism that are not supported by the conventional computing systems. Such a parallel or distributed computing environment is particularly suitable for large-scale geocomputation over big data as proved by our prior works, while the potential of such advanced infrastructure remains unexplored in this domain. Within this presentation, our prior and on-going initiatives will be summarized to exemplify how we exploit multicore CPUs, GPUs, and MICs, and clusters of CPUs, GPUs and MICs, to accelerate geocomputation in different applications.

A comparison of native GPU computing versus OpenACC for implementing flow-routing algorithms in hydrological applications

NASA Astrophysics Data System (ADS)

Rueda, Antonio J.; Noguera, José M.; Luque, Adrián

2016-02-01

In recent years GPU computing has gained wide acceptance as a simple low-cost solution for speeding up computationally expensive processing in many scientific and engineering applications. However, in most cases accelerating a traditional CPU implementation for a GPU is a non-trivial task that requires a thorough refactorization of the code and specific optimizations that depend on the architecture of the device. OpenACC is a promising technology that aims at reducing the effort required to accelerate C/C++/Fortran code on an attached multicore device. Virtually with this technology the CPU code only has to be augmented with a few compiler directives to identify the areas to be accelerated and the way in which data has to be moved between the CPU and GPU. Its potential benefits are multiple: better code readability, less development time, lower risk of errors and less dependency on the underlying architecture and future evolution of the GPU technology. Our aim with this work is to evaluate the pros and cons of using OpenACC against native GPU implementations in computationally expensive hydrological applications, using the classic D8 algorithm of O'Callaghan and Mark for river network extraction as case-study. We implemented the flow accumulation step of this algorithm in CPU, using OpenACC and two different CUDA versions, comparing the length and complexity of the code and its performance with different datasets. We advance that although OpenACC can not match the performance of a CUDA optimized implementation (×3.5 slower in average), it provides a significant performance improvement against a CPU implementation (×2-6) with by far a simpler code and less implementation effort.
ABINIT: Plane-Wave-Based Density-Functional Theory on High Performance Computers

NASA Astrophysics Data System (ADS)

Torrent, Marc

2014-03-01

For several years, a continuous effort has been produced to adapt electronic structure codes based on Density-Functional Theory to the future computing architectures. Among these codes, ABINIT is based on a plane-wave description of the wave functions which allows to treat systems of any kind. Porting such a code on petascale architectures pose difficulties related to the many-body nature of the DFT equations. To improve the performances of ABINIT - especially for what concerns standard LDA/GGA ground-state and response-function calculations - several strategies have been followed: A full multi-level parallelisation MPI scheme has been implemented, exploiting all possible levels and distributing both computation and memory. It allows to increase the number of distributed processes and could not be achieved without a strong restructuring of the code. The core algorithm used to solve the eigen problem (``Locally Optimal Blocked Congugate Gradient''), a Blocked-Davidson-like algorithm, is based on a distribution of processes combining plane-waves and bands. In addition to the distributed memory parallelization, a full hybrid scheme has been implemented, using standard shared-memory directives (openMP/openACC) or porting some comsuming code sections to Graphics Processing Units (GPU). As no simple performance model exists, the complexity of use has been increased; the code efficiency strongly depends on the distribution of processes among the numerous levels. ABINIT is able to predict the performances of several process distributions and automatically choose the most favourable one. On the other hand, a big effort has been carried out to analyse the performances of the code on petascale architectures, showing which sections of codes have to be improved; they all are related to Matrix Algebra (diagonalisation, orthogonalisation). The different strategies employed to improve the code scalability will be described. They are based on an exploration of new diagonalization algorithm, as well as the use of external optimized librairies. Part of this work has been supported by the european Prace project (PaRtnership for Advanced Computing in Europe) in the framework of its workpackage 8.
Large-Scale Image Analytics Using Deep Learning

NASA Astrophysics Data System (ADS)

Ganguly, S.; Nemani, R. R.; Basu, S.; Mukhopadhyay, S.; Michaelis, A.; Votava, P.

2014-12-01

High resolution land cover classification maps are needed to increase the accuracy of current Land ecosystem and climate model outputs. Limited studies are in place that demonstrates the state-of-the-art in deriving very high resolution (VHR) land cover products. In addition, most methods heavily rely on commercial softwares that are difficult to scale given the region of study (e.g. continents to globe). Complexities in present approaches relate to (a) scalability of the algorithm, (b) large image data processing (compute and memory intensive), (c) computational cost, (d) massively parallel architecture, and (e) machine learning automation. In addition, VHR satellite datasets are of the order of terabytes and features extracted from these datasets are of the order of petabytes. In our present study, we have acquired the National Agricultural Imaging Program (NAIP) dataset for the Continental United States at a spatial resolution of 1-m. This data comes as image tiles (a total of quarter million image scenes with ~60 million pixels) and has a total size of ~100 terabytes for a single acquisition. Features extracted from the entire dataset would amount to ~8-10 petabytes. In our proposed approach, we have implemented a novel semi-automated machine learning algorithm rooted on the principles of "deep learning" to delineate the percentage of tree cover. In order to perform image analytics in such a granular system, it is mandatory to devise an intelligent archiving and query system for image retrieval, file structuring, metadata processing and filtering of all available image scenes. Using the Open NASA Earth Exchange (NEX) initiative, which is a partnership with Amazon Web Services (AWS), we have developed an end-to-end architecture for designing the database and the deep belief network (following the distbelief computing model) to solve a grand challenge of scaling this process across quarter million NAIP tiles that cover the entire Continental United States. The AWS core components that we use to solve this problem are DynamoDB along with S3 for database query and storage, ElastiCache shared memory architecture for image segmentation, Elastic Map Reduce (EMR) for image feature extraction, and the memory optimized Elastic Cloud Compute (EC2) for the learning algorithm.
Integration of Libration Point Orbit Dynamics into a Universal 3-D Autonomous Formation Flying Algorithm

NASA Technical Reports Server (NTRS)

Folta, David; Bauer, Frank H. (Technical Monitor)

2001-01-01

The autonomous formation flying control algorithm developed by the Goddard Space Flight Center (GSFC) for the New Millennium Program (NMP) Earth Observing-1 (EO-1) mission is investigated for applicability to libration point orbit formations. In the EO-1 formation-flying algorithm, control is accomplished via linearization about a reference transfer orbit with a state transition matrix (STM) computed from state inputs. The effect of libration point orbit dynamics on this algorithm architecture is explored via computation of STMs using the flight proven code, a monodromy matrix developed from a N-body model of a libration orbit, and a standard STM developed from the gravitational and coriolis effects as measured at the libration point. A comparison of formation flying Delta-Vs calculated from these methods is made to a standard linear quadratic regulator (LQR) method. The universal 3-D approach is optimal in the sense that it can be accommodated as an open-loop or closed-loop control using only state information.
Acceleration of the Particle Swarm Optimization for Peierls–Nabarro modeling of dislocations in conventional and high-entropy alloys

DOE PAGES

Pei, Zongrui; Max-Planck-Inst. fur Eisenforschung, Duseldorf; Eisenbach, Markus

2017-02-06

Dislocations are among the most important defects in determining the mechanical properties of both conventional alloys and high-entropy alloys. The Peierls-Nabarro model supplies an efficient pathway to their geometries and mobility. The difficulty in solving the integro-differential Peierls-Nabarro equation is how to effectively avoid the local minima in the energy landscape of a dislocation core. Among the other methods to optimize the dislocation core structures, we choose the algorithm of Particle Swarm Optimization, an algorithm that simulates the social behaviors of organisms. By employing more particles (bigger swarm) and more iterative steps (allowing them to explore for longer time), themore » local minima can be effectively avoided. But this would require more computational cost. The advantage of this algorithm is that it is readily parallelized in modern high computing architecture. We demonstrate the performance of our parallelized algorithm scales linearly with the number of employed cores.« less
Cloud Computing and Its Applications in GIS

NASA Astrophysics Data System (ADS)

Kang, Cao

2011-12-01

Cloud computing is a novel computing paradigm that offers highly scalable and highly available distributed computing services. The objectives of this research are to: 1. analyze and understand cloud computing and its potential for GIS; 2. discover the feasibilities of migrating truly spatial GIS algorithms to distributed computing infrastructures; 3. explore a solution to host and serve large volumes of raster GIS data efficiently and speedily. These objectives thus form the basis for three professional articles. The first article is entitled "Cloud Computing and Its Applications in GIS". This paper introduces the concept, structure, and features of cloud computing. Features of cloud computing such as scalability, parallelization, and high availability make it a very capable computing paradigm. Unlike High Performance Computing (HPC), cloud computing uses inexpensive commodity computers. The uniform administration systems in cloud computing make it easier to use than GRID computing. Potential advantages of cloud-based GIS systems such as lower barrier to entry are consequently presented. Three cloud-based GIS system architectures are proposed: public cloud- based GIS systems, private cloud-based GIS systems and hybrid cloud-based GIS systems. Public cloud-based GIS systems provide the lowest entry barriers for users among these three architectures, but their advantages are offset by data security and privacy related issues. Private cloud-based GIS systems provide the best data protection, though they have the highest entry barriers. Hybrid cloud-based GIS systems provide a compromise between these extremes. The second article is entitled "A cloud computing algorithm for the calculation of Euclidian distance for raster GIS". Euclidean distance is a truly spatial GIS algorithm. Classical algorithms such as the pushbroom and growth ring techniques require computational propagation through the entire raster image, which makes it incompatible with the distributed nature of cloud computing. This paper presents a parallel Euclidean distance algorithm that works seamlessly with the distributed nature of cloud computing infrastructures. The mechanism of this algorithm is to subdivide a raster image into sub-images and wrap them with a one pixel deep edge layer of individually computed distance information. Each sub-image is then processed by a separate node, after which the resulting sub-images are reassembled into the final output. It is shown that while any rectangular sub-image shape can be used, those approximating squares are computationally optimal. This study also serves as a demonstration of this subdivide and layer-wrap strategy, which would enable the migration of many truly spatial GIS algorithms to cloud computing infrastructures. However, this research also indicates that certain spatial GIS algorithms such as cost distance cannot be migrated by adopting this mechanism, which presents significant challenges for the development of cloud-based GIS systems. The third article is entitled "A Distributed Storage Schema for Cloud Computing based Raster GIS Systems". This paper proposes a NoSQL Database Management System (NDDBMS) based raster GIS data storage schema. NDDBMS has good scalability and is able to use distributed commodity computers, which make it superior to Relational Database Management Systems (RDBMS) in a cloud computing environment. In order to provide optimized data service performance, the proposed storage schema analyzes the nature of commonly used raster GIS data sets. It discriminates two categories of commonly used data sets, and then designs corresponding data storage models for both categories. As a result, the proposed storage schema is capable of hosting and serving enormous volumes of raster GIS data speedily and efficiently on cloud computing infrastructures. In addition, the scheme also takes advantage of the data compression characteristics of Quadtrees, thus promoting efficient data storage. Through this assessment of cloud computing technology, the exploration of the challenges and solutions to the migration of GIS algorithms to cloud computing infrastructures, and the examination of strategies for serving large amounts of GIS data in a cloud computing infrastructure, this dissertation lends support to the feasibility of building a cloud-based GIS system. However, there are still challenges that need to be addressed before a full-scale functional cloud-based GIS system can be successfully implemented. (Abstract shortened by UMI.)
PIC codes for plasma accelerators on emerging computer architectures (GPUS, Multicore/Manycore CPUS)

NASA Astrophysics Data System (ADS)

Vincenti, Henri

2016-03-01

The advent of exascale computers will enable 3D simulations of a new laser-plasma interaction regimes that were previously out of reach of current Petasale computers. However, the paradigm used to write current PIC codes will have to change in order to fully exploit the potentialities of these new computing architectures. Indeed, achieving Exascale computing facilities in the next decade will be a great challenge in terms of energy consumption and will imply hardware developments directly impacting our way of implementing PIC codes. As data movement (from die to network) is by far the most energy consuming part of an algorithm future computers will tend to increase memory locality at the hardware level and reduce energy consumption related to data movement by using more and more cores on each compute nodes (''fat nodes'') that will have a reduced clock speed to allow for efficient cooling. To compensate for frequency decrease, CPU machine vendors are making use of long SIMD instruction registers that are able to process multiple data with one arithmetic operator in one clock cycle. SIMD register length is expected to double every four years. GPU's also have a reduced clock speed per core and can process Multiple Instructions on Multiple Datas (MIMD). At the software level Particle-In-Cell (PIC) codes will thus have to achieve both good memory locality and vectorization (for Multicore/Manycore CPU) to fully take advantage of these upcoming architectures. In this talk, we present the portable solutions we implemented in our high performance skeleton PIC code PICSAR to both achieve good memory locality and cache reuse as well as good vectorization on SIMD architectures. We also present the portable solutions used to parallelize the Pseudo-sepctral quasi-cylindrical code FBPIC on GPUs using the Numba python compiler.
Convolutional networks for fast, energy-efficient neuromorphic computing

PubMed Central

Esser, Steven K.; Merolla, Paul A.; Arthur, John V.; Cassidy, Andrew S.; Appuswamy, Rathinakumar; Andreopoulos, Alexander; Berg, David J.; McKinstry, Jeffrey L.; Melano, Timothy; Barch, Davis R.; di Nolfo, Carmelo; Datta, Pallab; Amir, Arnon; Taba, Brian; Flickner, Myron D.; Modha, Dharmendra S.

2016-01-01

Deep networks are now able to achieve human-level performance on a broad spectrum of recognition tasks. Independently, neuromorphic computing has now demonstrated unprecedented energy-efficiency through a new chip architecture based on spiking neurons, low precision synapses, and a scalable communication network. Here, we demonstrate that neuromorphic computing, despite its novel architectural primitives, can implement deep convolution networks that (i) approach state-of-the-art classification accuracy across eight standard datasets encompassing vision and speech, (ii) perform inference while preserving the hardware’s underlying energy-efficiency and high throughput, running on the aforementioned datasets at between 1,200 and 2,600 frames/s and using between 25 and 275 mW (effectively >6,000 frames/s per Watt), and (iii) can be specified and trained using backpropagation with the same ease-of-use as contemporary deep learning. This approach allows the algorithmic power of deep learning to be merged with the efficiency of neuromorphic processors, bringing the promise of embedded, intelligent, brain-inspired computing one step closer. PMID:27651489
Convolutional networks for fast, energy-efficient neuromorphic computing.

PubMed

Esser, Steven K; Merolla, Paul A; Arthur, John V; Cassidy, Andrew S; Appuswamy, Rathinakumar; Andreopoulos, Alexander; Berg, David J; McKinstry, Jeffrey L; Melano, Timothy; Barch, Davis R; di Nolfo, Carmelo; Datta, Pallab; Amir, Arnon; Taba, Brian; Flickner, Myron D; Modha, Dharmendra S

2016-10-11

Deep networks are now able to achieve human-level performance on a broad spectrum of recognition tasks. Independently, neuromorphic computing has now demonstrated unprecedented energy-efficiency through a new chip architecture based on spiking neurons, low precision synapses, and a scalable communication network. Here, we demonstrate that neuromorphic computing, despite its novel architectural primitives, can implement deep convolution networks that (i) approach state-of-the-art classification accuracy across eight standard datasets encompassing vision and speech, (ii) perform inference while preserving the hardware's underlying energy-efficiency and high throughput, running on the aforementioned datasets at between 1,200 and 2,600 frames/s and using between 25 and 275 mW (effectively >6,000 frames/s per Watt), and (iii) can be specified and trained using backpropagation with the same ease-of-use as contemporary deep learning. This approach allows the algorithmic power of deep learning to be merged with the efficiency of neuromorphic processors, bringing the promise of embedded, intelligent, brain-inspired computing one step closer.
DualTrust: A Trust Management Model for Swarm-Based Autonomic Computing Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Maiden, Wendy M.

Trust management techniques must be adapted to the unique needs of the application architectures and problem domains to which they are applied. For autonomic computing systems that utilize mobile agents and ant colony algorithms for their sensor layer, certain characteristics of the mobile agent ant swarm -- their lightweight, ephemeral nature and indirect communication -- make this adaptation especially challenging. This thesis looks at the trust issues and opportunities in swarm-based autonomic computing systems and finds that by monitoring the trustworthiness of the autonomic managers rather than the swarming sensors, the trust management problem becomes much more scalable and stillmore » serves to protect the swarm. After analyzing the applicability of trust management research as it has been applied to architectures with similar characteristics, this thesis specifies the required characteristics for trust management mechanisms used to monitor the trustworthiness of entities in a swarm-based autonomic computing system and describes a trust model that meets these requirements.« less
A Faster Parallel Algorithm and Efficient Multithreaded Implementations for Evaluating Betweenness Centrality on Massive Datasets

DOE Office of Scientific and Technical Information (OSTI.GOV)

Madduri, Kamesh; Ediger, David; Jiang, Karl

2009-05-29

We present a new lock-free parallel algorithm for computing betweenness centrality of massive small-world networks. With minor changes to the data structures, our algorithm also achieves better spatial cache locality compared to previous approaches. Betweenness centrality is a key algorithm kernel in the HPCS SSCA#2 Graph Analysis benchmark, which has been extensively used to evaluate the performance of emerging high-performance computing architectures for graph-theoretic computations. We design optimized implementations of betweenness centrality and the SSCA#2 benchmark for two hardware multithreaded systems: a Cray XMT system with the ThreadStorm processor, and a single-socket Sun multicore server with the UltraSparc T2 processor.more » For a small-world network of 134 million vertices and 1.073 billion edges, the 16-processor XMT system and the 8-core Sun Fire T5120 server achieve TEPS scores (an algorithmic performance count for the SSCA#2 benchmark) of 160 million and 90 million respectively, which corresponds to more than a 2X performance improvement over the previous parallel implementations. To better characterize the performance of these multithreaded systems, we correlate the SSCA#2 performance results with data from the memory-intensive STREAM and RandomAccess benchmarks. Finally, we demonstrate the applicability of our implementation to analyze massive real-world datasets by computing approximate betweenness centrality for a large-scale IMDb movie-actor network.« less
The Research and Implementation of MUSER CLEAN Algorithm Based on OpenCL

NASA Astrophysics Data System (ADS)

Feng, Y.; Chen, K.; Deng, H.; Wang, F.; Mei, Y.; Wei, S. L.; Dai, W.; Yang, Q. P.; Liu, Y. B.; Wu, J. P.

2017-03-01

It's urgent to carry out high-performance data processing with a single machine in the development of astronomical software. However, due to the different configuration of the machine, traditional programming techniques such as multi-threading, and CUDA (Compute Unified Device Architecture)+GPU (Graphic Processing Unit) have obvious limitations in portability and seamlessness between different operation systems. The OpenCL (Open Computing Language) used in the development of MUSER (MingantU SpEctral Radioheliograph) data processing system is introduced. And the Högbom CLEAN algorithm is re-implemented into parallel CLEAN algorithm by the Python language and PyOpenCL extended package. The experimental results show that the CLEAN algorithm based on OpenCL has approximately equally operating efficiency compared with the former CLEAN algorithm based on CUDA. More important, the data processing in merely CPU (Central Processing Unit) environment of this system can also achieve high performance, which has solved the problem of environmental dependence of CUDA+GPU. Overall, the research improves the adaptability of the system with emphasis on performance of MUSER image clean computing. In the meanwhile, the realization of OpenCL in MUSER proves its availability in scientific data processing. In view of the high-performance computing features of OpenCL in heterogeneous environment, it will probably become the preferred technology in the future high-performance astronomical software development.
GPU accelerated dynamic functional connectivity analysis for functional MRI data.

PubMed

Akgün, Devrim; Sakoğlu, Ünal; Esquivel, Johnny; Adinoff, Bryon; Mete, Mutlu

2015-07-01

Recent advances in multi-core processors and graphics card based computational technologies have paved the way for an improved and dynamic utilization of parallel computing techniques. Numerous applications have been implemented for the acceleration of computationally-intensive problems in various computational science fields including bioinformatics, in which big data problems are prevalent. In neuroimaging, dynamic functional connectivity (DFC) analysis is a computationally demanding method used to investigate dynamic functional interactions among different brain regions or networks identified with functional magnetic resonance imaging (fMRI) data. In this study, we implemented and analyzed a parallel DFC algorithm based on thread-based and block-based approaches. The thread-based approach was designed to parallelize DFC computations and was implemented in both Open Multi-Processing (OpenMP) and Compute Unified Device Architecture (CUDA) programming platforms. Another approach developed in this study to better utilize CUDA architecture is the block-based approach, where parallelization involves smaller parts of fMRI time-courses obtained by sliding-windows. Experimental results showed that the proposed parallel design solutions enabled by the GPUs significantly reduce the computation time for DFC analysis. Multicore implementation using OpenMP on 8-core processor provides up to 7.7× speed-up. GPU implementation using CUDA yielded substantial accelerations ranging from 18.5× to 157× speed-up once thread-based and block-based approaches were combined in the analysis. Proposed parallel programming solutions showed that multi-core processor and CUDA-supported GPU implementations accelerated the DFC analyses significantly. Developed algorithms make the DFC analyses more practical for multi-subject studies with more dynamic analyses. Copyright © 2015 Elsevier Ltd. All rights reserved.
Pre-Hardware Optimization and Implementation Of Fast Optics Closed Control Loop Algorithms

NASA Technical Reports Server (NTRS)

Kizhner, Semion; Lyon, Richard G.; Herman, Jay R.; Abuhassan, Nader

2004-01-01

One of the main heritage tools used in scientific and engineering data spectrum analysis is the Fourier Integral Transform and its high performance digital equivalent - the Fast Fourier Transform (FFT). The FFT is particularly useful in two-dimensional (2-D) image processing (FFT2) within optical systems control. However, timing constraints of a fast optics closed control loop would require a supercomputer to run the software implementation of the FFT2 and its inverse, as well as other image processing representative algorithm, such as numerical image folding and fringe feature extraction. A laboratory supercomputer is not always available even for ground operations and is not feasible for a night project. However, the computationally intensive algorithms still warrant alternative implementation using reconfigurable computing technologies (RC) such as Digital Signal Processors (DSP) and Field Programmable Gate Arrays (FPGA), which provide low cost compact super-computing capabilities. We present a new RC hardware implementation and utilization architecture that significantly reduces the computational complexity of a few basic image-processing algorithm, such as FFT2, image folding and phase diversity for the NASA Solar Viewing Interferometer Prototype (SVIP) using a cluster of DSPs and FPGAs. The DSP cluster utilization architecture also assures avoidance of a single point of failure, while using commercially available hardware. This, combined with the control algorithms pre-hardware optimization, or the first time allows construction of image-based 800 Hertz (Hz) optics closed control loops on-board a spacecraft, based on the SVIP ground instrument. That spacecraft is the proposed Earth Atmosphere Solar Occultation Imager (EASI) to study greenhouse gases CO2, C2H, H2O, O3, O2, N2O from Lagrange-2 point in space. This paper provides an advanced insight into a new type of science capabilities for future space exploration missions based on on-board image processing for control and for robotics missions using vision sensors. It presents a top-level description of technologies required for the design and construction of SVIP and EASI and to advance the spatial-spectral imaging and large-scale space interferometry science and engineering.
Scheduling based on a dynamic resource connection

NASA Astrophysics Data System (ADS)

Nagiyev, A. E.; Botygin, I. A.; Shersntneva, A. I.; Konyaev, P. A.

2017-02-01

The practical using of distributed computing systems associated with many problems, including troubles with the organization of an effective interaction between the agents located at the nodes of the system, with the specific configuration of each node of the system to perform a certain task, with the effective distribution of the available information and computational resources of the system, with the control of multithreading which implements the logic of solving research problems and so on. The article describes the method of computing load balancing in distributed automatic systems, focused on the multi-agency and multi-threaded data processing. The scheme of the control of processing requests from the terminal devices, providing the effective dynamic scaling of computing power under peak load is offered. The results of the model experiments research of the developed load scheduling algorithm are set out. These results show the effectiveness of the algorithm even with a significant expansion in the number of connected nodes and zoom in the architecture distributed computing system.
Π4U: A high performance computing framework for Bayesian uncertainty quantification of complex models

NASA Astrophysics Data System (ADS)

Hadjidoukas, P. E.; Angelikopoulos, P.; Papadimitriou, C.; Koumoutsakos, P.

2015-03-01

We present Π4U, an extensible framework, for non-intrusive Bayesian Uncertainty Quantification and Propagation (UQ+P) of complex and computationally demanding physical models, that can exploit massively parallel computer architectures. The framework incorporates Laplace asymptotic approximations as well as stochastic algorithms, along with distributed numerical differentiation and task-based parallelism for heterogeneous clusters. Sampling is based on the Transitional Markov Chain Monte Carlo (TMCMC) algorithm and its variants. The optimization tasks associated with the asymptotic approximations are treated via the Covariance Matrix Adaptation Evolution Strategy (CMA-ES). A modified subset simulation method is used for posterior reliability measurements of rare events. The framework accommodates scheduling of multiple physical model evaluations based on an adaptive load balancing library and shows excellent scalability. In addition to the software framework, we also provide guidelines as to the applicability and efficiency of Bayesian tools when applied to computationally demanding physical models. Theoretical and computational developments are demonstrated with applications drawn from molecular dynamics, structural dynamics and granular flow.
The Ettention software package.

PubMed

Dahmen, Tim; Marsalek, Lukas; Marniok, Nico; Turoňová, Beata; Bogachev, Sviatoslav; Trampert, Patrick; Nickels, Stefan; Slusallek, Philipp

2016-02-01

We present a novel software package for the problem "reconstruction from projections" in electron microscopy. The Ettention framework consists of a set of modular building-blocks for tomographic reconstruction algorithms. The well-known block iterative reconstruction method based on Kaczmarz algorithm is implemented using these building-blocks, including adaptations specific to electron tomography. Ettention simultaneously features (1) a modular, object-oriented software design, (2) optimized access to high-performance computing (HPC) platforms such as graphic processing units (GPU) or many-core architectures like Xeon Phi, and (3) accessibility to microscopy end-users via integration in the IMOD package and eTomo user interface. We also provide developers with a clean and well-structured application programming interface (API) that allows for extending the software easily and thus makes it an ideal platform for algorithmic research while hiding most of the technical details of high-performance computing. Copyright © 2015 Elsevier B.V. All rights reserved.
Factoring symmetric indefinite matrices on high-performance architectures

NASA Technical Reports Server (NTRS)

Jones, Mark T.; Patrick, Merrell L.

1990-01-01

The Bunch-Kaufman algorithm is the method of choice for factoring symmetric indefinite matrices in many applications. However, the Bunch-Kaufman algorithm does not take advantage of high-performance architectures such as the Cray Y-MP. Three new algorithms, based on Bunch-Kaufman factorization, that take advantage of such architectures are described. Results from an implementation of the third algorithm are presented.
A Sparse Self-Consistent Field Algorithm and Its Parallel Implementation: Application to Density-Functional-Based Tight Binding.

PubMed

Scemama, Anthony; Renon, Nicolas; Rapacioli, Mathias

2014-06-10

We present an algorithm and its parallel implementation for solving a self-consistent problem as encountered in Hartree-Fock or density functional theory. The algorithm takes advantage of the sparsity of matrices through the use of local molecular orbitals. The implementation allows one to exploit efficiently modern symmetric multiprocessing (SMP) computer architectures. As a first application, the algorithm is used within the density-functional-based tight binding method, for which most of the computational time is spent in the linear algebra routines (diagonalization of the Fock/Kohn-Sham matrix). We show that with this algorithm (i) single point calculations on very large systems (millions of atoms) can be performed on large SMP machines, (ii) calculations involving intermediate size systems (1000-100 000 atoms) are also strongly accelerated and can run efficiently on standard servers, and (iii) the error on the total energy due to the use of a cutoff in the molecular orbital coefficients can be controlled such that it remains smaller than the SCF convergence criterion.
Using Hybrid Algorithm to Improve Intrusion Detection in Multi Layer Feed Forward Neural Networks

ERIC Educational Resources Information Center

Ray, Loye Lynn

2014-01-01

The need for detecting malicious behavior on a computer networks continued to be important to maintaining a safe and secure environment. The purpose of this study was to determine the relationship of multilayer feed forward neural network architecture to the ability of detecting abnormal behavior in networks. This involved building, training, and…

Space Debris Detection on the HPDP, a Coarse-Grained Reconfigurable Array Architecture for Space

NASA Astrophysics Data System (ADS)

Suarez, Diego Andres; Bretz, Daniel; Helfers, Tim; Weidendorfer, Josef; Utzmann, Jens

2016-08-01

Stream processing, widely used in communications and digital signal processing applications, requires high- throughput data processing that is achieved in most cases using Application-Specific Integrated Circuit (ASIC) designs. Lack of programmability is an issue especially in space applications, which use on-board components with long life-cycles requiring applications updates. To this end, the High Performance Data Processor (HPDP) architecture integrates an array of coarse-grained reconfigurable elements to provide both flexible and efficient computational power suitable for stream-based data processing applications in space. In this work the capabilities of the HPDP architecture are demonstrated with the implementation of a real-time image processing algorithm for space debris detection in a space-based space surveillance system. The implementation challenges and alternatives are described making trade-offs to improve performance at the expense of negligible degradation of detection accuracy. The proposed implementation uses over 99% of the available computational resources. Performance estimations based on simulations show that the HPDP can amply match the application requirements.
ASIC-based architecture for the real-time computation of 2D convolution with large kernel size

NASA Astrophysics Data System (ADS)

Shao, Rui; Zhong, Sheng; Yan, Luxin

2015-12-01

Bidimensional convolution is a low-level processing algorithm of interest in many areas, but its high computational cost constrains the size of the kernels, especially in real-time embedded systems. This paper presents a hardware architecture for the ASIC-based implementation of 2-D convolution with medium-large kernels. Aiming to improve the efficiency of storage resources on-chip, reducing off-chip bandwidth of these two issues, proposed construction of a data cache reuse. Multi-block SPRAM to cross cached images and the on-chip ping-pong operation takes full advantage of the data convolution calculation reuse, design a new ASIC data scheduling scheme and overall architecture. Experimental results show that the structure can achieve 40× 32 size of template real-time convolution operations, and improve the utilization of on-chip memory bandwidth and on-chip memory resources, the experimental results show that the structure satisfies the conditions to maximize data throughput output , reducing the need for off-chip memory bandwidth.
Implementation and evaluation of various demons deformable image registration algorithms on a GPU.

PubMed

Gu, Xuejun; Pan, Hubert; Liang, Yun; Castillo, Richard; Yang, Deshan; Choi, Dongju; Castillo, Edward; Majumdar, Amitava; Guerrero, Thomas; Jiang, Steve B

2010-01-07

Online adaptive radiation therapy (ART) promises the ability to deliver an optimal treatment in response to daily patient anatomic variation. A major technical barrier for the clinical implementation of online ART is the requirement of rapid image segmentation. Deformable image registration (DIR) has been used as an automated segmentation method to transfer tumor/organ contours from the planning image to daily images. However, the current computational time of DIR is insufficient for online ART. In this work, this issue is addressed by using computer graphics processing units (GPUs). A gray-scale-based DIR algorithm called demons and five of its variants were implemented on GPUs using the compute unified device architecture (CUDA) programming environment. The spatial accuracy of these algorithms was evaluated over five sets of pulmonary 4D CT images with an average size of 256 x 256 x 100 and more than 1100 expert-determined landmark point pairs each. For all the testing scenarios presented in this paper, the GPU-based DIR computation required around 7 to 11 s to yield an average 3D error ranging from 1.5 to 1.8 mm. It is interesting to find out that the original passive force demons algorithms outperform subsequently proposed variants based on the combination of accuracy, efficiency and ease of implementation.
Execution of parallel algorithms on a heterogeneous multicomputer

NASA Astrophysics Data System (ADS)

Isenstein, Barry S.; Greene, Jonathon

1995-04-01

Many aerospace/defense sensing and dual-use applications require high-performance computing, extensive high-bandwidth interconnect and realtime deterministic operation. This paper will describe the architecture of a scalable multicomputer that includes DSP and RISC processors. A single chassis implementation is capable of delivering in excess of 10 GFLOPS of DSP processing power with 2 Gbytes/s of realtime sensor I/O. A software approach to implementing parallel algorithms called the Parallel Application System (PAS) is also presented. An example of applying PAS to a DSP application is shown.
Speed challenge: a case for hardware implementation in soft-computing

NASA Technical Reports Server (NTRS)

Daud, T.; Stoica, A.; Duong, T.; Keymeulen, D.; Zebulum, R.; Thomas, T.; Thakoor, A.

2000-01-01

For over a decade, JPL has been actively involved in soft computing research on theory, architecture, applications, and electronics hardware. The driving force in all our research activities, in addition to the potential enabling technology promise, has been creation of a niche that imparts orders of magnitude speed advantage by implementation in parallel processing hardware with algorithms made especially suitable for hardware implementation. We review our work on neural networks, fuzzy logic, and evolvable hardware with selected application examples requiring real time response capabilities.
System Level Applications of Adaptive Computing (SLAAC)

DTIC Science & Technology

2003-11-01

saved clock cycles, as the computation cycle time was directly proportional to the number of bitplanes in the image. The simulation was undertaken in...S-1][D -1] SK E W E R [k+K S-1][0] SK E W E R [k+K S-1][1] MinMax MinMax MinMax Min - IdxMin Max - IdxMax 0 Figure 3: PPI algorithm architeture ...parallel processing of data. The total throughput in these extended architectures is directly proportional to the amount of resources (CLB slices
Spatial operator factorization and inversion of the manipulator mass matrix

NASA Technical Reports Server (NTRS)

Rodriguez, Guillermo; Kreutz-Delgado, Kenneth

1992-01-01

This paper advances two linear operator factorizations of the manipulator mass matrix. Embedded in the factorizations are many of the techniques that are regarded as very efficient computational solutions to inverse and forward dynamics problems. The operator factorizations provide a high-level architectural understanding of the mass matrix and its inverse, which is not visible in the detailed algorithms. They also lead to a new approach to the development of computer programs or organize complexity in robot dynamics.
A high level language for a high performance computer

NASA Technical Reports Server (NTRS)

Perrott, R. H.

1978-01-01

The proposed computational aerodynamic facility will join the ranks of the supercomputers due to its architecture and increased execution speed. At present, the languages used to program these supercomputers have been modifications of programming languages which were designed many years ago for sequential machines. A new programming language should be developed based on the techniques which have proved valuable for sequential programming languages and incorporating the algorithmic techniques required for these supercomputers. The design objectives for such a language are outlined.
East-West paths to unconventional computing.

PubMed

Adamatzky, Andrew; Akl, Selim; Burgin, Mark; Calude, Cristian S; Costa, José Félix; Dehshibi, Mohammad Mahdi; Gunji, Yukio-Pegio; Konkoli, Zoran; MacLennan, Bruce; Marchal, Bruno; Margenstern, Maurice; Martínez, Genaro J; Mayne, Richard; Morita, Kenichi; Schumann, Andrew; Sergeyev, Yaroslav D; Sirakoulis, Georgios Ch; Stepney, Susan; Svozil, Karl; Zenil, Hector

2017-12-01

Unconventional computing is about breaking boundaries in thinking, acting and computing. Typical topics of this non-typical field include, but are not limited to physics of computation, non-classical logics, new complexity measures, novel hardware, mechanical, chemical and quantum computing. Unconventional computing encourages a new style of thinking while practical applications are obtained from uncovering and exploiting principles and mechanisms of information processing in and functional properties of, physical, chemical and living systems; in particular, efficient algorithms are developed, (almost) optimal architectures are designed and working prototypes of future computing devices are manufactured. This article includes idiosyncratic accounts of 'unconventional computing' scientists reflecting on their personal experiences, what attracted them to the field, their inspirations and discoveries. Copyright © 2017 Elsevier Ltd. All rights reserved.
A Parallel Ghosting Algorithm for The Flexible Distributed Mesh Database

DOE PAGES

Mubarak, Misbah; Seol, Seegyoung; Lu, Qiukai; ...

2013-01-01

Critical to the scalability of parallel adaptive simulations are parallel control functions including load balancing, reduced inter-process communication and optimal data decomposition. In distributed meshes, many mesh-based applications frequently access neighborhood information for computational purposes which must be transmitted efficiently to avoid parallel performance degradation when the neighbors are on different processors. This article presents a parallel algorithm of creating and deleting data copies, referred to as ghost copies, which localize neighborhood data for computation purposes while minimizing inter-process communication. The key characteristics of the algorithm are: (1) It can create ghost copies of any permissible topological order in amore » 1D, 2D or 3D mesh based on selected adjacencies. (2) It exploits neighborhood communication patterns during the ghost creation process thus eliminating all-to-all communication. (3) For applications that need neighbors of neighbors, the algorithm can create n number of ghost layers up to a point where the whole partitioned mesh can be ghosted. Strong and weak scaling results are presented for the IBM BG/P and Cray XE6 architectures up to a core count of 32,768 processors. The algorithm also leads to scalable results when used in a parallel super-convergent patch recovery error estimator, an application that frequently accesses neighborhood data to carry out computation.« less
Graphics Processing Unit–Enhanced Genetic Algorithms for Solving the Temporal Dynamics of Gene Regulatory Networks

PubMed Central

García-Calvo, Raúl; Guisado, JL; Diaz-del-Rio, Fernando; Córdoba, Antonio; Jiménez-Morales, Francisco

2018-01-01

Understanding the regulation of gene expression is one of the key problems in current biology. A promising method for that purpose is the determination of the temporal dynamics between known initial and ending network states, by using simple acting rules. The huge amount of rule combinations and the nonlinear inherent nature of the problem make genetic algorithms an excellent candidate for finding optimal solutions. As this is a computationally intensive problem that needs long runtimes in conventional architectures for realistic network sizes, it is fundamental to accelerate this task. In this article, we study how to develop efficient parallel implementations of this method for the fine-grained parallel architecture of graphics processing units (GPUs) using the compute unified device architecture (CUDA) platform. An exhaustive and methodical study of various parallel genetic algorithm schemes—master-slave, island, cellular, and hybrid models, and various individual selection methods (roulette, elitist)—is carried out for this problem. Several procedures that optimize the use of the GPU’s resources are presented. We conclude that the implementation that produces better results (both from the performance and the genetic algorithm fitness perspectives) is simulating a few thousands of individuals grouped in a few islands using elitist selection. This model comprises 2 mighty factors for discovering the best solutions: finding good individuals in a short number of generations, and introducing genetic diversity via a relatively frequent and numerous migration. As a result, we have even found the optimal solution for the analyzed gene regulatory network (GRN). In addition, a comparative study of the performance obtained by the different parallel implementations on GPU versus a sequential application on CPU is carried out. In our tests, a multifold speedup was obtained for our optimized parallel implementation of the method on medium class GPU over an equivalent sequential single-core implementation running on a recent Intel i7 CPU. This work can provide useful guidance to researchers in biology, medicine, or bioinformatics in how to take advantage of the parallelization on massively parallel devices and GPUs to apply novel metaheuristic algorithms powered by nature for real-world applications (like the method to solve the temporal dynamics of GRNs). PMID:29662297
Excavator Design Validation

NASA Technical Reports Server (NTRS)

Pholsiri, Chalongrath; English, James; Seberino, Charles; Lim, Yi-Je

2010-01-01

The Excavator Design Validation tool verifies excavator designs by automatically generating control systems and modeling their performance in an accurate simulation of their expected environment. Part of this software design includes interfacing with human operations that can be included in simulation-based studies and validation. This is essential for assessing productivity, versatility, and reliability. This software combines automatic control system generation from CAD (computer-aided design) models, rapid validation of complex mechanism designs, and detailed models of the environment including soil, dust, temperature, remote supervision, and communication latency to create a system of high value. Unique algorithms have been created for controlling and simulating complex robotic mechanisms automatically from just a CAD description. These algorithms are implemented as a commercial cross-platform C++ software toolkit that is configurable using the Extensible Markup Language (XML). The algorithms work with virtually any mobile robotic mechanisms using module descriptions that adhere to the XML standard. In addition, high-fidelity, real-time physics-based simulation algorithms have also been developed that include models of internal forces and the forces produced when a mechanism interacts with the outside world. This capability is combined with an innovative organization for simulation algorithms, new regolith simulation methods, and a unique control and study architecture to make powerful tools with the potential to transform the way NASA verifies and compares excavator designs. Energid's Actin software has been leveraged for this design validation. The architecture includes parametric and Monte Carlo studies tailored for validation of excavator designs and their control by remote human operators. It also includes the ability to interface with third-party software and human-input devices. Two types of simulation models have been adapted: high-fidelity discrete element models and fast analytical models. By using the first to establish parameters for the second, a system has been created that can be executed in real time, or faster than real time, on a desktop PC. This allows Monte Carlo simulations to be performed on a computer platform available to all researchers, and it allows human interaction to be included in a real-time simulation process. Metrics on excavator performance are established that work with the simulation architecture. Both static and dynamic metrics are included.
Graphics Processing Unit-Enhanced Genetic Algorithms for Solving the Temporal Dynamics of Gene Regulatory Networks.

PubMed

García-Calvo, Raúl; Guisado, J L; Diaz-Del-Rio, Fernando; Córdoba, Antonio; Jiménez-Morales, Francisco

2018-01-01

Understanding the regulation of gene expression is one of the key problems in current biology. A promising method for that purpose is the determination of the temporal dynamics between known initial and ending network states, by using simple acting rules. The huge amount of rule combinations and the nonlinear inherent nature of the problem make genetic algorithms an excellent candidate for finding optimal solutions. As this is a computationally intensive problem that needs long runtimes in conventional architectures for realistic network sizes, it is fundamental to accelerate this task. In this article, we study how to develop efficient parallel implementations of this method for the fine-grained parallel architecture of graphics processing units (GPUs) using the compute unified device architecture (CUDA) platform. An exhaustive and methodical study of various parallel genetic algorithm schemes-master-slave, island, cellular, and hybrid models, and various individual selection methods (roulette, elitist)-is carried out for this problem. Several procedures that optimize the use of the GPU's resources are presented. We conclude that the implementation that produces better results (both from the performance and the genetic algorithm fitness perspectives) is simulating a few thousands of individuals grouped in a few islands using elitist selection. This model comprises 2 mighty factors for discovering the best solutions: finding good individuals in a short number of generations, and introducing genetic diversity via a relatively frequent and numerous migration. As a result, we have even found the optimal solution for the analyzed gene regulatory network (GRN). In addition, a comparative study of the performance obtained by the different parallel implementations on GPU versus a sequential application on CPU is carried out. In our tests, a multifold speedup was obtained for our optimized parallel implementation of the method on medium class GPU over an equivalent sequential single-core implementation running on a recent Intel i7 CPU. This work can provide useful guidance to researchers in biology, medicine, or bioinformatics in how to take advantage of the parallelization on massively parallel devices and GPUs to apply novel metaheuristic algorithms powered by nature for real-world applications (like the method to solve the temporal dynamics of GRNs).
A Grey Wolf Optimizer for Modular Granular Neural Networks for Human Recognition

PubMed Central

Sánchez, Daniela; Melin, Patricia

2017-01-01

A grey wolf optimizer for modular neural network (MNN) with a granular approach is proposed. The proposed method performs optimal granulation of data and design of modular neural networks architectures to perform human recognition, and to prove its effectiveness benchmark databases of ear, iris, and face biometric measures are used to perform tests and comparisons against other works. The design of a modular granular neural network (MGNN) consists in finding optimal parameters of its architecture; these parameters are the number of subgranules, percentage of data for the training phase, learning algorithm, goal error, number of hidden layers, and their number of neurons. Nowadays, there is a great variety of approaches and new techniques within the evolutionary computing area, and these approaches and techniques have emerged to help find optimal solutions to problems or models and bioinspired algorithms are part of this area. In this work a grey wolf optimizer is proposed for the design of modular granular neural networks, and the results are compared against a genetic algorithm and a firefly algorithm in order to know which of these techniques provides better results when applied to human recognition. PMID:28894461
A Grey Wolf Optimizer for Modular Granular Neural Networks for Human Recognition.

PubMed

Sánchez, Daniela; Melin, Patricia; Castillo, Oscar

2017-01-01

A grey wolf optimizer for modular neural network (MNN) with a granular approach is proposed. The proposed method performs optimal granulation of data and design of modular neural networks architectures to perform human recognition, and to prove its effectiveness benchmark databases of ear, iris, and face biometric measures are used to perform tests and comparisons against other works. The design of a modular granular neural network (MGNN) consists in finding optimal parameters of its architecture; these parameters are the number of subgranules, percentage of data for the training phase, learning algorithm, goal error, number of hidden layers, and their number of neurons. Nowadays, there is a great variety of approaches and new techniques within the evolutionary computing area, and these approaches and techniques have emerged to help find optimal solutions to problems or models and bioinspired algorithms are part of this area. In this work a grey wolf optimizer is proposed for the design of modular granular neural networks, and the results are compared against a genetic algorithm and a firefly algorithm in order to know which of these techniques provides better results when applied to human recognition.
Feasibility of using the Massively Parallel Processor for large eddy simulations and other Computational Fluid Dynamics applications

NASA Technical Reports Server (NTRS)

Bruno, John

1984-01-01

The results of an investigation into the feasibility of using the MPP for direct and large eddy simulations of the Navier-Stokes equations is presented. A major part of this study was devoted to the implementation of two of the standard numerical algorithms for CFD. These implementations were not run on the Massively Parallel Processor (MPP) since the machine delivered to NASA Goddard does not have sufficient capacity. Instead, a detailed implementation plan was designed and from these were derived estimates of the time and space requirements of the algorithms on a suitably configured MPP. In addition, other issues related to the practical implementation of these algorithms on an MPP-like architecture were considered; namely, adaptive grid generation, zonal boundary conditions, the table lookup problem, and the software interface. Performance estimates show that the architectural components of the MPP, the Staging Memory and the Array Unit, appear to be well suited to the numerical algorithms of CFD. This combined with the prospect of building a faster and larger MMP-like machine holds the promise of achieving sustained gigaflop rates that are required for the numerical simulations in CFD.
Hybrid techniques for the digital control of mechanical and optical systems

NASA Astrophysics Data System (ADS)

Acernese, Fausto; Barone, Fabrizio; De Rosa, Rosario; Eleuteri, Antonio; Milano, Leopoldo; Pardi, Silvio; Ricciardi, Iolanda; Russo, Guido

2004-07-01

One of the main requirements of a digital system for the control of interferometric detectors of gravitational waves is the computing power, that is a direct consequence of the increasing complexity of the digital algorithms necessary for the control signals generation. For this specific task many specialised non standard real-time architectures have been developed, often very expensive and difficult to upgrade. On the other hand, such computing power is generally fully available for off-line applications on standard Pc based systems. Therefore, a possible and obvious solution may be provided by the integration of both the the real-time and off-line architecture resulting in a hybrid control system architecture based on standards available components, trying to get both the advantages of the perfect data synchronization provided by the real-time systems and by the large computing power available on Pc based systems. Such integration may be provided by the implementation of the link between the two different architectures through the standard Ethernet network, whose data transfer speed is largely increasing in these years, using the TCP/IP and UDP protocols. In this paper we describe the architecture of an hybrid Ethernet based real-time control system protoype we implemented in Napoli, discussing its characteristics and performances. Finally we discuss a possible application to the real-time control of a suspended mass of the mode cleaner of the 3m prototype optical interferometer for gravitational wave detection (IDGW-3P) operational in Napoli.
An efficient interpolation filter VLSI architecture for HEVC standard

NASA Astrophysics Data System (ADS)

Zhou, Wei; Zhou, Xin; Lian, Xiaocong; Liu, Zhenyu; Liu, Xiaoxiang

2015-12-01

The next-generation video coding standard of High-Efficiency Video Coding (HEVC) is especially efficient for coding high-resolution video such as 8K-ultra-high-definition (UHD) video. Fractional motion estimation in HEVC presents a significant challenge in clock latency and area cost as it consumes more than 40 % of the total encoding time and thus results in high computational complexity. With aims at supporting 8K-UHD video applications, an efficient interpolation filter VLSI architecture for HEVC is proposed in this paper. Firstly, a new interpolation filter algorithm based on the 8-pixel interpolation unit is proposed in this paper. It can save 19.7 % processing time on average with acceptable coding quality degradation. Based on the proposed algorithm, an efficient interpolation filter VLSI architecture, composed of a reused data path of interpolation, an efficient memory organization, and a reconfigurable pipeline interpolation filter engine, is presented to reduce the implement hardware area and achieve high throughput. The final VLSI implementation only requires 37.2k gates in a standard 90-nm CMOS technology at an operating frequency of 240 MHz. The proposed architecture can be reused for either half-pixel interpolation or quarter-pixel interpolation, which can reduce the area cost for about 131,040 bits RAM. The processing latency of our proposed VLSI architecture can support the real-time processing of 4:2:0 format 7680 × 4320@78fps video sequences.
A Semi-Automated Machine Learning Algorithm for Tree Cover Delineation from 1-m Naip Imagery Using a High Performance Computing Architecture

NASA Astrophysics Data System (ADS)

Basu, S.; Ganguly, S.; Nemani, R. R.; Mukhopadhyay, S.; Milesi, C.; Votava, P.; Michaelis, A.; Zhang, G.; Cook, B. D.; Saatchi, S. S.; Boyda, E.

2014-12-01

Accurate tree cover delineation is a useful instrument in the derivation of Above Ground Biomass (AGB) density estimates from Very High Resolution (VHR) satellite imagery data. Numerous algorithms have been designed to perform tree cover delineation in high to coarse resolution satellite imagery, but most of them do not scale to terabytes of data, typical in these VHR datasets. In this paper, we present an automated probabilistic framework for the segmentation and classification of 1-m VHR data as obtained from the National Agriculture Imagery Program (NAIP) for deriving tree cover estimates for the whole of Continental United States, using a High Performance Computing Architecture. The results from the classification and segmentation algorithms are then consolidated into a structured prediction framework using a discriminative undirected probabilistic graphical model based on Conditional Random Field (CRF), which helps in capturing the higher order contextual dependencies between neighboring pixels. Once the final probability maps are generated, the framework is updated and re-trained by incorporating expert knowledge through the relabeling of misclassified image patches. This leads to a significant improvement in the true positive rates and reduction in false positive rates. The tree cover maps were generated for the state of California, which covers a total of 11,095 NAIP tiles and spans a total geographical area of 163,696 sq. miles. Our framework produced correct detection rates of around 85% for fragmented forests and 70% for urban tree cover areas, with false positive rates lower than 3% for both regions. Comparative studies with the National Land Cover Data (NLCD) algorithm and the LiDAR high-resolution canopy height model shows the effectiveness of our algorithm in generating accurate high-resolution tree cover maps.
A simple and fast heuristic for protein structure comparison.

PubMed

Pelta, David A; González, Juan R; Moreno Vega, Marcos

2008-03-25

Protein structure comparison is a key problem in bioinformatics. There exist several methods for doing protein comparison, being the solution of the Maximum Contact Map Overlap problem (MAX-CMO) one of the alternatives available. Although this problem may be solved using exact algorithms, researchers require approximate algorithms that obtain good quality solutions using less computational resources than the formers. We propose a variable neighborhood search metaheuristic for solving MAX-CMO. We analyze this strategy in two aspects: 1) from an optimization point of view the strategy is tested on two different datasets, obtaining an error of 3.5%(over 2702 pairs) and 1.7% (over 161 pairs) with respect to optimal values; thus leading to high accurate solutions in a simpler and less expensive way than exact algorithms; 2) in terms of protein structure classification, we conduct experiments on three datasets and show that is feasible to detect structural similarities at SCOP's family and CATH's architecture levels using normalized overlap values. Some limitations and the role of normalization are outlined for doing classification at SCOP's fold level. We designed, implemented and tested.a new tool for solving MAX-CMO, based on a well-known metaheuristic technique. The good balance between solution's quality and computational effort makes it a valuable tool. Moreover, to the best of our knowledge, this is the first time the MAX-CMO measure is tested at SCOP's fold and CATH's architecture levels with encouraging results.

Some links on this page may take you to non-federal websites. Their policies may differ from this site.