asynchronous parallel algorithms: Topics by Science.gov

Sample records for asynchronous parallel algorithms

A Parallel Particle Swarm Optimization Algorithm Accelerated by Asynchronous Evaluations

NASA Technical Reports Server (NTRS)

Venter, Gerhard; Sobieszczanski-Sobieski, Jaroslaw

2005-01-01

A parallel Particle Swarm Optimization (PSO) algorithm is presented. Particle swarm optimization is a fairly recent addition to the family of non-gradient based, probabilistic search algorithms that is based on a simplified social model and is closely tied to swarming theory. Although PSO algorithms present several attractive properties to the designer, they are plagued by high computational cost as measured by elapsed time. One approach to reduce the elapsed time is to make use of coarse-grained parallelization to evaluate the design points. Previous parallel PSO algorithms were mostly implemented in a synchronous manner, where all design points within a design iteration are evaluated before the next iteration is started. This approach leads to poor parallel speedup in cases where a heterogeneous parallel environment is used and/or where the analysis time depends on the design point being analyzed. This paper introduces an asynchronous parallel PSO algorithm that greatly improves the parallel e ciency. The asynchronous algorithm is benchmarked on a cluster assembled of Apple Macintosh G5 desktop computers, using the multi-disciplinary optimization of a typical transport aircraft wing as an example.
A Binary Array Asynchronous Sorting Algorithm with Using Petri Nets

NASA Astrophysics Data System (ADS)

Voevoda, A. A.; Romannikov, D. O.

2017-01-01

Nowadays the tasks of computations speed-up and/or their optimization are actual. Among the approaches on how to solve these tasks, a method applying approaches of parallelization and asynchronization to a sorting algorithm is considered in the paper. The sorting methods are ones of elementary methods and they are used in a huge amount of different applications. In the paper, we offer a method of an array sorting that based on a division into a set of independent adjacent pairs of numbers and their parallel and asynchronous comparison. And this one distinguishes the offered method from the traditional sorting algorithms (like quick sorting, merge sorting, insertion sorting and others). The algorithm is implemented with the use of Petri nets, like the most suitable tool for an asynchronous systems description.
Applications of New Surrogate Global Optimization Algorithms including Efficient Synchronous and Asynchronous Parallelism for Calibration of Expensive Nonlinear Geophysical Simulation Models.

NASA Astrophysics Data System (ADS)

Shoemaker, C. A.; Pang, M.; Akhtar, T.; Bindel, D.

2016-12-01

New parallel surrogate global optimization algorithms are developed and applied to objective functions that are expensive simulations (possibly with multiple local minima). The algorithms can be applied to most geophysical simulations, including those with nonlinear partial differential equations. The optimization does not require simulations be parallelized. Asynchronous (and synchronous) parallel execution is available in the optimization toolbox "pySOT". The parallel algorithms are modified from serial to eliminate fine grained parallelism. The optimization is computed with open source software pySOT, a Surrogate Global Optimization Toolbox that allows user to pick the type of surrogate (or ensembles), the search procedure on surrogate, and the type of parallelism (synchronous or asynchronous). pySOT also allows the user to develop new algorithms by modifying parts of the code. In the applications here, the objective function takes up to 30 minutes for one simulation, and serial optimization can take over 200 hours. Results from Yellowstone (NSF) and NCSS (Singapore) supercomputers are given for groundwater contaminant hydrology simulations with applications to model parameter estimation and decontamination management. All results are compared with alternatives. The first results are for optimization of pumping at many wells to reduce cost for decontamination of groundwater at a superfund site. The optimization runs with up to 128 processors. Superlinear speed up is obtained for up to 16 processors, and efficiency with 64 processors is over 80%. Each evaluation of the objective function requires the solution of nonlinear partial differential equations to describe the impact of spatially distributed pumping and model parameters on model predictions for the spatial and temporal distribution of groundwater contaminants. The second application uses an asynchronous parallel global optimization for groundwater quality model calibration. The time for a single objective function evaluation varies unpredictably, so efficiency is improved with asynchronous parallel calculations to improve load balancing. The third application (done at NCSS) incorporates new global surrogate multi-objective parallel search algorithms into pySOT and applies it to a large watershed calibration problem.
Parallel and Preemptable Dynamically Dimensioned Search Algorithms for Single and Multi-objective Optimization in Water Resources

NASA Astrophysics Data System (ADS)

Tolson, B.; Matott, L. S.; Gaffoor, T. A.; Asadzadeh, M.; Shafii, M.; Pomorski, P.; Xu, X.; Jahanpour, M.; Razavi, S.; Haghnegahdar, A.; Craig, J. R.

2015-12-01

We introduce asynchronous parallel implementations of the Dynamically Dimensioned Search (DDS) family of algorithms including DDS, discrete DDS, PA-DDS and DDS-AU. These parallel algorithms are unique from most existing parallel optimization algorithms in the water resources field in that parallel DDS is asynchronous and does not require an entire population (set of candidate solutions) to be evaluated before generating and then sending a new candidate solution for evaluation. One key advance in this study is developing the first parallel PA-DDS multi-objective optimization algorithm. The other key advance is enhancing the computational efficiency of solving optimization problems (such as model calibration) by combining a parallel optimization algorithm with the deterministic model pre-emption concept. These two efficiency techniques can only be combined because of the asynchronous nature of parallel DDS. Model pre-emption functions to terminate simulation model runs early, prior to completely simulating the model calibration period for example, when intermediate results indicate the candidate solution is so poor that it will definitely have no influence on the generation of further candidate solutions. The computational savings of deterministic model preemption available in serial implementations of population-based algorithms (e.g., PSO) disappear in synchronous parallel implementations as these algorithms. In addition to the key advances above, we implement the algorithms across a range of computation platforms (Windows and Unix-based operating systems from multi-core desktops to a supercomputer system) and package these for future modellers within a model-independent calibration software package called Ostrich as well as MATLAB versions. Results across multiple platforms and multiple case studies (from 4 to 64 processors) demonstrate the vast improvement over serial DDS-based algorithms and highlight the important role model pre-emption plays in the performance of parallel, pre-emptable DDS algorithms. Case studies include single- and multiple-objective optimization problems in water resources model calibration and in many cases linear or near linear speedups are observed.
Global interrupt and barrier networks

DOEpatents

Blumrich, Matthias A.; Chen, Dong; Coteus, Paul W.; Gara, Alan G.; Giampapa, Mark E; Heidelberger, Philip; Kopcsay, Gerard V.; Steinmacher-Burow, Burkhard D.; Takken, Todd E.

2008-10-28

A system and method for generating global asynchronous signals in a computing structure. Particularly, a global interrupt and barrier network is implemented that implements logic for generating global interrupt and barrier signals for controlling global asynchronous operations performed by processing elements at selected processing nodes of a computing structure in accordance with a processing algorithm; and includes the physical interconnecting of the processing nodes for communicating the global interrupt and barrier signals to the elements via low-latency paths. The global asynchronous signals respectively initiate interrupt and barrier operations at the processing nodes at times selected for optimizing performance of the processing algorithms. In one embodiment, the global interrupt and barrier network is implemented in a scalable, massively parallel supercomputing device structure comprising a plurality of processing nodes interconnected by multiple independent networks, with each node including one or more processing elements for performing computation or communication activity as required when performing parallel algorithm operations. One multiple independent network includes a global tree network for enabling high-speed global tree communications among global tree network nodes or sub-trees thereof. The global interrupt and barrier network may operate in parallel with the global tree network for providing global asynchronous sideband signals.
Reverse engineering a gene network using an asynchronous parallel evolution strategy

PubMed Central

2010-01-01

Background The use of reverse engineering methods to infer gene regulatory networks by fitting mathematical models to gene expression data is becoming increasingly popular and successful. However, increasing model complexity means that more powerful global optimisation techniques are required for model fitting. The parallel Lam Simulated Annealing (pLSA) algorithm has been used in such approaches, but recent research has shown that island Evolutionary Strategies can produce faster, more reliable results. However, no parallel island Evolutionary Strategy (piES) has yet been demonstrated to be effective for this task. Results Here, we present synchronous and asynchronous versions of the piES algorithm, and apply them to a real reverse engineering problem: inferring parameters in the gap gene network. We find that the asynchronous piES exhibits very little communication overhead, and shows significant speed-up for up to 50 nodes: the piES running on 50 nodes is nearly 10 times faster than the best serial algorithm. We compare the asynchronous piES to pLSA on the same test problem, measuring the time required to reach particular levels of residual error, and show that it shows much faster convergence than pLSA across all optimisation conditions tested. Conclusions Our results demonstrate that the piES is consistently faster and more reliable than the pLSA algorithm on this problem, and scales better with increasing numbers of nodes. In addition, the piES is especially well suited to further improvements and adaptations: Firstly, the algorithm's fast initial descent speed and high reliability make it a good candidate for being used as part of a global/local search hybrid algorithm. Secondly, it has the potential to be used as part of a hierarchical evolutionary algorithm, which takes advantage of modern multi-core computing architectures. PMID:20196855
Dynamic grid refinement for partial differential equations on parallel computers

NASA Technical Reports Server (NTRS)

Mccormick, S.; Quinlan, D.

1989-01-01

The fast adaptive composite grid method (FAC) is an algorithm that uses various levels of uniform grids to provide adaptive resolution and fast solution of PDEs. An asynchronous version of FAC, called AFAC, that completely eliminates the bottleneck to parallelism is presented. This paper describes the advantage that this algorithm has in adaptive refinement for moving singularities on multiprocessor computers. This work is applicable to the parallel solution of two- and three-dimensional shock tracking problems.
Frog: Asynchronous Graph Processing on GPU with Hybrid Coloring Model

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shi, Xuanhua; Luo, Xuan; Liang, Junling

GPUs have been increasingly used to accelerate graph processing for complicated computational problems regarding graph theory. Many parallel graph algorithms adopt the asynchronous computing model to accelerate the iterative convergence. Unfortunately, the consistent asynchronous computing requires locking or atomic operations, leading to significant penalties/overheads when implemented on GPUs. As such, coloring algorithm is adopted to separate the vertices with potential updating conflicts, guaranteeing the consistency/correctness of the parallel processing. Common coloring algorithms, however, may suffer from low parallelism because of a large number of colors generally required for processing a large-scale graph with billions of vertices. We propose a light-weightmore » asynchronous processing framework called Frog with a preprocessing/hybrid coloring model. The fundamental idea is based on Pareto principle (or 80-20 rule) about coloring algorithms as we observed through masses of realworld graph coloring cases. We find that a majority of vertices (about 80%) are colored with only a few colors, such that they can be read and updated in a very high degree of parallelism without violating the sequential consistency. Accordingly, our solution separates the processing of the vertices based on the distribution of colors. In this work, we mainly answer three questions: (1) how to partition the vertices in a sparse graph with maximized parallelism, (2) how to process large-scale graphs that cannot fit into GPU memory, and (3) how to reduce the overhead of data transfers on PCIe while processing each partition. We conduct experiments on real-world data (Amazon, DBLP, YouTube, RoadNet-CA, WikiTalk and Twitter) to evaluate our approach and make comparisons with well-known non-preprocessed (such as Totem, Medusa, MapGraph and Gunrock) and preprocessed (Cusha) approaches, by testing four classical algorithms (BFS, PageRank, SSSP and CC). On all the tested applications and datasets, Frog is able to significantly outperform existing GPU-based graph processing systems except Gunrock and MapGraph. MapGraph gets better performance than Frog when running BFS on RoadNet-CA. The comparison between Gunrock and Frog is inconclusive. Frog can outperform Gunrock more than 1.04X when running PageRank and SSSP, while the advantage of Frog is not obvious when running BFS and CC on some datasets especially for RoadNet-CA.« less
Multi-Objective and Multidisciplinary Design Optimisation (MDO) of UAV Systems using Hierarchical Asynchronous Parallel Evolutionary Algorithms

DTIC Science & Technology

2007-09-17

been proposed; these include a combination of variable fidelity models, parallelisation strategies and hybridisation techniques (Coello, Veldhuizen et...Coello et al (Coello, Veldhuizen et al. 2002). 4.4.2 HIERARCHICAL POPULATION TOPOLOGY A hierarchical population topology, when integrated into...to hybrid parallel Multi-Objective Evolutionary Algorithms (pMOEA) (Cantu-Paz 2000; Veldhuizen , Zydallis et al. 2003); it uses a master slave
Parallel asynchronous systems and image processing algorithms

NASA Technical Reports Server (NTRS)

Coon, D. D.; Perera, A. G. U.

1989-01-01

A new hardware approach to implementation of image processing algorithms is described. The approach is based on silicon devices which would permit an independent analog processing channel to be dedicated to evey pixel. A laminar architecture consisting of a stack of planar arrays of the device would form a two-dimensional array processor with a 2-D array of inputs located directly behind a focal plane detector array. A 2-D image data stream would propagate in neuronlike asynchronous pulse coded form through the laminar processor. Such systems would integrate image acquisition and image processing. Acquisition and processing would be performed concurrently as in natural vision systems. The research is aimed at implementation of algorithms, such as the intensity dependent summation algorithm and pyramid processing structures, which are motivated by the operation of natural vision systems. Implementation of natural vision algorithms would benefit from the use of neuronlike information coding and the laminar, 2-D parallel, vision system type architecture. Besides providing a neural network framework for implementation of natural vision algorithms, a 2-D parallel approach could eliminate the serial bottleneck of conventional processing systems. Conversion to serial format would occur only after raw intensity data has been substantially processed. An interesting challenge arises from the fact that the mathematical formulation of natural vision algorithms does not specify the means of implementation, so that hardware implementation poses intriguing questions involving vision science.
SIAM Conference on Parallel Processing for Scientific Computing, 4th, Chicago, IL, Dec. 11-13, 1989, Proceedings

NASA Technical Reports Server (NTRS)

Dongarra, Jack (Editor); Messina, Paul (Editor); Sorensen, Danny C. (Editor); Voigt, Robert G. (Editor)

1990-01-01

Attention is given to such topics as an evaluation of block algorithm variants in LAPACK and presents a large-grain parallel sparse system solver, a multiprocessor method for the solution of the generalized Eigenvalue problem on an interval, and a parallel QR algorithm for iterative subspace methods on the CM2. A discussion of numerical methods includes the topics of asynchronous numerical solutions of PDEs on parallel computers, parallel homotopy curve tracking on a hypercube, and solving Navier-Stokes equations on the Cedar Multi-Cluster system. A section on differential equations includes a discussion of a six-color procedure for the parallel solution of elliptic systems using the finite quadtree structure, data parallel algorithms for the finite element method, and domain decomposition methods in aerodynamics. Topics dealing with massively parallel computing include hypercube vs. 2-dimensional meshes and massively parallel computation of conservation laws. Performance and tools are also discussed.
Optimization by nonhierarchical asynchronous decomposition

NASA Technical Reports Server (NTRS)

Shankar, Jayashree; Ribbens, Calvin J.; Haftka, Raphael T.; Watson, Layne T.

1992-01-01

Large scale optimization problems are tractable only if they are somehow decomposed. Hierarchical decompositions are inappropriate for some types of problems and do not parallelize well. Sobieszczanski-Sobieski has proposed a nonhierarchical decomposition strategy for nonlinear constrained optimization that is naturally parallel. Despite some successes on engineering problems, the algorithm as originally proposed fails on simple two dimensional quadratic programs. The algorithm is carefully analyzed for quadratic programs, and a number of modifications are suggested to improve its robustness.
A Parallel Saturation Algorithm on Shared Memory Architectures

NASA Technical Reports Server (NTRS)

Ezekiel, Jonathan; Siminiceanu

2007-01-01

Symbolic state-space generators are notoriously hard to parallelize. However, the Saturation algorithm implemented in the SMART verification tool differs from other sequential symbolic state-space generators in that it exploits the locality of ring events in asynchronous system models. This paper explores whether event locality can be utilized to efficiently parallelize Saturation on shared-memory architectures. Conceptually, we propose to parallelize the ring of events within a decision diagram node, which is technically realized via a thread pool. We discuss the challenges involved in our parallel design and conduct experimental studies on its prototypical implementation. On a dual-processor dual core PC, our studies show speed-ups for several example models, e.g., of up to 50% for a Kanban model, when compared to running our algorithm only on a single core.
Highly Asynchronous VisitOr Queue Graph Toolkit

DOE Office of Scientific and Technical Information (OSTI.GOV)

Pearce, R.

2012-10-01

HAVOQGT is a C++ framework that can be used to create highly parallel graph traversal algorithms. The framework stores the graph and algorithmic data structures on external memory that is typically mapped to high performance locally attached NAND FLASH arrays. The framework supports a vertex-centered visitor programming model. The frameworkd has been used to implement breadth first search, connected components, and single source shortest path.
Ultrascalable petaflop parallel supercomputer

DOEpatents

Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton On Hudson, NY; Chiu, George [Cross River, NY; Cipolla, Thomas M [Katonah, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Hall, Shawn [Pleasantville, NY; Haring, Rudolf A [Cortlandt Manor, NY; Heidelberger, Philip [Cortlandt Manor, NY; Kopcsay, Gerard V [Yorktown Heights, NY; Ohmacht, Martin [Yorktown Heights, NY; Salapura, Valentina [Chappaqua, NY; Sugavanam, Krishnan [Mahopac, NY; Takken, Todd [Brewster, NY

2010-07-20

A massively parallel supercomputer of petaOPS-scale includes node architectures based upon System-On-a-Chip technology, where each processing node comprises a single Application Specific Integrated Circuit (ASIC) having up to four processing elements. The ASIC nodes are interconnected by multiple independent networks that optimally maximize the throughput of packet communications between nodes with minimal latency. The multiple networks may include three high-speed networks for parallel algorithm message passing including a Torus, collective network, and a Global Asynchronous network that provides global barrier and notification functions. These multiple independent networks may be collaboratively or independently utilized according to the needs or phases of an algorithm for optimizing algorithm processing performance. The use of a DMA engine is provided to facilitate message passing among the nodes without the expenditure of processing resources at the node.
Cascaded VLSI neural network architecture for on-line learning

NASA Technical Reports Server (NTRS)

Thakoor, Anilkumar P. (Inventor); Duong, Tuan A. (Inventor); Daud, Taher (Inventor)

1992-01-01

High-speed, analog, fully-parallel, and asynchronous building blocks are cascaded for larger sizes and enhanced resolution. A hardware compatible algorithm permits hardware-in-the-loop learning despite limited weight resolution. A computation intensive feature classification application was demonstrated with this flexible hardware and new algorithm at high speed. This result indicates that these building block chips can be embedded as an application specific coprocessor for solving real world problems at extremely high data rates.
Cascaded VLSI neural network architecture for on-line learning

NASA Technical Reports Server (NTRS)

Duong, Tuan A. (Inventor); Daud, Taher (Inventor); Thakoor, Anilkumar P. (Inventor)

1995-01-01

High-speed, analog, fully-parallel and asynchronous building blocks are cascaded for larger sizes and enhanced resolution. A hardware-compatible algorithm permits hardware-in-the-loop learning despite limited weight resolution. A comparison-intensive feature classification application has been demonstrated with this flexible hardware and new algorithm at high speed. This result indicates that these building block chips can be embedded as application-specific-coprocessors for solving real-world problems at extremely high data rates.
Collective network for computer structures

DOEpatents

Blumrich, Matthias A; Coteus, Paul W; Chen, Dong; Gara, Alan; Giampapa, Mark E; Heidelberger, Philip; Hoenicke, Dirk; Takken, Todd E; Steinmacher-Burow, Burkhard D; Vranas, Pavlos M

2014-01-07

A system and method for enabling high-speed, low-latency global collective communications among interconnected processing nodes. The global collective network optimally enables collective reduction operations to be performed during parallel algorithm operations executing in a computer structure having a plurality of the interconnected processing nodes. Router devices are included that interconnect the nodes of the network via links to facilitate performance of low-latency global processing operations at nodes of the virtual network. The global collective network may be configured to provide global barrier and interrupt functionality in asynchronous or synchronized manner. When implemented in a massively-parallel supercomputing structure, the global collective network is physically and logically partitionable according to the needs of a processing algorithm.
Collective network for computer structures

DOEpatents

Blumrich, Matthias A [Ridgefield, CT; Coteus, Paul W [Yorktown Heights, NY; Chen, Dong [Croton On Hudson, NY; Gara, Alan [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Hoenicke, Dirk [Ossining, NY; Takken, Todd E [Brewster, NY; Steinmacher-Burow, Burkhard D [Wernau, DE; Vranas, Pavlos M [Bedford Hills, NY

2011-08-16

A system and method for enabling high-speed, low-latency global collective communications among interconnected processing nodes. The global collective network optimally enables collective reduction operations to be performed during parallel algorithm operations executing in a computer structure having a plurality of the interconnected processing nodes. Router devices ate included that interconnect the nodes of the network via links to facilitate performance of low-latency global processing operations at nodes of the virtual network and class structures. The global collective network may be configured to provide global barrier and interrupt functionality in asynchronous or synchronized manner. When implemented in a massively-parallel supercomputing structure, the global collective network is physically and logically partitionable according to needs of a processing algorithm.
Asynchronous multilevel adaptive methods for solving partial differential equations on multiprocessors - Performance results

NASA Technical Reports Server (NTRS)

Mccormick, S.; Quinlan, D.

1989-01-01

The fast adaptive composite grid method (FAC) is an algorithm that uses various levels of uniform grids (global and local) to provide adaptive resolution and fast solution of PDEs. Like all such methods, it offers parallelism by using possibly many disconnected patches per level, but is hindered by the need to handle these levels sequentially. The finest levels must therefore wait for processing to be essentially completed on all the coarser ones. A recently developed asynchronous version of FAC, called AFAC, completely eliminates this bottleneck to parallelism. This paper describes timing results for AFAC, coupled with a simple load balancing scheme, applied to the solution of elliptic PDEs on an Intel iPSC hypercube. These tests include performance of certain processes necessary in adaptive methods, including moving grids and changing refinement. A companion paper reports on numerical and analytical results for estimating convergence factors of AFAC applied to very large scale examples.

More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server

PubMed Central

Ho, Qirong; Cipar, James; Cui, Henggang; Kim, Jin Kyu; Lee, Seunghak; Gibbons, Phillip B.; Gibson, Garth A.; Ganger, Gregory R.; Xing, Eric P.

2014-01-01

We propose a parameter server system for distributed ML, which follows a Stale Synchronous Parallel (SSP) model of computation that maximizes the time computational workers spend doing useful work on ML algorithms, while still providing correctness guarantees. The parameter server provides an easy-to-use shared interface for read/write access to an ML model’s values (parameters and variables), and the SSP model allows distributed workers to read older, stale versions of these values from a local cache, instead of waiting to get them from a central storage. This significantly increases the proportion of time workers spend computing, as opposed to waiting. Furthermore, the SSP model ensures ML algorithm correctness by limiting the maximum age of the stale values. We provide a proof of correctness under SSP, as well as empirical results demonstrating that the SSP model achieves faster algorithm convergence on several different ML problems, compared to fully-synchronous and asynchronous schemes. PMID:25400488
DOE Office of Scientific and Technical Information (OSTI.GOV)

D'Azevedo, Eduardo; Abbott, Stephen; Koskela, Tuomas

The XGC fusion gyrokinetic code combines state-of-the-art, portable computational and algorithmic technologies to enable complicated multiscale simulations of turbulence and transport dynamics in ITER edge plasma on the largest US open-science computer, the CRAY XK7 Titan, at its maximal heterogeneous capability, which have not been possible before due to a factor of over 10 shortage in the time-to-solution for less than 5 days of wall-clock time for one physics case. Frontier techniques such as nested OpenMP parallelism, adaptive parallel I/O, staging I/O and data reduction using dynamic and asynchronous applications interactions, dynamic repartitioning.
Online measurement for geometrical parameters of wheel set based on structure light and CUDA parallel processing

NASA Astrophysics Data System (ADS)

Wu, Kaihua; Shao, Zhencheng; Chen, Nian; Wang, Wenjie

2018-01-01

The wearing degree of the wheel set tread is one of the main factors that influence the safety and stability of running train. Geometrical parameters mainly include flange thickness and flange height. Line structure laser light was projected on the wheel tread surface. The geometrical parameters can be deduced from the profile image. An online image acquisition system was designed based on asynchronous reset of CCD and CUDA parallel processing unit. The image acquisition was fulfilled by hardware interrupt mode. A high efficiency parallel segmentation algorithm based on CUDA was proposed. The algorithm firstly divides the image into smaller squares, and extracts the squares of the target by fusion of k_means and STING clustering image segmentation algorithm. Segmentation time is less than 0.97ms. A considerable acceleration ratio compared with the CPU serial calculation was obtained, which greatly improved the real-time image processing capacity. When wheel set was running in a limited speed, the system placed alone railway line can measure the geometrical parameters automatically. The maximum measuring speed is 120km/h.
Highly Scalable Asynchronous Computing Method for Partial Differential Equations: A Path Towards Exascale

NASA Astrophysics Data System (ADS)

Konduri, Aditya

Many natural and engineering systems are governed by nonlinear partial differential equations (PDEs) which result in a multiscale phenomena, e.g. turbulent flows. Numerical simulations of these problems are computationally very expensive and demand for extreme levels of parallelism. At realistic conditions, simulations are being carried out on massively parallel computers with hundreds of thousands of processing elements (PEs). It has been observed that communication between PEs as well as their synchronization at these extreme scales take up a significant portion of the total simulation time and result in poor scalability of codes. This issue is likely to pose a bottleneck in scalability of codes on future Exascale systems. In this work, we propose an asynchronous computing algorithm based on widely used finite difference methods to solve PDEs in which synchronization between PEs due to communication is relaxed at a mathematical level. We show that while stability is conserved when schemes are used asynchronously, accuracy is greatly degraded. Since message arrivals at PEs are random processes, so is the behavior of the error. We propose a new statistical framework in which we show that average errors drop always to first-order regardless of the original scheme. We propose new asynchrony-tolerant schemes that maintain accuracy when synchronization is relaxed. The quality of the solution is shown to depend, not only on the physical phenomena and numerical schemes, but also on the characteristics of the computing machine. A novel algorithm using remote memory access communications has been developed to demonstrate excellent scalability of the method for large-scale computing. Finally, we present a path to extend this method in solving complex multi-scale problems on Exascale machines.
Distributed Data-aggregation Consensus for Sensor Networks: Relaxation of Consensus Concept and Convergence Property

DTIC Science & Technology

2014-08-01

consensus algorithm called randomized gossip is more suitable [7, 8]. In asynchronous randomized gossip algorithms, pairs of neighboring nodes exchange...messages and perform updates in an asynchronous and unattended manner, and they also 1 The class of broadcast gossip algorithms [9, 10, 11, 12] are...dynamics [2] and asynchronous pairwise randomized gossip [7, 8], broadcast gossip algorithms do not require that nodes know the identities of their
Proxy-equation paradigm: A strategy for massively parallel asynchronous computations

NASA Astrophysics Data System (ADS)

Mittal, Ankita; Girimaji, Sharath

2017-09-01

Massively parallel simulations of transport equation systems call for a paradigm change in algorithm development to achieve efficient scalability. Traditional approaches require time synchronization of processing elements (PEs), which severely restricts scalability. Relaxing synchronization requirement introduces error and slows down convergence. In this paper, we propose and develop a novel "proxy equation" concept for a general transport equation that (i) tolerates asynchrony with minimal added error, (ii) preserves convergence order and thus, (iii) expected to scale efficiently on massively parallel machines. The central idea is to modify a priori the transport equation at the PE boundaries to offset asynchrony errors. Proof-of-concept computations are performed using a one-dimensional advection (convection) diffusion equation. The results demonstrate the promise and advantages of the present strategy.
ASC-ATDM Performance Portability Requirements for 2015-2019

DOE Office of Scientific and Technical Information (OSTI.GOV)

Edwards, Harold C.; Trott, Christian Robert

This report outlines the research, development, and support requirements for the Advanced Simulation and Computing (ASC ) Advanced Technology, Development, and Mitigation (ATDM) Performance Portability (a.k.a., Kokkos) project for 2015 - 2019 . The research and development (R&D) goal for Kokkos (v2) has been to create and demonstrate a thread - parallel programming model a nd standard C++ library - based implementation that enables performance portability across diverse manycore architectures such as multicore CPU, Intel Xeon Phi, and NVIDIA Kepler GPU. This R&D goal has been achieved for algorithms that use data parallel pat terns including parallel - for, parallelmore » - reduce, and parallel - scan. Current R&D is focusing on hierarchical parallel patterns such as a directed acyclic graph (DAG) of asynchronous tasks where each task contain s nested data parallel algorithms. This five y ear plan includes R&D required to f ully and performance portably exploit thread parallelism across current and anticipated next generation platforms (NGP). The Kokkos library is being evaluated by many projects exploring algorithm s and code design for NGP. Some production libraries and applications such as Trilinos and LAMMPS have already committed to Kokkos as their foundation for manycore parallelism an d performance portability. These five year requirements includes support required for current and antic ipated ASC projects to be effective and productive in their use of Kokkos on NGP. The greatest risk to the success of Kokkos and ASC projects relying upon Kokkos is a lack of staffing resources to support Kokkos to the degree needed by these ASC projects. This support includes up - to - date tutorials, documentation, multi - platform (hardware and software stack) testing, minor feature enhancements, thread - scalable algorithm consulting, and managing collaborative R&D.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)

Chow, Edmond

Solving sparse problems is at the core of many DOE computational science applications. We focus on the challenge of developing sparse algorithms that can fully exploit the parallelism in extreme-scale computing systems, in particular systems with massive numbers of cores per node. Our approach is to express a sparse matrix factorization as a large number of bilinear constraint equations, and then solving these equations via an asynchronous iterative method. The unknowns in these equations are the matrix entries of the factorization that is desired.
Hierarchical fractional-step approximations and parallel kinetic Monte Carlo algorithms

DOE Office of Scientific and Technical Information (OSTI.GOV)

Arampatzis, Giorgos, E-mail: garab@math.uoc.gr; Katsoulakis, Markos A., E-mail: markos@math.umass.edu; Plechac, Petr, E-mail: plechac@math.udel.edu

2012-10-01

We present a mathematical framework for constructing and analyzing parallel algorithms for lattice kinetic Monte Carlo (KMC) simulations. The resulting algorithms have the capacity to simulate a wide range of spatio-temporal scales in spatially distributed, non-equilibrium physiochemical processes with complex chemistry and transport micro-mechanisms. Rather than focusing on constructing exactly the stochastic trajectories, our approach relies on approximating the evolution of observables, such as density, coverage, correlations and so on. More specifically, we develop a spatial domain decomposition of the Markov operator (generator) that describes the evolution of all observables according to the kinetic Monte Carlo algorithm. This domain decompositionmore » corresponds to a decomposition of the Markov generator into a hierarchy of operators and can be tailored to specific hierarchical parallel architectures such as multi-core processors or clusters of Graphical Processing Units (GPUs). Based on this operator decomposition, we formulate parallel Fractional step kinetic Monte Carlo algorithms by employing the Trotter Theorem and its randomized variants; these schemes, (a) are partially asynchronous on each fractional step time-window, and (b) are characterized by their communication schedule between processors. The proposed mathematical framework allows us to rigorously justify the numerical and statistical consistency of the proposed algorithms, showing the convergence of our approximating schemes to the original serial KMC. The approach also provides a systematic evaluation of different processor communicating schedules. We carry out a detailed benchmarking of the parallel KMC schemes using available exact solutions, for example, in Ising-type systems and we demonstrate the capabilities of the method to simulate complex spatially distributed reactions at very large scales on GPUs. Finally, we discuss work load balancing between processors and propose a re-balancing scheme based on probabilistic mass transport methods.« less
Scenario Decomposition for 0-1 Stochastic Programs: Improvements and Asynchronous Implementation

DOE PAGES

Ryan, Kevin; Rajan, Deepak; Ahmed, Shabbir

2016-05-01

We recently proposed scenario decomposition algorithm for stochastic 0-1 programs finds an optimal solution by evaluating and removing individual solutions that are discovered by solving scenario subproblems. In our work, we develop an asynchronous, distributed implementation of the algorithm which has computational advantages over existing synchronous implementations of the algorithm. Improvements to both the synchronous and asynchronous algorithm are proposed. We also test the results on well known stochastic 0-1 programs from the SIPLIB test library and is able to solve one previously unsolved instance from the test set.
Understanding and Optimizing Asynchronous Low-Precision Stochastic Gradient Descent

PubMed Central

De Sa, Christopher; Feldman, Matthew; Ré, Christopher; Olukotun, Kunle

2018-01-01

Stochastic gradient descent (SGD) is one of the most popular numerical algorithms used in machine learning and other domains. Since this is likely to continue for the foreseeable future, it is important to study techniques that can make it run fast on parallel hardware. In this paper, we provide the first analysis of a technique called Buckwild! that uses both asynchronous execution and low-precision computation. We introduce the DMGC model, the first conceptualization of the parameter space that exists when implementing low-precision SGD, and show that it provides a way to both classify these algorithms and model their performance. We leverage this insight to propose and analyze techniques to improve the speed of low-precision SGD. First, we propose software optimizations that can increase throughput on existing CPUs by up to 11×. Second, we propose architectural changes, including a new cache technique we call an obstinate cache, that increase throughput beyond the limits of current-generation hardware. We also implement and analyze low-precision SGD on the FPGA, which is a promising alternative to the CPU for future SGD systems. PMID:29391770
Parallel, Asynchronous Executive (PAX): System concepts, facilities, and architecture

NASA Technical Reports Server (NTRS)

Jones, W. H.

1983-01-01

The Parallel, Asynchronous Executive (PAX) is a software operating system simulation that allows many computers to work on a single problem at the same time. PAX is currently implemented on a UNIVAC 1100/42 computer system. Independent UNIVAC runstreams are used to simulate independent computers. Data are shared among independent UNIVAC runstreams through shared mass-storage files. PAX has achieved the following: (1) applied several computing processes simultaneously to a single, logically unified problem; (2) resolved most parallel processor conflicts by careful work assignment; (3) resolved by means of worker requests to PAX all conflicts not resolved by work assignment; (4) provided fault isolation and recovery mechanisms to meet the problems of an actual parallel, asynchronous processing machine. Additionally, one real-life problem has been constructed for the PAX environment. This is CASPER, a collection of aerodynamic and structural dynamic problem simulation routines. CASPER is not discussed in this report except to provide examples of parallel-processing techniques.
Interpolation algorithm for asynchronous ADC-data

NASA Astrophysics Data System (ADS)

Bramburger, Stefan; Zinke, Benny; Killat, Dirk

2017-09-01

This paper presents a modified interpolation algorithm for signals with variable data rate from asynchronous ADCs. The Adaptive weights Conjugate gradient Toeplitz matrix (ACT) algorithm is extended to operate with a continuous data stream. An additional preprocessing of data with constant and linear sections and a weighted overlap of step-by-step into spectral domain transformed signals improve the reconstruction of the asycnhronous ADC signal. The interpolation method can be used if asynchronous ADC data is fed into synchronous digital signal processing.
GraphReduce: Processing Large-Scale Graphs on Accelerator-Based Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sengupta, Dipanjan; Song, Shuaiwen; Agarwal, Kapil

2015-11-15

Recent work on real-world graph analytics has sought to leverage the massive amount of parallelism offered by GPU devices, but challenges remain due to the inherent irregularity of graph algorithms and limitations in GPU-resident memory for storing large graphs. We present GraphReduce, a highly efficient and scalable GPU-based framework that operates on graphs that exceed the device’s internal memory capacity. GraphReduce adopts a combination of edge- and vertex-centric implementations of the Gather-Apply-Scatter programming model and operates on multiple asynchronous GPU streams to fully exploit the high degrees of parallelism in GPUs with efficient graph data movement between the host andmore » device.« less
Quinoa - Adaptive Computational Fluid Dynamics, 0.2

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bakosi, Jozsef; Gonzalez, Francisco; Rogers, Brandon

Quinoa is a set of computational tools that enables research and numerical analysis in fluid dynamics. At this time it remains a test-bed to experiment with various algorithms using fully asynchronous runtime systems. Currently, Quinoa consists of the following tools: (1) Walker, a numerical integrator for systems of stochastic differential equations in time. It is a mathematical tool to analyze and design the behavior of stochastic differential equations. It allows the estimation of arbitrary coupled statistics and probability density functions and is currently used for the design of statistical moment approximations for multiple mixing materials in variable-density turbulence. (2) Inciter,more » an overdecomposition-aware finite element field solver for partial differential equations using 3D unstructured grids. Inciter is used to research asynchronous mesh-based algorithms and to experiment with coupling asynchronous to bulk-synchronous parallel code. Two planned new features of Inciter, compared to the previous release (LA-CC-16-015), to be implemented in 2017, are (a) a simple Navier-Stokes solver for ideal single-material compressible gases, and (b) solution-adaptive mesh refinement (AMR), which enables dynamically concentrating compute resources to regions with interesting physics. Using the NS-AMR problem we plan to explore how to scale such high-load-imbalance simulations, representative of large production multiphysics codes, to very large problems on very large computers using an asynchronous runtime system. (3) RNGTest, a test harness to subject random number generators to stringent statistical tests enabling quantitative ranking with respect to their quality and computational cost. (4) UnitTest, a unit test harness, running hundreds of tests per second, capable of testing serial, synchronous, and asynchronous functions. (5) MeshConv, a mesh file converter that can be used to convert 3D tetrahedron meshes from and to either of the following formats: Gmsh, (http://www.geuz.org/gmsh), Netgen, (http://sourceforge.net/apps/mediawiki/netgen-mesher), ExodusII, (http://sourceforge.net/projects/exodusii), HyperMesh, (http://www.altairhyperworks.com/product/HyperMesh).« less
A Stream Tilling Approach to Surface Area Estimation for Large Scale Spatial Data in a Shared Memory System

NASA Astrophysics Data System (ADS)

Liu, Jiping; Kang, Xiaochen; Dong, Chun; Xu, Shenghua

2017-12-01

Surface area estimation is a widely used tool for resource evaluation in the physical world. When processing large scale spatial data, the input/output (I/O) can easily become the bottleneck in parallelizing the algorithm due to the limited physical memory resources and the very slow disk transfer rate. In this paper, we proposed a stream tilling approach to surface area estimation that first decomposed a spatial data set into tiles with topological expansions. With these tiles, the one-to-one mapping relationship between the input and the computing process was broken. Then, we realized a streaming framework towards the scheduling of the I/O processes and computing units. Herein, each computing unit encapsulated a same copy of the estimation algorithm, and multiple asynchronous computing units could work individually in parallel. Finally, the performed experiment demonstrated that our stream tilling estimation can efficiently alleviate the heavy pressures from the I/O-bound work, and the measured speedup after being optimized have greatly outperformed the directly parallel versions in shared memory systems with multi-core processors.
Privacy Preservation in Distributed Subgradient Optimization Algorithms.

PubMed

Lou, Youcheng; Yu, Lean; Wang, Shouyang; Yi, Peng

2017-07-31

In this paper, some privacy-preserving features for distributed subgradient optimization algorithms are considered. Most of the existing distributed algorithms focus mainly on the algorithm design and convergence analysis, but not the protection of agents' privacy. Privacy is becoming an increasingly important issue in applications involving sensitive information. In this paper, we first show that the distributed subgradient synchronous homogeneous-stepsize algorithm is not privacy preserving in the sense that the malicious agent can asymptotically discover other agents' subgradients by transmitting untrue estimates to its neighbors. Then a distributed subgradient asynchronous heterogeneous-stepsize projection algorithm is proposed and accordingly its convergence and optimality is established. In contrast to the synchronous homogeneous-stepsize algorithm, in the new algorithm agents make their optimization updates asynchronously with heterogeneous stepsizes. The introduced two mechanisms of projection operation and asynchronous heterogeneous-stepsize optimization can guarantee that agents' privacy can be effectively protected.
A massively asynchronous, parallel brain.

PubMed

Zeki, Semir

2015-05-19

Whether the visual brain uses a parallel or a serial, hierarchical, strategy to process visual signals, the end result appears to be that different attributes of the visual scene are perceived asynchronously--with colour leading form (orientation) by 40 ms and direction of motion by about 80 ms. Whatever the neural root of this asynchrony, it creates a problem that has not been properly addressed, namely how visual attributes that are perceived asynchronously over brief time windows after stimulus onset are bound together in the longer term to give us a unified experience of the visual world, in which all attributes are apparently seen in perfect registration. In this review, I suggest that there is no central neural clock in the (visual) brain that synchronizes the activity of different processing systems. More likely, activity in each of the parallel processing-perceptual systems of the visual brain is reset independently, making of the brain a massively asynchronous organ, just like the new generation of more efficient computers promise to be. Given the asynchronous operations of the brain, it is likely that the results of activities in the different processing-perceptual systems are not bound by physiological interactions between cells in the specialized visual areas, but post-perceptually, outside the visual brain.
GraphReduce: Large-Scale Graph Analytics on Accelerator-Based HPC Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sengupta, Dipanjan; Agarwal, Kapil; Song, Shuaiwen

2015-09-30

Recent work on real-world graph analytics has sought to leverage the massive amount of parallelism offered by GPU devices, but challenges remain due to the inherent irregularity of graph algorithms and limitations in GPU-resident memory for storing large graphs. We present GraphReduce, a highly efficient and scalable GPU-based framework that operates on graphs that exceed the device’s internal memory capacity. GraphReduce adopts a combination of both edge- and vertex-centric implementations of the Gather-Apply-Scatter programming model and operates on multiple asynchronous GPU streams to fully exploit the high degrees of parallelism in GPUs with efficient graph data movement between the hostmore » and the device.« less
Taming the Wild: A Unified Analysis of Hogwild!-Style Algorithms.

PubMed

De Sa, Christopher; Zhang, Ce; Olukotun, Kunle; Ré, Christopher

2015-12-01

Stochastic gradient descent (SGD) is a ubiquitous algorithm for a variety of machine learning problems. Researchers and industry have developed several techniques to optimize SGD's runtime performance, including asynchronous execution and reduced precision. Our main result is a martingale-based analysis that enables us to capture the rich noise models that may arise from such techniques. Specifically, we use our new analysis in three ways: (1) we derive convergence rates for the convex case (Hogwild!) with relaxed assumptions on the sparsity of the problem; (2) we analyze asynchronous SGD algorithms for non-convex matrix problems including matrix completion; and (3) we design and analyze an asynchronous SGD algorithm, called Buckwild!, that uses lower-precision arithmetic. We show experimentally that our algorithms run efficiently for a variety of problems on modern hardware.

Neighbor Discovery Algorithm in Wireless Local Area Networks Using Multi-beam Directional Antennas

NASA Astrophysics Data System (ADS)

Wang, Jin; Peng, Wei; Liu, Song

2017-10-01

Neighbor discovery is an important step for Wireless Local Area Networks (WLAN) and the use of multi-beam directional antennas can greatly improve the network performance. However, most neighbor discovery algorithms in WLAN, based on multi-beam directional antennas, can only work effectively in synchronous system but not in asynchro-nous system. And collisions at AP remain a bottleneck for neighbor discovery. In this paper, we propose two asynchrono-us neighbor discovery algorithms: asynchronous hierarchical scanning (AHS) and asynchronous directional scanning (ADS) algorithm. Both of them are based on three-way handshaking mechanism. AHS and ADS reduce collisions at AP to have a good performance in a hierarchical way and directional way respectively. In the end, the performance of the AHS and ADS are tested on OMNeT++. Moreover, it is analyzed that different application scenarios and the factors how to affect the performance of these algorithms. The simulation results show that AHS is suitable for the densely populated scenes around AP while ADS is suitable for that most of the neighborhood nodes are far from AP.
Tempest: Accelerated MS/MS database search software for heterogeneous computing platforms

PubMed Central

Adamo, Mark E.; Gerber, Scott A.

2017-01-01

MS/MS database search algorithms derive a set of candidate peptide sequences from in-silico digest of a protein sequence database, and compute theoretical fragmentation patterns to match these candidates against observed MS/MS spectra. The original Tempest publication described these operations mapped to a CPU-GPU model, in which the CPU generates peptide candidates that are asynchronously sent to a discrete GPU to be scored against experimental spectra in parallel (Milloy et al., 2012). The current version of Tempest expands this model, incorporating OpenCL to offer seamless parallelization across multicore CPUs, GPUs, integrated graphics chips, and general-purpose coprocessors. Three protocols describe how to configure and run a Tempest search, including discussion of how to leverage Tempest's unique feature set to produce optimal results. PMID:27603022
Pteros 2.0: Evolution of the fast parallel molecular analysis library for C++ and python.

PubMed

Yesylevskyy, Semen O

2015-07-15

Pteros is the high-performance open-source library for molecular modeling and analysis of molecular dynamics trajectories. Starting from version 2.0 Pteros is available for C++ and Python programming languages with very similar interfaces. This makes it suitable for writing complex reusable programs in C++ and simple interactive scripts in Python alike. New version improves the facilities for asynchronous trajectory reading and parallel execution of analysis tasks by introducing analysis plugins which could be written in either C++ or Python in completely uniform way. The high level of abstraction provided by analysis plugins greatly simplifies prototyping and implementation of complex analysis algorithms. Pteros is available for free under Artistic License from http://sourceforge.net/projects/pteros/. © 2015 Wiley Periodicals, Inc.
Multiple asynchronous stimulus- and task-dependent hierarchies (STDH) within the visual brain's parallel processing systems.

PubMed

Zeki, Semir

2016-10-01

Results from a variety of sources, some many years old, lead ineluctably to a re-appraisal of the twin strategies of hierarchical and parallel processing used by the brain to construct an image of the visual world. Contrary to common supposition, there are at least three 'feed-forward' anatomical hierarchies that reach the primary visual cortex (V1) and the specialized visual areas outside it, in parallel. These anatomical hierarchies do not conform to the temporal order with which visual signals reach the specialized visual areas through V1. Furthermore, neither the anatomical hierarchies nor the temporal order of activation through V1 predict the perceptual hierarchies. The latter shows that we see (and become aware of) different visual attributes at different times, with colour leading form (orientation) and directional visual motion, even though signals from fast-moving, high-contrast stimuli are among the earliest to reach the visual cortex (of area V5). Parallel processing, on the other hand, is much more ubiquitous than commonly supposed but is subject to a barely noticed but fundamental aspect of brain operations, namely that different parallel systems operate asynchronously with respect to each other and reach perceptual endpoints at different times. This re-assessment leads to the conclusion that the visual brain is constituted of multiple, parallel and asynchronously operating task- and stimulus-dependent hierarchies (STDH); which of these parallel anatomical hierarchies have temporal and perceptual precedence at any given moment is stimulus and task related, and dependent on the visual brain's ability to undertake multiple operations asynchronously. © 2016 Federation of European Neuroscience Societies and John Wiley & Sons Ltd.
Software and hardware complex for research and management of the separation process

NASA Astrophysics Data System (ADS)

Borisov, A. P.

2018-01-01

The article is devoted to the development of a program for studying the operation of an asynchronous electric drive using vector-algorithmic switching of windings, as well as the development of a hardware-software complex for controlling parameters and controlling the speed of rotation of an asynchronous electric drive for investigating the operation of a cyclone. To study the operation of an asynchronous electric drive, a method was used in which the average value of flux linkage is found and a method for vector-algorithmic calculation of the power and electromagnetic moment of an asynchronous electric drive feeding from a single-phase network is developed, with vector-algorithmic commutation, and software for calculating parameters. The software part of the complex allows to regulate the speed of rotation of the motor by vector-algorithmic switching of transistors or, using pulse-width modulation (PWM), set any engine speed. Also sensors are connected to the hardware-software complex at the inlet and outlet of the cyclone. The developed cyclone with an inserted complex allows to receive high efficiency of product separation at various entrance speeds. At an inlet air speed of 18 m / s, the cyclone’s maximum efficiency is achieved. For this, it is necessary to provide the rotational speed of an asynchronous electric drive with a frequency of 45 Hz.
Asynchronous Gossip for Averaging and Spectral Ranking

NASA Astrophysics Data System (ADS)

Borkar, Vivek S.; Makhijani, Rahul; Sundaresan, Rajesh

2014-08-01

We consider two variants of the classical gossip algorithm. The first variant is a version of asynchronous stochastic approximation. We highlight a fundamental difficulty associated with the classical asynchronous gossip scheme, viz., that it may not converge to a desired average, and suggest an alternative scheme based on reinforcement learning that has guaranteed convergence to the desired average. We then discuss a potential application to a wireless network setting with simultaneous link activation constraints. The second variant is a gossip algorithm for distributed computation of the Perron-Frobenius eigenvector of a nonnegative matrix. While the first variant draws upon a reinforcement learning algorithm for an average cost controlled Markov decision problem, the second variant draws upon a reinforcement learning algorithm for risk-sensitive control. We then discuss potential applications of the second variant to ranking schemes, reputation networks, and principal component analysis.
Asynchronous Replica Exchange Software for Grid and Heterogeneous Computing.

PubMed

Gallicchio, Emilio; Xia, Junchao; Flynn, William F; Zhang, Baofeng; Samlalsingh, Sade; Mentes, Ahmet; Levy, Ronald M

2015-11-01

Parallel replica exchange sampling is an extended ensemble technique often used to accelerate the exploration of the conformational ensemble of atomistic molecular simulations of chemical systems. Inter-process communication and coordination requirements have historically discouraged the deployment of replica exchange on distributed and heterogeneous resources. Here we describe the architecture of a software (named ASyncRE) for performing asynchronous replica exchange molecular simulations on volunteered computing grids and heterogeneous high performance clusters. The asynchronous replica exchange algorithm on which the software is based avoids centralized synchronization steps and the need for direct communication between remote processes. It allows molecular dynamics threads to progress at different rates and enables parameter exchanges among arbitrary sets of replicas independently from other replicas. ASyncRE is written in Python following a modular design conducive to extensions to various replica exchange schemes and molecular dynamics engines. Applications of the software for the modeling of association equilibria of supramolecular and macromolecular complexes on BOINC campus computational grids and on the CPU/MIC heterogeneous hardware of the XSEDE Stampede supercomputer are illustrated. They show the ability of ASyncRE to utilize large grids of desktop computers running the Windows, MacOS, and/or Linux operating systems as well as collections of high performance heterogeneous hardware devices.
The fusion code XGC: Enabling kinetic study of multi-scale edge turbulent transport in ITER [Book Chapter

DOE Office of Scientific and Technical Information (OSTI.GOV)

D'Azevedo, Eduardo; Abbott, Stephen; Koskela, Tuomas

The XGC fusion gyrokinetic code combines state-of-the-art, portable computational and algorithmic technologies to enable complicated multiscale simulations of turbulence and transport dynamics in ITER edge plasma on the largest US open-science computer, the CRAY XK7 Titan, at its maximal heterogeneous capability, which have not been possible before due to a factor of over 10 shortage in the time-to-solution for less than 5 days of wall-clock time for one physics case. Frontier techniques such as nested OpenMP parallelism, adaptive parallel I/O, staging I/O and data reduction using dynamic and asynchronous applications interactions, dynamic repartitioning for balancing computational work in pushing particlesmore » and in grid related work, scalable and accurate discretization algorithms for non-linear Coulomb collisions, and communication-avoiding subcycling technology for pushing particles on both CPUs and GPUs are also utilized to dramatically improve the scalability and time-to-solution, hence enabling the difficult kinetic ITER edge simulation on a present-day leadership class computer.« less
A FAST ITERATIVE METHOD FOR SOLVING THE EIKONAL EQUATION ON TETRAHEDRAL DOMAINS

PubMed Central

Fu, Zhisong; Kirby, Robert M.; Whitaker, Ross T.

2014-01-01

Generating numerical solutions to the eikonal equation and its many variations has a broad range of applications in both the natural and computational sciences. Efficient solvers on cutting-edge, parallel architectures require new algorithms that may not be theoretically optimal, but that are designed to allow asynchronous solution updates and have limited memory access patterns. This paper presents a parallel algorithm for solving the eikonal equation on fully unstructured tetrahedral meshes. The method is appropriate for the type of fine-grained parallelism found on modern massively-SIMD architectures such as graphics processors and takes into account the particular constraints and capabilities of these computing platforms. This work builds on previous work for solving these equations on triangle meshes; in this paper we adapt and extend previous two-dimensional strategies to accommodate three-dimensional, unstructured, tetrahedralized domains. These new developments include a local update strategy with data compaction for tetrahedral meshes that provides solutions on both serial and parallel architectures, with a generalization to inhomogeneous, anisotropic speed functions. We also propose two new update schemes, specialized to mitigate the natural data increase observed when moving to three dimensions, and the data structures necessary for efficiently mapping data to parallel SIMD processors in a way that maintains computational density. Finally, we present descriptions of the implementations for a single CPU, as well as multicore CPUs with shared memory and SIMD architectures, with comparative results against state-of-the-art eikonal solvers. PMID:25221418
A massively asynchronous, parallel brain

PubMed Central

Zeki, Semir

2015-01-01

Whether the visual brain uses a parallel or a serial, hierarchical, strategy to process visual signals, the end result appears to be that different attributes of the visual scene are perceived asynchronously—with colour leading form (orientation) by 40 ms and direction of motion by about 80 ms. Whatever the neural root of this asynchrony, it creates a problem that has not been properly addressed, namely how visual attributes that are perceived asynchronously over brief time windows after stimulus onset are bound together in the longer term to give us a unified experience of the visual world, in which all attributes are apparently seen in perfect registration. In this review, I suggest that there is no central neural clock in the (visual) brain that synchronizes the activity of different processing systems. More likely, activity in each of the parallel processing-perceptual systems of the visual brain is reset independently, making of the brain a massively asynchronous organ, just like the new generation of more efficient computers promise to be. Given the asynchronous operations of the brain, it is likely that the results of activities in the different processing-perceptual systems are not bound by physiological interactions between cells in the specialized visual areas, but post-perceptually, outside the visual brain. PMID:25823871
Tempest: Accelerated MS/MS Database Search Software for Heterogeneous Computing Platforms.

PubMed

Adamo, Mark E; Gerber, Scott A

2016-09-07

MS/MS database search algorithms derive a set of candidate peptide sequences from in silico digest of a protein sequence database, and compute theoretical fragmentation patterns to match these candidates against observed MS/MS spectra. The original Tempest publication described these operations mapped to a CPU-GPU model, in which the CPU (central processing unit) generates peptide candidates that are asynchronously sent to a discrete GPU (graphics processing unit) to be scored against experimental spectra in parallel. The current version of Tempest expands this model, incorporating OpenCL to offer seamless parallelization across multicore CPUs, GPUs, integrated graphics chips, and general-purpose coprocessors. Three protocols describe how to configure and run a Tempest search, including discussion of how to leverage Tempest's unique feature set to produce optimal results. © 2016 by John Wiley & Sons, Inc. Copyright © 2016 John Wiley & Sons, Inc.
An efficient implementation of 3D high-resolution imaging for large-scale seismic data with GPU/CPU heterogeneous parallel computing

NASA Astrophysics Data System (ADS)

Xu, Jincheng; Liu, Wei; Wang, Jin; Liu, Linong; Zhang, Jianfeng

2018-02-01

De-absorption pre-stack time migration (QPSTM) compensates for the absorption and dispersion of seismic waves by introducing an effective Q parameter, thereby making it an effective tool for 3D, high-resolution imaging of seismic data. Although the optimal aperture obtained via stationary-phase migration reduces the computational cost of 3D QPSTM and yields 3D stationary-phase QPSTM, the associated computational efficiency is still the main problem in the processing of 3D, high-resolution images for real large-scale seismic data. In the current paper, we proposed a division method for large-scale, 3D seismic data to optimize the performance of stationary-phase QPSTM on clusters of graphics processing units (GPU). Then, we designed an imaging point parallel strategy to achieve an optimal parallel computing performance. Afterward, we adopted an asynchronous double buffering scheme for multi-stream to perform the GPU/CPU parallel computing. Moreover, several key optimization strategies of computation and storage based on the compute unified device architecture (CUDA) were adopted to accelerate the 3D stationary-phase QPSTM algorithm. Compared with the initial GPU code, the implementation of the key optimization steps, including thread optimization, shared memory optimization, register optimization and special function units (SFU), greatly improved the efficiency. A numerical example employing real large-scale, 3D seismic data showed that our scheme is nearly 80 times faster than the CPU-QPSTM algorithm. Our GPU/CPU heterogeneous parallel computing framework significant reduces the computational cost and facilitates 3D high-resolution imaging for large-scale seismic data.
Barrier-breaking performance for industrial problems on the CRAY C916

DOE Office of Scientific and Technical Information (OSTI.GOV)

Graffunder, S.K.

1993-12-31

Nine applications, including third-party codes, were submitted to the Gordon Bell Prize committee showing the CRAY C916 supercomputer providing record-breaking time to solution for industrial problems in several disciplines. Performance was obtained by balancing raw hardware speed; effective use of large, real, shared memory; compiler vectorization and autotasking; hand optimization; asynchronous I/O techniques; and new algorithms. The highest GFLOPS performance for the submissions was 11.1 GFLOPS out of a peak advertised performance of 16 GFLOPS for the CRAY C916 system. One program achieved a 15.45 speedup from the compiler with just two hand-inserted directives to scope variables properly for themore » mathematical library. New I/O techniques hide tens of gigabytes of I/O behind parallel computations. Finally, new iterative solver algorithms have demonstrated times to solution on 1 CPU as high as 70 times faster than the best direct solvers.« less
On the utility of threads for data parallel programming

NASA Technical Reports Server (NTRS)

Fahringer, Thomas; Haines, Matthew; Mehrotra, Piyush

1995-01-01

Threads provide a useful programming model for asynchronous behavior because of their ability to encapsulate units of work that can then be scheduled for execution at runtime, based on the dynamic state of a system. Recently, the threaded model has been applied to the domain of data parallel scientific codes, and initial reports indicate that the threaded model can produce performance gains over non-threaded approaches, primarily through the use of overlapping useful computation with communication latency. However, overlapping computation with communication is possible without the benefit of threads if the communication system supports asynchronous primitives, and this comparison has not been made in previous papers. This paper provides a critical look at the utility of lightweight threads as applied to data parallel scientific programming.
Pteros: fast and easy to use open-source C++ library for molecular analysis.

PubMed

Yesylevskyy, Semen O

2012-07-15

An open-source Pteros library for molecular modeling and analysis of molecular dynamics trajectories for C++ programming language is introduced. Pteros provides a number of routine analysis operations ranging from reading and writing trajectory files and geometry transformations to structural alignment and computation of nonbonded interaction energies. The library features asynchronous trajectory reading and parallel execution of several analysis routines, which greatly simplifies development of computationally intensive trajectory analysis algorithms. Pteros programming interface is very simple and intuitive while the source code is well documented and easily extendible. Pteros is available for free under open-source Artistic License from http://sourceforge.net/projects/pteros/. Copyright © 2012 Wiley Periodicals, Inc.
Sensor placement on Canton Tower for health monitoring using asynchronous-climb monkey algorithm

NASA Astrophysics Data System (ADS)

Yi, Ting-Hua; Li, Hong-Nan; Zhang, Xu-Dong

2012-12-01

Heuristic optimization algorithms have become a popular choice for solving complex and intricate sensor placement problems which are difficult to solve by traditional methods. This paper proposes a novel and interesting methodology called the asynchronous-climb monkey algorithm (AMA) for the optimum design of sensor arrays for a structural health monitoring system. Different from the existing algorithms, the dual-structure coding method is designed and adopted for the representation of the design variables. The asynchronous-climb process is incorporated in the proposed AMA that can adjust the trajectory of each individual dynamically in the search space according to its own experience and other monkeys. The concept of ‘monkey king’ is introduced in the AMA, which reflects the Darwinian principle of natural selection and can create an interaction network to correctly guide the movement of other monkeys. Numerical experiments are carried out using two different objective functions by considering the Canton Tower in China with or without the antenna mast to evaluate the performance of the proposed algorithm. Investigations have indicated that the proposed AMA exhibits faster convergence characteristics and can generate sensor configurations superior in all instances when compared to the conventional monkey algorithm. For structures with stiffness mutation such as the Canton Tower, the sensor placement needs to be considered for each part separately.
Digital Self-Interference Cancellation for Asynchronous In-Band Full-Duplex Underwater Acoustic Communication.

PubMed

Qiao, Gang; Gan, Shuwei; Liu, Songzuo; Ma, Lu; Sun, Zongxin

2018-05-24

To improve the throughput of underwater acoustic (UWA) networking, the In-band full-duplex (IBFD) communication is one of the most vital pieces of research. The major drawback of IBFD-UWA communication is Self-Interference (SI). This paper presents a digital SI cancellation algorithm for asynchronous IBFD-UWA communication system. We focus on two issues: one is asynchronous communication dissimilar to IBFD radio communication, the other is nonlinear distortion caused by power amplifier (PA). First, we discuss asynchronous IBFD-UWA signal model with the nonlinear distortion of PA. Then, we design a scheme for asynchronous IBFD-UWA communication utilizing the non-overlapping region between SI and intended signal to estimate the nonlinear SI channel. To cancel the nonlinear distortion caused by PA, we propose an Over-Parameterization based Recursive Least Squares (RLS) algorithm (OPRLS) to estimate the nonlinear SI channel. Furthermore, we present the OPRLS with a sparse constraint to estimate the SI channel, which reduces the requirement of the length of the non-overlapping region. Finally, we verify our concept through simulation and the pool experiment. Results demonstrate that the proposed digital SI cancellation scheme can cancel SI efficiently.
Identification of unknown spatial load distributions in a vibrating Euler-Bernoulli beam from limited measured data

NASA Astrophysics Data System (ADS)

Hasanov, Alemdar; Kawano, Alexandre

2016-05-01

Two types of inverse source problems of identifying asynchronously distributed spatial loads governed by the Euler-Bernoulli beam equation ρ (x){w}{tt}+μ (x){w}t+{({EI}(x){w}{xx})}{xx}-{T}r{u}{xx}={\\sum }m=1M{g}m(t){f}m(x), (x,t)\\in {{{Ω }}}T := (0,l)× (0,T), with hinged-clamped ends (w(0,t)={w}{xx}(0,t)=0,w(l,t) = {w}x(l,t)=0,t\\in (0,T)), are studied. Here {g}m(t) are linearly independent functions, describing an asynchronous temporal loading, and {f}m(x) are the spatial load distributions. In the first identification problem the values {ν }k(t),k=\\bar{1,K}, of the deflection w(x,t), are assumed to be known, as measured output data, in a neighbourhood of the finite set of points P:= \\{{x}k\\in (0,l),k=\\bar{1,K}\\}\\subset (0,l), corresponding to the internal points of a continuous beam, for all t\\in ]0,T[. In the second identification problem the values {θ }k(t),k=\\bar{1,K}, of the slope {w}x(x,t), are assumed to be known, as measured output data in a neighbourhood of the same set of points P for all t\\in ]0,T[. These inverse source problems will be defined subsequently as the problems ISP1 and ISP2. The general purpose of this study is to develop mathematical concepts and tools that are capable of providing effective numerical algorithms for the numerical solution of the considered class of inverse problems. Note that both measured output data {ν }k(t) and {θ }k(t) contain random noise. In the first part of the study we prove that each measured output data {ν }k(t) and {θ }k(t),k=\\bar{1,K} can uniquely determine the unknown functions {f}m\\in {H}-1(]0,l[),m=\\bar{1,M}. In the second part of the study we will introduce the input-output operators {{ K }}d :{L}2(0,T)\\mapsto {L}2(0,T),({{ K }}df)(t):= w(x,t;f),x\\in P, f(x) := ({f}1(x),\\ldots ,{f}M(x)), and {{ K }}s :{L}2(0,T)\\mapsto {L}2(0,T), ({{ K }}sf)(t):= {w}x(x,t;f), x\\in P , corresponding to the problems ISP1 and ISP2, and then reformulate these problems as the operator equations: {{ K }}df=ν and {{ K }}sf=θ , where ν (t):= ({ν }1(t),\\ldots ,{ν }K(t)) and {θ }k(t):= ({θ }1(t),\\ldots ,{θ }K(t)). Since both measured output data contain random noise, we use the most prominent regularisation method, Tikhonov regularisation, introducing the regularised cost functionals {J}1α (f):= (1/2)\\parallel {{ K }}df-ν {\\parallel }{L2(0,T)}2+(1/2)α \\parallel f{\\parallel }{L2(0,T)}2 and {J}2α (f):= (1/2)\\parallel {{ K }}sf-θ {\\parallel }{L2(0,T)}2+(1/2)α \\parallel f{\\parallel }{L2(0,T)}2. Using a priori estimates for the weak solution of the direct problem and the Tikhonov regularisation method combined with the adjoint problem approach, we prove that the Fréchet gradients {J}1\\prime (f) and {J}2\\prime (f) of both cost functionals can explicitly be derived via the corresponding weak solutions of adjoint problems and the known temporal loads {g}m(t). Moreover, we show that these gradients are Lipschitz continuous, which allows the use of gradient type iteration convergent algorithms. Two applications of the proposed theory are presented. It is shown that solvability results for inverse source problems related to the synchronous loading case, with a single interior measured data, are special cases of the obtained results for asynchronously distributed spatial load cases.
Asynchronous Incremental Stochastic Dual Descent Algorithm for Network Resource Allocation

NASA Astrophysics Data System (ADS)

Bedi, Amrit Singh; Rajawat, Ketan

2018-05-01

Stochastic network optimization problems entail finding resource allocation policies that are optimum on an average but must be designed in an online fashion. Such problems are ubiquitous in communication networks, where resources such as energy and bandwidth are divided among nodes to satisfy certain long-term objectives. This paper proposes an asynchronous incremental dual decent resource allocation algorithm that utilizes delayed stochastic {gradients} for carrying out its updates. The proposed algorithm is well-suited to heterogeneous networks as it allows the computationally-challenged or energy-starved nodes to, at times, postpone the updates. The asymptotic analysis of the proposed algorithm is carried out, establishing dual convergence under both, constant and diminishing step sizes. It is also shown that with constant step size, the proposed resource allocation policy is asymptotically near-optimal. An application involving multi-cell coordinated beamforming is detailed, demonstrating the usefulness of the proposed algorithm.
What can neuromorphic event-driven precise timing add to spike-based pattern recognition?

PubMed

Akolkar, Himanshu; Meyer, Cedric; Clady, Zavier; Marre, Olivier; Bartolozzi, Chiara; Panzeri, Stefano; Benosman, Ryad

2015-03-01

This letter introduces a study to precisely measure what an increase in spike timing precision can add to spike-driven pattern recognition algorithms. The concept of generating spikes from images by converting gray levels into spike timings is currently at the basis of almost every spike-based modeling of biological visual systems. The use of images naturally leads to generating incorrect artificial and redundant spike timings and, more important, also contradicts biological findings indicating that visual processing is massively parallel, asynchronous with high temporal resolution. A new concept for acquiring visual information through pixel-individual asynchronous level-crossing sampling has been proposed in a recent generation of asynchronous neuromorphic visual sensors. Unlike conventional cameras, these sensors acquire data not at fixed points in time for the entire array but at fixed amplitude changes of their input, resulting optimally sparse in space and time-pixel individually and precisely timed only if new, (previously unknown) information is available (event based). This letter uses the high temporal resolution spiking output of neuromorphic event-based visual sensors to show that lowering time precision degrades performance on several recognition tasks specifically when reaching the conventional range of machine vision acquisition frequencies (30-60 Hz). The use of information theory to characterize separability between classes for each temporal resolution shows that high temporal acquisition provides up to 70% more information that conventional spikes generated from frame-based acquisition as used in standard artificial vision, thus drastically increasing the separability between classes of objects. Experiments on real data show that the amount of information loss is correlated with temporal precision. Our information-theoretic study highlights the potentials of neuromorphic asynchronous visual sensors for both practical applications and theoretical investigations. Moreover, it suggests that representing visual information as a precise sequence of spike times as reported in the retina offers considerable advantages for neuro-inspired visual computations.

Parallel tempering simulation of the three-dimensional Edwards-Anderson model with compact asynchronous multispin coding on GPU

NASA Astrophysics Data System (ADS)

Fang, Ye; Feng, Sheng; Tam, Ka-Ming; Yun, Zhifeng; Moreno, Juana; Ramanujam, J.; Jarrell, Mark

2014-10-01

Monte Carlo simulations of the Ising model play an important role in the field of computational statistical physics, and they have revealed many properties of the model over the past few decades. However, the effect of frustration due to random disorder, in particular the possible spin glass phase, remains a crucial but poorly understood problem. One of the obstacles in the Monte Carlo simulation of random frustrated systems is their long relaxation time making an efficient parallel implementation on state-of-the-art computation platforms highly desirable. The Graphics Processing Unit (GPU) is such a platform that provides an opportunity to significantly enhance the computational performance and thus gain new insight into this problem. In this paper, we present optimization and tuning approaches for the CUDA implementation of the spin glass simulation on GPUs. We discuss the integration of various design alternatives, such as GPU kernel construction with minimal communication, memory tiling, and look-up tables. We present a binary data format, Compact Asynchronous Multispin Coding (CAMSC), which provides an additional 28.4% speedup compared with the traditionally used Asynchronous Multispin Coding (AMSC). Our overall design sustains a performance of 33.5 ps per spin flip attempt for simulating the three-dimensional Edwards-Anderson model with parallel tempering, which significantly improves the performance over existing GPU implementations.
Material Implementation of Hyperincursive Field on Slime Mold Computer

NASA Astrophysics Data System (ADS)

Aono, Masashi; Gunji, Yukio-Pegio

2004-08-01

"Elementary Conflictable Cellular Automaton (ECCA)" was introduced by Aono and Gunji as a problematic computational syntax embracing the non-deterministic/non-algorithmic property due to its hyperincursivity and nonlocality. Although ECCA's hyperincursive evolution equation indicates the occurrence of the deadlock/infinite-loop, we do not consider that this problem declares the fundamental impossibility of implementing ECCA materially. Dubois proposed to call a computing system where uncertainty/contradiction occurs "the hyperincursive field". In this paper we introduce a material implementation of the hyperincursive field by using plasmodia of the true slime mold Physarum polycephalum. The amoeboid organism is adopted as a computing media of ECCA slime mold computer (ECCA-SMC) mainly because; it is a parallel non-distributed system whose locally branched tips (components) can act in parallel with asynchronism and nonlocal correlation. A notable characteristic of ECCA-SMC is that a cell representing a spatio-temporal segment of computation is occupied (overlapped) redundantly by multiple spatially adjacent computing operations and by temporally successive computing events. The overlapped time representation may contribute to the progression of discussions on unconventional notions of the time.
An improved parallel fuzzy connected image segmentation method based on CUDA.

PubMed

Wang, Liansheng; Li, Dong; Huang, Shaohui

2016-05-12

Fuzzy connectedness method (FC) is an effective method for extracting fuzzy objects from medical images. However, when FC is applied to large medical image datasets, its running time will be greatly expensive. Therefore, a parallel CUDA version of FC (CUDA-kFOE) was proposed by Ying et al. to accelerate the original FC. Unfortunately, CUDA-kFOE does not consider the edges between GPU blocks, which causes miscalculation of edge points. In this paper, an improved algorithm is proposed by adding a correction step on the edge points. The improved algorithm can greatly enhance the calculation accuracy. In the improved method, an iterative manner is applied. In the first iteration, the affinity computation strategy is changed and a look up table is employed for memory reduction. In the second iteration, the error voxels because of asynchronism are updated again. Three different CT sequences of hepatic vascular with different sizes were used in the experiments with three different seeds. NVIDIA Tesla C2075 is used to evaluate our improved method over these three data sets. Experimental results show that the improved algorithm can achieve a faster segmentation compared to the CPU version and higher accuracy than CUDA-kFOE. The calculation results were consistent with the CPU version, which demonstrates that it corrects the edge point calculation error of the original CUDA-kFOE. The proposed method has a comparable time cost and has less errors compared to the original CUDA-kFOE as demonstrated in the experimental results. In the future, we will focus on automatic acquisition method and automatic processing.
An asynchronous traversal engine for graph-based rich metadata management

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dai, Dong; Carns, Philip; Ross, Robert B.

Rich metadata in high-performance computing (HPC) systems contains extended information about users, jobs, data files, and their relationships. Property graphs are a promising data model to represent heterogeneous rich metadata flexibly. Specifically, a property graph can use vertices to represent different entities and edges to record the relationships between vertices with unique annotations. The high-volume HPC use case, with millions of entities and relationships, naturally requires an out-of-core distributed property graph database, which must support live updates (to ingest production information in real time), low-latency point queries (for frequent metadata operations such as permission checking), and large-scale traversals (for provenancemore » data mining). Among these needs, large-scale property graph traversals are particularly challenging for distributed graph storage systems. Most existing graph systems implement a "level synchronous" breadth-first search algorithm that relies on global synchronization in each traversal step. This performs well in many problem domains; but a rich metadata management system is characterized by imbalanced graphs, long traversal lengths, and concurrent workloads, each of which has the potential to introduce or exacerbate stragglers (i.e., abnormally slow steps or servers in a graph traversal) that lead to low overall throughput for synchronous traversal algorithms. Previous research indicated that the straggler problem can be mitigated by using asynchronous traversal algorithms, and many graph-processing frameworks have successfully demonstrated this approach. Such systems require the graph to be loaded into a separate batch-processing framework instead of being iteratively accessed, however. In this work, we investigate a general asynchronous graph traversal engine that can operate atop a rich metadata graph in its native format. We outline a traversal-aware query language and key optimizations (traversal-affiliate caching and execution merging) necessary for efficient performance. We further explore the effect of different graph partitioning strategies on the traversal performance for both synchronous and asynchronous traversal engines. Our experiments show that the asynchronous graph traversal engine is more efficient than its synchronous counterpart in the case of HPC rich metadata processing, where more servers are involved and larger traversals are needed. Furthermore, the asynchronous traversal engine is more adaptive to different graph partitioning strategies.« less
An asynchronous traversal engine for graph-based rich metadata management

DOE PAGES

Dai, Dong; Carns, Philip; Ross, Robert B.; ...

2016-06-23

Rich metadata in high-performance computing (HPC) systems contains extended information about users, jobs, data files, and their relationships. Property graphs are a promising data model to represent heterogeneous rich metadata flexibly. Specifically, a property graph can use vertices to represent different entities and edges to record the relationships between vertices with unique annotations. The high-volume HPC use case, with millions of entities and relationships, naturally requires an out-of-core distributed property graph database, which must support live updates (to ingest production information in real time), low-latency point queries (for frequent metadata operations such as permission checking), and large-scale traversals (for provenancemore » data mining). Among these needs, large-scale property graph traversals are particularly challenging for distributed graph storage systems. Most existing graph systems implement a "level synchronous" breadth-first search algorithm that relies on global synchronization in each traversal step. This performs well in many problem domains; but a rich metadata management system is characterized by imbalanced graphs, long traversal lengths, and concurrent workloads, each of which has the potential to introduce or exacerbate stragglers (i.e., abnormally slow steps or servers in a graph traversal) that lead to low overall throughput for synchronous traversal algorithms. Previous research indicated that the straggler problem can be mitigated by using asynchronous traversal algorithms, and many graph-processing frameworks have successfully demonstrated this approach. Such systems require the graph to be loaded into a separate batch-processing framework instead of being iteratively accessed, however. In this work, we investigate a general asynchronous graph traversal engine that can operate atop a rich metadata graph in its native format. We outline a traversal-aware query language and key optimizations (traversal-affiliate caching and execution merging) necessary for efficient performance. We further explore the effect of different graph partitioning strategies on the traversal performance for both synchronous and asynchronous traversal engines. Our experiments show that the asynchronous graph traversal engine is more efficient than its synchronous counterpart in the case of HPC rich metadata processing, where more servers are involved and larger traversals are needed. Furthermore, the asynchronous traversal engine is more adaptive to different graph partitioning strategies.« less
Multiple grid problems on concurrent-processing computers

NASA Technical Reports Server (NTRS)

Eberhardt, D. S.; Baganoff, D.

1986-01-01

Three computer codes were studied which make use of concurrent processing computer architectures in computational fluid dynamics (CFD). The three parallel codes were tested on a two processor multiple-instruction/multiple-data (MIMD) facility at NASA Ames Research Center, and are suggested for efficient parallel computations. The first code is a well-known program which makes use of the Beam and Warming, implicit, approximate factored algorithm. This study demonstrates the parallelism found in a well-known scheme and it achieved speedups exceeding 1.9 on the two processor MIMD test facility. The second code studied made use of an embedded grid scheme which is used to solve problems having complex geometries. The particular application for this study considered an airfoil/flap geometry in an incompressible flow. The scheme eliminates some of the inherent difficulties found in adapting approximate factorization techniques onto MIMD machines and allows the use of chaotic relaxation and asynchronous iteration techniques. The third code studied is an application of overset grids to a supersonic blunt body problem. The code addresses the difficulties encountered when using embedded grids on a compressible, and therefore nonlinear, problem. The complex numerical boundary system associated with overset grids is discussed and several boundary schemes are suggested. A boundary scheme based on the method of characteristics achieved the best results.
Distributed asynchronous microprocessor architectures in fault tolerant integrated flight systems

NASA Technical Reports Server (NTRS)

Dunn, W. R.

1983-01-01

The paper discusses the implementation of fault tolerant digital flight control and navigation systems for rotorcraft application. It is shown that in implementing fault tolerance at the systems level using advanced LSI/VLSI technology, aircraft physical layout and flight systems requirements tend to define a system architecture of distributed, asynchronous microprocessors in which fault tolerance can be achieved locally through hardware redundancy and/or globally through application of analytical redundancy. The effects of asynchronism on the execution of dynamic flight software is discussed. It is shown that if the asynchronous microprocessors have knowledge of time, these errors can be significantly reduced through appropiate modifications of the flight software. Finally, the papear extends previous work to show that through the combined use of time referencing and stable flight algorithms, individual microprocessors can be configured to autonomously tolerate intermittent faults.
Decentralized Quasi-Newton Methods

NASA Astrophysics Data System (ADS)

Eisen, Mark; Mokhtari, Aryan; Ribeiro, Alejandro

2017-05-01

We introduce the decentralized Broyden-Fletcher-Goldfarb-Shanno (D-BFGS) method as a variation of the BFGS quasi-Newton method for solving decentralized optimization problems. The D-BFGS method is of interest in problems that are not well conditioned, making first order decentralized methods ineffective, and in which second order information is not readily available, making second order decentralized methods impossible. D-BFGS is a fully distributed algorithm in which nodes approximate curvature information of themselves and their neighbors through the satisfaction of a secant condition. We additionally provide a formulation of the algorithm in asynchronous settings. Convergence of D-BFGS is established formally in both the synchronous and asynchronous settings and strong performance advantages relative to first order methods are shown numerically.
An Asynchronous Many-Task Implementation of In-Situ Statistical Analysis using Legion.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Pebay, Philippe Pierre; Bennett, Janine Camille

2015-11-01

In this report, we propose a framework for the design and implementation of in-situ analy- ses using an asynchronous many-task (AMT) model, using the Legion programming model together with the MiniAero mini-application as a surrogate for full-scale parallel scientific computing applications. The bulk of this work consists of converting the Learn/Derive/Assess model which we had initially developed for parallel statistical analysis using MPI [PTBM11], from a SPMD to an AMT model. In this goal, we propose an original use of the concept of Legion logical regions as a replacement for the parallel communication schemes used for the only operation ofmore » the statistics engines that require explicit communication. We then evaluate this proposed scheme in a shared memory environment, using the Legion port of MiniAero as a proxy for a full-scale scientific application, as a means to provide input data sets of variable size for the in-situ statistical analyses in an AMT context. We demonstrate in particular that the approach has merit, and warrants further investigation, in collaboration with ongoing efforts to improve the overall parallel performance of the Legion system.« less
AP-IO: asynchronous pipeline I/O for hiding periodic output cost in CFD simulation.

PubMed

Xiaoguang, Ren; Xinhai, Xu

2014-01-01

Computational fluid dynamics (CFD) simulation often needs to periodically output intermediate results to files in the form of snapshots for visualization or restart, which seriously impacts the performance. In this paper, we present asynchronous pipeline I/O (AP-IO) optimization scheme for the periodically snapshot output on the basis of asynchronous I/O and CFD application characteristics. In AP-IO, dedicated background I/O processes or threads are in charge of handling the file write in pipeline mode, therefore the write overhead can be hidden with more calculation than classic asynchronous I/O. We design the framework of AP-IO and implement it in OpenFOAM, providing CFD users with a user-friendly interface. Experimental results on the Tianhe-2 supercomputer demonstrate that AP-IO can achieve a good optimization effect for the periodical snapshot output in CFD application, and the effect is especially better for massively parallel CFD simulations, which can reduce the total execution time up to about 40%.
AP-IO: Asynchronous Pipeline I/O for Hiding Periodic Output Cost in CFD Simulation

PubMed Central

Xiaoguang, Ren; Xinhai, Xu

2014-01-01

Computational fluid dynamics (CFD) simulation often needs to periodically output intermediate results to files in the form of snapshots for visualization or restart, which seriously impacts the performance. In this paper, we present asynchronous pipeline I/O (AP-IO) optimization scheme for the periodically snapshot output on the basis of asynchronous I/O and CFD application characteristics. In AP-IO, dedicated background I/O processes or threads are in charge of handling the file write in pipeline mode, therefore the write overhead can be hidden with more calculation than classic asynchronous I/O. We design the framework of AP-IO and implement it in OpenFOAM, providing CFD users with a user-friendly interface. Experimental results on the Tianhe-2 supercomputer demonstrate that AP-IO can achieve a good optimization effect for the periodical snapshot output in CFD application, and the effect is especially better for massively parallel CFD simulations, which can reduce the total execution time up to about 40%. PMID:24955390
Reliability models for dataflow computer systems

NASA Technical Reports Server (NTRS)

Kavi, K. M.; Buckles, B. P.

1985-01-01

The demands for concurrent operation within a computer system and the representation of parallelism in programming languages have yielded a new form of program representation known as data flow (DENN 74, DENN 75, TREL 82a). A new model based on data flow principles for parallel computations and parallel computer systems is presented. Necessary conditions for liveness and deadlock freeness in data flow graphs are derived. The data flow graph is used as a model to represent asynchronous concurrent computer architectures including data flow computers.
Adaptive parallel logic networks

NASA Technical Reports Server (NTRS)

Martinez, Tony R.; Vidal, Jacques J.

1988-01-01

Adaptive, self-organizing concurrent systems (ASOCS) that combine self-organization with massive parallelism for such applications as adaptive logic devices, robotics, process control, and system malfunction management, are presently discussed. In ASOCS, an adaptive network composed of many simple computing elements operating in combinational and asynchronous fashion is used and problems are specified by presenting if-then rules to the system in the form of Boolean conjunctions. During data processing, which is a different operational phase from adaptation, the network acts as a parallel hardware circuit.
UPC++ Programmer’s Guide (v1.0 2017.9)

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bachan, J.; Baden, S.; Bonachea, D.

UPC++ is a C++11 library that provides Asynchronous Partitioned Global Address Space (APGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The APGAS model is single program, multiple-data (SPMD), with each separate thread of execution (referred to as a rank, a term borrowed from MPI) having access to local memory as it would in C++. However, APGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the ranks. UPC++ provides numerous methods for accessing and using global memory. In UPC++, allmore » operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.« less
RAMP: A fault tolerant distributed microcomputer structure for aircraft navigation and control

NASA Technical Reports Server (NTRS)

Dunn, W. R.

1980-01-01

RAMP consists of distributed sets of parallel computers partioned on the basis of software and packaging constraints. To minimize hardware and software complexity, the processors operate asynchronously. It was shown that through the design of asymptotically stable control laws, data errors due to the asynchronism were minimized. It was further shown that by designing control laws with this property and making minor hardware modifications to the RAMP modules, the system became inherently tolerant to intermittent faults. A laboratory version of RAMP was constructed and is described in the paper along with the experimental results.
An Environment for Incremental Development of Distributed Extensible Asynchronous Real-time Systems

NASA Technical Reports Server (NTRS)

Ames, Charles K.; Burleigh, Scott; Briggs, Hugh C.; Auernheimer, Brent

1996-01-01

Incremental parallel development of distributed real-time systems is difficult. Architectural techniques and software tools developed at the Jet Propulsion Laboratory's (JPL's) Flight System Testbed make feasible the integration of complex systems in various stages of development.
Programs as Polypeptides.

PubMed

Williams, Lance R

2016-01-01

Object-oriented combinator chemistry (OOCC) is an artificial chemistry with composition devices borrowed from object-oriented and functional programming languages. Actors in OOCC are embedded in space and subject to diffusion; since they are neither created nor destroyed, their mass is conserved. Actors use programs constructed from combinators to asynchronously update their own states and the states of other actors in their neighborhoods. The fact that programs and combinators are themselves reified as actors makes it possible to build programs that build programs from combinators of a few primitive types using asynchronous spatial processes that resemble chemistry as much as computation. To demonstrate this, OOCC is used to define a parallel, asynchronous, spatially distributed self-replicating system modeled in part on the living cell. Since interactions among its parts result in the construction of more of these same parts, the system is strongly constructive. The system's high normalized complexity is contrasted with that of a simple composome.
Design, development and use of the finite element machine

NASA Technical Reports Server (NTRS)

Adams, L. M.; Voigt, R. C.

1983-01-01

Some of the considerations that went into the design of the Finite Element Machine, a research asynchronous parallel computer are described. The present status of the system is also discussed along with some indication of the type of results that were obtained.
Myria: Scalable Analytics as a Service

NASA Astrophysics Data System (ADS)

Howe, B.; Halperin, D.; Whitaker, A.

2014-12-01

At the UW eScience Institute, we're working to empower non-experts, especially in the sciences, to write and use data-parallel algorithms. To this end, we are building Myria, a web-based platform for scalable analytics and data-parallel programming. Myria's internal model of computation is the relational algebra extended with iteration, such that every program is inherently data-parallel, just as every query in a database is inherently data-parallel. But unlike databases, iteration is a first class concept, allowing us to express machine learning tasks, graph traversal tasks, and more. Programs can be expressed in a number of languages and can be executed on a number of execution environments, but we emphasize a particular language called MyriaL that supports both imperative and declarative styles and a particular execution engine called MyriaX that uses an in-memory column-oriented representation and asynchronous iteration. We deliver Myria over the web as a service, providing an editor, performance analysis tools, and catalog browsing features in a single environment. We find that this web-based "delivery vector" is critical in reaching non-experts: they are insulated from irrelevant effort technical work associated with installation, configuration, and resource management. The MyriaX backend, one of several execution runtimes we support, is a main-memory, column-oriented, RDBMS-on-the-worker system that supports cyclic data flows as a first-class citizen and has been shown to outperform competitive systems on 100-machine cluster sizes. I will describe the Myria system, give a demo, and present some new results in large-scale oceanographic microbiology.
Scalable Parallel Methods for Analyzing Metagenomics Data at Extreme Scale

DOE Office of Scientific and Technical Information (OSTI.GOV)

Daily, Jeffrey A.

2015-05-01

The field of bioinformatics and computational biology is currently experiencing a data revolution. The exciting prospect of making fundamental biological discoveries is fueling the rapid development and deployment of numerous cost-effective, high-throughput next-generation sequencing technologies. The result is that the DNA and protein sequence repositories are being bombarded with new sequence information. Databases are continuing to report a Moore’s law-like growth trajectory in their database sizes, roughly doubling every 18 months. In what seems to be a paradigm-shift, individual projects are now capable of generating billions of raw sequence data that need to be analyzed in the presence of alreadymore » annotated sequence information. While it is clear that data-driven methods, such as sequencing homology detection, are becoming the mainstay in the field of computational life sciences, the algorithmic advancements essential for implementing complex data analytics at scale have mostly lagged behind. Sequence homology detection is central to a number of bioinformatics applications including genome sequencing and protein family characterization. Given millions of sequences, the goal is to identify all pairs of sequences that are highly similar (or “homologous”) on the basis of alignment criteria. While there are optimal alignment algorithms to compute pairwise homology, their deployment for large-scale is currently not feasible; instead, heuristic methods are used at the expense of quality. In this dissertation, we present the design and evaluation of a parallel implementation for conducting optimal homology detection on distributed memory supercomputers. Our approach uses a combination of techniques from asynchronous load balancing (viz. work stealing, dynamic task counters), data replication, and exact-matching filters to achieve homology detection at scale. Results for a collection of 2.56M sequences show parallel efficiencies of ~75-100% on up to 8K cores, representing a time-to-solution of 33 seconds. We extend this work with a detailed analysis of single-node sequence alignment performance using the latest CPU vector instruction set extensions. Preliminary results reveal that current sequence alignment algorithms are unable to fully utilize widening vector registers.« less

A Verification System for Distributed Objects with Asynchronous Method Calls

NASA Astrophysics Data System (ADS)

Ahrendt, Wolfgang; Dylla, Maximilian

We present a verification system for Creol, an object-oriented modeling language for concurrent distributed applications. The system is an instance of KeY, a framework for object-oriented software verification, which has so far been applied foremost to sequential Java. Building on KeY characteristic concepts, like dynamic logic, sequent calculus, explicit substitutions, and the taclet rule language, the system presented in this paper addresses functional correctness of Creol models featuring local cooperative thread parallelism and global communication via asynchronous method calls. The calculus heavily operates on communication histories which describe the interfaces of Creol units. Two example scenarios demonstrate the usage of the system.
A wavelet approach to binary blackholes with asynchronous multitasking

NASA Astrophysics Data System (ADS)

Lim, Hyun; Hirschmann, Eric; Neilsen, David; Anderson, Matthew; Debuhr, Jackson; Zhang, Bo

2016-03-01

Highly accurate simulations of binary black holes and neutron stars are needed to address a variety of interesting problems in relativistic astrophysics. We present a new method for the solving the Einstein equations (BSSN formulation) using iterated interpolating wavelets. Wavelet coefficients provide a direct measure of the local approximation error for the solution and place collocation points that naturally adapt to features of the solution. Further, they exhibit exponential convergence on unevenly spaced collection points. The parallel implementation of the wavelet simulation framework presented here deviates from conventional practice in combining multi-threading with a form of message-driven computation sometimes referred to as asynchronous multitasking.
CUDA-based high-performance computing of the S-BPF algorithm with no-waiting pipelining

NASA Astrophysics Data System (ADS)

Deng, Lin; Yan, Bin; Chang, Qingmei; Han, Yu; Zhang, Xiang; Xi, Xiaoqi; Li, Lei

2015-10-01

The backprojection-filtration (BPF) algorithm has become a good solution for local reconstruction in cone-beam computed tomography (CBCT). However, the reconstruction speed of BPF is a severe limitation for clinical applications. The selective-backprojection filtration (S-BPF) algorithm is developed to improve the parallel performance of BPF by selective backprojection. Furthermore, the general-purpose graphics processing unit (GP-GPU) is a popular tool for accelerating the reconstruction. Much work has been performed aiming for the optimization of the cone-beam back-projection. As the cone-beam back-projection process becomes faster, the data transportation holds a much bigger time proportion in the reconstruction than before. This paper focuses on minimizing the total time in the reconstruction with the S-BPF algorithm by hiding the data transportation among hard disk, CPU and GPU. And based on the analysis of the S-BPF algorithm, some strategies are implemented: (1) the asynchronous calls are used to overlap the implemention of CPU and GPU, (2) an innovative strategy is applied to obtain the DBP image to hide the transport time effectively, (3) two streams for data transportation and calculation are synchronized by the cudaEvent in the inverse of finite Hilbert transform on GPU. Our main contribution is a smart reconstruction of the S-BPF algorithm with GPU's continuous calculation and no data transportation time cost. a 5123 volume is reconstructed in less than 0.7 second on a single Tesla-based K20 GPU from 182 views projection with 5122 pixel per projection. The time cost of our implementation is about a half of that without the overlap behavior.
High Performance Asynchronous Limited Sensing Algorithms for CSMA and CSMA-CD Channels.

DTIC Science & Technology

1985-03-01

for CSM and CSMA-CD Channels" M. Geargiopoulos L. Merakos P. Papantoni-Kazakos- Technical Report UCT/ DEECS /TR-85-2 Mrh1985 iDISTRIBUTION STAEmEN 2j C...Perforance Asynchronous Limited Sensing Algoit~5 forCSMAandCSMA-CD Channels" M1. Georgiop0 ul 0 5 L. Merakos P. PaPantoni-.Kazakos Technical Report UCT/ DEECS ...SCHEDULE unlimited. 4 PERFORMING ORGANiZATION REPORT NUMBERIS) 5. MONITORING ORGANIZATION REPORT NUMBERIS" UCT/ DEECS /TR-85-2 SR-TR. 85-0 398 6& NAME OF
Asynchronous broadcast for ordered delivery between compute nodes in a parallel computing system where packet header space is limited

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kumar, Sameer

Disclosed is a mechanism on receiving processors in a parallel computing system for providing order to data packets received from a broadcast call and to distinguish data packets received at nodes from several incoming asynchronous broadcast messages where header space is limited. In the present invention, processors at lower leafs of a tree do not need to obtain a broadcast message by directly accessing the data in a root processor's buffer. Instead, each subsequent intermediate node's rank id information is squeezed into the software header of packet headers. In turn, the entire broadcast message is not transferred from the rootmore » processor to each processor in a communicator but instead is replicated on several intermediate nodes which then replicated the message to nodes in lower leafs. Hence, the intermediate compute nodes become "virtual root compute nodes" for the purpose of replicating the broadcast message to lower levels of a tree.« less
Exploring Asynchronous Many-Task Runtime Systems toward Extreme Scales

DOE Office of Scientific and Technical Information (OSTI.GOV)

Knight, Samuel; Baker, Gavin Matthew; Gamell, Marc

2015-10-01

Major exascale computing reports indicate a number of software challenges to meet the dramatic change of system architectures in near future. While several-orders-of-magnitude increase in parallelism is the most commonly cited of those, hurdles also include performance heterogeneity of compute nodes across the system, increased imbalance between computational capacity and I/O capabilities, frequent system interrupts, and complex hardware architectures. Asynchronous task-parallel programming models show a great promise in addressing these issues, but are not yet fully understood nor developed su ciently for computational science and engineering application codes. We address these knowledge gaps through quantitative and qualitative exploration of leadingmore » candidate solutions in the context of engineering applications at Sandia. In this poster, we evaluate MiniAero code ported to three leading candidate programming models (Charm++, Legion and UINTAH) to examine the feasibility of these models that permits insertion of new programming model elements into an existing code base.« less
Data-Driven Based Asynchronous Motor Control for Printing Servo Systems

NASA Astrophysics Data System (ADS)

Bian, Min; Guo, Qingyun

Modern digital printing equipment aims to the environmental-friendly industry with high dynamic performances and control precision and low vibration and abrasion. High performance motion control system of printing servo systems was required. Control system of asynchronous motor based on data acquisition was proposed. Iterative learning control (ILC) algorithm was studied. PID control was widely used in the motion control. However, it was sensitive to the disturbances and model parameters variation. The ILC applied the history error data and present control signals to approximate the control signal directly in order to fully track the expect trajectory without the system models and structures. The motor control algorithm based on the ILC and PID was constructed and simulation results were given. The results show that data-driven control method is effective dealing with bounded disturbances for the motion control of printing servo systems.
Portable parallel portfolio optimization in the Aurora Financial Management System

NASA Astrophysics Data System (ADS)

Laure, Erwin; Moritsch, Hans

2001-07-01

Financial planning problems are formulated as large scale, stochastic, multiperiod, tree structured optimization problems. An efficient technique for solving this kind of problems is the nested Benders decomposition method. In this paper we present a parallel, portable, asynchronous implementation of this technique. To achieve our portability goals we elected the programming language Java for our implementation and used a high level Java based framework, called OpusJava, for expressing the parallelism potential as well as synchronization constraints. Our implementation is embedded within a modular decision support tool for portfolio and asset liability management, the Aurora Financial Management System.
Bayesian operational modal analysis with asynchronous data, part I: Most probable value

NASA Astrophysics Data System (ADS)

Zhu, Yi-Chen; Au, Siu-Kui

2018-01-01

In vibration tests, multiple sensors are used to obtain detailed mode shape information about the tested structure. Time synchronisation among data channels is required in conventional modal identification approaches. Modal identification can be more flexibly conducted if this is not required. Motivated by the potential gain in feasibility and economy, this work proposes a Bayesian frequency domain method for modal identification using asynchronous 'output-only' ambient data, i.e. 'operational modal analysis'. It provides a rigorous means for identifying the global mode shape taking into account the quality of the measured data and their asynchronous nature. This paper (Part I) proposes an efficient algorithm for determining the most probable values of modal properties. The method is validated using synthetic and laboratory data. The companion paper (Part II) investigates identification uncertainty and challenges in applications to field vibration data.
Distributed Computing for Signal Processing: Modeling of Asynchronous Parallel Computation.

DTIC Science & Technology

1986-03-01

the proposed approaches 16, 16, 40 . 451. The conclusion most often reached is that the best scheme to use in a particular design depends highly upon...76. 40 . Siegel, H. J., McMillen. R. J., and Mueller. P. T.. Jr. A survey of interconnection methods for reconligurable parallel processing systems...addressing meehaanm distributed in the network area rimonication% tit reach gigabit./second speeds je g.. PoCoS83 .’ i.V--i the lirO! lk i nitronment is
Global tree network for computing structures enabling global processing operations

DOEpatents

Blumrich; Matthias A.; Chen, Dong; Coteus, Paul W.; Gara, Alan G.; Giampapa, Mark E.; Heidelberger, Philip; Hoenicke, Dirk; Steinmacher-Burow, Burkhard D.; Takken, Todd E.; Vranas, Pavlos M.

2010-01-19

A system and method for enabling high-speed, low-latency global tree network communications among processing nodes interconnected according to a tree network structure. The global tree network enables collective reduction operations to be performed during parallel algorithm operations executing in a computer structure having a plurality of the interconnected processing nodes. Router devices are included that interconnect the nodes of the tree via links to facilitate performance of low-latency global processing operations at nodes of the virtual tree and sub-tree structures. The global operations performed include one or more of: broadcast operations downstream from a root node to leaf nodes of a virtual tree, reduction operations upstream from leaf nodes to the root node in the virtual tree, and point-to-point message passing from any node to the root node. The global tree network is configurable to provide global barrier and interrupt functionality in asynchronous or synchronized manner, and, is physically and logically partitionable.
The Use of Efficient Broadcast Protocols in Asynchronous Distributed Systems. Ph.D. Thesis

NASA Technical Reports Server (NTRS)

Schmuck, Frank Bernhard

1988-01-01

Reliable broadcast protocols are important tools in distributed and fault-tolerant programming. They are useful for sharing information and for maintaining replicated data in a distributed system. However, a wide range of such protocols has been proposed. These protocols differ in their fault tolerance and delivery ordering characteristics. There is a tradeoff between the cost of a broadcast protocol and how much ordering it provides. It is, therefore, desirable to employ protocols that support only a low degree of ordering whenever possible. This dissertation presents techniques for deciding how strongly ordered a protocol is necessary to solve a given application problem. It is shown that there are two distinct classes of application problems: problems that can be solved with efficient, asynchronous protocols, and problems that require global ordering. The concept of a linearization function that maps partially ordered sets of events to totally ordered histories is introduced. How to construct an asynchronous implementation that solves a given problem if a linearization function for it can be found is shown. It is proved that in general the question of whether a problem has an asynchronous solution is undecidable. Hence there exists no general algorithm that would automatically construct a suitable linearization function for a given problem. Therefore, an important subclass of problems that have certain commutativity properties are considered. Techniques for constructing asynchronous implementations for this class are presented. These techniques are useful for constructing efficient asynchronous implementations for a broad range of practical problems.
Porting Gravitational Wave Signal Extraction to Parallel Virtual Machine (PVM)

NASA Technical Reports Server (NTRS)

Thirumalainambi, Rajkumar; Thompson, David E.; Redmon, Jeffery

2009-01-01

Laser Interferometer Space Antenna (LISA) is a planned NASA-ESA mission to be launched around 2012. The Gravitational Wave detection is fundamentally the determination of frequency, source parameters, and waveform amplitude derived in a specific order from the interferometric time-series of the rotating LISA spacecrafts. The LISA Science Team has developed a Mock LISA Data Challenge intended to promote the testing of complicated nested search algorithms to detect the 100-1 millihertz frequency signals at amplitudes of 10E-21. However, it has become clear that, sequential search of the parameters is very time consuming and ultra-sensitive; hence, a new strategy has been developed. Parallelization of existing sequential search algorithms of Gravitational Wave signal identification consists of decomposing sequential search loops, beginning with outermost loops and working inward. In this process, the main challenge is to detect interdependencies among loops and partitioning the loops so as to preserve concurrency. Existing parallel programs are based upon either shared memory or distributed memory paradigms. In PVM, master and node programs are used to execute parallelization and process spawning. The PVM can handle process management and process addressing schemes using a virtual machine configuration. The task scheduling and the messaging and signaling can be implemented efficiently for the LISA Gravitational Wave search process using a master and 6 nodes. This approach is accomplished using a server that is available at NASA Ames Research Center, and has been dedicated to the LISA Data Challenge Competition. Historically, gravitational wave and source identification parameters have taken around 7 days in this dedicated single thread Linux based server. Using PVM approach, the parameter extraction problem can be reduced to within a day. The low frequency computation and a proxy signal-to-noise ratio are calculated in separate nodes that are controlled by the master using message and vector of data passing. The message passing among nodes follows a pattern of synchronous and asynchronous send-and-receive protocols. The communication model and the message buffers are allocated dynamically to address rapid search of gravitational wave source information in the Mock LISA data sets.
Asynchronous timing and Doppler recovery in DSP based DPSK modems for fixed and mobile satellite applications

NASA Astrophysics Data System (ADS)

Koblents, B.; Belanger, M.; Woods, D.; McLane, P. J.

While conventional analog modems employ some kind of clock wave regenerator circuit for synchronous timing recovery, in sampled modem receivers the timing is recovered asynchronously to the incoming data stream, with no adjustment being made to the input sampling rate. All timing corrections are accomplished by digital operations on the sampled data stream, and timing recovery is asynchronous with the uncontrolled, input A/D system. A good timing error measurement algorithm is a zero crossing tracker proposed by Gardner. Digital, speech rate (2400 - 4800 bps) M-PSK modem receivers employing Gardner's zero crossing tracker were implemented and tested and found to achieve BER performance very close to theoretical values on the AWGN channel. Nyguist pulse shaped modem systems with excess bandwidth factors ranging from 100 to 60 percent were considered. We can show that for any symmetric M-PSK signal set Gardner's NDA algorithm is free of pattern jitter for any carrier phase offset for rectangular pulses and for Nyquist pulses having 100 percent excess bandwidth. Also, the Nyquist pulse shaped system is studied on the mobile satellite channel, where Doppler shifts and multipath fading degrade the pi/4-DQPSK signal. Two simple modifications to Gardner's zero crossing tracker enable it to remain useful in the presence of multipath fading.
Asynchronous timing and Doppler recovery in DSP based DPSK modems for fixed and mobile satellite applications

NASA Technical Reports Server (NTRS)

Koblents, B.; Belanger, M.; Woods, D.; Mclane, P. J.

1993-01-01

While conventional analog modems employ some kind of clock wave regenerator circuit for synchronous timing recovery, in sampled modem receivers the timing is recovered asynchronously to the incoming data stream, with no adjustment being made to the input sampling rate. All timing corrections are accomplished by digital operations on the sampled data stream, and timing recovery is asynchronous with the uncontrolled, input A/D system. A good timing error measurement algorithm is a zero crossing tracker proposed by Gardner. Digital, speech rate (2400 - 4800 bps) M-PSK modem receivers employing Gardner's zero crossing tracker were implemented and tested and found to achieve BER performance very close to theoretical values on the AWGN channel. Nyguist pulse shaped modem systems with excess bandwidth factors ranging from 100 to 60 percent were considered. We can show that for any symmetric M-PSK signal set Gardner's NDA algorithm is free of pattern jitter for any carrier phase offset for rectangular pulses and for Nyquist pulses having 100 percent excess bandwidth. Also, the Nyquist pulse shaped system is studied on the mobile satellite channel, where Doppler shifts and multipath fading degrade the pi/4-DQPSK signal. Two simple modifications to Gardner's zero crossing tracker enable it to remain useful in the presence of multipath fading.
The control algorithm of the system ‘frequency converter - asynchronous motor’ of the batcher

NASA Astrophysics Data System (ADS)

Lyapushkin, S. V.; Martyushev, N. V.; Shiryaev, S. Y.

2017-01-01

The paper is devoted to the solution of the problem of optimum batching of bulk mixtures according to the criterion of accuracy and maximally possible performance. This problem is solved for applied utilization when running the system ‘frequency converter - asynchronous motor’ having pulse-width modulation of a screw batcher of agricultural equipment. The developed control algorithm allows batching small components of a bulk mixture with the prescribed accuracy due to the weight consideration of the falling column of the material being in the air after the screw stoppage. The paper also shows that in order to reduce the influence of the mass of the ‘falling column’ on the accuracy of batching, it is necessary to specify the sequence of batching of components inside of the recipe beginning from the largest component ending with the least one. To exclude the variable error of batching, which arises owing to the mass of the material column, falling into the batcher-bunker, the algorithm of dynamic correction of the task is used in the control system.
Asynchronous collision integrators: Explicit treatment of unilateral contact with friction and nodal restraints

PubMed Central

Wolff, Sebastian; Bucher, Christian

2013-01-01

This article presents asynchronous collision integrators and a simple asynchronous method treating nodal restraints. Asynchronous discretizations allow individual time step sizes for each spatial region, improving the efficiency of explicit time stepping for finite element meshes with heterogeneous element sizes. The article first introduces asynchronous variational integration being expressed by drift and kick operators. Linear nodal restraint conditions are solved by a simple projection of the forces that is shown to be equivalent to RATTLE. Unilateral contact is solved by an asynchronous variant of decomposition contact response. Therein, velocities are modified avoiding penetrations. Although decomposition contact response is solving a large system of linear equations (being critical for the numerical efficiency of explicit time stepping schemes) and is needing special treatment regarding overconstraint and linear dependency of the contact constraints (for example from double-sided node-to-surface contact or self-contact), the asynchronous strategy handles these situations efficiently and robust. Only a single constraint involving a very small number of degrees of freedom is considered at once leading to a very efficient solution. The treatment of friction is exemplified for the Coulomb model. Special care needs the contact of nodes that are subject to restraints. Together with the aforementioned projection for restraints, a novel efficient solution scheme can be presented. The collision integrator does not influence the critical time step. Hence, the time step can be chosen independently from the underlying time-stepping scheme. The time step may be fixed or time-adaptive. New demands on global collision detection are discussed exemplified by position codes and node-to-segment integration. Numerical examples illustrate convergence and efficiency of the new contact algorithm. Copyright © 2013 The Authors. International Journal for Numerical Methods in Engineering published by John Wiley & Sons, Ltd. PMID:23970806
Wireless acoustic modules for real-time data fusion using asynchronous sniper localization algorithms

NASA Astrophysics Data System (ADS)

Hengy, S.; De Mezzo, S.; Duffner, P.; Naz, P.

2012-11-01

The presence of snipers in modern conflicts leads to high insecurity for the soldiers. In order to improve the soldier's protection against this threat, the French German Research Institute of Saint-Louis (ISL) has been conducting studies in the domain of acoustic localization of shots. Mobile antennas mounted on the soldier's helmet were initially used for real-time detection, classification and localization of sniper shots. It showed good performances in land scenarios, but also in urban scenarios if the array was in the shot corridor, meaning that the microphones first detect the direct wave and then the reflections of the Mach and muzzle waves (15% distance estimation error compared to the actual shooter array distance). Fusing data sent by multiple sensor nodes distributed on the field showed some of the limitations of the technologies that have been implemented in ISL's demonstrators. Among others, the determination of the arrays' orientation was not accurate enough, thereby degrading the performance of data fusion. Some new solutions have been developed in the past year in order to obtain better performance for data fusion. Asynchronous localization algorithms have been developed and post-processed on data measured in both free-field and urban environments with acoustic modules on the line of sight of the shooter. These results are presented in the first part of the paper. The impact of GPS position estimation error is also discussed in the article in order to evaluate the possible use of those algorithms for real-time processing using mobile acoustic nodes. In the frame of ISL's transverse project IMOTEP (IMprovement Of optical and acoustical TEchnologies for the Protection), some demonstrators are developed that will allow real-time asynchronous localization of sniper shots. An embedded detection and classification algorithm is implemented on wireless acoustic modules that send the relevant information to a central PC. Data fusion is then processed and the estimated position of the shooter is sent back to the users. A SWIR active imaging system is used for localization refinement. A built-in DSP is related to the detection/classification tasks for each acoustic module. A GPS module is used for time difference of arrival and module's position estimation. Wireless communication is supported using ZigBee technology. These acoustic modules are described in the article and first results of real-time asynchronous sniper localization using those modules are discussed.
Fast parallel algorithm for slicing STL based on pipeline

NASA Astrophysics Data System (ADS)

Ma, Xulong; Lin, Feng; Yao, Bo

2016-05-01

In Additive Manufacturing field, the current researches of data processing mainly focus on a slicing process of large STL files or complicated CAD models. To improve the efficiency and reduce the slicing time, a parallel algorithm has great advantages. However, traditional algorithms can't make full use of multi-core CPU hardware resources. In the paper, a fast parallel algorithm is presented to speed up data processing. A pipeline mode is adopted to design the parallel algorithm. And the complexity of the pipeline algorithm is analyzed theoretically. To evaluate the performance of the new algorithm, effects of threads number and layers number are investigated by a serial of experiments. The experimental results show that the threads number and layers number are two remarkable factors to the speedup ratio. The tendency of speedup versus threads number reveals a positive relationship which greatly agrees with the Amdahl's law, and the tendency of speedup versus layers number also keeps a positive relationship agreeing with Gustafson's law. The new algorithm uses topological information to compute contours with a parallel method of speedup. Another parallel algorithm based on data parallel is used in experiments to show that pipeline parallel mode is more efficient. A case study at last shows a suspending performance of the new parallel algorithm. Compared with the serial slicing algorithm, the new pipeline parallel algorithm can make full use of the multi-core CPU hardware, accelerate the slicing process, and compared with the data parallel slicing algorithm, the new slicing algorithm in this paper adopts a pipeline parallel model, and a much higher speedup ratio and efficiency is achieved.
A parallel algorithm for the eigenvalues and eigenvectors for a general complex matrix

NASA Technical Reports Server (NTRS)

Shroff, Gautam

1989-01-01

A new parallel Jacobi-like algorithm is developed for computing the eigenvalues of a general complex matrix. Most parallel methods for this parallel typically display only linear convergence. Sequential norm-reducing algorithms also exit and they display quadratic convergence in most cases. The new algorithm is a parallel form of the norm-reducing algorithm due to Eberlein. It is proven that the asymptotic convergence rate of this algorithm is quadratic. Numerical experiments are presented which demonstrate the quadratic convergence of the algorithm and certain situations where the convergence is slow are also identified. The algorithm promises to be very competitive on a variety of parallel architectures.

Parallel algorithms for placement and routing in VLSI design. Ph.D. Thesis

NASA Technical Reports Server (NTRS)

Brouwer, Randall Jay

1991-01-01

The computational requirements for high quality synthesis, analysis, and verification of very large scale integration (VLSI) designs have rapidly increased with the fast growing complexity of these designs. Research in the past has focused on the development of heuristic algorithms, special purpose hardware accelerators, or parallel algorithms for the numerous design tasks to decrease the time required for solution. Two new parallel algorithms are proposed for two VLSI synthesis tasks, standard cell placement and global routing. The first algorithm, a parallel algorithm for global routing, uses hierarchical techniques to decompose the routing problem into independent routing subproblems that are solved in parallel. Results are then presented which compare the routing quality to the results of other published global routers and which evaluate the speedups attained. The second algorithm, a parallel algorithm for cell placement and global routing, hierarchically integrates a quadrisection placement algorithm, a bisection placement algorithm, and the previous global routing algorithm. Unique partitioning techniques are used to decompose the various stages of the algorithm into independent tasks which can be evaluated in parallel. Finally, results are presented which evaluate the various algorithm alternatives and compare the algorithm performance to other placement programs. Measurements are presented on the parallel speedups available.
GPU acceleration of a petascale application for turbulent mixing at high Schmidt number using OpenMP 4.5

NASA Astrophysics Data System (ADS)

Clay, M. P.; Buaria, D.; Yeung, P. K.; Gotoh, T.

2018-07-01

This paper reports on the successful implementation of a massively parallel GPU-accelerated algorithm for the direct numerical simulation of turbulent mixing at high Schmidt number. The work stems from a recent development (Comput. Phys. Commun., vol. 219, 2017, 313-328), in which a low-communication algorithm was shown to attain high degrees of scalability on the Cray XE6 architecture when overlapping communication and computation via dedicated communication threads. An even higher level of performance has now been achieved using OpenMP 4.5 on the Cray XK7 architecture, where on each node the 16 integer cores of an AMD Interlagos processor share a single Nvidia K20X GPU accelerator. In the new algorithm, data movements are minimized by performing virtually all of the intensive scalar field computations in the form of combined compact finite difference (CCD) operations on the GPUs. A memory layout in departure from usual practices is found to provide much better performance for a specific kernel required to apply the CCD scheme. Asynchronous execution enabled by adding the OpenMP 4.5 NOWAIT clause to TARGET constructs improves scalability when used to overlap computation on the GPUs with computation and communication on the CPUs. On the 27-petaflops supercomputer Titan at Oak Ridge National Laboratory, USA, a GPU-to-CPU speedup factor of approximately 5 is consistently observed at the largest problem size of 81923 grid points for the scalar field computed with 8192 XK7 nodes.
Synchronous versus asynchronous modeling of gene regulatory networks.

PubMed

Garg, Abhishek; Di Cara, Alessandro; Xenarios, Ioannis; Mendoza, Luis; De Micheli, Giovanni

2008-09-01

In silico modeling of gene regulatory networks has gained some momentum recently due to increased interest in analyzing the dynamics of biological systems. This has been further facilitated by the increasing availability of experimental data on gene-gene, protein-protein and gene-protein interactions. The two dynamical properties that are often experimentally testable are perturbations and stable steady states. Although a lot of work has been done on the identification of steady states, not much work has been reported on in silico modeling of cellular differentiation processes. In this manuscript, we provide algorithms based on reduced ordered binary decision diagrams (ROBDDs) for Boolean modeling of gene regulatory networks. Algorithms for synchronous and asynchronous transition models have been proposed and their corresponding computational properties have been analyzed. These algorithms allow users to compute cyclic attractors of large networks that are currently not feasible using existing software. Hereby we provide a framework to analyze the effect of multiple gene perturbation protocols, and their effect on cell differentiation processes. These algorithms were validated on the T-helper model showing the correct steady state identification and Th1-Th2 cellular differentiation process. The software binaries for Windows and Linux platforms can be downloaded from http://si2.epfl.ch/~garg/genysis.html.
Iterative algorithms for large sparse linear systems on parallel computers

NASA Technical Reports Server (NTRS)

Adams, L. M.

1982-01-01

Algorithms for assembling in parallel the sparse system of linear equations that result from finite difference or finite element discretizations of elliptic partial differential equations, such as those that arise in structural engineering are developed. Parallel linear stationary iterative algorithms and parallel preconditioned conjugate gradient algorithms are developed for solving these systems. In addition, a model for comparing parallel algorithms on array architectures is developed and results of this model for the algorithms are given.
Simplified Parallel Domain Traversal

DOE Office of Scientific and Technical Information (OSTI.GOV)

Erickson III, David J

2011-01-01

Many data-intensive scientific analysis techniques require global domain traversal, which over the years has been a bottleneck for efficient parallelization across distributed-memory architectures. Inspired by MapReduce and other simplified parallel programming approaches, we have designed DStep, a flexible system that greatly simplifies efficient parallelization of domain traversal techniques at scale. In order to deliver both simplicity to users as well as scalability on HPC platforms, we introduce a novel two-tiered communication architecture for managing and exploiting asynchronous communication loads. We also integrate our design with advanced parallel I/O techniques that operate directly on native simulation output. We demonstrate DStep bymore » performing teleconnection analysis across ensemble runs of terascale atmospheric CO{sub 2} and climate data, and we show scalability results on up to 65,536 IBM BlueGene/P cores.« less
Ultrawideband asynchronous tracking system and method

NASA Technical Reports Server (NTRS)

Arndt, G. Dickey (Inventor); Ngo, Phong H. (Inventor); Phan, Chau T. (Inventor); Gross, Julia A. (Inventor); Ni, Jianjun (Inventor); Dusl, John (Inventor)

2012-01-01

A passive tracking system is provided with a plurality of ultrawideband (UWB) receivers that is asynchronous with respect to a UWB transmitter. A geometry of the tracking system may utilize a plurality of clusters with each cluster comprising a plurality of antennas. Time Difference of Arrival (TDOA) may be determined for the antennas in each cluster and utilized to determine Angle of Arrival (AOA) based on a far field assumption regarding the geometry. Parallel software communication sockets may be established with each of the plurality of UWB receivers. Transfer of waveform data may be processed by alternately receiving packets of waveform data from each UWB receiver. Cross Correlation Peak Detection (CCPD) is utilized to estimate TDOA information to reduce errors in a noisy, multipath environment.
A class of parallel algorithms for computation of the manipulator inertia matrix

NASA Technical Reports Server (NTRS)

Fijany, Amir; Bejczy, Antal K.

1989-01-01

Parallel and parallel/pipeline algorithms for computation of the manipulator inertia matrix are presented. An algorithm based on composite rigid-body spatial inertia method, which provides better features for parallelization, is used for the computation of the inertia matrix. Two parallel algorithms are developed which achieve the time lower bound in computation. Also described is the mapping of these algorithms with topological variation on a two-dimensional processor array, with nearest-neighbor connection, and with cardinality variation on a linear processor array. An efficient parallel/pipeline algorithm for the linear array was also developed, but at significantly higher efficiency.
Investigation of periodically driven systems by x-ray absorption spectroscopy using asynchronous data collection mode

NASA Astrophysics Data System (ADS)

Singh, H.; Donetsky, D.; Liu, J.; Attenkofer, K.; Cheng, B.; Trelewicz, J. R.; Lubomirsky, I.; Stavitski, E.; Frenkel, A. I.

2018-04-01

We report the development, testing, and demonstration of a setup for modulation excitation spectroscopy experiments at the Inner Shell Spectroscopy beamline of National Synchrotron Light Source - II. A computer algorithm and dedicated software were developed for asynchronous data processing and analysis. We demonstrate the reconstruction of X-ray absorption spectra for different time points within the modulation pulse using a model system. This setup and the software are intended for a broad range of functional materials which exhibit structural and/or electronic responses to the external stimulation, such as catalysts, energy and battery materials, and electromechanical devices.
Introducing parallelism to histogramming functions for GEM systems

NASA Astrophysics Data System (ADS)

Krawczyk, Rafał D.; Czarski, Tomasz; Kolasinski, Piotr; Pozniak, Krzysztof T.; Linczuk, Maciej; Byszuk, Adrian; Chernyshova, Maryna; Juszczyk, Bartlomiej; Kasprowicz, Grzegorz; Wojenski, Andrzej; Zabolotny, Wojciech

2015-09-01

This article is an assessment of potential parallelization of histogramming algorithms in GEM detector system. Histogramming and preprocessing algorithms in MATLAB were analyzed with regard to adding parallelism. Preliminary implementation of parallel strip histogramming resulted in speedup. Analysis of algorithms parallelizability is presented. Overview of potential hardware and software support to implement parallel algorithm is discussed.
Multi-petascale highly efficient parallel supercomputer

DOE Office of Scientific and Technical Information (OSTI.GOV)

Asaad, Sameh; Bellofatto, Ralph E.; Blocksome, Michael A.

A Multi-Petascale Highly Efficient Parallel Supercomputer of 100 petaflop-scale includes node architectures based upon System-On-a-Chip technology, where each processing node comprises a single Application Specific Integrated Circuit (ASIC). The ASIC nodes are interconnected by a five dimensional torus network that optimally maximize the throughput of packet communications between nodes and minimize latency. The network implements collective network and a global asynchronous network that provides global barrier and notification functions. Integrated in the node design include a list-based prefetcher. The memory system implements transaction memory, thread level speculation, and multiversioning cache that improves soft error rate at the same time andmore » supports DMA functionality allowing for parallel processing message-passing.« less
ATDM LANL FleCSI: Topology and Execution Framework

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bergen, Benjamin Karl

FleCSI is a compile-time configurable C++ framework designed to support multi-physics application development. As such, FleCSI attempts to provide a very general set of infrastructure design patterns that can be specialized and extended to suit the needs of a broad variety of solver and data requirements. This means that FleCSI is potentially useful to many different ECP projects. Current support includes multidimensional mesh topology, mesh geometry, and mesh adjacency information, n-dimensional hashed-tree data structures, graph partitioning interfaces, and dependency closures (to identify data dependencies between distributed-memory address spaces). FleCSI introduces a functional programming model with control, execution, and data abstractionsmore » that are consistent with state-of-the-art task-based runtimes such as Legion and Charm++. The model also provides support for fine-grained, data-parallel execution with backend support for runtimes such as OpenMP and C++17. The FleCSI abstraction layer provides the developer with insulation from the underlying runtimes, while allowing support for multiple runtime systems, including conventional models like asynchronous MPI. The intent is to give developers a concrete set of user-friendly programming tools that can be used now, while allowing flexibility in choosing runtime implementations and optimizations that can be applied to architectures and runtimes that arise in the future. This project is essential to the ECP Ristra Next-Generation Code project, part of ASC ATDM, because it provides a hierarchically parallel programming model that is consistent with the design of modern system architectures, but which allows for the straightforward expression of algorithmic parallelism in a portably performant manner.« less
Acceleration of discrete stochastic biochemical simulation using GPGPU.

PubMed

Sumiyoshi, Kei; Hirata, Kazuki; Hiroi, Noriko; Funahashi, Akira

2015-01-01

For systems made up of a small number of molecules, such as a biochemical network in a single cell, a simulation requires a stochastic approach, instead of a deterministic approach. The stochastic simulation algorithm (SSA) simulates the stochastic behavior of a spatially homogeneous system. Since stochastic approaches produce different results each time they are used, multiple runs are required in order to obtain statistical results; this results in a large computational cost. We have implemented a parallel method for using SSA to simulate a stochastic model; the method uses a graphics processing unit (GPU), which enables multiple realizations at the same time, and thus reduces the computational time and cost. During the simulation, for the purpose of analysis, each time course is recorded at each time step. A straightforward implementation of this method on a GPU is about 16 times faster than a sequential simulation on a CPU with hybrid parallelization; each of the multiple simulations is run simultaneously, and the computational tasks within each simulation are parallelized. We also implemented an improvement to the memory access and reduced the memory footprint, in order to optimize the computations on the GPU. We also implemented an asynchronous data transfer scheme to accelerate the time course recording function. To analyze the acceleration of our implementation on various sizes of model, we performed SSA simulations on different model sizes and compared these computation times to those for sequential simulations with a CPU. When used with the improved time course recording function, our method was shown to accelerate the SSA simulation by a factor of up to 130.
Acceleration of discrete stochastic biochemical simulation using GPGPU

PubMed Central

Sumiyoshi, Kei; Hirata, Kazuki; Hiroi, Noriko; Funahashi, Akira

2015-01-01

For systems made up of a small number of molecules, such as a biochemical network in a single cell, a simulation requires a stochastic approach, instead of a deterministic approach. The stochastic simulation algorithm (SSA) simulates the stochastic behavior of a spatially homogeneous system. Since stochastic approaches produce different results each time they are used, multiple runs are required in order to obtain statistical results; this results in a large computational cost. We have implemented a parallel method for using SSA to simulate a stochastic model; the method uses a graphics processing unit (GPU), which enables multiple realizations at the same time, and thus reduces the computational time and cost. During the simulation, for the purpose of analysis, each time course is recorded at each time step. A straightforward implementation of this method on a GPU is about 16 times faster than a sequential simulation on a CPU with hybrid parallelization; each of the multiple simulations is run simultaneously, and the computational tasks within each simulation are parallelized. We also implemented an improvement to the memory access and reduced the memory footprint, in order to optimize the computations on the GPU. We also implemented an asynchronous data transfer scheme to accelerate the time course recording function. To analyze the acceleration of our implementation on various sizes of model, we performed SSA simulations on different model sizes and compared these computation times to those for sequential simulations with a CPU. When used with the improved time course recording function, our method was shown to accelerate the SSA simulation by a factor of up to 130. PMID:25762936
Parallelizing flow-accumulation calculations on graphics processing units—From iterative DEM preprocessing algorithm to recursive multiple-flow-direction algorithm

NASA Astrophysics Data System (ADS)

Qin, Cheng-Zhi; Zhan, Lijun

2012-06-01

As one of the important tasks in digital terrain analysis, the calculation of flow accumulations from gridded digital elevation models (DEMs) usually involves two steps in a real application: (1) using an iterative DEM preprocessing algorithm to remove the depressions and flat areas commonly contained in real DEMs, and (2) using a recursive flow-direction algorithm to calculate the flow accumulation for every cell in the DEM. Because both algorithms are computationally intensive, quick calculation of the flow accumulations from a DEM (especially for a large area) presents a practical challenge to personal computer (PC) users. In recent years, rapid increases in hardware capacity of the graphics processing units (GPUs) provided in modern PCs have made it possible to meet this challenge in a PC environment. Parallel computing on GPUs using a compute-unified-device-architecture (CUDA) programming model has been explored to speed up the execution of the single-flow-direction algorithm (SFD). However, the parallel implementation on a GPU of the multiple-flow-direction (MFD) algorithm, which generally performs better than the SFD algorithm, has not been reported. Moreover, GPU-based parallelization of the DEM preprocessing step in the flow-accumulation calculations has not been addressed. This paper proposes a parallel approach to calculate flow accumulations (including both iterative DEM preprocessing and a recursive MFD algorithm) on a CUDA-compatible GPU. For the parallelization of an MFD algorithm (MFD-md), two different parallelization strategies using a GPU are explored. The first parallelization strategy, which has been used in the existing parallel SFD algorithm on GPU, has the problem of computing redundancy. Therefore, we designed a parallelization strategy based on graph theory. The application results show that the proposed parallel approach to calculate flow accumulations on a GPU performs much faster than either sequential algorithms or other parallel GPU-based algorithms based on existing parallelization strategies.
The development of a scalable parallel 3-D CFD algorithm for turbomachinery. M.S. Thesis Final Report

NASA Technical Reports Server (NTRS)

Luke, Edward Allen

1993-01-01

Two algorithms capable of computing a transonic 3-D inviscid flow field about rotating machines are considered for parallel implementation. During the study of these algorithms, a significant new method of measuring the performance of parallel algorithms is developed. The theory that supports this new method creates an empirical definition of scalable parallel algorithms that is used to produce quantifiable evidence that a scalable parallel application was developed. The implementation of the parallel application and an automated domain decomposition tool are also discussed.
Data Elevator

DOE Office of Scientific and Technical Information (OSTI.GOV)

BYNA, SUNRENDRA; DONG, BIN; WU, KESHENG

Data Elevator: Efficient Asynchronous Data Movement in Hierarchical Storage Systems Multi-layer storage subsystems, including SSD-based burst buffers and disk-based parallel file systems (PFS), are becoming part of HPC systems. However, software for this storage hierarchy is still in its infancy. Applications may have to explicitly move data among the storage layers. We propose Data Elevator for transparently and efficiently moving data between a burst buffer and a PFS. Users specify the final destination for their data, typically on PFS, Data Elevator intercepts the I/O calls, stages data on burst buffer, and then asynchronously transfers the data to their final destinationmore » in the background. This system allows extensive optimizations, such as overlapping read and write operations, choosing I/O modes, and aligning buffer boundaries. In tests with large-scale scientific applications, Data Elevator is as much as 4.2X faster than Cray DataWarp, the start-of-art software for burst buffer, and 4X faster than directly writing to PFS. The Data Elevator library uses HDF5's Virtual Object Layer (VOL) for intercepting parallel I/O calls that write data to PFS. The intercepted calls are redirected to the Data Elevator, which provides a handle to write the file in a faster and intermediate burst buffer system. Once the application finishes writing the data to the burst buffer, the Data Elevator job uses HDF5 to move the data to final destination in an asynchronous manner. Hence, using the Data Elevator library is currently useful for applications that call HDF5 for writing data files. Also, the Data Elevator depends on the HDF5 VOL functionality.« less
Superconducting magnetic energy storage for asynchronous electrical systems

DOEpatents

Boenig, Heinrich J.

1986-01-01

A superconducting magnetic energy storage coil connected in parallel between converters of two or more ac power systems provides load leveling and stability improvement to any or all of the ac systems. Control is provided to direct the charging and independently the discharging of the superconducting coil to at least a selected one of the ac power systems.
Parallel Lattice Basis Reduction Using a Multi-threaded Schnorr-Euchner LLL Algorithm

NASA Astrophysics Data System (ADS)

Backes, Werner; Wetzel, Susanne

In this paper, we introduce a new parallel variant of the LLL lattice basis reduction algorithm. Our new, multi-threaded algorithm is the first to provide an efficient, parallel implementation of the Schorr-Euchner algorithm for today’s multi-processor, multi-core computer architectures. Experiments with sparse and dense lattice bases show a speed-up factor of about 1.8 for the 2-thread and about factor 3.2 for the 4-thread version of our new parallel lattice basis reduction algorithm in comparison to the traditional non-parallel algorithm.
Parallel O(log n) algorithms for open- and closed-chain rigid multibody systems based on a new mass matrix factorization technique

NASA Technical Reports Server (NTRS)

Fijany, Amir

1993-01-01

In this paper, parallel O(log n) algorithms for computation of rigid multibody dynamics are developed. These parallel algorithms are derived by parallelization of new O(n) algorithms for the problem. The underlying feature of these O(n) algorithms is a drastically different strategy for decomposition of interbody force which leads to a new factorization of the mass matrix (M). Specifically, it is shown that a factorization of the inverse of the mass matrix in the form of the Schur Complement is derived as M(exp -1) = C - B(exp *)A(exp -1)B, wherein matrices C, A, and B are block tridiagonal matrices. The new O(n) algorithm is then derived as a recursive implementation of this factorization of M(exp -1). For the closed-chain systems, similar factorizations and O(n) algorithms for computation of Operational Space Mass Matrix lambda and its inverse lambda(exp -1) are also derived. It is shown that these O(n) algorithms are strictly parallel, that is, they are less efficient than other algorithms for serial computation of the problem. But, to our knowledge, they are the only known algorithms that can be parallelized and that lead to both time- and processor-optimal parallel algorithms for the problem, i.e., parallel O(log n) algorithms with O(n) processors. The developed parallel algorithms, in addition to their theoretical significance, are also practical from an implementation point of view due to their simple architectural requirements.
High-Performance Psychometrics: The Parallel-E Parallel-M Algorithm for Generalized Latent Variable Models. Research Report. ETS RR-16-34

ERIC Educational Resources Information Center

von Davier, Matthias

2016-01-01

This report presents results on a parallel implementation of the expectation-maximization (EM) algorithm for multidimensional latent variable models. The developments presented here are based on code that parallelizes both the E step and the M step of the parallel-E parallel-M algorithm. Examples presented in this report include item response…

ProperCAD: A portable object-oriented parallel environment for VLSI CAD

NASA Technical Reports Server (NTRS)

Ramkumar, Balkrishna; Banerjee, Prithviraj

1993-01-01

Most parallel algorithms for VLSI CAD proposed to date have one important drawback: they work efficiently only on machines that they were designed for. As a result, algorithms designed to date are dependent on the architecture for which they are developed and do not port easily to other parallel architectures. A new project under way to address this problem is described. A Portable object-oriented parallel environment for CAD algorithms (ProperCAD) is being developed. The objectives of this research are (1) to develop new parallel algorithms that run in a portable object-oriented environment (CAD algorithms using a general purpose platform for portable parallel programming called CARM is being developed and a C++ environment that is truly object-oriented and specialized for CAD applications is also being developed); and (2) to design the parallel algorithms around a good sequential algorithm with a well-defined parallel-sequential interface (permitting the parallel algorithm to benefit from future developments in sequential algorithms). One CAD application that has been implemented as part of the ProperCAD project, flat VLSI circuit extraction, is described. The algorithm, its implementation, and its performance on a range of parallel machines are discussed in detail. It currently runs on an Encore Multimax, a Sequent Symmetry, Intel iPSC/2 and i860 hypercubes, a NCUBE 2 hypercube, and a network of Sun Sparc workstations. Performance data for other applications that were developed are provided: namely test pattern generation for sequential circuits, parallel logic synthesis, and standard cell placement.
Distributed Sleep Scheduling in Wireless Sensor Networks via Fractional Domatic Partitioning

NASA Astrophysics Data System (ADS)

Schumacher, André; Haanpää, Harri

We consider setting up sleep scheduling in sensor networks. We formulate the problem as an instance of the fractional domatic partition problem and obtain a distributed approximation algorithm by applying linear programming approximation techniques. Our algorithm is an application of the Garg-Könemann (GK) scheme that requires solving an instance of the minimum weight dominating set (MWDS) problem as a subroutine. Our two main contributions are a distributed implementation of the GK scheme for the sleep-scheduling problem and a novel asynchronous distributed algorithm for approximating MWDS based on a primal-dual analysis of Chvátal's set-cover algorithm. We evaluate our algorithm with ns2 simulations.
Distributed parameter estimation in unreliable sensor networks via broadcast gossip algorithms.

PubMed

Wang, Huiwei; Liao, Xiaofeng; Wang, Zidong; Huang, Tingwen; Chen, Guo

2016-01-01

In this paper, we present an asynchronous algorithm to estimate the unknown parameter under an unreliable network which allows new sensors to join and old sensors to leave, and can tolerate link failures. Each sensor has access to partially informative measurements when it is awakened. In addition, the proposed algorithm can avoid the interference among messages and effectively reduce the accumulated measurement and quantization errors. Based on the theory of stochastic approximation, we prove that our proposed algorithm almost surely converges to the unknown parameter. Finally, we present a numerical example to assess the performance and the communication cost of the algorithm. Copyright © 2015 Elsevier Ltd. All rights reserved.
A software architecture for multidisciplinary applications: Integrating task and data parallelism

NASA Technical Reports Server (NTRS)

Chapman, Barbara; Mehrotra, Piyush; Vanrosendale, John; Zima, Hans

1994-01-01

Data parallel languages such as Vienna Fortran and HPF can be successfully applied to a wide range of numerical applications. However, many advanced scientific and engineering applications are of a multidisciplinary and heterogeneous nature and thus do not fit well into the data parallel paradigm. In this paper we present new Fortran 90 language extensions to fill this gap. Tasks can be spawned as asynchronous activities in a homogeneous or heterogeneous computing environment; they interact by sharing access to Shared Data Abstractions (SDA's). SDA's are an extension of Fortran 90 modules, representing a pool of common data, together with a set of Methods for controlled access to these data and a mechanism for providing persistent storage. Our language supports the integration of data and task parallelism as well as nested task parallelism and thus can be used to express multidisciplinary applications in a natural and efficient way.
Multitasking TORT under UNICOS: Parallel performance models and measurements

DOE Office of Scientific and Technical Information (OSTI.GOV)

Barnett, A.; Azmy, Y.Y.

1999-09-27

The existing parallel algorithms in the TORT discrete ordinates code were updated to function in a UNICOS environment. A performance model for the parallel overhead was derived for the existing algorithms. The largest contributors to the parallel overhead were identified and a new algorithm was developed. A parallel overhead model was also derived for the new algorithm. The results of the comparison of parallel performance models were compared to applications of the code to two TORT standard test problems and a large production problem. The parallel performance models agree well with the measured parallel overhead.
Multitasking TORT Under UNICOS: Parallel Performance Models and Measurements

DOE Office of Scientific and Technical Information (OSTI.GOV)

Azmy, Y.Y.; Barnett, D.A.

1999-09-27

The existing parallel algorithms in the TORT discrete ordinates were updated to function in a UNI-COS environment. A performance model for the parallel overhead was derived for the existing algorithms. The largest contributors to the parallel overhead were identified and a new algorithm was developed. A parallel overhead model was also derived for the new algorithm. The results of the comparison of parallel performance models were compared to applications of the code to two TORT standard test problems and a large production problem. The parallel performance models agree well with the measured parallel overhead.
Distributed Computing for Signal Processing: Modeling of Asynchronous Parallel Computation. Appendix G. On the Design and Modeling of Special Purpose Parallel Processing Systems.

DTIC Science & Technology

1985-05-01

unit in the data base, with knowing one generic assembly language. °-’--a 139 The 5-tuple describing single operation execution time of the operations...TSi-- generate , random eventi ( ,.0-15 tieit tmls - ((floa egus ()16 274 r Ispt imet imel I at :EVE’JS- II ktime=0.0; /0 present time 0/ rrs ptime=0.0...computing machinery capable of performing these tasks within a given time constraint. Because the majority of the available computing machinery is general
Overcoming Challenges in Kinetic Modeling of Magnetized Plasmas and Vacuum Electronic Devices

NASA Astrophysics Data System (ADS)

Omelchenko, Yuri; Na, Dong-Yeop; Teixeira, Fernando

2017-10-01

We transform the state-of-the art of plasma modeling by taking advantage of novel computational techniques for fast and robust integration of multiscale hybrid (full particle ions, fluid electrons, no displacement current) and full-PIC models. These models are implemented in 3D HYPERS and axisymmetric full-PIC CONPIC codes. HYPERS is a massively parallel, asynchronous code. The HYPERS solver does not step fields and particles synchronously in time but instead executes local variable updates (events) at their self-adaptive rates while preserving fundamental conservation laws. The charge-conserving CONPIC code has a matrix-free explicit finite-element (FE) solver based on a sparse-approximate inverse (SPAI) algorithm. This explicit solver approximates the inverse FE system matrix (``mass'' matrix) using successive sparsity pattern orders of the original matrix. It does not reduce the set of Maxwell's equations to a vector-wave (curl-curl) equation of second order but instead utilizes the standard coupled first-order Maxwell's system. We discuss the ability of our codes to accurately and efficiently account for multiscale physical phenomena in 3D magnetized space and laboratory plasmas and axisymmetric vacuum electronic devices.
Scalable Parallel Density-based Clustering and Applications

NASA Astrophysics Data System (ADS)

Patwary, Mostofa Ali

2014-04-01

Recently, density-based clustering algorithms (DBSCAN and OPTICS) have gotten significant attention of the scientific community due to their unique capability of discovering arbitrary shaped clusters and eliminating noise data. These algorithms have several applications, which require high performance computing, including finding halos and subhalos (clusters) from massive cosmology data in astrophysics, analyzing satellite images, X-ray crystallography, and anomaly detection. However, parallelization of these algorithms are extremely challenging as they exhibit inherent sequential data access order, unbalanced workload resulting in low parallel efficiency. To break the data access sequentiality and to achieve high parallelism, we develop new parallel algorithms, both for DBSCAN and OPTICS, designed using graph algorithmic techniques. For example, our parallel DBSCAN algorithm exploits the similarities between DBSCAN and computing connected components. Using datasets containing up to a billion floating point numbers, we show that our parallel density-based clustering algorithms significantly outperform the existing algorithms, achieving speedups up to 27.5 on 40 cores on shared memory architecture and speedups up to 5,765 using 8,192 cores on distributed memory architecture. In our experiments, we found that while achieving the scalability, our algorithms produce clustering results with comparable quality to the classical algorithms.
Ropes: Support for collective opertions among distributed threads

NASA Technical Reports Server (NTRS)

Haines, Matthew; Mehrotra, Piyush; Cronk, David

1995-01-01

Lightweight threads are becoming increasingly useful in supporting parallelism and asynchronous control structures in applications and language implementations. Recently, systems have been designed and implemented to support interprocessor communication between lightweight threads so that threads can be exploited in a distributed memory system. Their use, in this setting, has been largely restricted to supporting latency hiding techniques and functional parallelism within a single application. However, to execute data parallel codes independent of other threads in the system, collective operations and relative indexing among threads are required. This paper describes the design of ropes: a scoping mechanism for collective operations and relative indexing among threads. We present the design of ropes in the context of the Chant system, and provide performance results evaluating our initial design decisions.
Reducing power consumption while synchronizing a plurality of compute nodes during execution of a parallel application

DOEpatents

Archer, Charles J [Rochester, MN; Blocksome, Michael A [Rochester, MN; Peters, Amanda A [Rochester, MN; Ratterman, Joseph D [Rochester, MN; Smith, Brian E [Rochester, MN

2012-01-10

Methods, apparatus, and products are disclosed for reducing power consumption while synchronizing a plurality of compute nodes during execution of a parallel application that include: beginning, by each compute node, performance of a blocking operation specified by the parallel application, each compute node beginning the blocking operation asynchronously with respect to the other compute nodes; reducing, for each compute node, power to one or more hardware components of that compute node in response to that compute node beginning the performance of the blocking operation; and restoring, for each compute node, the power to the hardware components having power reduced in response to all of the compute nodes beginning the performance of the blocking operation.
Reducing power consumption while synchronizing a plurality of compute nodes during execution of a parallel application

DOEpatents

Archer, Charles J [Rochester, MN; Blocksome, Michael A [Rochester, MN; Peters, Amanda E [Cambridge, MA; Ratterman, Joseph D [Rochester, MN; Smith, Brian E [Rochester, MN

2012-04-17

Methods, apparatus, and products are disclosed for reducing power consumption while synchronizing a plurality of compute nodes during execution of a parallel application that include: beginning, by each compute node, performance of a blocking operation specified by the parallel application, each compute node beginning the blocking operation asynchronously with respect to the other compute nodes; reducing, for each compute node, power to one or more hardware components of that compute node in response to that compute node beginning the performance of the blocking operation; and restoring, for each compute node, the power to the hardware components having power reduced in response to all of the compute nodes beginning the performance of the blocking operation.
Parallel consistent labeling algorithms

DOE Office of Scientific and Technical Information (OSTI.GOV)

Samal, A.; Henderson, T.

Mackworth and Freuder have analyzed the time complexity of several constraint satisfaction algorithms. Mohr and Henderson have given new algorithms, AC-4 and PC-3, for arc and path consistency, respectively, and have shown that the arc consistency algorithm is optimal in time complexity and of the same order space complexity as the earlier algorithms. In this paper, they give parallel algorithms for solving node and arc consistency. They show that any parallel algorithm for enforcing arc consistency in the worst case must have O(na) sequential steps, where n is number of nodes, and a is the number of labels per node.more » They give several parallel algorithms to do arc consistency. It is also shown that they all have optimal time complexity. The results of running the parallel algorithms on a BBN Butterfly multiprocessor are also presented.« less
Opus: A Coordination Language for Multidisciplinary Applications

NASA Technical Reports Server (NTRS)

Chapman, Barbara; Haines, Matthew; Mehrotra, Piyush; Zima, Hans; vanRosendale, John

1997-01-01

Data parallel languages, such as High Performance fortran, can be successfully applied to a wide range of numerical applications. However, many advanced scientific and engineering applications are multidisciplinary and heterogeneous in nature, and thus do not fit well into the data parallel paradigm. In this paper we present Opus, a language designed to fill this gap. The central concept of Opus is a mechanism called ShareD Abstractions (SDA). An SDA can be used as a computation server, i.e., a locus of computational activity, or as a data repository for sharing data between asynchronous tasks. SDAs can be internally data parallel, providing support for the integration of data and task parallelism as well as nested task parallelism. They can thus be used to express multidisciplinary applications in a natural and efficient way. In this paper we describe the features of the language through a series of examples and give an overview of the runtime support required to implement these concepts in parallel and distributed environments.
Analysis of the principal component algorithm in phase-shifting interferometry.

PubMed

Vargas, J; Quiroga, J Antonio; Belenguer, T

2011-06-15

We recently presented a new asynchronous demodulation method for phase-sampling interferometry. The method is based in the principal component analysis (PCA) technique. In the former work, the PCA method was derived heuristically. In this work, we present an in-depth analysis of the PCA demodulation method.
Parallel CE/SE Computations via Domain Decomposition

NASA Technical Reports Server (NTRS)

Himansu, Ananda; Jorgenson, Philip C. E.; Wang, Xiao-Yen; Chang, Sin-Chung

2000-01-01

This paper describes the parallelization strategy and achieved parallel efficiency of an explicit time-marching algorithm for solving conservation laws. The Space-Time Conservation Element and Solution Element (CE/SE) algorithm for solving the 2D and 3D Euler equations is parallelized with the aid of domain decomposition. The parallel efficiency of the resultant algorithm on a Silicon Graphics Origin 2000 parallel computer is checked.
Comparison of multihardware parallel implementations for a phase unwrapping algorithm

NASA Astrophysics Data System (ADS)

Hernandez-Lopez, Francisco Javier; Rivera, Mariano; Salazar-Garibay, Adan; Legarda-Sáenz, Ricardo

2018-04-01

Phase unwrapping is an important problem in the areas of optical metrology, synthetic aperture radar (SAR) image analysis, and magnetic resonance imaging (MRI) analysis. These images are becoming larger in size and, particularly, the availability and need for processing of SAR and MRI data have increased significantly with the acquisition of remote sensing data and the popularization of magnetic resonators in clinical diagnosis. Therefore, it is important to develop faster and accurate phase unwrapping algorithms. We propose a parallel multigrid algorithm of a phase unwrapping method named accumulation of residual maps, which builds on a serial algorithm that consists of the minimization of a cost function; minimization achieved by means of a serial Gauss-Seidel kind algorithm. Our algorithm also optimizes the original cost function, but unlike the original work, our algorithm is a parallel Jacobi class with alternated minimizations. This strategy is known as the chessboard type, where red pixels can be updated in parallel at same iteration since they are independent. Similarly, black pixels can be updated in parallel in an alternating iteration. We present parallel implementations of our algorithm for different parallel multicore architecture such as CPU-multicore, Xeon Phi coprocessor, and Nvidia graphics processing unit. In all the cases, we obtain a superior performance of our parallel algorithm when compared with the original serial version. In addition, we present a detailed comparative performance of the developed parallel versions.
A parallel algorithm for the two-dimensional time fractional diffusion equation with implicit difference method.

PubMed

Gong, Chunye; Bao, Weimin; Tang, Guojian; Jiang, Yuewen; Liu, Jie

2014-01-01

It is very time consuming to solve fractional differential equations. The computational complexity of two-dimensional fractional differential equation (2D-TFDE) with iterative implicit finite difference method is O(M(x)M(y)N(2)). In this paper, we present a parallel algorithm for 2D-TFDE and give an in-depth discussion about this algorithm. A task distribution model and data layout with virtual boundary are designed for this parallel algorithm. The experimental results show that the parallel algorithm compares well with the exact solution. The parallel algorithm on single Intel Xeon X5540 CPU runs 3.16-4.17 times faster than the serial algorithm on single CPU core. The parallel efficiency of 81 processes is up to 88.24% compared with 9 processes on a distributed memory cluster system. We do think that the parallel computing technology will become a very basic method for the computational intensive fractional applications in the near future.
A simple highly efficient non invasive EMG-based HMI.

PubMed

Vitiello, N; Olcese, U; Oddo, C M; Carpaneto, J; Micera, S; Carrozza, M C; Dario, P

2006-01-01

Muscle activity recorded non-invasively is sufficient to control a mobile robot if it is used in combination with an algorithm for its asynchronous analysis. In this paper, we show that several subjects successfully can control the movements of a robot in a structured environment made up of six rooms by contracting two different muscles using a simple algorithm. After a small training period, subjects were able to control the robot with performances comparable to those achieved manually controlling the robot.
Parallel Architectures and Parallel Algorithms for Integrated Vision Systems. Ph.D. Thesis

NASA Technical Reports Server (NTRS)

Choudhary, Alok Nidhi

1989-01-01

Computer vision is regarded as one of the most complex and computationally intensive problems. An integrated vision system (IVS) is a system that uses vision algorithms from all levels of processing to perform for a high level application (e.g., object recognition). An IVS normally involves algorithms from low level, intermediate level, and high level vision. Designing parallel architectures for vision systems is of tremendous interest to researchers. Several issues are addressed in parallel architectures and parallel algorithms for integrated vision systems.

On a model of three-dimensional bursting and its parallel implementation

NASA Astrophysics Data System (ADS)

Tabik, S.; Romero, L. F.; Garzón, E. M.; Ramos, J. I.

2008-04-01

A mathematical model for the simulation of three-dimensional bursting phenomena and its parallel implementation are presented. The model consists of four nonlinearly coupled partial differential equations that include fast and slow variables, and exhibits bursting in the absence of diffusion. The differential equations have been discretized by means of a second-order accurate in both space and time, linearly-implicit finite difference method in equally-spaced grids. The resulting system of linear algebraic equations at each time level has been solved by means of the Preconditioned Conjugate Gradient (PCG) method. Three different parallel implementations of the proposed mathematical model have been developed; two of these implementations, i.e., the MPI and the PETSc codes, are based on a message passing paradigm, while the third one, i.e., the OpenMP code, is based on a shared space address paradigm. These three implementations are evaluated on two current high performance parallel architectures, i.e., a dual-processor cluster and a Shared Distributed Memory (SDM) system. A novel representation of the results that emphasizes the most relevant factors that affect the performance of the paralled implementations, is proposed. The comparative analysis of the computational results shows that the MPI and the OpenMP implementations are about twice more efficient than the PETSc code on the SDM system. It is also shown that, for the conditions reported here, the nonlinear dynamics of the three-dimensional bursting phenomena exhibits three stages characterized by asynchronous, synchronous and then asynchronous oscillations, before a quiescent state is reached. It is also shown that the fast system reaches steady state in much less time than the slow variables.
Efficient Parallel Kernel Solvers for Computational Fluid Dynamics Applications

NASA Technical Reports Server (NTRS)

Sun, Xian-He

1997-01-01

Distributed-memory parallel computers dominate today's parallel computing arena. These machines, such as Intel Paragon, IBM SP2, and Cray Origin2OO, have successfully delivered high performance computing power for solving some of the so-called "grand-challenge" problems. Despite initial success, parallel machines have not been widely accepted in production engineering environments due to the complexity of parallel programming. On a parallel computing system, a task has to be partitioned and distributed appropriately among processors to reduce communication cost and to attain load balance. More importantly, even with careful partitioning and mapping, the performance of an algorithm may still be unsatisfactory, since conventional sequential algorithms may be serial in nature and may not be implemented efficiently on parallel machines. In many cases, new algorithms have to be introduced to increase parallel performance. In order to achieve optimal performance, in addition to partitioning and mapping, a careful performance study should be conducted for a given application to find a good algorithm-machine combination. This process, however, is usually painful and elusive. The goal of this project is to design and develop efficient parallel algorithms for highly accurate Computational Fluid Dynamics (CFD) simulations and other engineering applications. The work plan is 1) developing highly accurate parallel numerical algorithms, 2) conduct preliminary testing to verify the effectiveness and potential of these algorithms, 3) incorporate newly developed algorithms into actual simulation packages. The work plan has well achieved. Two highly accurate, efficient Poisson solvers have been developed and tested based on two different approaches: (1) Adopting a mathematical geometry which has a better capacity to describe the fluid, (2) Using compact scheme to gain high order accuracy in numerical discretization. The previously developed Parallel Diagonal Dominant (PDD) algorithm and Reduced Parallel Diagonal Dominant (RPDD) algorithm have been carefully studied on different parallel platforms for different applications, and a NASA simulation code developed by Man M. Rai and his colleagues has been parallelized and implemented based on data dependency analysis. These achievements are addressed in detail in the paper.
An efficient parallel algorithm: Poststack and prestack Kirchhoff 3D depth migration using flexi-depth iterations

NASA Astrophysics Data System (ADS)

Rastogi, Richa; Srivastava, Abhishek; Khonde, Kiran; Sirasala, Kirannmayi M.; Londhe, Ashutosh; Chavhan, Hitesh

2015-07-01

This paper presents an efficient parallel 3D Kirchhoff depth migration algorithm suitable for current class of multicore architecture. The fundamental Kirchhoff depth migration algorithm exhibits inherent parallelism however, when it comes to 3D data migration, as the data size increases the resource requirement of the algorithm also increases. This challenges its practical implementation even on current generation high performance computing systems. Therefore a smart parallelization approach is essential to handle 3D data for migration. The most compute intensive part of Kirchhoff depth migration algorithm is the calculation of traveltime tables due to its resource requirements such as memory/storage and I/O. In the current research work, we target this area and develop a competent parallel algorithm for post and prestack 3D Kirchhoff depth migration, using hybrid MPI+OpenMP programming techniques. We introduce a concept of flexi-depth iterations while depth migrating data in parallel imaging space, using optimized traveltime table computations. This concept provides flexibility to the algorithm by migrating data in a number of depth iterations, which depends upon the available node memory and the size of data to be migrated during runtime. Furthermore, it minimizes the requirements of storage, I/O and inter-node communication, thus making it advantageous over the conventional parallelization approaches. The developed parallel algorithm is demonstrated and analysed on Yuva II, a PARAM series of supercomputers. Optimization, performance and scalability experiment results along with the migration outcome show the effectiveness of the parallel algorithm.
Parallel Algorithms for Least Squares and Related Computations.

DTIC Science & Technology

1991-03-22

for dense computations in linear algebra . The work has recently been published in a general reference book on parallel algorithms by SIAM. AFO SR...written his Ph.D. dissertation with the principal investigator. (See publication 6.) • Parallel Algorithms for Dense Linear Algebra Computations. Our...and describe and to put into perspective a selection of the more important parallel algorithms for numerical linear algebra . We give a major new
Genetic algorithms using SISAL parallel programming language

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tejada, S.

1994-05-06

Genetic algorithms are a mathematical optimization technique developed by John Holland at the University of Michigan [1]. The SISAL programming language possesses many of the characteristics desired to implement genetic algorithms. SISAL is a deterministic, functional programming language which is inherently parallel. Because SISAL is functional and based on mathematical concepts, genetic algorithms can be efficiently translated into the language. Several of the steps involved in genetic algorithms, such as mutation, crossover, and fitness evaluation, can be parallelized using SISAL. In this paper I will l discuss the implementation and performance of parallel genetic algorithms in SISAL.
On the suitability of the connection machine for direct particle simulation

NASA Technical Reports Server (NTRS)

Dagum, Leonard

1990-01-01

The algorithmic structure was examined of the vectorizable Stanford particle simulation (SPS) method and the structure is reformulated in data parallel form. Some of the SPS algorithms can be directly translated to data parallel, but several of the vectorizable algorithms have no direct data parallel equivalent. This requires the development of new, strictly data parallel algorithms. In particular, a new sorting algorithm is developed to identify collision candidates in the simulation and a master/slave algorithm is developed to minimize communication cost in large table look up. Validation of the method is undertaken through test calculations for thermal relaxation of a gas, shock wave profiles, and shock reflection from a stationary wall. A qualitative measure is provided of the performance of the Connection Machine for direct particle simulation. The massively parallel architecture of the Connection Machine is found quite suitable for this type of calculation. However, there are difficulties in taking full advantage of this architecture because of lack of a broad based tradition of data parallel programming. An important outcome of this work has been new data parallel algorithms specifically of use for direct particle simulation but which also expand the data parallel diction.
Increasing processor utilization during parallel computation rundown

NASA Technical Reports Server (NTRS)

Jones, W. H.

1986-01-01

Some parallel processing environments provide for asynchronous execution and completion of general purpose parallel computations from a single computational phase. When all the computations from such a phase are complete, a new parallel computational phase is begun. Depending upon the granularity of the parallel computations to be performed, there may be a shortage of available work as a particular computational phase draws to a close (computational rundown). This can result in the waste of computing resources and the delay of the overall problem. In many practical instances, strict sequential ordering of phases of parallel computation is not totally required. In such cases, the beginning of one phase can be correctly computed before the end of a previous phase is completed. This allows additional work to be generated somewhat earlier to keep computing resources busy during each computational rundown. The conditions under which this can occur are identified and the frequency of occurrence of such overlapping in an actual parallel Navier-Stokes code is reported. A language construct is suggested and possible control strategies for the management of such computational phase overlapping are discussed.
Parallel adaptive wavelet collocation method for PDEs

DOE Office of Scientific and Technical Information (OSTI.GOV)

Nejadmalayeri, Alireza, E-mail: Alireza.Nejadmalayeri@gmail.com; Vezolainen, Alexei, E-mail: Alexei.Vezolainen@Colorado.edu; Brown-Dymkoski, Eric, E-mail: Eric.Browndymkoski@Colorado.edu

2015-10-01

A parallel adaptive wavelet collocation method for solving a large class of Partial Differential Equations is presented. The parallelization is achieved by developing an asynchronous parallel wavelet transform, which allows one to perform parallel wavelet transform and derivative calculations with only one data synchronization at the highest level of resolution. The data are stored using tree-like structure with tree roots starting at a priori defined level of resolution. Both static and dynamic domain partitioning approaches are developed. For the dynamic domain partitioning, trees are considered to be the minimum quanta of data to be migrated between the processes. This allowsmore » fully automated and efficient handling of non-simply connected partitioning of a computational domain. Dynamic load balancing is achieved via domain repartitioning during the grid adaptation step and reassigning trees to the appropriate processes to ensure approximately the same number of grid points on each process. The parallel efficiency of the approach is discussed based on parallel adaptive wavelet-based Coherent Vortex Simulations of homogeneous turbulence with linear forcing at effective non-adaptive resolutions up to 2048{sup 3} using as many as 2048 CPU cores.« less
Decentralized automatic generation control of interconnected power systems incorporating asynchronous tie-lines.

PubMed

Ibraheem; Hasan, Naimul; Hussein, Arkan Ahmed

2014-01-01

This Paper presents the design of decentralized automatic generation controller for an interconnected power system using PID, Genetic Algorithm (GA) and Particle Swarm Optimization (PSO). The designed controllers are tested on identical two-area interconnected power systems consisting of thermal power plants. The area interconnections between two areas are considered as (i) AC tie-line only (ii) Asynchronous tie-line. The dynamic response analysis is carried out for 1% load perturbation. The performance of the intelligent controllers based on GA and PSO has been compared with the conventional PID controller. The investigations of the system dynamic responses reveal that PSO has the better dynamic response result as compared with PID and GA controller for both type of area interconnection.
Runtime support for parallelizing data mining algorithms

NASA Astrophysics Data System (ADS)

Jin, Ruoming; Agrawal, Gagan

2002-03-01

With recent technological advances, shared memory parallel machines have become more scalable, and offer large main memories and high bus bandwidths. They are emerging as good platforms for data warehousing and data mining. In this paper, we focus on shared memory parallelization of data mining algorithms. We have developed a series of techniques for parallelization of data mining algorithms, including full replication, full locking, fixed locking, optimized full locking, and cache-sensitive locking. Unlike previous work on shared memory parallelization of specific data mining algorithms, all of our techniques apply to a large number of common data mining algorithms. In addition, we propose a reduction-object based interface for specifying a data mining algorithm. We show how our runtime system can apply any of the technique we have developed starting from a common specification of the algorithm.
Parallel transformation of K-SVD solar image denoising algorithm

NASA Astrophysics Data System (ADS)

Liang, Youwen; Tian, Yu; Li, Mei

2017-02-01

The images obtained by observing the sun through a large telescope always suffered with noise due to the low SNR. K-SVD denoising algorithm can effectively remove Gauss white noise. Training dictionaries for sparse representations is a time consuming task, due to the large size of the data involved and to the complexity of the training algorithms. In this paper, an OpenMP parallel programming language is proposed to transform the serial algorithm to the parallel version. Data parallelism model is used to transform the algorithm. Not one atom but multiple atoms updated simultaneously is the biggest change. The denoising effect and acceleration performance are tested after completion of the parallel algorithm. Speedup of the program is 13.563 in condition of using 16 cores. This parallel version can fully utilize the multi-core CPU hardware resources, greatly reduce running time and easily to transplant in multi-core platform.
A parallel simulated annealing algorithm for standard cell placement on a hypercube computer

NASA Technical Reports Server (NTRS)

Jones, Mark Howard

1987-01-01

A parallel version of a simulated annealing algorithm is presented which is targeted to run on a hypercube computer. A strategy for mapping the cells in a two dimensional area of a chip onto processors in an n-dimensional hypercube is proposed such that both small and large distance moves can be applied. Two types of moves are allowed: cell exchanges and cell displacements. The computation of the cost function in parallel among all the processors in the hypercube is described along with a distributed data structure that needs to be stored in the hypercube to support parallel cost evaluation. A novel tree broadcasting strategy is used extensively in the algorithm for updating cell locations in the parallel environment. Studies on the performance of the algorithm on example industrial circuits show that it is faster and gives better final placement results than the uniprocessor simulated annealing algorithms. An improved uniprocessor algorithm is proposed which is based on the improved results obtained from parallelization of the simulated annealing algorithm.
Exploiting parallel computing with limited program changes using a network of microcomputers

NASA Technical Reports Server (NTRS)

Rogers, J. L., Jr.; Sobieszczanski-Sobieski, J.

1985-01-01

Network computing and multiprocessor computers are two discernible trends in parallel processing. The computational behavior of an iterative distributed process in which some subtasks are completed later than others because of an imbalance in computational requirements is of significant interest. The effects of asynchronus processing was studied. A small existing program was converted to perform finite element analysis by distributing substructure analysis over a network of four Apple IIe microcomputers connected to a shared disk, simulating a parallel computer. The substructure analysis uses an iterative, fully stressed, structural resizing procedure. A framework of beams divided into three substructures is used as the finite element model. The effects of asynchronous processing on the convergence of the design variables are determined by not resizing particular substructures on various iterations.
Massively parallel algorithms for real-time wavefront control of a dense adaptive optics system

DOE Office of Scientific and Technical Information (OSTI.GOV)

Fijany, A.; Milman, M.; Redding, D.

1994-12-31

In this paper massively parallel algorithms and architectures for real-time wavefront control of a dense adaptive optic system (SELENE) are presented. The authors have already shown that the computation of a near optimal control algorithm for SELENE can be reduced to the solution of a discrete Poisson equation on a regular domain. Although, this represents an optimal computation, due the large size of the system and the high sampling rate requirement, the implementation of this control algorithm poses a computationally challenging problem since it demands a sustained computational throughput of the order of 10 GFlops. They develop a novel algorithm,more » designated as Fast Invariant Imbedding algorithm, which offers a massive degree of parallelism with simple communication and synchronization requirements. Due to these features, this algorithm is significantly more efficient than other Fast Poisson Solvers for implementation on massively parallel architectures. The authors also discuss two massively parallel, algorithmically specialized, architectures for low-cost and optimal implementation of the Fast Invariant Imbedding algorithm.« less
Communication library for run-time visualization of distributed, asynchronous data

DOE Office of Scientific and Technical Information (OSTI.GOV)

Rowlan, J.; Wightman, B.T.

1994-04-01

In this paper we present a method for collecting and visualizing data generated by a parallel computational simulation during run time. Data distributed across multiple processes is sent across parallel communication lines to a remote workstation, which sorts and queues the data for visualization. We have implemented our method in a set of tools called PORTAL (for Parallel aRchitecture data-TrAnsfer Library). The tools comprise generic routines for sending data from a parallel program (callable from either C or FORTRAN), a semi-parallel communication scheme currently built upon Unix Sockets, and a real-time connection to the scientific visualization program AVS. Our methodmore » is most valuable when used to examine large datasets that can be efficiently generated and do not need to be stored on disk. The PORTAL source libraries, detailed documentation, and a working example can be obtained by anonymous ftp from info.mcs.anl.gov from the file portal.tar.Z from the directory pub/portal.« less
Parallel Algorithms for Groebner-Basis Reduction

DTIC Science & Technology

1987-09-25

22209 ELEMENT NO. NO. NO. ACCESSION NO. 11. TITLE (Include Security Classification) * PARALLEL ALGORITHMS FOR GROEBNER -BASIS REDUCTION 12. PERSONAL...All other editions are obsolete. Productivity Engineering in the UNIXt Environment p Parallel Algorithms for Groebner -Basis Reduction Technical Report
Parallel and fault-tolerant algorithms for hypercube multiprocessors

DOE Office of Scientific and Technical Information (OSTI.GOV)

Aykanat, C.

1988-01-01

Several techniques for increasing the performance of parallel algorithms on distributed-memory message-passing multi-processor systems are investigated. These techniques are effectively implemented for the parallelization of the Scaled Conjugate Gradient (SCG) algorithm on a hypercube connected message-passing multi-processor. Significant performance improvement is achieved by using these techniques. The SCG algorithm is used for the solution phase of an FE modeling system. Almost linear speed-up is achieved, and it is shown that hypercube topology is scalable for an FE class of problem. The SCG algorithm is also shown to be suitable for vectorization, and near supercomputer performance is achieved on a vectormore » hypercube multiprocessor by exploiting both parallelization and vectorization. Fault-tolerance issues for the parallel SCG algorithm and for the hypercube topology are also addressed.« less
Hybrid massively parallel fast sweeping method for static Hamilton-Jacobi equations

NASA Astrophysics Data System (ADS)

Detrixhe, Miles; Gibou, Frédéric

2016-10-01

The fast sweeping method is a popular algorithm for solving a variety of static Hamilton-Jacobi equations. Fast sweeping algorithms for parallel computing have been developed, but are severely limited. In this work, we present a multilevel, hybrid parallel algorithm that combines the desirable traits of two distinct parallel methods. The fine and coarse grained components of the algorithm take advantage of heterogeneous computer architecture common in high performance computing facilities. We present the algorithm and demonstrate its effectiveness on a set of example problems including optimal control, dynamic games, and seismic wave propagation. We give results for convergence, parallel scaling, and show state-of-the-art speedup values for the fast sweeping method.
A novel asynchronous access method with binary interfaces

PubMed Central

2008-01-01

Background Traditionally synchronous access strategies require users to comply with one or more time constraints in order to communicate intent with a binary human-machine interface (e.g., mechanical, gestural or neural switches). Asynchronous access methods are preferable, but have not been used with binary interfaces in the control of devices that require more than two commands to be successfully operated. Methods We present the mathematical development and evaluation of a novel asynchronous access method that may be used to translate sporadic activations of binary interfaces into distinct outcomes for the control of devices requiring an arbitrary number of commands to be controlled. With this method, users are required to activate their interfaces only when the device under control behaves erroneously. Then, a recursive algorithm, incorporating contextual assumptions relevant to all possible outcomes, is used to obtain an informed estimate of user intention. We evaluate this method by simulating a control task requiring a series of target commands to be tracked by a model user. Results When compared to a random selection, the proposed asynchronous access method offers a significant reduction in the number of interface activations required from the user. Conclusion This novel access method offers a variety of advantages over traditionally synchronous access strategies and may be adapted to a wide variety of contexts, with primary relevance to applications involving direct object manipulation. PMID:18959797
Asynchronous signal-dependent non-uniform sampler

NASA Astrophysics Data System (ADS)

Can-Cimino, Azime; Chaparro, Luis F.; Sejdić, Ervin

2014-05-01

Analog sparse signals resulting from biomedical and sensing network applications are typically non-stationary with frequency-varying spectra. By ignoring that the maximum frequency of their spectra is changing, uniform sampling of sparse signals collects unnecessary samples in quiescent segments of the signal. A more appropriate sampling approach would be signal-dependent. Moreover, in many of these applications power consumption and analog processing are issues of great importance that need to be considered. In this paper we present a signal dependent non-uniform sampler that uses a Modified Asynchronous Sigma Delta Modulator which consumes low-power and can be processed using analog procedures. Using Prolate Spheroidal Wave Functions (PSWF) interpolation of the original signal is performed, thus giving an asynchronous analog to digital and digital to analog conversion. Stable solutions are obtained by using modulated PSWFs functions. The advantage of the adapted asynchronous sampler is that range of frequencies of the sparse signal is taken into account avoiding aliasing. Moreover, it requires saving only the zero-crossing times of the non-uniform samples, or their differences, and the reconstruction can be done using their quantized values and a PSWF-based interpolation. The range of frequencies analyzed can be changed and the sampler can be implemented as a bank of filters for unknown range of frequencies. The performance of the proposed algorithm is illustrated with an electroencephalogram (EEG) signal.

Parallel Algorithms for Switching Edges in Heterogeneous Graphs.

PubMed

Bhuiyan, Hasanuzzaman; Khan, Maleq; Chen, Jiangzhuo; Marathe, Madhav

2017-06-01

An edge switch is an operation on a graph (or network) where two edges are selected randomly and one of their end vertices are swapped with each other. Edge switch operations have important applications in graph theory and network analysis, such as in generating random networks with a given degree sequence, modeling and analyzing dynamic networks, and in studying various dynamic phenomena over a network. The recent growth of real-world networks motivates the need for efficient parallel algorithms. The dependencies among successive edge switch operations and the requirement to keep the graph simple (i.e., no self-loops or parallel edges) as the edges are switched lead to significant challenges in designing a parallel algorithm. Addressing these challenges requires complex synchronization and communication among the processors leading to difficulties in achieving a good speedup by parallelization. In this paper, we present distributed memory parallel algorithms for switching edges in massive networks. These algorithms provide good speedup and scale well to a large number of processors. A harmonic mean speedup of 73.25 is achieved on eight different networks with 1024 processors. One of the steps in our edge switch algorithms requires the computation of multinomial random variables in parallel. This paper presents the first non-trivial parallel algorithm for the problem, achieving a speedup of 925 using 1024 processors.
Parallel Algorithms for Switching Edges in Heterogeneous Graphs☆

PubMed Central

Khan, Maleq; Chen, Jiangzhuo; Marathe, Madhav

2017-01-01

An edge switch is an operation on a graph (or network) where two edges are selected randomly and one of their end vertices are swapped with each other. Edge switch operations have important applications in graph theory and network analysis, such as in generating random networks with a given degree sequence, modeling and analyzing dynamic networks, and in studying various dynamic phenomena over a network. The recent growth of real-world networks motivates the need for efficient parallel algorithms. The dependencies among successive edge switch operations and the requirement to keep the graph simple (i.e., no self-loops or parallel edges) as the edges are switched lead to significant challenges in designing a parallel algorithm. Addressing these challenges requires complex synchronization and communication among the processors leading to difficulties in achieving a good speedup by parallelization. In this paper, we present distributed memory parallel algorithms for switching edges in massive networks. These algorithms provide good speedup and scale well to a large number of processors. A harmonic mean speedup of 73.25 is achieved on eight different networks with 1024 processors. One of the steps in our edge switch algorithms requires the computation of multinomial random variables in parallel. This paper presents the first non-trivial parallel algorithm for the problem, achieving a speedup of 925 using 1024 processors. PMID:28757680
Scaling Up Coordinate Descent Algorithms for Large ℓ1 Regularization Problems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Scherrer, Chad; Halappanavar, Mahantesh; Tewari, Ambuj

2012-07-03

We present a generic framework for parallel coordinate descent (CD) algorithms that has as special cases the original sequential algorithms of Cyclic CD and Stochastic CD, as well as the recent parallel Shotgun algorithm of Bradley et al. We introduce two novel parallel algorithms that are also special cases---Thread-Greedy CD and Coloring-Based CD---and give performance measurements for an OpenMP implementation of these.
Parallel language constructs for tensor product computations on loosely coupled architectures

NASA Technical Reports Server (NTRS)

Mehrotra, Piyush; Van Rosendale, John

1989-01-01

A set of language primitives designed to allow the specification of parallel numerical algorithms at a higher level is described. The authors focus on tensor product array computations, a simple but important class of numerical algorithms. They consider first the problem of programming one-dimensional kernel routines, such as parallel tridiagonal solvers, and then look at how such parallel kernels can be combined to form parallel tensor product algorithms.
An interactive control algorithm used for equilateral triangle formation with robotic sensors.

PubMed

Li, Xiang; Chen, Hongcai

2014-04-22

This paper describes an interactive control algorithm, called Triangle Formation Algorithm (TFA), used for three neighboring robotic sensors which are distributed randomly to self-organize into and equilateral triangle (E) formation. The algorithm is proposed based on the triangular geometry and considering the actual sensors used in robotics. In particular, the stability of the TFA, which can be executed by robotic sensors independently and asynchronously for E formation, is analyzed in details based on Lyapunov stability theory. Computer simulations are carried out for verifying the effectiveness of the TFA. The analytical results and simulation studies indicate that three neighboring robots employing conventional sensors can self-organize into E formations successfully regardless of their initial distribution using the same TFAs.
An Interactive Control Algorithm Used for Equilateral Triangle Formation with Robotic Sensors

PubMed Central

Li, Xiang; Chen, Hongcai

2014-01-01

This paper describes an interactive control algorithm, called Triangle Formation Algorithm (TFA), used for three neighboring robotic sensors which are distributed randomly to self-organize into and equilateral triangle (E) formation. The algorithm is proposed based on the triangular geometry and considering the actual sensors used in robotics. In particular, the stability of the TFA, which can be executed by robotic sensors independently and asynchronously for E formation, is analyzed in details based on Lyapunov stability theory. Computer simulations are carried out for verifying the effectiveness of the TFA. The analytical results and simulation studies indicate that three neighboring robots employing conventional sensors can self-organize into E formations successfully regardless of their initial distribution using the same TFAs. PMID:24759118
Big Data: A Parallel Particle Swarm Optimization-Back-Propagation Neural Network Algorithm Based on MapReduce.

PubMed

Cao, Jianfang; Cui, Hongyan; Shi, Hao; Jiao, Lijuan

2016-01-01

A back-propagation (BP) neural network can solve complicated random nonlinear mapping problems; therefore, it can be applied to a wide range of problems. However, as the sample size increases, the time required to train BP neural networks becomes lengthy. Moreover, the classification accuracy decreases as well. To improve the classification accuracy and runtime efficiency of the BP neural network algorithm, we proposed a parallel design and realization method for a particle swarm optimization (PSO)-optimized BP neural network based on MapReduce on the Hadoop platform using both the PSO algorithm and a parallel design. The PSO algorithm was used to optimize the BP neural network's initial weights and thresholds and improve the accuracy of the classification algorithm. The MapReduce parallel programming model was utilized to achieve parallel processing of the BP algorithm, thereby solving the problems of hardware and communication overhead when the BP neural network addresses big data. Datasets on 5 different scales were constructed using the scene image library from the SUN Database. The classification accuracy of the parallel PSO-BP neural network algorithm is approximately 92%, and the system efficiency is approximately 0.85, which presents obvious advantages when processing big data. The algorithm proposed in this study demonstrated both higher classification accuracy and improved time efficiency, which represents a significant improvement obtained from applying parallel processing to an intelligent algorithm on big data.
Fiber optic cable-based high-resolution, long-distance VGA extenders

NASA Astrophysics Data System (ADS)

Rhee, Jin-Geun; Lee, Iksoo; Kim, Heejoon; Kim, Sungjoon; Koh, Yeon-Wan; Kim, Hoik; Lim, Jiseok; Kim, Chur; Kim, Jungwon

2013-02-01

Remote transfer of high-resolution video information finds more applications in detached display applications for large facilities such as theaters, sports complex, airports, and security facilities. Active optical cables (AOCs) provide a promising approach for enhancing both the transmittable resolution and distance that standard copper-based cables cannot reach. In addition to the standard digital formats such as HDMI, the high-resolution, long-distance transfer of VGA format signals is important for applications where high-resolution analog video ports should be also supported, such as military/defense applications and high-resolution video camera links. In this presentation we present the development of a compressionless, high-resolution (up to WUXGA, 1920x1200), long-distance (up to 2 km) VGA extenders based on serialized technique. We employed asynchronous serial transmission and clock regeneration techniques, which enables lower cost implementation of VGA extenders by removing the necessity for clock transmission and large memory at the receiver. Two 3.125-Gbps transceivers are used in parallel to meet the required maximum video data rate of 6.25 Gbps. As the data are transmitted asynchronously, 24-bit pixel clock time stamp is employed to regenerate video pixel clock accurately at the receiver side. In parallel to the video information, stereo audio and RS-232 control signals are transmitted as well.
Multirate-based fast parallel algorithms for 2-D DHT-based real-valued discrete Gabor transform.

PubMed

Tao, Liang; Kwan, Hon Keung

2012-07-01

Novel algorithms for the multirate and fast parallel implementation of the 2-D discrete Hartley transform (DHT)-based real-valued discrete Gabor transform (RDGT) and its inverse transform are presented in this paper. A 2-D multirate-based analysis convolver bank is designed for the 2-D RDGT, and a 2-D multirate-based synthesis convolver bank is designed for the 2-D inverse RDGT. The parallel channels in each of the two convolver banks have a unified structure and can apply the 2-D fast DHT algorithm to speed up their computations. The computational complexity of each parallel channel is low and is independent of the Gabor oversampling rate. All the 2-D RDGT coefficients of an image are computed in parallel during the analysis process and can be reconstructed in parallel during the synthesis process. The computational complexity and time of the proposed parallel algorithms are analyzed and compared with those of the existing fastest algorithms for 2-D discrete Gabor transforms. The results indicate that the proposed algorithms are the fastest, which make them attractive for real-time image processing.
An efficient parallel algorithm for matrix-vector multiplication

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hendrickson, B.; Leland, R.; Plimpton, S.

The multiplication of a vector by a matrix is the kernel computation of many algorithms in scientific computation. A fast parallel algorithm for this calculation is therefore necessary if one is to make full use of the new generation of parallel supercomputers. This paper presents a high performance, parallel matrix-vector multiplication algorithm that is particularly well suited to hypercube multiprocessors. For an n x n matrix on p processors, the communication cost of this algorithm is O(n/[radical]p + log(p)), independent of the matrix sparsity pattern. The performance of the algorithm is demonstrated by employing it as the kernel in themore » well-known NAS conjugate gradient benchmark, where a run time of 6.09 seconds was observed. This is the best published performance on this benchmark achieved to date using a massively parallel supercomputer.« less
Hybrid massively parallel fast sweeping method for static Hamilton–Jacobi equations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Detrixhe, Miles, E-mail: mdetrixhe@engineering.ucsb.edu; University of California Santa Barbara, Santa Barbara, CA, 93106; Gibou, Frédéric, E-mail: fgibou@engineering.ucsb.edu

The fast sweeping method is a popular algorithm for solving a variety of static Hamilton–Jacobi equations. Fast sweeping algorithms for parallel computing have been developed, but are severely limited. In this work, we present a multilevel, hybrid parallel algorithm that combines the desirable traits of two distinct parallel methods. The fine and coarse grained components of the algorithm take advantage of heterogeneous computer architecture common in high performance computing facilities. We present the algorithm and demonstrate its effectiveness on a set of example problems including optimal control, dynamic games, and seismic wave propagation. We give results for convergence, parallel scaling,more » and show state-of-the-art speedup values for the fast sweeping method.« less
Optimal Design of Passive Power Filters Based on Pseudo-parallel Genetic Algorithm

NASA Astrophysics Data System (ADS)

Li, Pei; Li, Hongbo; Gao, Nannan; Niu, Lin; Guo, Liangfeng; Pei, Ying; Zhang, Yanyan; Xu, Minmin; Chen, Kerui

2017-05-01

The economic costs together with filter efficiency are taken as targets to optimize the parameter of passive filter. Furthermore, the method of combining pseudo-parallel genetic algorithm with adaptive genetic algorithm is adopted in this paper. In the early stages pseudo-parallel genetic algorithm is introduced to increase the population diversity, and adaptive genetic algorithm is used in the late stages to reduce the workload. At the same time, the migration rate of pseudo-parallel genetic algorithm is improved to change with population diversity adaptively. Simulation results show that the filter designed by the proposed method has better filtering effect with lower economic cost, and can be used in engineering.
Parallel Directionally Split Solver Based on Reformulation of Pipelined Thomas Algorithm

NASA Technical Reports Server (NTRS)

Povitsky, A.

1998-01-01

In this research an efficient parallel algorithm for 3-D directionally split problems is developed. The proposed algorithm is based on a reformulated version of the pipelined Thomas algorithm that starts the backward step computations immediately after the completion of the forward step computations for the first portion of lines This algorithm has data available for other computational tasks while processors are idle from the Thomas algorithm. The proposed 3-D directionally split solver is based on the static scheduling of processors where local and non-local, data-dependent and data-independent computations are scheduled while processors are idle. A theoretical model of parallelization efficiency is used to define optimal parameters of the algorithm, to show an asymptotic parallelization penalty and to obtain an optimal cover of a global domain with subdomains. It is shown by computational experiments and by the theoretical model that the proposed algorithm reduces the parallelization penalty about two times over the basic algorithm for the range of the number of processors (subdomains) considered and the number of grid nodes per subdomain.
A Primer for Telemetry Interfacing in Accordance with NASA Standards Using Low Cost FPGAs

NASA Astrophysics Data System (ADS)

McCoy, Jake; Schultz, Ted; Tutt, James; Rogers, Thomas; Miles, Drew; McEntaffer, Randall

2016-03-01

Photon counting detector systems on sounding rocket payloads often require interfacing asynchronous outputs with a synchronously clocked telemetry (TM) stream. Though this can be handled with an on-board computer, there are several low cost alternatives including custom hardware, microcontrollers and field-programmable gate arrays (FPGAs). This paper outlines how a TM interface (TMIF) for detectors on a sounding rocket with asynchronous parallel digital output can be implemented using low cost FPGAs and minimal custom hardware. Low power consumption and high speed FPGAs are available as commercial off-the-shelf (COTS) products and can be used to develop the main component of the TMIF. Then, only a small amount of additional hardware is required for signal buffering and level translating. This paper also discusses how this system can be tested with a simulated TM chain in the small laboratory setting using FPGAs and COTS specialized data acquisition products.
Architecture and method for a burst buffer using flash technology

DOEpatents

Tzelnic, Percy; Faibish, Sorin; Gupta, Uday K.; Bent, John; Grider, Gary Alan; Chen, Hsing-bung

2016-03-15

A parallel supercomputing cluster includes compute nodes interconnected in a mesh of data links for executing an MPI job, and solid-state storage nodes each linked to a respective group of the compute nodes for receiving checkpoint data from the respective compute nodes, and magnetic disk storage linked to each of the solid-state storage nodes for asynchronous migration of the checkpoint data from the solid-state storage nodes to the magnetic disk storage. Each solid-state storage node presents a file system interface to the MPI job, and multiple MPI processes of the MPI job write the checkpoint data to a shared file in the solid-state storage in a strided fashion, and the solid-state storage node asynchronously migrates the checkpoint data from the shared file in the solid-state storage to the magnetic disk storage and writes the checkpoint data to the magnetic disk storage in a sequential fashion.
WATERLOPP V2/64: A highly parallel machine for numerical computation

NASA Astrophysics Data System (ADS)

Ostlund, Neil S.

1985-07-01

Current technological trends suggest that the high performance scientific machines of the future are very likely to consist of a large number (greater than 1024) of processors connected and communicating with each other in some as yet undetermined manner. Such an assembly of processors should behave as a single machine in obtaining numerical solutions to scientific problems. However, the appropriate way of organizing both the hardware and software of such an assembly of processors is an unsolved and active area of research. It is particularly important to minimize the organizational overhead of interprocessor comunication, global synchronization, and contention for shared resources if the performance of a large number ( n) of processors is to be anything like the desirable n times the performance of a single processor. In many situations, adding a processor actually decreases the performance of the overall system since the extra organizational overhead is larger than the extra processing power added. The systolic loop architecture is a new multiple processor architecture which attemps at a solution to the problem of how to organize a large number of asynchronous processors into an effective computational system while minimizing the organizational overhead. This paper gives a brief overview of the basic systolic loop architecture, systolic loop algorithms for numerical computation, and a 64-processor implementation of the architecture, WATERLOOP V2/64, that is being used as a testbed for exploring the hardware, software, and algorithmic aspects of the architecture.
Parallel optimization algorithms and their implementation in VLSI design

NASA Technical Reports Server (NTRS)

Lee, G.; Feeley, J. J.

1991-01-01

Two new parallel optimization algorithms based on the simplex method are described. They may be executed by a SIMD parallel processor architecture and be implemented in VLSI design. Several VLSI design implementations are introduced. An application example is reported to demonstrate that the algorithms are effective.
A parallel time integrator for noisy nonlinear oscillatory systems

NASA Astrophysics Data System (ADS)

Subber, Waad; Sarkar, Abhijit

2018-06-01

In this paper, we adapt a parallel time integration scheme to track the trajectories of noisy non-linear dynamical systems. Specifically, we formulate a parallel algorithm to generate the sample path of nonlinear oscillator defined by stochastic differential equations (SDEs) using the so-called parareal method for ordinary differential equations (ODEs). The presence of Wiener process in SDEs causes difficulties in the direct application of any numerical integration techniques of ODEs including the parareal algorithm. The parallel implementation of the algorithm involves two SDEs solvers, namely a fine-level scheme to integrate the system in parallel and a coarse-level scheme to generate and correct the required initial conditions to start the fine-level integrators. For the numerical illustration, a randomly excited Duffing oscillator is investigated in order to study the performance of the stochastic parallel algorithm with respect to a range of system parameters. The distributed implementation of the algorithm exploits Massage Passing Interface (MPI).
A parallel variable metric optimization algorithm

NASA Technical Reports Server (NTRS)

Straeter, T. A.

1973-01-01

An algorithm, designed to exploit the parallel computing or vector streaming (pipeline) capabilities of computers is presented. When p is the degree of parallelism, then one cycle of the parallel variable metric algorithm is defined as follows: first, the function and its gradient are computed in parallel at p different values of the independent variable; then the metric is modified by p rank-one corrections; and finally, a single univariant minimization is carried out in the Newton-like direction. Several properties of this algorithm are established. The convergence of the iterates to the solution is proved for a quadratic functional on a real separable Hilbert space. For a finite-dimensional space the convergence is in one cycle when p equals the dimension of the space. Results of numerical experiments indicate that the new algorithm will exploit parallel or pipeline computing capabilities to effect faster convergence than serial techniques.
Designing a parallel evolutionary algorithm for inferring gene networks on the cloud computing environment.

PubMed

Lee, Wei-Po; Hsiao, Yu-Ting; Hwang, Wei-Che

2014-01-16

To improve the tedious task of reconstructing gene networks through testing experimentally the possible interactions between genes, it becomes a trend to adopt the automated reverse engineering procedure instead. Some evolutionary algorithms have been suggested for deriving network parameters. However, to infer large networks by the evolutionary algorithm, it is necessary to address two important issues: premature convergence and high computational cost. To tackle the former problem and to enhance the performance of traditional evolutionary algorithms, it is advisable to use parallel model evolutionary algorithms. To overcome the latter and to speed up the computation, it is advocated to adopt the mechanism of cloud computing as a promising solution: most popular is the method of MapReduce programming model, a fault-tolerant framework to implement parallel algorithms for inferring large gene networks. This work presents a practical framework to infer large gene networks, by developing and parallelizing a hybrid GA-PSO optimization method. Our parallel method is extended to work with the Hadoop MapReduce programming model and is executed in different cloud computing environments. To evaluate the proposed approach, we use a well-known open-source software GeneNetWeaver to create several yeast S. cerevisiae sub-networks and use them to produce gene profiles. Experiments have been conducted and the results have been analyzed. They show that our parallel approach can be successfully used to infer networks with desired behaviors and the computation time can be largely reduced. Parallel population-based algorithms can effectively determine network parameters and they perform better than the widely-used sequential algorithms in gene network inference. These parallel algorithms can be distributed to the cloud computing environment to speed up the computation. By coupling the parallel model population-based optimization method and the parallel computational framework, high quality solutions can be obtained within relatively short time. This integrated approach is a promising way for inferring large networks.

Designing a parallel evolutionary algorithm for inferring gene networks on the cloud computing environment

PubMed Central

2014-01-01

Background To improve the tedious task of reconstructing gene networks through testing experimentally the possible interactions between genes, it becomes a trend to adopt the automated reverse engineering procedure instead. Some evolutionary algorithms have been suggested for deriving network parameters. However, to infer large networks by the evolutionary algorithm, it is necessary to address two important issues: premature convergence and high computational cost. To tackle the former problem and to enhance the performance of traditional evolutionary algorithms, it is advisable to use parallel model evolutionary algorithms. To overcome the latter and to speed up the computation, it is advocated to adopt the mechanism of cloud computing as a promising solution: most popular is the method of MapReduce programming model, a fault-tolerant framework to implement parallel algorithms for inferring large gene networks. Results This work presents a practical framework to infer large gene networks, by developing and parallelizing a hybrid GA-PSO optimization method. Our parallel method is extended to work with the Hadoop MapReduce programming model and is executed in different cloud computing environments. To evaluate the proposed approach, we use a well-known open-source software GeneNetWeaver to create several yeast S. cerevisiae sub-networks and use them to produce gene profiles. Experiments have been conducted and the results have been analyzed. They show that our parallel approach can be successfully used to infer networks with desired behaviors and the computation time can be largely reduced. Conclusions Parallel population-based algorithms can effectively determine network parameters and they perform better than the widely-used sequential algorithms in gene network inference. These parallel algorithms can be distributed to the cloud computing environment to speed up the computation. By coupling the parallel model population-based optimization method and the parallel computational framework, high quality solutions can be obtained within relatively short time. This integrated approach is a promising way for inferring large networks. PMID:24428926
Symbolic Computation of Strongly Connected Components Using Saturation

NASA Technical Reports Server (NTRS)

Zhao, Yang; Ciardo, Gianfranco

2010-01-01

Finding strongly connected components (SCCs) in the state-space of discrete-state models is a critical task in formal verification of LTL and fair CTL properties, but the potentially huge number of reachable states and SCCs constitutes a formidable challenge. This paper is concerned with computing the sets of states in SCCs or terminal SCCs of asynchronous systems. Because of its advantages in many applications, we employ saturation on two previously proposed approaches: the Xie-Beerel algorithm and transitive closure. First, saturation speeds up state-space exploration when computing each SCC in the Xie-Beerel algorithm. Then, our main contribution is a novel algorithm to compute the transitive closure using saturation. Experimental results indicate that our improved algorithms achieve a clear speedup over previous algorithms in some cases. With the help of the new transitive closure computation algorithm, up to 10(exp 150) SCCs can be explored within a few seconds.
A parallel Jacobson-Oksman optimization algorithm. [parallel processing (computers)

NASA Technical Reports Server (NTRS)

Straeter, T. A.; Markos, A. T.

1975-01-01

A gradient-dependent optimization technique which exploits the vector-streaming or parallel-computing capabilities of some modern computers is presented. The algorithm, derived by assuming that the function to be minimized is homogeneous, is a modification of the Jacobson-Oksman serial minimization method. In addition to describing the algorithm, conditions insuring the convergence of the iterates of the algorithm and the results of numerical experiments on a group of sample test functions are presented. The results of these experiments indicate that this algorithm will solve optimization problems in less computing time than conventional serial methods on machines having vector-streaming or parallel-computing capabilities.
Rapid code acquisition algorithms employing PN matched filters

NASA Technical Reports Server (NTRS)

Su, Yu T.

1988-01-01

The performance of four algorithms using pseudonoise matched filters (PNMFs), for direct-sequence spread-spectrum systems, is analyzed. They are: parallel search with fix dwell detector (PL-FDD), parallel search with sequential detector (PL-SD), parallel-serial search with fix dwell detector (PS-FDD), and parallel-serial search with sequential detector (PS-SD). The operation characteristic for each detector and the mean acquisition time for each algorithm are derived. All the algorithms are studied in conjunction with the noncoherent integration technique, which enables the system to operate in the presence of data modulation. Several previous proposals using PNMF are seen as special cases of the present algorithms.
Algorithms and programming tools for image processing on the MPP

NASA Technical Reports Server (NTRS)

Reeves, A. P.

1985-01-01

Topics addressed include: data mapping and rotational algorithms for the Massively Parallel Processor (MPP); Parallel Pascal language; documentation for the Parallel Pascal Development system; and a description of the Parallel Pascal language used on the MPP.
Big Data: A Parallel Particle Swarm Optimization-Back-Propagation Neural Network Algorithm Based on MapReduce

PubMed Central

Cao, Jianfang; Cui, Hongyan; Shi, Hao; Jiao, Lijuan

2016-01-01

A back-propagation (BP) neural network can solve complicated random nonlinear mapping problems; therefore, it can be applied to a wide range of problems. However, as the sample size increases, the time required to train BP neural networks becomes lengthy. Moreover, the classification accuracy decreases as well. To improve the classification accuracy and runtime efficiency of the BP neural network algorithm, we proposed a parallel design and realization method for a particle swarm optimization (PSO)-optimized BP neural network based on MapReduce on the Hadoop platform using both the PSO algorithm and a parallel design. The PSO algorithm was used to optimize the BP neural network’s initial weights and thresholds and improve the accuracy of the classification algorithm. The MapReduce parallel programming model was utilized to achieve parallel processing of the BP algorithm, thereby solving the problems of hardware and communication overhead when the BP neural network addresses big data. Datasets on 5 different scales were constructed using the scene image library from the SUN Database. The classification accuracy of the parallel PSO-BP neural network algorithm is approximately 92%, and the system efficiency is approximately 0.85, which presents obvious advantages when processing big data. The algorithm proposed in this study demonstrated both higher classification accuracy and improved time efficiency, which represents a significant improvement obtained from applying parallel processing to an intelligent algorithm on big data. PMID:27304987
Applications and accuracy of the parallel diagonal dominant algorithm

NASA Technical Reports Server (NTRS)

Sun, Xian-He

1993-01-01

The Parallel Diagonal Dominant (PDD) algorithm is a highly efficient, ideally scalable tridiagonal solver. In this paper, a detailed study of the PDD algorithm is given. First the PDD algorithm is introduced. Then the algorithm is extended to solve periodic tridiagonal systems. A variant, the reduced PDD algorithm, is also proposed. Accuracy analysis is provided for a class of tridiagonal systems, the symmetric, and anti-symmetric Toeplitz tridiagonal systems. Implementation results show that the analysis gives a good bound on the relative error, and the algorithm is a good candidate for the emerging massively parallel machines.
Parallel Computing Strategies for Irregular Algorithms

NASA Technical Reports Server (NTRS)

Biswas, Rupak; Oliker, Leonid; Shan, Hongzhang; Biegel, Bryan (Technical Monitor)

2002-01-01

Parallel computing promises several orders of magnitude increase in our ability to solve realistic computationally-intensive problems, but relies on their efficient mapping and execution on large-scale multiprocessor architectures. Unfortunately, many important applications are irregular and dynamic in nature, making their effective parallel implementation a daunting task. Moreover, with the proliferation of parallel architectures and programming paradigms, the typical scientist is faced with a plethora of questions that must be answered in order to obtain an acceptable parallel implementation of the solution algorithm. In this paper, we consider three representative irregular applications: unstructured remeshing, sparse matrix computations, and N-body problems, and parallelize them using various popular programming paradigms on a wide spectrum of computer platforms ranging from state-of-the-art supercomputers to PC clusters. We present the underlying problems, the solution algorithms, and the parallel implementation strategies. Smart load-balancing, partitioning, and ordering techniques are used to enhance parallel performance. Overall results demonstrate the complexity of efficiently parallelizing irregular algorithms.
COMP Superscalar, an interoperable programming framework

NASA Astrophysics Data System (ADS)

Badia, Rosa M.; Conejero, Javier; Diaz, Carlos; Ejarque, Jorge; Lezzi, Daniele; Lordan, Francesc; Ramon-Cortes, Cristian; Sirvent, Raul

2015-12-01

COMPSs is a programming framework that aims to facilitate the parallelization of existing applications written in Java, C/C++ and Python scripts. For that purpose, it offers a simple programming model based on sequential development in which the user is mainly responsible for (i) identifying the functions to be executed as asynchronous parallel tasks and (ii) annotating them with annotations or standard Python decorators. A runtime system is in charge of exploiting the inherent concurrency of the code, automatically detecting and enforcing the data dependencies between tasks and spawning these tasks to the available resources, which can be nodes in a cluster, clouds or grids. In cloud environments, COMPSs provides scalability and elasticity features allowing the dynamic provision of resources.
Modeling UV Radiation Feedback from Massive Stars. I. Implementation of Adaptive Ray-tracing Method and Tests

NASA Astrophysics Data System (ADS)

Kim, Jeong-Gyu; Kim, Woong-Tae; Ostriker, Eve C.; Skinner, M. Aaron

2017-12-01

We present an implementation of an adaptive ray-tracing (ART) module in the Athena hydrodynamics code that accurately and efficiently handles the radiative transfer involving multiple point sources on a three-dimensional Cartesian grid. We adopt a recently proposed parallel algorithm that uses nonblocking, asynchronous MPI communications to accelerate transport of rays across the computational domain. We validate our implementation through several standard test problems, including the propagation of radiation in vacuum and the expansions of various types of H II regions. Additionally, scaling tests show that the cost of a full ray trace per source remains comparable to that of the hydrodynamics update on up to ∼ {10}3 processors. To demonstrate application of our ART implementation, we perform a simulation of star cluster formation in a marginally bound, turbulent cloud, finding that its star formation efficiency is 12% when both radiation pressure forces and photoionization by UV radiation are treated. We directly compare the radiation forces computed from the ART scheme with those from the M1 closure relation. Although the ART and M1 schemes yield similar results on large scales, the latter is unable to resolve the radiation field accurately near individual point sources.
Throughput analysis of the IEEE 802.4 token bus standard under heavy load

NASA Technical Reports Server (NTRS)

Pang, Joseph; Tobagi, Fouad

1987-01-01

It has become clear in the last few years that there is a trend towards integrated digital services. Parallel to the development of public Integrated Services Digital Network (ISDN) is service integration in the local area (e.g., a campus, a building, an aircraft). The types of services to be integrated depend very much on the specific local environment. However, applications tend to generate data traffic belonging to one of two classes. According to IEEE 802.4 terminology, the first major class of traffic is termed synchronous, such as packetized voice and data generated from other applications with real-time constraints, and the second class is called asynchronous which includes most computer data traffic such as file transfer or facsimile. The IEEE 802.4 token bus protocol which was designed to support both synchronous and asynchronous traffic is examined. The protocol is basically a timer-controlled token bus access scheme. By a suitable choice of the design parameters, it can be shown that access delay is bounded for synchronous traffic. As well, the bandwidth allocated to asynchronous traffic can be controlled. A throughput analysis of the protocol under heavy load with constant channel occupation of synchronous traffic and constant token-passing times is presented.
Fusion of Asynchronous, Parallel, Unreliable Data Streams

DTIC Science & Technology

2010-09-01

channels that might be used. The two channels chosen for this study, galvanic skin response (GSR) and pulse rate, are convenient and reasonably well...vector as NA. The MDS software tool, PERMAP, uses this same abbreviation. The impact of the lack of information may vary depending on the situation...of how PERMAP (and MDS in general) functions when the input parameters are varied. That is outlined in this section; the impact of those choices is
Sublattice parallel replica dynamics.

PubMed

Martínez, Enrique; Uberuaga, Blas P; Voter, Arthur F

2014-06-01

Exascale computing presents a challenge for the scientific community as new algorithms must be developed to take full advantage of the new computing paradigm. Atomistic simulation methods that offer full fidelity to the underlying potential, i.e., molecular dynamics (MD) and parallel replica dynamics, fail to use the whole machine speedup, leaving a region in time and sample size space that is unattainable with current algorithms. In this paper, we present an extension of the parallel replica dynamics algorithm [A. F. Voter, Phys. Rev. B 57, R13985 (1998)] by combining it with the synchronous sublattice approach of Shim and Amar [ and , Phys. Rev. B 71, 125432 (2005)], thereby exploiting event locality to improve the algorithm scalability. This algorithm is based on a domain decomposition in which events happen independently in different regions in the sample. We develop an analytical expression for the speedup given by this sublattice parallel replica dynamics algorithm and compare it with parallel MD and traditional parallel replica dynamics. We demonstrate how this algorithm, which introduces a slight additional approximation of event locality, enables the study of physical systems unreachable with traditional methodologies and promises to better utilize the resources of current high performance and future exascale computers.
A Hybrid Algorithm for Period Analysis from Multiband Data with Sparse and Irregular Sampling for Arbitrary Light-curve Shapes

NASA Astrophysics Data System (ADS)

Saha, Abhijit; Vivas, A. Katherina

2017-12-01

Ongoing and future surveys with repeat imaging in multiple bands are producing (or will produce) time-spaced measurements of brightness, resulting in the identification of large numbers of variable sources in the sky. A large fraction of these are periodic variables: compilations of these are of scientific interest for a variety of purposes. Unavoidably, the data sets from many such surveys not only have sparse sampling, but also have embedded frequencies in the observing cadence that beat against the natural periodicities of any object under investigation. Such limitations can make period determination ambiguous and uncertain. For multiband data sets with asynchronous measurements in multiple passbands, we wish to maximally use the information on periodicity in a manner that is agnostic of differences in the light-curve shapes across the different channels. Given large volumes of data, computational efficiency is also at a premium. This paper develops and presents a computationally economic method for determining periodicity that combines the results from two different classes of period-determination algorithms. The underlying principles are illustrated through examples. The effectiveness of this approach for combining asynchronously sampled measurements in multiple observables that share an underlying fundamental frequency is also demonstrated.
Parallel Algorithms and Patterns

DOE Office of Scientific and Technical Information (OSTI.GOV)

Robey, Robert W.

2016-06-16

This is a powerpoint presentation on parallel algorithms and patterns. A parallel algorithm is a well-defined, step-by-step computational procedure that emphasizes concurrency to solve a problem. Examples of problems include: Sorting, searching, optimization, matrix operations. A parallel pattern is a computational step in a sequence of independent, potentially concurrent operations that occurs in diverse scenarios with some frequency. Examples are: Reductions, prefix scans, ghost cell updates. We only touch on parallel patterns in this presentation. It really deserves its own detailed discussion which Gabe Rockefeller would like to develop.
Performance Evaluation in Network-Based Parallel Computing

NASA Technical Reports Server (NTRS)

Dezhgosha, Kamyar

1996-01-01

Network-based parallel computing is emerging as a cost-effective alternative for solving many problems which require use of supercomputers or massively parallel computers. The primary objective of this project has been to conduct experimental research on performance evaluation for clustered parallel computing. First, a testbed was established by augmenting our existing SUNSPARCs' network with PVM (Parallel Virtual Machine) which is a software system for linking clusters of machines. Second, a set of three basic applications were selected. The applications consist of a parallel search, a parallel sort, a parallel matrix multiplication. These application programs were implemented in C programming language under PVM. Third, we conducted performance evaluation under various configurations and problem sizes. Alternative parallel computing models and workload allocations for application programs were explored. The performance metric was limited to elapsed time or response time which in the context of parallel computing can be expressed in terms of speedup. The results reveal that the overhead of communication latency between processes in many cases is the restricting factor to performance. That is, coarse-grain parallelism which requires less frequent communication between processes will result in higher performance in network-based computing. Finally, we are in the final stages of installing an Asynchronous Transfer Mode (ATM) switch and four ATM interfaces (each 155 Mbps) which will allow us to extend our study to newer applications, performance metrics, and configurations.
Parallel computing of physical maps--a comparative study in SIMD and MIMD parallelism.

PubMed

Bhandarkar, S M; Chirravuri, S; Arnold, J

1996-01-01

Ordering clones from a genomic library into physical maps of whole chromosomes presents a central computational problem in genetics. Chromosome reconstruction via clone ordering is usually isomorphic to the NP-complete Optimal Linear Arrangement problem. Parallel SIMD and MIMD algorithms for simulated annealing based on Markov chain distribution are proposed and applied to the problem of chromosome reconstruction via clone ordering. Perturbation methods and problem-specific annealing heuristics are proposed and described. The SIMD algorithms are implemented on a 2048 processor MasPar MP-2 system which is an SIMD 2-D toroidal mesh architecture whereas the MIMD algorithms are implemented on an 8 processor Intel iPSC/860 which is an MIMD hypercube architecture. A comparative analysis of the various SIMD and MIMD algorithms is presented in which the convergence, speedup, and scalability characteristics of the various algorithms are analyzed and discussed. On a fine-grained, massively parallel SIMD architecture with a low synchronization overhead such as the MasPar MP-2, a parallel simulated annealing algorithm based on multiple periodically interacting searches performs the best. For a coarse-grained MIMD architecture with high synchronization overhead such as the Intel iPSC/860, a parallel simulated annealing algorithm based on multiple independent searches yields the best results. In either case, distribution of clonal data across multiple processors is shown to exacerbate the tendency of the parallel simulated annealing algorithm to get trapped in a local optimum.
Exact parallel algorithms for some members of the traveling salesman problem family

DOE Office of Scientific and Technical Information (OSTI.GOV)

Pekny, J.F.

1989-01-01

The traveling salesman problem and its many generalizations comprise one of the best known combinatorial optimization problem families. Most members of the family are NP-complete problems so that exact algorithms require an unpredictable and sometimes large computational effort. Parallel computers offer hope for providing the power required to meet these demands. A major barrier to applying parallel computers is the lack of parallel algorithms. The contributions presented in this thesis center around new exact parallel algorithms for the asymmetric traveling salesman problem (ATSP), prize collecting traveling salesman problem (PCTSP), and resource constrained traveling salesman problem (RCTSP). The RCTSP is amore » particularly difficult member of the family since finding a feasible solution is an NP-complete problem. An exact sequential algorithm is also presented for the directed hamiltonian cycle problem (DHCP). The DHCP algorithm is superior to current heuristic approaches and represents the first exact method applicable to large graphs. Computational results presented for each of the algorithms demonstrates the effectiveness of combining efficient algorithms with parallel computing methods. Performance statistics are reported for randomly generated ATSPs with 7,500 cities, PCTSPs with 200 cities, RCTSPs with 200 cities, DHCPs with 3,500 vertices, and assignment problems of size 10,000. Sequential results were collected on a Sun 4/260 engineering workstation, while parallel results were collected using a 14 and 100 processor BBN Butterfly Plus computer. The computational results represent the largest instances ever solved to optimality on any type of computer.« less
Efficient implementation of parallel three-dimensional FFT on clusters of PCs

NASA Astrophysics Data System (ADS)

Takahashi, Daisuke

2003-05-01

In this paper, we propose a high-performance parallel three-dimensional fast Fourier transform (FFT) algorithm on clusters of PCs. The three-dimensional FFT algorithm can be altered into a block three-dimensional FFT algorithm to reduce the number of cache misses. We show that the block three-dimensional FFT algorithm improves performance by utilizing the cache memory effectively. We use the block three-dimensional FFT algorithm to implement the parallel three-dimensional FFT algorithm. We succeeded in obtaining performance of over 1.3 GFLOPS on an 8-node dual Pentium III 1 GHz PC SMP cluster.
Solving Partial Differential Equations in a data-driven multiprocessor environment

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gaudiot, J.L.; Lin, C.M.; Hosseiniyar, M.

1988-12-31

Partial differential equations can be found in a host of engineering and scientific problems. The emergence of new parallel architectures has spurred research in the definition of parallel PDE solvers. Concurrently, highly programmable systems such as data-how architectures have been proposed for the exploitation of large scale parallelism. The implementation of some Partial Differential Equation solvers (such as the Jacobi method) on a tagged token data-flow graph is demonstrated here. Asynchronous methods (chaotic relaxation) are studied and new scheduling approaches (the Token No-Labeling scheme) are introduced in order to support the implementation of the asychronous methods in a data-driven environment.more » New high-level data-flow language program constructs are introduced in order to handle chaotic operations. Finally, the performance of the program graphs is demonstrated by a deterministic simulation of a message passing data-flow multiprocessor. An analysis of the overhead in the data-flow graphs is undertaken to demonstrate the limits of parallel operations in dataflow PDE program graphs.« less

Computational mechanics analysis tools for parallel-vector supercomputers

NASA Technical Reports Server (NTRS)

Storaasli, O. O.; Nguyen, D. T.; Baddourah, M. A.; Qin, J.

1993-01-01

Computational algorithms for structural analysis on parallel-vector supercomputers are reviewed. These parallel algorithms, developed by the authors, are for the assembly of structural equations, 'out-of-core' strategies for linear equation solution, massively distributed-memory equation solution, unsymmetric equation solution, general eigen-solution, geometrically nonlinear finite element analysis, design sensitivity analysis for structural dynamics, optimization algorithm and domain decomposition. The source code for many of these algorithms is available from NASA Langley.
Concurrent computation of attribute filters on shared memory parallel machines.

PubMed

Wilkinson, Michael H F; Gao, Hui; Hesselink, Wim H; Jonker, Jan-Eppo; Meijster, Arnold

2008-10-01

Morphological attribute filters have not previously been parallelized, mainly because they are both global and non-separable. We propose a parallel algorithm that achieves efficient parallelism for a large class of attribute filters, including attribute openings, closings, thinnings and thickenings, based on Salembier's Max-Trees and Min-trees. The image or volume is first partitioned in multiple slices. We then compute the Max-trees of each slice using any sequential Max-Tree algorithm. Subsequently, the Max-trees of the slices can be merged to obtain the Max-tree of the image. A C-implementation yielded good speed-ups on both a 16-processor MIPS 14000 parallel machine, and a dual-core Opteron-based machine. It is shown that the speed-up of the parallel algorithm is a direct measure of the gain with respect to the sequential algorithm used. Furthermore, the concurrent algorithm shows a speed gain of up to 72 percent on a single-core processor, due to reduced cache thrashing.
Regional-scale calculation of the LS factor using parallel processing

NASA Astrophysics Data System (ADS)

Liu, Kai; Tang, Guoan; Jiang, Ling; Zhu, A.-Xing; Yang, Jianyi; Song, Xiaodong

2015-05-01

With the increase of data resolution and the increasing application of USLE over large areas, the existing serial implementation of algorithms for computing the LS factor is becoming a bottleneck. In this paper, a parallel processing model based on message passing interface (MPI) is presented for the calculation of the LS factor, so that massive datasets at a regional scale can be processed efficiently. The parallel model contains algorithms for calculating flow direction, flow accumulation, drainage network, slope, slope length and the LS factor. According to the existence of data dependence, the algorithms are divided into local algorithms and global algorithms. Parallel strategy are designed according to the algorithm characters including the decomposition method for maintaining the integrity of the results, optimized workflow for reducing the time taken for exporting the unnecessary intermediate data and a buffer-communication-computation strategy for improving the communication efficiency. Experiments on a multi-node system show that the proposed parallel model allows efficient calculation of the LS factor at a regional scale with a massive dataset.
A new scheduling algorithm for parallel sparse LU factorization with static pivoting

DOE Office of Scientific and Technical Information (OSTI.GOV)

Grigori, Laura; Li, Xiaoye S.

2002-08-20

In this paper we present a static scheduling algorithm for parallel sparse LU factorization with static pivoting. The algorithm is divided into mapping and scheduling phases, using the symmetric pruned graphs of L' and U to represent dependencies. The scheduling algorithm is designed for driving the parallel execution of the factorization on a distributed-memory architecture. Experimental results and comparisons with SuperLU{_}DIST are reported after applying this algorithm on real world application matrices on an IBM SP RS/6000 distributed memory machine.
Parallel conjugate gradient algorithms for manipulator dynamic simulation

NASA Technical Reports Server (NTRS)

Fijany, Amir; Scheld, Robert E.

1989-01-01

Parallel conjugate gradient algorithms for the computation of multibody dynamics are developed for the specialized case of a robot manipulator. For an n-dimensional positive-definite linear system, the Classical Conjugate Gradient (CCG) algorithms are guaranteed to converge in n iterations, each with a computation cost of O(n); this leads to a total computational cost of O(n sq) on a serial processor. A conjugate gradient algorithms is presented that provide greater efficiency using a preconditioner, which reduces the number of iterations required, and by exploiting parallelism, which reduces the cost of each iteration. Two Preconditioned Conjugate Gradient (PCG) algorithms are proposed which respectively use a diagonal and a tridiagonal matrix, composed of the diagonal and tridiagonal elements of the mass matrix, as preconditioners. Parallel algorithms are developed to compute the preconditioners and their inversions in O(log sub 2 n) steps using n processors. A parallel algorithm is also presented which, on the same architecture, achieves the computational time of O(log sub 2 n) for each iteration. Simulation results for a seven degree-of-freedom manipulator are presented. Variants of the proposed algorithms are also developed which can be efficiently implemented on the Robot Mathematics Processor (RMP).
Asynchronous machine rotor speed estimation using a tabulated numerical approach

NASA Astrophysics Data System (ADS)

Nguyen, Huu Phuc; De Miras, Jérôme; Charara, Ali; Eltabach, Mario; Bonnet, Stéphane

2017-12-01

This paper proposes a new method to estimate the rotor speed of the asynchronous machine by looking at the estimation problem as a nonlinear optimal control problem. The behavior of the nonlinear plant model is approximated off-line as a prediction map using a numerical one-step time discretization obtained from simulations. At each time-step, the speed of the induction machine is selected satisfying the dynamic fitting problem between the plant output and the predicted output, leading the system to adopt its dynamical behavior. Thanks to the limitation of the prediction horizon to a single time-step, the execution time of the algorithm can be completely bounded. It can thus easily be implemented and embedded into a real-time system to observe the speed of the real induction motor. Simulation results show the performance and robustness of the proposed estimator.
GPU-completeness: theory and implications

NASA Astrophysics Data System (ADS)

Lin, I.-Jong

2011-01-01

This paper formalizes a major insight into a class of algorithms that relate parallelism and performance. The purpose of this paper is to define a class of algorithms that trades off parallelism for quality of result (e.g. visual quality, compression rate), and we propose a similar method for algorithmic classification based on NP-Completeness techniques, applied toward parallel acceleration. We will define this class of algorithm as "GPU-Complete" and will postulate the necessary properties of the algorithms for admission into this class. We will also formally relate his algorithmic space and imaging algorithms space. This concept is based upon our experience in the print production area where GPUs (Graphic Processing Units) have shown a substantial cost/performance advantage within the context of HPdelivered enterprise services and commercial printing infrastructure. While CPUs and GPUs are converging in their underlying hardware and functional blocks, their system behaviors are clearly distinct in many ways: memory system design, programming paradigms, and massively parallel SIMD architecture. There are applications that are clearly suited to each architecture: for CPU: language compilation, word processing, operating systems, and other applications that are highly sequential in nature; for GPU: video rendering, particle simulation, pixel color conversion, and other problems clearly amenable to massive parallelization. While GPUs establishing themselves as a second, distinct computing architecture from CPUs, their end-to-end system cost/performance advantage in certain parts of computation inform the structure of algorithms and their efficient parallel implementations. While GPUs are merely one type of architecture for parallelization, we show that their introduction into the design space of printing systems demonstrate the trade-offs against competing multi-core, FPGA, and ASIC architectures. While each architecture has its own optimal application, we believe that the selection of architecture can be defined in terms of properties of GPU-Completeness. For a welldefined subset of algorithms, GPU-Completeness is intended to connect the parallelism, algorithms and efficient architectures into a unified framework to show that multiple layers of parallel implementation are guided by the same underlying trade-off.
Crashworthiness simulations with DYNA3D

DOE Office of Scientific and Technical Information (OSTI.GOV)

Schauer, D.A.; Hoover, C.G.; Kay, G.J.

1996-04-01

Current progress in parallel algorithm research and applications in vehicle crash simulation is described for the explicit, finite element algorithms in DYNA3D. Problem partitioning methods and parallel algorithms for contact at material interfaces are the two challenging algorithm research problems that are addressed. Two prototype parallel contact algorithms have been developed for treating the cases of local and arbitrary contact. Demonstration problems for local contact are crashworthiness simulations with 222 locally defined contact surfaces and a vehicle/barrier collision modeled with arbitrary contact. A simulation of crash tests conducted for a vehicle impacting a U-channel small sign post embedded in soilmore » has been run on both the serial and parallel versions of DYNA3D. A significant reduction in computational time has been observed when running these problems on the parallel version. However, to achieve maximum efficiency, complex problems must be appropriately partitioned, especially when contact dominates the computation.« less
Line-drawing algorithms for parallel machines

NASA Technical Reports Server (NTRS)

Pang, Alex T.

1990-01-01

The fact that conventional line-drawing algorithms, when applied directly on parallel machines, can lead to very inefficient codes is addressed. It is suggested that instead of modifying an existing algorithm for a parallel machine, a more efficient implementation can be produced by going back to the invariants in the definition. Popular line-drawing algorithms are compared with two alternatives; distance to a line (a point is on the line if sufficiently close to it) and intersection with a line (a point on the line if an intersection point). For massively parallel single-instruction-multiple-data (SIMD) machines (with thousands of processors and up), the alternatives provide viable line-drawing algorithms. Because of the pixel-per-processor mapping, their performance is independent of the line length and orientation.
Toward Millions of File System IOPS on Low-Cost, Commodity Hardware

PubMed Central

Zheng, Da; Burns, Randal; Szalay, Alexander S.

2013-01-01

We describe a storage system that removes I/O bottlenecks to achieve more than one million IOPS based on a user-space file abstraction for arrays of commodity SSDs. The file abstraction refactors I/O scheduling and placement for extreme parallelism and non-uniform memory and I/O. The system includes a set-associative, parallel page cache in the user space. We redesign page caching to eliminate CPU overhead and lock-contention in non-uniform memory architecture machines. We evaluate our design on a 32 core NUMA machine with four, eight-core processors. Experiments show that our design delivers 1.23 million 512-byte read IOPS. The page cache realizes the scalable IOPS of Linux asynchronous I/O (AIO) and increases user-perceived I/O performance linearly with cache hit rates. The parallel, set-associative cache matches the cache hit rates of the global Linux page cache under real workloads. PMID:24402052
Toward Millions of File System IOPS on Low-Cost, Commodity Hardware.

PubMed

Zheng, Da; Burns, Randal; Szalay, Alexander S

2013-01-01

We describe a storage system that removes I/O bottlenecks to achieve more than one million IOPS based on a user-space file abstraction for arrays of commodity SSDs. The file abstraction refactors I/O scheduling and placement for extreme parallelism and non-uniform memory and I/O. The system includes a set-associative, parallel page cache in the user space. We redesign page caching to eliminate CPU overhead and lock-contention in non-uniform memory architecture machines. We evaluate our design on a 32 core NUMA machine with four, eight-core processors. Experiments show that our design delivers 1.23 million 512-byte read IOPS. The page cache realizes the scalable IOPS of Linux asynchronous I/O (AIO) and increases user-perceived I/O performance linearly with cache hit rates. The parallel, set-associative cache matches the cache hit rates of the global Linux page cache under real workloads.
Multiprocessing the Sieve of Eratosthenes

NASA Technical Reports Server (NTRS)

Bokhari, S.

1986-01-01

The Sieve of Eratosthenes for finding prime numbers in recent years has seen much use as a benchmark algorithm for serial computers while its intrinsically parallel nature has gone largely unnoticed. The implementation of a parallel version of this algorithm for a real parallel computer, the Flex/32, is described and its performance discussed. It is shown that the algorithm is sensitive to several fundamental performance parameters of parallel machines, such as spawning time, signaling time, memory access, and overhead of process switching. Because of the nature of the algorithm, it is impossible to get any speedup beyond 4 or 5 processors unless some form of dynamic load balancing is employed. We describe the performance of our algorithm with and without load balancing and compare it with theoretical lower bounds and simulated results. It is straightforward to understand this algorithm and to check the final results. However, its efficient implementation on a real parallel machine requires thoughtful design, especially if dynamic load balancing is desired. The fundamental operations required by the algorithm are very simple: this means that the slightest overhead appears prominently in performance data. The Sieve thus serves not only as a very severe test of the capabilities of a parallel processor but is also an interesting challenge for the programmer.
A Parallel Rendering Algorithm for MIMD Architectures

NASA Technical Reports Server (NTRS)

Crockett, Thomas W.; Orloff, Tobias

1991-01-01

Applications such as animation and scientific visualization demand high performance rendering of complex three dimensional scenes. To deliver the necessary rendering rates, highly parallel hardware architectures are required. The challenge is then to design algorithms and software which effectively use the hardware parallelism. A rendering algorithm targeted to distributed memory MIMD architectures is described. For maximum performance, the algorithm exploits both object-level and pixel-level parallelism. The behavior of the algorithm is examined both analytically and experimentally. Its performance for large numbers of processors is found to be limited primarily by communication overheads. An experimental implementation for the Intel iPSC/860 shows increasing performance from 1 to 128 processors across a wide range of scene complexities. It is shown that minimal modifications to the algorithm will adapt it for use on shared memory architectures as well.
A highly efficient multi-core algorithm for clustering extremely large datasets

PubMed Central

2010-01-01

Background In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer. Results We introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization. Conclusions Most desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer. PMID:20370922
Data General Corporation Advanced Operating System/Virtual Storage (AOS/ VS). Revision 7.60

DTIC Science & Technology

1989-02-22

control list for each directory and data file. An access control list includes the users who can and cannot access files as well as the access...and any required data, it can -5- February 22, 1989 Final Evaluation Report Data General AOS/VS SYSTEM OVERVIEW operate asynchronously and in parallel...memory. The IOC can perform the data transfer without further interventiin from the CPU. The I/O channels interface with the processor or system
A sweep algorithm for massively parallel simulation of circuit-switched networks

NASA Technical Reports Server (NTRS)

Gaujal, Bruno; Greenberg, Albert G.; Nicol, David M.

1992-01-01

A new massively parallel algorithm is presented for simulating large asymmetric circuit-switched networks, controlled by a randomized-routing policy that includes trunk-reservation. A single instruction multiple data (SIMD) implementation is described, and corresponding experiments on a 16384 processor MasPar parallel computer are reported. A multiple instruction multiple data (MIMD) implementation is also described, and corresponding experiments on an Intel IPSC/860 parallel computer, using 16 processors, are reported. By exploiting parallelism, our algorithm increases the possible execution rate of such complex simulations by as much as an order of magnitude.
Constraint treatment techniques and parallel algorithms for multibody dynamic analysis. Ph.D. Thesis

NASA Technical Reports Server (NTRS)

Chiou, Jin-Chern

1990-01-01

Computational procedures for kinematic and dynamic analysis of three-dimensional multibody dynamic (MBD) systems are developed from the differential-algebraic equations (DAE's) viewpoint. Constraint violations during the time integration process are minimized and penalty constraint stabilization techniques and partitioning schemes are developed. The governing equations of motion, a two-stage staggered explicit-implicit numerical algorithm, are treated which takes advantage of a partitioned solution procedure. A robust and parallelizable integration algorithm is developed. This algorithm uses a two-stage staggered central difference algorithm to integrate the translational coordinates and the angular velocities. The angular orientations of bodies in MBD systems are then obtained by using an implicit algorithm via the kinematic relationship between Euler parameters and angular velocities. It is shown that the combination of the present solution procedures yields a computationally more accurate solution. To speed up the computational procedures, parallel implementation of the present constraint treatment techniques, the two-stage staggered explicit-implicit numerical algorithm was efficiently carried out. The DAE's and the constraint treatment techniques were transformed into arrowhead matrices to which Schur complement form was derived. By fully exploiting the sparse matrix structural analysis techniques, a parallel preconditioned conjugate gradient numerical algorithm is used to solve the systems equations written in Schur complement form. A software testbed was designed and implemented in both sequential and parallel computers. This testbed was used to demonstrate the robustness and efficiency of the constraint treatment techniques, the accuracy of the two-stage staggered explicit-implicit numerical algorithm, and the speed up of the Schur-complement-based parallel preconditioned conjugate gradient algorithm on a parallel computer.
Computing Maximum Cardinality Matchings in Parallel on Bipartite Graphs via Tree-Grafting

DOE Office of Scientific and Technical Information (OSTI.GOV)

Azad, Ariful; Buluc, Aydn; Pothen, Alex

It is difficult to obtain high performance when computing matchings on parallel processors because matching algorithms explicitly or implicitly search for paths in the graph, and when these paths become long, there is little concurrency. In spite of this limitation, we present a new algorithm and its shared-memory parallelization that achieves good performance and scalability in computing maximum cardinality matchings in bipartite graphs. This algorithm searches for augmenting paths via specialized breadth-first searches (BFS) from multiple source vertices, hence creating more parallelism than single source algorithms. Algorithms that employ multiple-source searches cannot discard a search tree once no augmenting pathmore » is discovered from the tree, unlike algorithms that rely on single-source searches. We describe a novel tree-grafting method that eliminates most of the redundant edge traversals resulting from this property of multiple-source searches. We also employ the recent direction-optimizing BFS algorithm as a subroutine to discover augmenting paths faster. Our algorithm compares favorably with the current best algorithms in terms of the number of edges traversed, the average augmenting path length, and the number of iterations. Here, we provide a proof of correctness for our algorithm. Our NUMA-aware implementation is scalable to 80 threads of an Intel multiprocessor and to 240 threads on an Intel Knights Corner coprocessor. On average, our parallel algorithm runs an order of magnitude faster than the fastest algorithms available. The performance improvement is more significant on graphs with small matching number.« less
Computing Maximum Cardinality Matchings in Parallel on Bipartite Graphs via Tree-Grafting

DOE PAGES

Azad, Ariful; Buluc, Aydn; Pothen, Alex

2016-03-24

It is difficult to obtain high performance when computing matchings on parallel processors because matching algorithms explicitly or implicitly search for paths in the graph, and when these paths become long, there is little concurrency. In spite of this limitation, we present a new algorithm and its shared-memory parallelization that achieves good performance and scalability in computing maximum cardinality matchings in bipartite graphs. This algorithm searches for augmenting paths via specialized breadth-first searches (BFS) from multiple source vertices, hence creating more parallelism than single source algorithms. Algorithms that employ multiple-source searches cannot discard a search tree once no augmenting pathmore » is discovered from the tree, unlike algorithms that rely on single-source searches. We describe a novel tree-grafting method that eliminates most of the redundant edge traversals resulting from this property of multiple-source searches. We also employ the recent direction-optimizing BFS algorithm as a subroutine to discover augmenting paths faster. Our algorithm compares favorably with the current best algorithms in terms of the number of edges traversed, the average augmenting path length, and the number of iterations. Here, we provide a proof of correctness for our algorithm. Our NUMA-aware implementation is scalable to 80 threads of an Intel multiprocessor and to 240 threads on an Intel Knights Corner coprocessor. On average, our parallel algorithm runs an order of magnitude faster than the fastest algorithms available. The performance improvement is more significant on graphs with small matching number.« less
Fast parallel approach for 2-D DHT-based real-valued discrete Gabor transform.

PubMed

Tao, Liang; Kwan, Hon Keung

2009-12-01

Two-dimensional fast Gabor transform algorithms are useful for real-time applications due to the high computational complexity of the traditional 2-D complex-valued discrete Gabor transform (CDGT). This paper presents two block time-recursive algorithms for 2-D DHT-based real-valued discrete Gabor transform (RDGT) and its inverse transform and develops a fast parallel approach for the implementation of the two algorithms. The computational complexity of the proposed parallel approach is analyzed and compared with that of the existing 2-D CDGT algorithms. The results indicate that the proposed parallel approach is attractive for real time image processing.

Communications oriented programming of parallel iterative solutions of sparse linear systems

NASA Technical Reports Server (NTRS)

Patrick, M. L.; Pratt, T. W.

1986-01-01

Parallel algorithms are developed for a class of scientific computational problems by partitioning the problems into smaller problems which may be solved concurrently. The effectiveness of the resulting parallel solutions is determined by the amount and frequency of communication and synchronization and the extent to which communication can be overlapped with computation. Three different parallel algorithms for solving the same class of problems are presented, and their effectiveness is analyzed from this point of view. The algorithms are programmed using a new programming environment. Run-time statistics and experience obtained from the execution of these programs assist in measuring the effectiveness of these algorithms.
Efficient parallel implementation of active appearance model fitting algorithm on GPU.

PubMed

Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou

2014-01-01

The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures.
Efficient Parallel Implementation of Active Appearance Model Fitting Algorithm on GPU

PubMed Central

Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou

2014-01-01

The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures. PMID:24723812
Research on parallel algorithm for sequential pattern mining

NASA Astrophysics Data System (ADS)

Zhou, Lijuan; Qin, Bai; Wang, Yu; Hao, Zhongxiao

2008-03-01

Sequential pattern mining is the mining of frequent sequences related to time or other orders from the sequence database. Its initial motivation is to discover the laws of customer purchasing in a time section by finding the frequent sequences. In recent years, sequential pattern mining has become an important direction of data mining, and its application field has not been confined to the business database and has extended to new data sources such as Web and advanced science fields such as DNA analysis. The data of sequential pattern mining has characteristics as follows: mass data amount and distributed storage. Most existing sequential pattern mining algorithms haven't considered the above-mentioned characteristics synthetically. According to the traits mentioned above and combining the parallel theory, this paper puts forward a new distributed parallel algorithm SPP(Sequential Pattern Parallel). The algorithm abides by the principal of pattern reduction and utilizes the divide-and-conquer strategy for parallelization. The first parallel task is to construct frequent item sets applying frequent concept and search space partition theory and the second task is to structure frequent sequences using the depth-first search method at each processor. The algorithm only needs to access the database twice and doesn't generate the candidated sequences, which abates the access time and improves the mining efficiency. Based on the random data generation procedure and different information structure designed, this paper simulated the SPP algorithm in a concrete parallel environment and implemented the AprioriAll algorithm. The experiments demonstrate that compared with AprioriAll, the SPP algorithm had excellent speedup factor and efficiency.
Parallel/distributed direct method for solving linear systems

NASA Technical Reports Server (NTRS)

Lin, Avi

1990-01-01

A new family of parallel schemes for directly solving linear systems is presented and analyzed. It is shown that these schemes exhibit a near optimal performance and enjoy several important features: (1) For large enough linear systems, the design of the appropriate paralleled algorithm is insensitive to the number of processors as its performance grows monotonically with them; (2) It is especially good for large matrices, with dimensions large relative to the number of processors in the system; (3) It can be used in both distributed parallel computing environments and tightly coupled parallel computing systems; and (4) This set of algorithms can be mapped onto any parallel architecture without any major programming difficulties or algorithmical changes.
Modelling and control algorithms of the cross conveyors line with multiengine variable speed drives

NASA Astrophysics Data System (ADS)

Cheremushkina, M. S.; Baburin, S. V.

2017-02-01

The paper deals with the actual problem of developing the control algorithm that meets the technical requirements of the mine belt conveyors, and enables energy and resource savings taking into account a random sort of traffic. The most effective method of solution of these tasks is the construction of control systems with the use of variable speed drives for asynchronous motors. The authors designed the mathematical model of the system ‘variable speed multiengine drive - conveyor - control system of conveyors’ that takes into account the dynamic processes occurring in the elements of the transport system, provides an assessment of the energy efficiency of application the developed algorithms, which allows one to reduce the dynamic overload in the belt to 15-20%.
On the Accuracy and Parallelism of GPGPU-Powered Incremental Clustering Algorithms.

PubMed

Chen, Chunlei; He, Li; Zhang, Huixiang; Zheng, Hao; Wang, Lei

2017-01-01

Incremental clustering algorithms play a vital role in various applications such as massive data analysis and real-time data processing. Typical application scenarios of incremental clustering raise high demand on computing power of the hardware platform. Parallel computing is a common solution to meet this demand. Moreover, General Purpose Graphic Processing Unit (GPGPU) is a promising parallel computing device. Nevertheless, the incremental clustering algorithm is facing a dilemma between clustering accuracy and parallelism when they are powered by GPGPU. We formally analyzed the cause of this dilemma. First, we formalized concepts relevant to incremental clustering like evolving granularity. Second, we formally proved two theorems. The first theorem proves the relation between clustering accuracy and evolving granularity. Additionally, this theorem analyzes the upper and lower bounds of different-to-same mis-affiliation. Fewer occurrences of such mis-affiliation mean higher accuracy. The second theorem reveals the relation between parallelism and evolving granularity. Smaller work-depth means superior parallelism. Through the proofs, we conclude that accuracy of an incremental clustering algorithm is negatively related to evolving granularity while parallelism is positively related to the granularity. Thus the contradictory relations cause the dilemma. Finally, we validated the relations through a demo algorithm. Experiment results verified theoretical conclusions.
cOSPREY: A Cloud-Based Distributed Algorithm for Large-Scale Computational Protein Design

PubMed Central

Pan, Yuchao; Dong, Yuxi; Zhou, Jingtian; Hallen, Mark; Donald, Bruce R.; Xu, Wei

2016-01-01

Abstract Finding the global minimum energy conformation (GMEC) of a huge combinatorial search space is the key challenge in computational protein design (CPD) problems. Traditional algorithms lack a scalable and efficient distributed design scheme, preventing researchers from taking full advantage of current cloud infrastructures. We design cloud OSPREY (cOSPREY), an extension to a widely used protein design software OSPREY, to allow the original design framework to scale to the commercial cloud infrastructures. We propose several novel designs to integrate both algorithm and system optimizations, such as GMEC-specific pruning, state search partitioning, asynchronous algorithm state sharing, and fault tolerance. We evaluate cOSPREY on three different cloud platforms using different technologies and show that it can solve a number of large-scale protein design problems that have not been possible with previous approaches. PMID:27154509
Application of integration algorithms in a parallel processing environment for the simulation of jet engines

NASA Technical Reports Server (NTRS)

Krosel, S. M.; Milner, E. J.

1982-01-01

The application of Predictor corrector integration algorithms developed for the digital parallel processing environment are investigated. The algorithms are implemented and evaluated through the use of a software simulator which provides an approximate representation of the parallel processing hardware. Test cases which focus on the use of the algorithms are presented and a specific application using a linear model of a turbofan engine is considered. Results are presented showing the effects of integration step size and the number of processors on simulation accuracy. Real time performance, interprocessor communication, and algorithm startup are also discussed.
Efficient Parallel Algorithm For Direct Numerical Simulation of Turbulent Flows

NASA Technical Reports Server (NTRS)

Moitra, Stuti; Gatski, Thomas B.

1997-01-01

A distributed algorithm for a high-order-accurate finite-difference approach to the direct numerical simulation (DNS) of transition and turbulence in compressible flows is described. This work has two major objectives. The first objective is to demonstrate that parallel and distributed-memory machines can be successfully and efficiently used to solve computationally intensive and input/output intensive algorithms of the DNS class. The second objective is to show that the computational complexity involved in solving the tridiagonal systems inherent in the DNS algorithm can be reduced by algorithm innovations that obviate the need to use a parallelized tridiagonal solver.
Efficiency Analysis of the Parallel Implementation of the SIMPLE Algorithm on Multiprocessor Computers

NASA Astrophysics Data System (ADS)

Lashkin, S. V.; Kozelkov, A. S.; Yalozo, A. V.; Gerasimov, V. Yu.; Zelensky, D. K.

2017-12-01

This paper describes the details of the parallel implementation of the SIMPLE algorithm for numerical solution of the Navier-Stokes system of equations on arbitrary unstructured grids. The iteration schemes for the serial and parallel versions of the SIMPLE algorithm are implemented. In the description of the parallel implementation, special attention is paid to computational data exchange among processors under the condition of the grid model decomposition using fictitious cells. We discuss the specific features for the storage of distributed matrices and implementation of vector-matrix operations in parallel mode. It is shown that the proposed way of matrix storage reduces the number of interprocessor exchanges. A series of numerical experiments illustrates the effect of the multigrid SLAE solver tuning on the general efficiency of the algorithm; the tuning involves the types of the cycles used (V, W, and F), the number of iterations of a smoothing operator, and the number of cells for coarsening. Two ways (direct and indirect) of efficiency evaluation for parallelization of the numerical algorithm are demonstrated. The paper presents the results of solving some internal and external flow problems with the evaluation of parallelization efficiency by two algorithms. It is shown that the proposed parallel implementation enables efficient computations for the problems on a thousand processors. Based on the results obtained, some general recommendations are made for the optimal tuning of the multigrid solver, as well as for selecting the optimal number of cells per processor.
MULTIOBJECTIVE PARALLEL GENETIC ALGORITHM FOR WASTE MINIMIZATION

EPA Science Inventory

In this research we have developed an efficient multiobjective parallel genetic algorithm (MOPGA) for waste minimization problems. This MOPGA integrates PGAPack (Levine, 1996) and NSGA-II (Deb, 2000) with novel modifications. PGAPack is a master-slave parallel implementation of a...
A sample implementation for parallelizing Divide-and-Conquer algorithms on the GPU.

PubMed

Mei, Gang; Zhang, Jiayin; Xu, Nengxiong; Zhao, Kunyang

2018-01-01

The strategy of Divide-and-Conquer (D&C) is one of the frequently used programming patterns to design efficient algorithms in computer science, which has been parallelized on shared memory systems and distributed memory systems. Tzeng and Owens specifically developed a generic paradigm for parallelizing D&C algorithms on modern Graphics Processing Units (GPUs). In this paper, by following the generic paradigm proposed by Tzeng and Owens, we provide a new and publicly available GPU implementation of the famous D&C algorithm, QuickHull, to give a sample and guide for parallelizing D&C algorithms on the GPU. The experimental results demonstrate the practicality of our sample GPU implementation. Our research objective in this paper is to present a sample GPU implementation of a classical D&C algorithm to help interested readers to develop their own efficient GPU implementations with fewer efforts.
Data communications in a parallel active messaging interface of a parallel computer

DOEpatents

Davis, Kristan D.; Faraj, Daniel A.

2014-07-22

Algorithm selection for data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including specifications of a client, a context, and a task, endpoints coupled for data communications through the PAMI, including associating in the PAMI data communications algorithms and ranges of message sizes so that each algorithm is associated with a separate range of message sizes; receiving in an origin endpoint of the PAMI a data communications instruction, the instruction specifying transmission of a data communications message from the origin endpoint to a target endpoint, the data communications message characterized by a message size; selecting, from among the associated algorithms and ranges, a data communications algorithm in dependence upon the message size; and transmitting, according to the selected data communications algorithm from the origin endpoint to the target endpoint, the data communications message.
Data communications in a parallel active messaging interface of a parallel computer

DOEpatents

Davis, Kristan D; Faraj, Daniel A

2013-07-09

Algorithm selection for data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including specifications of a client, a context, and a task, endpoints coupled for data communications through the PAMI, including associating in the PAMI data communications algorithms and ranges of message sizes so that each algorithm is associated with a separate range of message sizes; receiving in an origin endpoint of the PAMI a data communications instruction, the instruction specifying transmission of a data communications message from the origin endpoint to a target endpoint, the data communications message characterized by a message size; selecting, from among the associated algorithms and ranges, a data communications algorithm in dependence upon the message size; and transmitting, according to the selected data communications algorithm from the origin endpoint to the target endpoint, the data communications message.
Graphical Representation of Parallel Algorithmic Processes

DTIC Science & Technology

1990-12-01

interface with the AAARF main process . The source code for the AAARF class-common library is in the common subdi- rectory and consists of the following files... for public release; distribution unlimited AFIT/GCE/ENG/90D-07 Graphical Representation of Parallel Algorithmic Processes THESIS Presented to the...goal of this study is to develop an algorithm animation facility for parallel processes executing on different architectures, from multiprocessor
The openGL visualization of the 2D parallel FDTD algorithm

NASA Astrophysics Data System (ADS)

Walendziuk, Wojciech

2005-02-01

This paper presents a way of visualization of a two-dimensional version of a parallel algorithm of the FDTD method. The visualization module was created on the basis of the OpenGL graphic standard with the use of the GLUT interface. In addition, the work includes the results of the efficiency of the parallel algorithm in the form of speedup charts.
Implementing a Parallel Image Edge Detection Algorithm Based on the Otsu-Canny Operator on the Hadoop Platform.

PubMed

Cao, Jianfang; Chen, Lichao; Wang, Min; Tian, Yun

2018-01-01

The Canny operator is widely used to detect edges in images. However, as the size of the image dataset increases, the edge detection performance of the Canny operator decreases and its runtime becomes excessive. To improve the runtime and edge detection performance of the Canny operator, in this paper, we propose a parallel design and implementation for an Otsu-optimized Canny operator using a MapReduce parallel programming model that runs on the Hadoop platform. The Otsu algorithm is used to optimize the Canny operator's dual threshold and improve the edge detection performance, while the MapReduce parallel programming model facilitates parallel processing for the Canny operator to solve the processing speed and communication cost problems that occur when the Canny edge detection algorithm is applied to big data. For the experiments, we constructed datasets of different scales from the Pascal VOC2012 image database. The proposed parallel Otsu-Canny edge detection algorithm performs better than other traditional edge detection algorithms. The parallel approach reduced the running time by approximately 67.2% on a Hadoop cluster architecture consisting of 5 nodes with a dataset of 60,000 images. Overall, our approach system speeds up the system by approximately 3.4 times when processing large-scale datasets, which demonstrates the obvious superiority of our method. The proposed algorithm in this study demonstrates both better edge detection performance and improved time performance.
A parallel algorithm for switch-level timing simulation on a hypercube multiprocessor

NASA Technical Reports Server (NTRS)

Rao, Hariprasad Nannapaneni

1989-01-01

The parallel approach to speeding up simulation is studied, specifically the simulation of digital LSI MOS circuitry on the Intel iPSC/2 hypercube. The simulation algorithm is based on RSIM, an event driven switch-level simulator that incorporates a linear transistor model for simulating digital MOS circuits. Parallel processing techniques based on the concepts of Virtual Time and rollback are utilized so that portions of the circuit may be simulated on separate processors, in parallel for as large an increase in speed as possible. A partitioning algorithm is also developed in order to subdivide the circuit for parallel processing.
Commodity cluster and hardware-based massively parallel implementations of hyperspectral imaging algorithms

NASA Astrophysics Data System (ADS)

Plaza, Antonio; Chang, Chein-I.; Plaza, Javier; Valencia, David

2006-05-01

The incorporation of hyperspectral sensors aboard airborne/satellite platforms is currently producing a nearly continual stream of multidimensional image data, and this high data volume has soon introduced new processing challenges. The price paid for the wealth spatial and spectral information available from hyperspectral sensors is the enormous amounts of data that they generate. Several applications exist, however, where having the desired information calculated quickly enough for practical use is highly desirable. High computing performance of algorithm analysis is particularly important in homeland defense and security applications, in which swift decisions often involve detection of (sub-pixel) military targets (including hostile weaponry, camouflage, concealment, and decoys) or chemical/biological agents. In order to speed-up computational performance of hyperspectral imaging algorithms, this paper develops several fast parallel data processing techniques. Techniques include four classes of algorithms: (1) unsupervised classification, (2) spectral unmixing, and (3) automatic target recognition, and (4) onboard data compression. A massively parallel Beowulf cluster (Thunderhead) at NASA's Goddard Space Flight Center in Maryland is used to measure parallel performance of the proposed algorithms. In order to explore the viability of developing onboard, real-time hyperspectral data compression algorithms, a Xilinx Virtex-II field programmable gate array (FPGA) is also used in experiments. Our quantitative and comparative assessment of parallel techniques and strategies may help image analysts in selection of parallel hyperspectral algorithms for specific applications.

Simple and Flexible Self-Reproducing Structures in Asynchronous Cellular Automata and Their Dynamics

NASA Astrophysics Data System (ADS)

Huang, Xin; Lee, Jia; Yang, Rui-Long; Zhu, Qing-Sheng

2013-03-01

Self-reproduction on asynchronous cellular automata (ACAs) has attracted wide attention due to the evident artifacts induced by synchronous updating. Asynchronous updating, which allows cells to undergo transitions independently at random times, might be more compatible with the natural processes occurring at micro-scale, but the dark side of the coin is the increment in the complexity of an ACA in order to accomplish stable self-reproduction. This paper proposes a novel model of self-timed cellular automata (STCAs), a special type of ACAs, where unsheathed loops are able to duplicate themselves reliably in parallel. The removal of sheath cannot only allow various loops with more flexible and compact structures to replicate themselves, but also reduce the number of cell states of the STCA as compared to the previous model adopting sheathed loops [Y. Takada, T. Isokawa, F. Peper and N. Matsui, Physica D227, 26 (2007)]. The lack of sheath, on the other hand, often tends to cause much more complicated interactions among loops, when all of them struggle independently to stretch out their constructing arms at the same time. In particular, such intense collisions may even cause the emergence of a mess of twisted constructing arms in the cellular space. By using a simple and natural method, our self-reproducing loops (SRLs) are able to retract their arms successively, thereby disentangling from the mess successfully.
High performance interconnection between high data rate networks

NASA Technical Reports Server (NTRS)

Foudriat, E. C.; Maly, K.; Overstreet, C. M.; Zhang, L.; Sun, W.

1992-01-01

The bridge/gateway system needed to interconnect a wide range of computer networks to support a wide range of user quality-of-service requirements is discussed. The bridge/gateway must handle a wide range of message types including synchronous and asynchronous traffic, large, bursty messages, short, self-contained messages, time critical messages, etc. It is shown that messages can be classified into three basic classes, synchronous and large and small asynchronous messages. The first two require call setup so that packet identification, buffer handling, etc. can be supported in the bridge/gateway. Identification enables resequences in packet size. The third class is for messages which do not require call setup. Resequencing hardware based to handle two types of resequencing problems is presented. The first is for a virtual parallel circuit which can scramble channel bytes. The second system is effective in handling both synchronous and asynchronous traffic between networks with highly differing packet sizes and data rates. The two other major needs for the bridge/gateway are congestion and error control. A dynamic, lossless congestion control scheme which can easily support effective error correction is presented. Results indicate that the congestion control scheme provides close to optimal capacity under congested conditions. Under conditions where error may develop due to intervening networks which are not lossless, intermediate error recovery and correction takes 1/3 less time than equivalent end-to-end error correction under similar conditions.
Asynchronous parallel status comparator

DOEpatents

Arnold, Jeffrey W.; Hart, Mark M.

1992-01-01

Apparatus for matching asynchronously received signals and determining whether two or more out of a total number of possible signals match. The apparatus comprises, in one embodiment, an array of sensors positioned in discrete locations and in communication with one or more processors. The processors will receive signals if the sensors detect a change in the variable sensed from a nominal to a special condition and will transmit location information in the form of a digital data set to two or more receivers. The receivers collect, read, latch and acknowledge the data sets and forward them to decoders that produce an output signal for each data set received. The receivers also periodically reset the system following each scan of the sensor array. A comparator then determines if any two or more, as specified by the user, of the output signals corresponds to the same location. A sufficient number of matches produces a system output signal that activates a system to restore the array to its nominal condition.
Asynchronous parallel status comparator

DOEpatents

Arnold, J.W.; Hart, M.M.

1992-12-15

Disclosed is an apparatus for matching asynchronously received signals and determining whether two or more out of a total number of possible signals match. The apparatus comprises, in one embodiment, an array of sensors positioned in discrete locations and in communication with one or more processors. The processors will receive signals if the sensors detect a change in the variable sensed from a nominal to a special condition and will transmit location information in the form of a digital data set to two or more receivers. The receivers collect, read, latch and acknowledge the data sets and forward them to decoders that produce an output signal for each data set received. The receivers also periodically reset the system following each scan of the sensor array. A comparator then determines if any two or more, as specified by the user, of the output signals corresponds to the same location. A sufficient number of matches produces a system output signal that activates a system to restore the array to its nominal condition. 4 figs.
Parallelizing serial code for a distributed processing environment with an application to high frequency electromagnetic scattering

NASA Astrophysics Data System (ADS)

Work, Paul R.

1991-12-01

This thesis investigates the parallelization of existing serial programs in computational electromagnetics for use in a parallel environment. Existing algorithms for calculating the radar cross section of an object are covered, and a ray-tracing code is chosen for implementation on a parallel machine. Current parallel architectures are introduced and a suitable parallel machine is selected for the implementation of the chosen ray-tracing algorithm. The standard techniques for the parallelization of serial codes are discussed, including load balancing and decomposition considerations, and appropriate methods for the parallelization effort are selected. A load balancing algorithm is modified to increase the efficiency of the application, and a high level design of the structure of the serial program is presented. A detailed design of the modifications for the parallel implementation is also included, with both the high level and the detailed design specified in a high level design language called UNITY. The correctness of the design is proven using UNITY and standard logic operations. The theoretical and empirical results show that it is possible to achieve an efficient parallel application for a serial computational electromagnetic program where the characteristics of the algorithm and the target architecture critically influence the development of such an implementation.
Microwave-based reaction screening: tandem retro-Diels-Alder/Diels-Alder cycloadditions of o-quinol dimers.

PubMed

Dong, Suwei; Cahill, Katharine J; Kang, Moon-Il; Colburn, Nancy H; Henrich, Curtis J; Wilson, Jennifer A; Beutler, John A; Johnson, Richard P; Porco, John A

2011-11-04

We have accomplished a parallel screen of cycloaddition partners for o-quinols utilizing a plate-based microwave system. Microwave irradiation improves the efficiency of retro-Diels-Alder/Diels-Alder cascades of o-quinol dimers which generally proceed in a diastereoselective fashion. Computational studies indicate that asynchronous transition states are favored in Diels-Alder cycloadditions of o-quinols. Subsequent biological evaluation of a collection of cycloadducts has identified an inhibitor of activator protein-1 (AP-1), an oncogenic transcription factor.
Workshop on Solid State Switches for Pulsed Power, held January 12-14, 1983 at Tamarron, Colorado

DTIC Science & Technology

1983-05-31

of its anticipated scalabil- ity. However, the projected performance of other types of dis- crete switches made their continued exploration and...linking of "asynchronous AC power grids. Some present installations arid projected increases are showr. in Table 2. A new commercial power application...Average Power 62.5 KW 160 KW Device RBDT (RSR) T60R SCR 2N3873 Arra , 6 Series 10 Parallel-20 Series Table 18. Applications of solid state pulse
VLSI neuroprocessors

NASA Technical Reports Server (NTRS)

Kemeny, Sabrina E.

1994-01-01

Electronic and optoelectronic hardware implementations of highly parallel computing architectures address several ill-defined and/or computation-intensive problems not easily solved by conventional computing techniques. The concurrent processing architectures developed are derived from a variety of advanced computing paradigms including neural network models, fuzzy logic, and cellular automata. Hardware implementation technologies range from state-of-the-art digital/analog custom-VLSI to advanced optoelectronic devices such as computer-generated holograms and e-beam fabricated Dammann gratings. JPL's concurrent processing devices group has developed a broad technology base in hardware implementable parallel algorithms, low-power and high-speed VLSI designs and building block VLSI chips, leading to application-specific high-performance embeddable processors. Application areas include high throughput map-data classification using feedforward neural networks, terrain based tactical movement planner using cellular automata, resource optimization (weapon-target assignment) using a multidimensional feedback network with lateral inhibition, and classification of rocks using an inner-product scheme on thematic mapper data. In addition to addressing specific functional needs of DOD and NASA, the JPL-developed concurrent processing device technology is also being customized for a variety of commercial applications (in collaboration with industrial partners), and is being transferred to U.S. industries. This viewgraph p resentation focuses on two application-specific processors which solve the computation intensive tasks of resource allocation (weapon-target assignment) and terrain based tactical movement planning using two extremely different topologies. Resource allocation is implemented as an asynchronous analog competitive assignment architecture inspired by the Hopfield network. Hardware realization leads to a two to four order of magnitude speed-up over conventional techniques and enables multiple assignments, (many to many), not achievable with standard statistical approaches. Tactical movement planning (finding the best path from A to B) is accomplished with a digital two-dimensional concurrent processor array. By exploiting the natural parallel decomposition of the problem in silicon, a four order of magnitude speed-up over optimized software approaches has been demonstrated.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Chrisochoides, N.; Sukup, F.

In this paper we present a parallel implementation of the Bowyer-Watson (BW) algorithm using the task-parallel programming model. The BW algorithm constitutes an ideal mesh refinement strategy for implementing a large class of unstructured mesh generation techniques on both sequential and parallel computers, by preventing the need for global mesh refinement. Its implementation on distributed memory multicomputes using the traditional data-parallel model has been proven very inefficient due to excessive synchronization needed among processors. In this paper we demonstrate that with the task-parallel model we can tolerate synchronization costs inherent to data-parallel methods by exploring concurrency in the processor level.more » Our preliminary performance data indicate that the task- parallel approach: (i) is almost four times faster than the existing data-parallel methods, (ii) scales linearly, and (iii) introduces minimum overheads compared to the {open_quotes}best{close_quotes} sequential implementation of the BW algorithm.« less
Computational mechanics analysis tools for parallel-vector supercomputers

NASA Technical Reports Server (NTRS)

Storaasli, Olaf O.; Nguyen, Duc T.; Baddourah, Majdi; Qin, Jiangning

1993-01-01

Computational algorithms for structural analysis on parallel-vector supercomputers are reviewed. These parallel algorithms, developed by the authors, are for the assembly of structural equations, 'out-of-core' strategies for linear equation solution, massively distributed-memory equation solution, unsymmetric equation solution, general eigensolution, geometrically nonlinear finite element analysis, design sensitivity analysis for structural dynamics, optimization search analysis and domain decomposition. The source code for many of these algorithms is available.
A unifying framework for rigid multibody dynamics and serial and parallel computational issues

NASA Technical Reports Server (NTRS)

Fijany, Amir; Jain, Abhinandan

1989-01-01

A unifying framework for various formulations of the dynamics of open-chain rigid multibody systems is discussed. Their suitability for serial and parallel processing is assessed. The framework is based on the derivation of intrinsic, i.e., coordinate-free, equations of the algorithms which provides a suitable abstraction and permits a distinction to be made between the computational redundancy in the intrinsic and extrinsic equations. A set of spatial notation is used which allows the derivation of the various algorithms in a common setting and thus clarifies the relationships among them. The three classes of algorithms viz., O(n), O(n exp 2) and O(n exp 3) or the solution of the dynamics problem are investigated. Researchers begin with the derivation of O(n exp 3) algorithms based on the explicit computation of the mass matrix and it provides insight into the underlying basis of the O(n) algorithms. From a computational perspective, the optimal choice of a coordinate frame for the projection of the intrinsic equations is discussed and the serial computational complexity of the different algorithms is evaluated. The three classes of algorithms are also analyzed for suitability for parallel processing. It is shown that the problem belongs to the class of N C and the time and processor bounds are of O(log2/2(n)) and O(n exp 4), respectively. However, the algorithm that achieves the above bounds is not stable. Researchers show that the fastest stable parallel algorithm achieves a computational complexity of O(n) with O(n exp 4), respectively. However, the algorithm that achieves the above bounds is not stable. Researchers show that the fastest stable parallel algorithm achieves a computational complexity of O(n) with O(n exp 2) processors, and results from the parallelization of the O(n exp 3) serial algorithm.
Fast forward kinematics algorithm for real-time and high-precision control of the 3-RPS parallel mechanism

NASA Astrophysics Data System (ADS)

Wang, Yue; Yu, Jingjun; Pei, Xu

2018-06-01

A new forward kinematics algorithm for the mechanism of 3-RPS (R: Revolute; P: Prismatic; S: Spherical) parallel manipulators is proposed in this study. This algorithm is primarily based on the special geometric conditions of the 3-RPS parallel mechanism, and it eliminates the errors produced by parasitic motions to improve and ensure accuracy. Specifically, the errors can be less than 10-6. In this method, only the group of solutions that is consistent with the actual situation of the platform is obtained rapidly. This algorithm substantially improves calculation efficiency because the selected initial values are reasonable, and all the formulas in the calculation are analytical. This novel forward kinematics algorithm is well suited for real-time and high-precision control of the 3-RPS parallel mechanism.
Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs.

PubMed

Kundeti, Vamsi K; Rajasekaran, Sanguthevar; Dinh, Hieu; Vaughn, Matthew; Thapar, Vishal

2010-11-15

Assembling genomic sequences from a set of overlapping reads is one of the most fundamental problems in computational biology. Algorithms addressing the assembly problem fall into two broad categories - based on the data structures which they employ. The first class uses an overlap/string graph and the second type uses a de Bruijn graph. However with the recent advances in short read sequencing technology, de Bruijn graph based algorithms seem to play a vital role in practice. Efficient algorithms for building these massive de Bruijn graphs are very essential in large sequencing projects based on short reads. In an earlier work, an O(n/p) time parallel algorithm has been given for this problem. Here n is the size of the input and p is the number of processors. This algorithm enumerates all possible bi-directed edges which can overlap with a node and ends up generating Θ(nΣ) messages (Σ being the size of the alphabet). In this paper we present a Θ(n/p) time parallel algorithm with a communication complexity that is equal to that of parallel sorting and is not sensitive to Σ. The generality of our algorithm makes it very easy to extend it even to the out-of-core model and in this case it has an optimal I/O complexity of Θ(nlog(n/B)Blog(M/B)) (M being the main memory size and B being the size of the disk block). We demonstrate the scalability of our parallel algorithm on a SGI/Altix computer. A comparison of our algorithm with the previous approaches reveals that our algorithm is faster--both asymptotically and practically. We demonstrate the scalability of our sequential out-of-core algorithm by comparing it with the algorithm used by VELVET to build the bi-directed de Bruijn graph. Our experiments reveal that our algorithm can build the graph with a constant amount of memory, which clearly outperforms VELVET. We also provide efficient algorithms for the bi-directed chain compaction problem. The bi-directed de Bruijn graph is a fundamental data structure for any sequence assembly program based on Eulerian approach. Our algorithms for constructing Bi-directed de Bruijn graphs are efficient in parallel and out of core settings. These algorithms can be used in building large scale bi-directed de Bruijn graphs. Furthermore, our algorithms do not employ any all-to-all communications in a parallel setting and perform better than the prior algorithms. Finally our out-of-core algorithm is extremely memory efficient and can replace the existing graph construction algorithm in VELVET.
Real-time implementations of image segmentation algorithms on shared memory multicore architecture: a survey (Conference Presentation)

NASA Astrophysics Data System (ADS)

Akil, Mohamed

2017-05-01

The real-time processing is getting more and more important in many image processing applications. Image segmentation is one of the most fundamental tasks image analysis. As a consequence, many different approaches for image segmentation have been proposed. The watershed transform is a well-known image segmentation tool. The watershed transform is a very data intensive task. To achieve acceleration and obtain real-time processing of watershed algorithms, parallel architectures and programming models for multicore computing have been developed. This paper focuses on the survey of the approaches for parallel implementation of sequential watershed algorithms on multicore general purpose CPUs: homogeneous multicore processor with shared memory. To achieve an efficient parallel implementation, it's necessary to explore different strategies (parallelization/distribution/distributed scheduling) combined with different acceleration and optimization techniques to enhance parallelism. In this paper, we give a comparison of various parallelization of sequential watershed algorithms on shared memory multicore architecture. We analyze the performance measurements of each parallel implementation and the impact of the different sources of overhead on the performance of the parallel implementations. In this comparison study, we also discuss the advantages and disadvantages of the parallel programming models. Thus, we compare the OpenMP (an application programming interface for multi-Processing) with Ptheads (POSIX Threads) to illustrate the impact of each parallel programming model on the performance of the parallel implementations.
Parallel solution of sparse one-dimensional dynamic programming problems

NASA Technical Reports Server (NTRS)

Nicol, David M.

1989-01-01

Parallel computation offers the potential for quickly solving large computational problems. However, it is often a non-trivial task to effectively use parallel computers. Solution methods must sometimes be reformulated to exploit parallelism; the reformulations are often more complex than their slower serial counterparts. We illustrate these points by studying the parallelization of sparse one-dimensional dynamic programming problems, those which do not obviously admit substantial parallelization. We propose a new method for parallelizing such problems, develop analytic models which help us to identify problems which parallelize well, and compare the performance of our algorithm with existing algorithms on a multiprocessor.
Soft-output decoding algorithms in iterative decoding of turbo codes

NASA Technical Reports Server (NTRS)

Benedetto, S.; Montorsi, G.; Divsalar, D.; Pollara, F.

1996-01-01

In this article, we present two versions of a simplified maximum a posteriori decoding algorithm. The algorithms work in a sliding window form, like the Viterbi algorithm, and can thus be used to decode continuously transmitted sequences obtained by parallel concatenated codes, without requiring code trellis termination. A heuristic explanation is also given of how to embed the maximum a posteriori algorithms into the iterative decoding of parallel concatenated codes (turbo codes). The performances of the two algorithms are compared on the basis of a powerful rate 1/3 parallel concatenated code. Basic circuits to implement the simplified a posteriori decoding algorithm using lookup tables, and two further approximations (linear and threshold), with a very small penalty, to eliminate the need for lookup tables are proposed.
A high-performance spatial database based approach for pathology imaging algorithm evaluation

PubMed Central

Wang, Fusheng; Kong, Jun; Gao, Jingjing; Cooper, Lee A.D.; Kurc, Tahsin; Zhou, Zhengwen; Adler, David; Vergara-Niedermayr, Cristobal; Katigbak, Bryan; Brat, Daniel J.; Saltz, Joel H.

2013-01-01

Background: Algorithm evaluation provides a means to characterize variability across image analysis algorithms, validate algorithms by comparison with human annotations, combine results from multiple algorithms for performance improvement, and facilitate algorithm sensitivity studies. The sizes of images and image analysis results in pathology image analysis pose significant challenges in algorithm evaluation. We present an efficient parallel spatial database approach to model, normalize, manage, and query large volumes of analytical image result data. This provides an efficient platform for algorithm evaluation. Our experiments with a set of brain tumor images demonstrate the application, scalability, and effectiveness of the platform. Context: The paper describes an approach and platform for evaluation of pathology image analysis algorithms. The platform facilitates algorithm evaluation through a high-performance database built on the Pathology Analytic Imaging Standards (PAIS) data model. Aims: (1) Develop a framework to support algorithm evaluation by modeling and managing analytical results and human annotations from pathology images; (2) Create a robust data normalization tool for converting, validating, and fixing spatial data from algorithm or human annotations; (3) Develop a set of queries to support data sampling and result comparisons; (4) Achieve high performance computation capacity via a parallel data management infrastructure, parallel data loading and spatial indexing optimizations in this infrastructure. Materials and Methods: We have considered two scenarios for algorithm evaluation: (1) algorithm comparison where multiple result sets from different methods are compared and consolidated; and (2) algorithm validation where algorithm results are compared with human annotations. We have developed a spatial normalization toolkit to validate and normalize spatial boundaries produced by image analysis algorithms or human annotations. The validated data were formatted based on the PAIS data model and loaded into a spatial database. To support efficient data loading, we have implemented a parallel data loading tool that takes advantage of multi-core CPUs to accelerate data injection. The spatial database manages both geometric shapes and image features or classifications, and enables spatial sampling, result comparison, and result aggregation through expressive structured query language (SQL) queries with spatial extensions. To provide scalable and efficient query support, we have employed a shared nothing parallel database architecture, which distributes data homogenously across multiple database partitions to take advantage of parallel computation power and implements spatial indexing to achieve high I/O throughput. Results: Our work proposes a high performance, parallel spatial database platform for algorithm validation and comparison. This platform was evaluated by storing, managing, and comparing analysis results from a set of brain tumor whole slide images. The tools we develop are open source and available to download. Conclusions: Pathology image algorithm validation and comparison are essential to iterative algorithm development and refinement. One critical component is the support for queries involving spatial predicates and comparisons. In our work, we develop an efficient data model and parallel database approach to model, normalize, manage and query large volumes of analytical image result data. Our experiments demonstrate that the data partitioning strategy and the grid-based indexing result in good data distribution across database nodes and reduce I/O overhead in spatial join queries through parallel retrieval of relevant data and quick subsetting of datasets. The set of tools in the framework provide a full pipeline to normalize, load, manage and query analytical results for algorithm evaluation. PMID:23599905
On the Accuracy and Parallelism of GPGPU-Powered Incremental Clustering Algorithms

PubMed Central

He, Li; Zheng, Hao; Wang, Lei

2017-01-01

Incremental clustering algorithms play a vital role in various applications such as massive data analysis and real-time data processing. Typical application scenarios of incremental clustering raise high demand on computing power of the hardware platform. Parallel computing is a common solution to meet this demand. Moreover, General Purpose Graphic Processing Unit (GPGPU) is a promising parallel computing device. Nevertheless, the incremental clustering algorithm is facing a dilemma between clustering accuracy and parallelism when they are powered by GPGPU. We formally analyzed the cause of this dilemma. First, we formalized concepts relevant to incremental clustering like evolving granularity. Second, we formally proved two theorems. The first theorem proves the relation between clustering accuracy and evolving granularity. Additionally, this theorem analyzes the upper and lower bounds of different-to-same mis-affiliation. Fewer occurrences of such mis-affiliation mean higher accuracy. The second theorem reveals the relation between parallelism and evolving granularity. Smaller work-depth means superior parallelism. Through the proofs, we conclude that accuracy of an incremental clustering algorithm is negatively related to evolving granularity while parallelism is positively related to the granularity. Thus the contradictory relations cause the dilemma. Finally, we validated the relations through a demo algorithm. Experiment results verified theoretical conclusions. PMID:29123546
Parallelization strategies for continuum-generalized method of moments on the multi-thread systems

NASA Astrophysics Data System (ADS)

Bustamam, A.; Handhika, T.; Ernastuti, Kerami, D.

2017-07-01

Continuum-Generalized Method of Moments (C-GMM) covers the Generalized Method of Moments (GMM) shortfall which is not as efficient as Maximum Likelihood estimator by using the continuum set of moment conditions in a GMM framework. However, this computation would take a very long time since optimizing regularization parameter. Unfortunately, these calculations are processed sequentially whereas in fact all modern computers are now supported by hierarchical memory systems and hyperthreading technology, which allowing for parallel computing. This paper aims to speed up the calculation process of C-GMM by designing a parallel algorithm for C-GMM on the multi-thread systems. First, parallel regions are detected for the original C-GMM algorithm. There are two parallel regions in the original C-GMM algorithm, that are contributed significantly to the reduction of computational time: the outer-loop and the inner-loop. Furthermore, this parallel algorithm will be implemented with standard shared-memory application programming interface, i.e. Open Multi-Processing (OpenMP). The experiment shows that the outer-loop parallelization is the best strategy for any number of observations.
Parallel-SymD: A Parallel Approach to Detect Internal Symmetry in Protein Domains.

PubMed

Jha, Ashwani; Flurchick, K M; Bikdash, Marwan; Kc, Dukka B

2016-01-01

Internally symmetric proteins are proteins that have a symmetrical structure in their monomeric single-chain form. Around 10-15% of the protein domains can be regarded as having some sort of internal symmetry. In this regard, we previously published SymD (symmetry detection), an algorithm that determines whether a given protein structure has internal symmetry by attempting to align the protein to its own copy after the copy is circularly permuted by all possible numbers of residues. SymD has proven to be a useful algorithm to detect symmetry. In this paper, we present a new parallelized algorithm called Parallel-SymD for detecting symmetry of proteins on clusters of computers. The achieved speedup of the new Parallel-SymD algorithm scales well with the number of computing processors. Scaling is better for proteins with a larger number of residues. For a protein of 509 residues, a speedup of 63 was achieved on a parallel system with 100 processors.

Parallel-SymD: A Parallel Approach to Detect Internal Symmetry in Protein Domains

PubMed Central

Jha, Ashwani; Flurchick, K. M.; Bikdash, Marwan

2016-01-01

Internally symmetric proteins are proteins that have a symmetrical structure in their monomeric single-chain form. Around 10–15% of the protein domains can be regarded as having some sort of internal symmetry. In this regard, we previously published SymD (symmetry detection), an algorithm that determines whether a given protein structure has internal symmetry by attempting to align the protein to its own copy after the copy is circularly permuted by all possible numbers of residues. SymD has proven to be a useful algorithm to detect symmetry. In this paper, we present a new parallelized algorithm called Parallel-SymD for detecting symmetry of proteins on clusters of computers. The achieved speedup of the new Parallel-SymD algorithm scales well with the number of computing processors. Scaling is better for proteins with a larger number of residues. For a protein of 509 residues, a speedup of 63 was achieved on a parallel system with 100 processors. PMID:27747230
Implementation of a vector-based tracking loop receiver in a pseudolite navigation system.

PubMed

So, Hyoungmin; Lee, Taikjin; Jeon, Sanghoon; Kim, Chongwon; Kee, Changdon; Kim, Taehee; Lee, Sanguk

2010-01-01

We propose a vector tracking loop (VTL) algorithm for an asynchronous pseudolite navigation system. It was implemented in a software receiver and experiments in an indoor navigation system were conducted. Test results show that the VTL successfully tracks signals against the near-far problem, one of the major limitations in pseudolite navigation systems, and could improve positioning availability by extending pseudolite navigation coverage.
Comparaison de méthodes d'identification des paramètres d'une machine asynchrone

NASA Astrophysics Data System (ADS)

Bellaaj-Mrabet, N.; Jelassi, K.

1998-07-01

Interests, in Genetic Algorithms (G.A.) expands rapidly. This paper consists initially to apply G.A. for identifying induction motor parameters. Next, we compare the performances with classical methods like Maximum Likelihood and classical electrotechnical methods. These methods are applied on three induction motors of different powers to compare results following a set of criteria. Les algorithmes génétiques sont des méthodes adaptatives de plus en plus utilisée pour la résolution de certains problèmes d'optimisation. Le présent travail consiste d'une part, à mettre en œuvre un A.G sur des problèmes d'identification des machines électriques, et d'autre part à comparer ses performances avec les méthodes classiques tels que la méthode du maximum de vraisemblance et la méthode électrotechnique basée sur des essais à vides et en court-circuit. Ces méthodes sont appliquées sur des machines asynchrones de différentes puissances. Les résultats obtenus sont comparés selon certains critères, permettant de conclure sur la validité et la performance de chaque méthode.
Interference tables: a useful model for interference analysis in asynchronous multicarrier transmission

NASA Astrophysics Data System (ADS)

Medjahdi, Yahia; Terré, Michel; Ruyet, Didier Le; Roviras, Daniel

2014-12-01

In this paper, we investigate the impact of timing asynchronism on the performance of multicarrier techniques in a spectrum coexistence context. Two multicarrier schemes are considered: cyclic prefix-based orthogonal frequency division multiplexing (CP-OFDM) with a rectangular pulse shape and filter bank-based multicarrier (FBMC) with physical layer for dynamic spectrum access and cognitive radio (PHYDYAS) and isotropic orthogonal transform algorithm (IOTA) waveforms. First, we present the general concept of the so-called power spectral density (PSD)-based interference tables which are commonly used for multicarrier interference characterization in spectrum sharing context. After highlighting the limits of this approach, we propose a new family of interference tables called `instantaneous interference tables'. The proposed tables give the interference power caused by a given interfering subcarrier on a victim one, not only as a function of the spectral distance separating both subcarriers but also with respect to the timing misalignment between the subcarrier holders. In contrast to the PSD-based interference tables, the accuracy of the proposed tables has been validated through different simulation results. Furthermore, due to the better frequency localization of both PHYDYAS and IOTA waveforms, FBMC technique is demonstrated to be more robust to timing asynchronism compared to OFDM one. Such a result makes FBMC a potential candidate for the physical layer of future cognitive radio systems.
A Fast parallel tridiagonal algorithm for a class of CFD applications

NASA Technical Reports Server (NTRS)

Moitra, Stuti; Sun, Xian-He

1996-01-01

The parallel diagonal dominant (PDD) algorithm is an efficient tridiagonal solver. This paper presents for study a variation of the PDD algorithm, the reduced PDD algorithm. The new algorithm maintains the minimum communication provided by the PDD algorithm, but has a reduced operation count. The PDD algorithm also has a smaller operation count than the conventional sequential algorithm for many applications. Accuracy analysis is provided for the reduced PDD algorithm for symmetric Toeplitz tridiagonal (STT) systems. Implementation results on Langley's Intel Paragon and IBM SP2 show that both the PDD and reduced PDD algorithms are efficient and scalable.
Parallel language constructs for tensor product computations on loosely coupled architectures

NASA Technical Reports Server (NTRS)

Mehrotra, Piyush; Vanrosendale, John

1989-01-01

Distributed memory architectures offer high levels of performance and flexibility, but have proven awkard to program. Current languages for nonshared memory architectures provide a relatively low level programming environment, and are poorly suited to modular programming, and to the construction of libraries. A set of language primitives designed to allow the specification of parallel numerical algorithms at a higher level is described. Tensor product array computations are focused on along with a simple but important class of numerical algorithms. The problem of programming 1-D kernal routines is focused on first, such as parallel tridiagonal solvers, and then how such parallel kernels can be combined to form parallel tensor product algorithms is examined.
Parallel processing in finite element structural analysis

NASA Technical Reports Server (NTRS)

Noor, Ahmed K.

1987-01-01

A brief review is made of the fundamental concepts and basic issues of parallel processing. Discussion focuses on parallel numerical algorithms, performance evaluation of machines and algorithms, and parallelism in finite element computations. A computational strategy is proposed for maximizing the degree of parallelism at different levels of the finite element analysis process including: 1) formulation level (through the use of mixed finite element models); 2) analysis level (through additive decomposition of the different arrays in the governing equations into the contributions to a symmetrized response plus correction terms); 3) numerical algorithm level (through the use of operator splitting techniques and application of iterative processes); and 4) implementation level (through the effective combination of vectorization, multitasking and microtasking, whenever available).
Parallel Algorithms for the Exascale Era

DOE Office of Scientific and Technical Information (OSTI.GOV)

Robey, Robert W.

New parallel algorithms are needed to reach the Exascale level of parallelism with millions of cores. We look at some of the research developed by students in projects at LANL. The research blends ideas from the early days of computing while weaving in the fresh approach brought by students new to the field of high performance computing. We look at reproducibility of global sums and why it is important to parallel computing. Next we look at how the concept of hashing has led to the development of more scalable algorithms suitable for next-generation parallel computers. Nearly all of this workmore » has been done by undergraduates and published in leading scientific journals.« less
Bioinformatics algorithm based on a parallel implementation of a machine learning approach using transducers

NASA Astrophysics Data System (ADS)

Roche-Lima, Abiel; Thulasiram, Ruppa K.

2012-02-01

Finite automata, in which each transition is augmented with an output label in addition to the familiar input label, are considered finite-state transducers. Transducers have been used to analyze some fundamental issues in bioinformatics. Weighted finite-state transducers have been proposed to pairwise alignments of DNA and protein sequences; as well as to develop kernels for computational biology. Machine learning algorithms for conditional transducers have been implemented and used for DNA sequence analysis. Transducer learning algorithms are based on conditional probability computation. It is calculated by using techniques, such as pair-database creation, normalization (with Maximum-Likelihood normalization) and parameters optimization (with Expectation-Maximization - EM). These techniques are intrinsically costly for computation, even worse when are applied to bioinformatics, because the databases sizes are large. In this work, we describe a parallel implementation of an algorithm to learn conditional transducers using these techniques. The algorithm is oriented to bioinformatics applications, such as alignments, phylogenetic trees, and other genome evolution studies. Indeed, several experiences were developed using the parallel and sequential algorithm on Westgrid (specifically, on the Breeze cluster). As results, we obtain that our parallel algorithm is scalable, because execution times are reduced considerably when the data size parameter is increased. Another experience is developed by changing precision parameter. In this case, we obtain smaller execution times using the parallel algorithm. Finally, number of threads used to execute the parallel algorithm on the Breezy cluster is changed. In this last experience, we obtain as result that speedup is considerably increased when more threads are used; however there is a convergence for number of threads equal to or greater than 16.
Parallel image compression

NASA Technical Reports Server (NTRS)

Reif, John H.

1987-01-01

A parallel compression algorithm for the 16,384 processor MPP machine was developed. The serial version of the algorithm can be viewed as a combination of on-line dynamic lossless test compression techniques (which employ simple learning strategies) and vector quantization. These concepts are described. How these concepts are combined to form a new strategy for performing dynamic on-line lossy compression is discussed. Finally, the implementation of this algorithm in a massively parallel fashion on the MPP is discussed.
The remote sensing image segmentation mean shift algorithm parallel processing based on MapReduce

NASA Astrophysics Data System (ADS)

Chen, Xi; Zhou, Liqing

2015-12-01

With the development of satellite remote sensing technology and the remote sensing image data, traditional remote sensing image segmentation technology cannot meet the massive remote sensing image processing and storage requirements. This article put cloud computing and parallel computing technology in remote sensing image segmentation process, and build a cheap and efficient computer cluster system that uses parallel processing to achieve MeanShift algorithm of remote sensing image segmentation based on the MapReduce model, not only to ensure the quality of remote sensing image segmentation, improved split speed, and better meet the real-time requirements. The remote sensing image segmentation MeanShift algorithm parallel processing algorithm based on MapReduce shows certain significance and a realization of value.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Chen, Chao; Pouransari, Hadi; Rajamanickam, Sivasankaran

We present a parallel hierarchical solver for general sparse linear systems on distributed-memory machines. For large-scale problems, this fully algebraic algorithm is faster and more memory-efficient than sparse direct solvers because it exploits the low-rank structure of fill-in blocks. Depending on the accuracy of low-rank approximations, the hierarchical solver can be used either as a direct solver or as a preconditioner. The parallel algorithm is based on data decomposition and requires only local communication for updating boundary data on every processor. Moreover, the computation-to-communication ratio of the parallel algorithm is approximately the volume-to-surface-area ratio of the subdomain owned by everymore » processor. We also provide various numerical results to demonstrate the versatility and scalability of the parallel algorithm.« less
Performance Comparison of a Set of Periodic and Non-Periodic Tridiagonal Solvers on SP2 and Paragon Parallel Computers

NASA Technical Reports Server (NTRS)

Sun, Xian-He; Moitra, Stuti

1996-01-01

Various tridiagonal solvers have been proposed in recent years for different parallel platforms. In this paper, the performance of three tridiagonal solvers, namely, the parallel partition LU algorithm, the parallel diagonal dominant algorithm, and the reduced diagonal dominant algorithm, is studied. These algorithms are designed for distributed-memory machines and are tested on an Intel Paragon and an IBM SP2 machines. Measured results are reported in terms of execution time and speedup. Analytical study are conducted for different communication topologies and for different tridiagonal systems. The measured results match the analytical results closely. In addition to address implementation issues, performance considerations such as problem sizes and models of speedup are also discussed.
Parallelization and implementation of approximate root isolation for nonlinear system by Monte Carlo

NASA Astrophysics Data System (ADS)

Khosravi, Ebrahim

1998-12-01

This dissertation solves a fundamental problem of isolating the real roots of nonlinear systems of equations by Monte-Carlo that were published by Bush Jones. This algorithm requires only function values and can be applied readily to complicated systems of transcendental functions. The implementation of this sequential algorithm provides scientists with the means to utilize function analysis in mathematics or other fields of science. The algorithm, however, is so computationally intensive that the system is limited to a very small set of variables, and this will make it unfeasible for large systems of equations. Also a computational technique was needed for investigating a metrology of preventing the algorithm structure from converging to the same root along different paths of computation. The research provides techniques for improving the efficiency and correctness of the algorithm. The sequential algorithm for this technique was corrected and a parallel algorithm is presented. This parallel method has been formally analyzed and is compared with other known methods of root isolation. The effectiveness, efficiency, enhanced overall performance of the parallel processing of the program in comparison to sequential processing is discussed. The message passing model was used for this parallel processing, and it is presented and implemented on Intel/860 MIMD architecture. The parallel processing proposed in this research has been implemented in an ongoing high energy physics experiment: this algorithm has been used to track neutrinoes in a super K detector. This experiment is located in Japan, and data can be processed on-line or off-line locally or remotely.
Implementing a Parallel Image Edge Detection Algorithm Based on the Otsu-Canny Operator on the Hadoop Platform

PubMed Central

Wang, Min; Tian, Yun

2018-01-01

The Canny operator is widely used to detect edges in images. However, as the size of the image dataset increases, the edge detection performance of the Canny operator decreases and its runtime becomes excessive. To improve the runtime and edge detection performance of the Canny operator, in this paper, we propose a parallel design and implementation for an Otsu-optimized Canny operator using a MapReduce parallel programming model that runs on the Hadoop platform. The Otsu algorithm is used to optimize the Canny operator's dual threshold and improve the edge detection performance, while the MapReduce parallel programming model facilitates parallel processing for the Canny operator to solve the processing speed and communication cost problems that occur when the Canny edge detection algorithm is applied to big data. For the experiments, we constructed datasets of different scales from the Pascal VOC2012 image database. The proposed parallel Otsu-Canny edge detection algorithm performs better than other traditional edge detection algorithms. The parallel approach reduced the running time by approximately 67.2% on a Hadoop cluster architecture consisting of 5 nodes with a dataset of 60,000 images. Overall, our approach system speeds up the system by approximately 3.4 times when processing large-scale datasets, which demonstrates the obvious superiority of our method. The proposed algorithm in this study demonstrates both better edge detection performance and improved time performance. PMID:29861711
A parallel adaptive mesh refinement algorithm

NASA Technical Reports Server (NTRS)

Quirk, James J.; Hanebutte, Ulf R.

1993-01-01

Over recent years, Adaptive Mesh Refinement (AMR) algorithms which dynamically match the local resolution of the computational grid to the numerical solution being sought have emerged as powerful tools for solving problems that contain disparate length and time scales. In particular, several workers have demonstrated the effectiveness of employing an adaptive, block-structured hierarchical grid system for simulations of complex shock wave phenomena. Unfortunately, from the parallel algorithm developer's viewpoint, this class of scheme is quite involved; these schemes cannot be distilled down to a small kernel upon which various parallelizing strategies may be tested. However, because of their block-structured nature such schemes are inherently parallel, so all is not lost. In this paper we describe the method by which Quirk's AMR algorithm has been parallelized. This method is built upon just a few simple message passing routines and so it may be implemented across a broad class of MIMD machines. Moreover, the method of parallelization is such that the original serial code is left virtually intact, and so we are left with just a single product to support. The importance of this fact should not be underestimated given the size and complexity of the original algorithm.
A New Augmentation Based Algorithm for Extracting Maximal Chordal Subgraphs.

PubMed

Bhowmick, Sanjukta; Chen, Tzu-Yi; Halappanavar, Mahantesh

2015-02-01

A graph is chordal if every cycle of length greater than three contains an edge between non-adjacent vertices. Chordal graphs are of interest both theoretically, since they admit polynomial time solutions to a range of NP-hard graph problems, and practically, since they arise in many applications including sparse linear algebra, computer vision, and computational biology. A maximal chordal subgraph is a chordal subgraph that is not a proper subgraph of any other chordal subgraph. Existing algorithms for computing maximal chordal subgraphs depend on dynamically ordering the vertices, which is an inherently sequential process and therefore limits the algorithms' parallelizability. In this paper we explore techniques to develop a scalable parallel algorithm for extracting a maximal chordal subgraph. We demonstrate that an earlier attempt at developing a parallel algorithm may induce a non-optimal vertex ordering and is therefore not guaranteed to terminate with a maximal chordal subgraph. We then give a new algorithm that first computes and then repeatedly augments a spanning chordal subgraph. After proving that the algorithm terminates with a maximal chordal subgraph, we then demonstrate that this algorithm is more amenable to parallelization and that the parallel version also terminates with a maximal chordal subgraph. That said, the complexity of the new algorithm is higher than that of the previous parallel algorithm, although the earlier algorithm computes a chordal subgraph which is not guaranteed to be maximal. We experimented with our augmentation-based algorithm on both synthetic and real-world graphs. We provide scalability results and also explore the effect of different choices for the initial spanning chordal subgraph on both the running time and on the number of edges in the maximal chordal subgraph.
GPU-based parallel algorithm for blind image restoration using midfrequency-based methods

NASA Astrophysics Data System (ADS)

Xie, Lang; Luo, Yi-han; Bao, Qi-liang

2013-08-01

GPU-based general-purpose computing is a new branch of modern parallel computing, so the study of parallel algorithms specially designed for GPU hardware architecture is of great significance. In order to solve the problem of high computational complexity and poor real-time performance in blind image restoration, the midfrequency-based algorithm for blind image restoration was analyzed and improved in this paper. Furthermore, a midfrequency-based filtering method is also used to restore the image hardly with any recursion or iteration. Combining the algorithm with data intensiveness, data parallel computing and GPU execution model of single instruction and multiple threads, a new parallel midfrequency-based algorithm for blind image restoration is proposed in this paper, which is suitable for stream computing of GPU. In this algorithm, the GPU is utilized to accelerate the estimation of class-G point spread functions and midfrequency-based filtering. Aiming at better management of the GPU threads, the threads in a grid are scheduled according to the decomposition of the filtering data in frequency domain after the optimization of data access and the communication between the host and the device. The kernel parallelism structure is determined by the decomposition of the filtering data to ensure the transmission rate to get around the memory bandwidth limitation. The results show that, with the new algorithm, the operational speed is significantly increased and the real-time performance of image restoration is effectively improved, especially for high-resolution images.
Synchronization Of Parallel Discrete Event Simulations

NASA Technical Reports Server (NTRS)

Steinman, Jeffrey S.

1992-01-01

Adaptive, parallel, discrete-event-simulation-synchronization algorithm, Breathing Time Buckets, developed in Synchronous Parallel Environment for Emulation and Discrete Event Simulation (SPEEDES) operating system. Algorithm allows parallel simulations to process events optimistically in fluctuating time cycles that naturally adapt while simulation in progress. Combines best of optimistic and conservative synchronization strategies while avoiding major disadvantages. Algorithm processes events optimistically in time cycles adapting while simulation in progress. Well suited for modeling communication networks, for large-scale war games, for simulated flights of aircraft, for simulations of computer equipment, for mathematical modeling, for interactive engineering simulations, and for depictions of flows of information.
Characterization of robotics parallel algorithms and mapping onto a reconfigurable SIMD machine

NASA Technical Reports Server (NTRS)

Lee, C. S. G.; Lin, C. T.

1989-01-01

The kinematics, dynamics, Jacobian, and their corresponding inverse computations are six essential problems in the control of robot manipulators. Efficient parallel algorithms for these computations are discussed and analyzed. Their characteristics are identified and a scheme on the mapping of these algorithms to a reconfigurable parallel architecture is presented. Based on the characteristics including type of parallelism, degree of parallelism, uniformity of the operations, fundamental operations, data dependencies, and communication requirement, it is shown that most of the algorithms for robotic computations possess highly regular properties and some common structures, especially the linear recursive structure. Moreover, they are well-suited to be implemented on a single-instruction-stream multiple-data-stream (SIMD) computer with reconfigurable interconnection network. The model of a reconfigurable dual network SIMD machine with internal direct feedback is introduced. A systematic procedure internal direct feedback is introduced. A systematic procedure to map these computations to the proposed machine is presented. A new scheduling problem for SIMD machines is investigated and a heuristic algorithm, called neighborhood scheduling, that reorders the processing sequence of subtasks to reduce the communication time is described. Mapping results of a benchmark algorithm are illustrated and discussed.

Efficient sequential and parallel algorithms for record linkage.

PubMed

Mamun, Abdullah-Al; Mi, Tian; Aseltine, Robert; Rajasekaran, Sanguthevar

2014-01-01

Integrating data from multiple sources is a crucial and challenging problem. Even though there exist numerous algorithms for record linkage or deduplication, they suffer from either large time needs or restrictions on the number of datasets that they can integrate. In this paper we report efficient sequential and parallel algorithms for record linkage which handle any number of datasets and outperform previous algorithms. Our algorithms employ hierarchical clustering algorithms as the basis. A key idea that we use is radix sorting on certain attributes to eliminate identical records before any further processing. Another novel idea is to form a graph that links similar records and find the connected components. Our sequential and parallel algorithms have been tested on a real dataset of 1,083,878 records and synthetic datasets ranging in size from 50,000 to 9,000,000 records. Our sequential algorithm runs at least two times faster, for any dataset, than the previous best-known algorithm, the two-phase algorithm using faster computation of the edit distance (TPA (FCED)). The speedups obtained by our parallel algorithm are almost linear. For example, we get a speedup of 7.5 with 8 cores (residing in a single node), 14.1 with 16 cores (residing in two nodes), and 26.4 with 32 cores (residing in four nodes). We have compared the performance of our sequential algorithm with TPA (FCED) and found that our algorithm outperforms the previous one. The accuracy is the same as that of this previous best-known algorithm.
New Parallel Algorithms for Landscape Evolution Model

NASA Astrophysics Data System (ADS)

Jin, Y.; Zhang, H.; Shi, Y.

2017-12-01

Most landscape evolution models (LEM) developed in the last two decades solve the diffusion equation to simulate the transportation of surface sediments. This numerical approach is difficult to parallelize due to the computation of drainage area for each node, which needs huge amount of communication if run in parallel. In order to overcome this difficulty, we developed two parallel algorithms for LEM with a stream net. One algorithm handles the partition of grid with traditional methods and applies an efficient global reduction algorithm to do the computation of drainage areas and transport rates for the stream net; the other algorithm is based on a new partition algorithm, which partitions the nodes in catchments between processes first, and then partitions the cells according to the partition of nodes. Both methods focus on decreasing communication between processes and take the advantage of massive computing techniques, and numerical experiments show that they are both adequate to handle large scale problems with millions of cells. We implemented the two algorithms in our program based on the widely used finite element library deal.II, so that it can be easily coupled with ASPECT.
Parallel digital forensics infrastructure.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Liebrock, Lorie M.; Duggan, David Patrick

2009-10-01

This report documents the architecture and implementation of a Parallel Digital Forensics infrastructure. This infrastructure is necessary for supporting the design, implementation, and testing of new classes of parallel digital forensics tools. Digital Forensics has become extremely difficult with data sets of one terabyte and larger. The only way to overcome the processing time of these large sets is to identify and develop new parallel algorithms for performing the analysis. To support algorithm research, a flexible base infrastructure is required. A candidate architecture for this base infrastructure was designed, instantiated, and tested by this project, in collaboration with New Mexicomore » Tech. Previous infrastructures were not designed and built specifically for the development and testing of parallel algorithms. With the size of forensics data sets only expected to increase significantly, this type of infrastructure support is necessary for continued research in parallel digital forensics. This report documents the implementation of the parallel digital forensics (PDF) infrastructure architecture and implementation.« less
Enhancing PC Cluster-Based Parallel Branch-and-Bound Algorithms for the Graph Coloring Problem

NASA Astrophysics Data System (ADS)

Taoka, Satoshi; Takafuji, Daisuke; Watanabe, Toshimasa

A branch-and-bound algorithm (BB for short) is the most general technique to deal with various combinatorial optimization problems. Even if it is used, computation time is likely to increase exponentially. So we consider its parallelization to reduce it. It has been reported that the computation time of a parallel BB heavily depends upon node-variable selection strategies. And, in case of a parallel BB, it is also necessary to prevent increase in communication time. So, it is important to pay attention to how many and what kind of nodes are to be transferred (called sending-node selection strategy). In this paper, for the graph coloring problem, we propose some sending-node selection strategies for a parallel BB algorithm by adopting MPI for parallelization and experimentally evaluate how these strategies affect computation time of a parallel BB on a PC cluster network.
Data communications for a collective operation in a parallel active messaging interface of a parallel computer

DOE Office of Scientific and Technical Information (OSTI.GOV)

Faraj, Daniel A.

Algorithm selection for data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including specifications of a client, a context, and a task, endpoints coupled for data communications through the PAMI, including associating in the PAMI data communications algorithms and bit masks; receiving in an origin endpoint of the PAMI a collective instruction, the instruction specifying transmission of a data communications message from the origin endpoint to a target endpoint; constructing a bit mask for the received collective instruction; selecting, from among the associated algorithms and bit masks,more » a data communications algorithm in dependence upon the constructed bit mask; and executing the collective instruction, transmitting, according to the selected data communications algorithm from the origin endpoint to the target endpoint, the data communications message.« less
Data communications for a collective operation in a parallel active messaging interface of a parallel computer

DOEpatents

Faraj, Daniel A

2013-07-16

Algorithm selection for data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including specifications of a client, a context, and a task, endpoints coupled for data communications through the PAMI, including associating in the PAMI data communications algorithms and bit masks; receiving in an origin endpoint of the PAMI a collective instruction, the instruction specifying transmission of a data communications message from the origin endpoint to a target endpoint; constructing a bit mask for the received collective instruction; selecting, from among the associated algorithms and bit masks, a data communications algorithm in dependence upon the constructed bit mask; and executing the collective instruction, transmitting, according to the selected data communications algorithm from the origin endpoint to the target endpoint, the data communications message.
Using Hadoop MapReduce for Parallel Genetic Algorithms: A Comparison of the Global, Grid and Island Models.

PubMed

Ferrucci, Filomena; Salza, Pasquale; Sarro, Federica

2017-06-29

The need to improve the scalability of Genetic Algorithms (GAs) has motivated the research on Parallel Genetic Algorithms (PGAs), and different technologies and approaches have been used. Hadoop MapReduce represents one of the most mature technologies to develop parallel algorithms. Based on the fact that parallel algorithms introduce communication overhead, the aim of the present work is to understand if, and possibly when, the parallel GAs solutions using Hadoop MapReduce show better performance than sequential versions in terms of execution time. Moreover, we are interested in understanding which PGA model can be most effective among the global, grid, and island models. We empirically assessed the performance of these three parallel models with respect to a sequential GA on a software engineering problem, evaluating the execution time and the achieved speedup. We also analysed the behaviour of the parallel models in relation to the overhead produced by the use of Hadoop MapReduce and the GAs' computational effort, which gives a more machine-independent measure of these algorithms. We exploited three problem instances to differentiate the computation load and three cluster configurations based on 2, 4, and 8 parallel nodes. Moreover, we estimated the costs of the execution of the experimentation on a potential cloud infrastructure, based on the pricing of the major commercial cloud providers. The empirical study revealed that the use of PGA based on the island model outperforms the other parallel models and the sequential GA for all the considered instances and clusters. Using 2, 4, and 8 nodes, the island model achieves an average speedup over the three datasets of 1.8, 3.4, and 7.0 times, respectively. Hadoop MapReduce has a set of different constraints that need to be considered during the design and the implementation of parallel algorithms. The overhead of data store (i.e., HDFS) accesses, communication, and latency requires solutions that reduce data store operations. For this reason, the island model is more suitable for PGAs than the global and grid model, also in terms of costs when executed on a commercial cloud provider.
Update on Development of Mesh Generation Algorithms in MeshKit

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jain, Rajeev; Vanderzee, Evan; Mahadevan, Vijay

2015-09-30

MeshKit uses a graph-based design for coding all its meshing algorithms, which includes the Reactor Geometry (and mesh) Generation (RGG) algorithms. This report highlights the developmental updates of all the algorithms, results and future work. Parallel versions of algorithms, documentation and performance results are reported. RGG GUI design was updated to incorporate new features requested by the users; boundary layer generation and parallel RGG support were added to the GUI. Key contributions to the release, upgrade and maintenance of other SIGMA1 libraries (CGM and MOAB) were made. Several fundamental meshing algorithms for creating a robust parallel meshing pipeline in MeshKitmore » are under development. Results and current status of automated, open-source and high quality nuclear reactor assembly mesh generation algorithms such as trimesher, quadmesher, interval matching and multi-sweeper are reported.« less
Distributed shared memory for roaming large volumes.

PubMed

Castanié, Laurent; Mion, Christophe; Cavin, Xavier; Lévy, Bruno

2006-01-01

We present a cluster-based volume rendering system for roaming very large volumes. This system allows to move a gigabyte-sized probe inside a total volume of several tens or hundreds of gigabytes in real-time. While the size of the probe is limited by the total amount of texture memory on the cluster, the size of the total data set has no theoretical limit. The cluster is used as a distributed graphics processing unit that both aggregates graphics power and graphics memory. A hardware-accelerated volume renderer runs in parallel on the cluster nodes and the final image compositing is implemented using a pipelined sort-last rendering algorithm. Meanwhile, volume bricking and volume paging allow efficient data caching. On each rendering node, a distributed hierarchical cache system implements a global software-based distributed shared memory on the cluster. In case of a cache miss, this system first checks page residency on the other cluster nodes instead of directly accessing local disks. Using two Gigabit Ethernet network interfaces per node, we accelerate data fetching by a factor of 4 compared to directly accessing local disks. The system also implements asynchronous disk access and texture loading, which makes it possible to overlap data loading, volume slicing and rendering for optimal volume roaming.
Parallel algorithms for computation of the manipulator inertia matrix

NASA Technical Reports Server (NTRS)

Amin-Javaheri, Masoud; Orin, David E.

1989-01-01

The development of an O(log2N) parallel algorithm for the manipulator inertia matrix is presented. It is based on the most efficient serial algorithm which uses the composite rigid body method. Recursive doubling is used to reformulate the linear recurrence equations which are required to compute the diagonal elements of the matrix. It results in O(log2N) levels of computation. Computation of the off-diagonal elements involves N linear recurrences of varying-size and a new method, which avoids redundant computation of position and orientation transforms for the manipulator, is developed. The O(log2N) algorithm is presented in both equation and graphic forms which clearly show the parallelism inherent in the algorithm.
On the impact of communication complexity in the design of parallel numerical algorithms

NASA Technical Reports Server (NTRS)

Gannon, D.; Vanrosendale, J.

1984-01-01

This paper describes two models of the cost of data movement in parallel numerical algorithms. One model is a generalization of an approach due to Hockney, and is suitable for shared memory multiprocessors where each processor has vector capabilities. The other model is applicable to highly parallel nonshared memory MIMD systems. In the second model, algorithm performance is characterized in terms of the communication network design. Techniques used in VLSI complexity theory are also brought in, and algorithm independent upper bounds on system performance are derived for several problems that are important to scientific computation.
An efficient parallel algorithm for the solution of a tridiagonal linear system of equations

NASA Technical Reports Server (NTRS)

Stone, H. S.

1971-01-01

Tridiagonal linear systems of equations are solved on conventional serial machines in a time proportional to N, where N is the number of equations. The conventional algorithms do not lend themselves directly to parallel computations on computers of the ILLIAC IV class, in the sense that they appear to be inherently serial. An efficient parallel algorithm is presented in which computation time grows as log sub 2 N. The algorithm is based on recursive doubling solutions of linear recurrence relations, and can be used to solve recurrence relations of all orders.
Massively Parallel Solution of Poisson Equation on Coarse Grain MIMD Architectures

NASA Technical Reports Server (NTRS)

Fijany, A.; Weinberger, D.; Roosta, R.; Gulati, S.

1998-01-01

In this paper a new algorithm, designated as Fast Invariant Imbedding algorithm, for solution of Poisson equation on vector and massively parallel MIMD architectures is presented. This algorithm achieves the same optimal computational efficiency as other Fast Poisson solvers while offering a much better structure for vector and parallel implementation. Our implementation on the Intel Delta and Paragon shows that a speedup of over two orders of magnitude can be achieved even for moderate size problems.
NDL-v2.0: A new version of the numerical differentiation library for parallel architectures

NASA Astrophysics Data System (ADS)

Hadjidoukas, P. E.; Angelikopoulos, P.; Voglis, C.; Papageorgiou, D. G.; Lagaris, I. E.

2014-07-01

We present a new version of the numerical differentiation library (NDL) used for the numerical estimation of first and second order partial derivatives of a function by finite differencing. In this version we have restructured the serial implementation of the code so as to achieve optimal task-based parallelization. The pure shared-memory parallelization of the library has been based on the lightweight OpenMP tasking model allowing for the full extraction of the available parallelism and efficient scheduling of multiple concurrent library calls. On multicore clusters, parallelism is exploited by means of TORC, an MPI-based multi-threaded tasking library. The new MPI implementation of NDL provides optimal performance in terms of function calls and, furthermore, supports asynchronous execution of multiple library calls within legacy MPI programs. In addition, a Python interface has been implemented for all cases, exporting the functionality of our library to sequential Python codes. Catalog identifier: AEDG_v2_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEDG_v2_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 63036 No. of bytes in distributed program, including test data, etc.: 801872 Distribution format: tar.gz Programming language: ANSI Fortran-77, ANSI C, Python. Computer: Distributed systems (clusters), shared memory systems. Operating system: Linux, Unix. Has the code been vectorized or parallelized?: Yes. RAM: The library uses O(N) internal storage, N being the dimension of the problem. It can use up to O(N2) internal storage for Hessian calculations, if a task throttling factor has not been set by the user. Classification: 4.9, 4.14, 6.5. Catalog identifier of previous version: AEDG_v1_0 Journal reference of previous version: Comput. Phys. Comm. 180(2009)1404 Does the new version supersede the previous version?: Yes Nature of problem: The numerical estimation of derivatives at several accuracy levels is a common requirement in many computational tasks, such as optimization, solution of nonlinear systems, and sensitivity analysis. For a large number of scientific and engineering applications, the underlying functions correspond to simulation codes for which analytical estimation of derivatives is difficult or almost impossible. A parallel implementation that exploits systems with multiple CPUs is very important for large scale and computationally expensive problems. Solution method: Finite differencing is used with a carefully chosen step that minimizes the sum of the truncation and round-off errors. The parallel versions employ both OpenMP and MPI libraries. Reasons for new version: The updated version was motivated by our endeavors to extend a parallel Bayesian uncertainty quantification framework [1], by incorporating higher order derivative information as in most state-of-the-art stochastic simulation methods such as Stochastic Newton MCMC [2] and Riemannian Manifold Hamiltonian MC [3]. The function evaluations are simulations with significant time-to-solution, which also varies with the input parameters such as in [1, 4]. The runtime of the N-body-type of problem changes considerably with the introduction of a longer cut-off between the bodies. In the first version of the library, the OpenMP-parallel subroutines spawn a new team of threads and distribute the function evaluations with a PARALLEL DO directive. This limits the functionality of the library as multiple concurrent calls require nested parallelism support from the OpenMP environment. Therefore, either their function evaluations will be serialized or processor oversubscription is likely to occur due to the increased number of OpenMP threads. In addition, the Hessian calculations include two explicit parallel regions that compute first the diagonal and then the off-diagonal elements of the array. Due to the barrier between the two regions, the parallelism of the calculations is not fully exploited. These issues have been addressed in the new version by first restructuring the serial code and then running the function evaluations in parallel using OpenMP tasks. Although the MPI-parallel implementation of the first version is capable of fully exploiting the task parallelism of the PNDL routines, it does not utilize the caching mechanism of the serial code and, therefore, performs some redundant function evaluations in the Hessian and Jacobian calculations. This can lead to: (a) higher execution times if the number of available processors is lower than the total number of tasks, and (b) significant energy consumption due to wasted processor cycles. Overcoming these drawbacks, which become critical as the time of a single function evaluation increases, was the primary goal of this new version. Due to the code restructure, the MPI-parallel implementation (and the OpenMP-parallel in accordance) avoids redundant calls, providing optimal performance in terms of the number of function evaluations. Another limitation of the library was that the library subroutines were collective and synchronous calls. In the new version, each MPI process can issue any number of subroutines for asynchronous execution. We introduce two library calls that provide global and local task synchronizations, similarly to the BARRIER and TASKWAIT directives of OpenMP. The new MPI-implementation is based on TORC, a new tasking library for multicore clusters [5-7]. TORC improves the portability of the software, as it relies exclusively on the POSIX-Threads and MPI programming interfaces. It allows MPI processes to utilize multiple worker threads, offering a hybrid programming and execution environment similar to MPI+OpenMP, in a completely transparent way. Finally, to further improve the usability of our software, a Python interface has been implemented on top of both the OpenMP and MPI versions of the library. This allows sequential Python codes to exploit shared and distributed memory systems. Summary of revisions: The revised code improves the performance of both parallel (OpenMP and MPI) implementations. The functionality and the user-interface of the MPI-parallel version have been extended to support the asynchronous execution of multiple PNDL calls, issued by one or multiple MPI processes. A new underlying tasking library increases portability and allows MPI processes to have multiple worker threads. For both implementations, an interface to the Python programming language has been added. Restrictions: The library uses only double precision arithmetic. The MPI implementation assumes the homogeneity of the execution environment provided by the operating system. Specifically, the processes of a single MPI application must have identical address space and a user function resides at the same virtual address. In addition, address space layout randomization should not be used for the application. Unusual features: The software takes into account bound constraints, in the sense that only feasible points are used to evaluate the derivatives, and given the level of the desired accuracy, the proper formula is automatically employed. Running time: Running time depends on the function's complexity. The test run took 23 ms for the serial distribution, 25 ms for the OpenMP with 2 threads, 53 ms and 1.01 s for the MPI parallel distribution using 2 threads and 2 processes respectively and yield-time for idle workers equal to 10 ms. References: [1] P. Angelikopoulos, C. Paradimitriou, P. Koumoutsakos, Bayesian uncertainty quantification and propagation in molecular dynamics simulations: a high performance computing framework, J. Chem. Phys 137 (14). [2] H.P. Flath, L.C. Wilcox, V. Akcelik, J. Hill, B. van Bloemen Waanders, O. Ghattas, Fast algorithms for Bayesian uncertainty quantification in large-scale linear inverse problems based on low-rank partial Hessian approximations, SIAM J. Sci. Comput. 33 (1) (2011) 407-432. [3] M. Girolami, B. Calderhead, Riemann manifold Langevin and Hamiltonian Monte Carlo methods, J. R. Stat. Soc. Ser. B (Stat. Methodol.) 73 (2) (2011) 123-214. [4] P. Angelikopoulos, C. Paradimitriou, P. Koumoutsakos, Data driven, predictive molecular dynamics for nanoscale flow simulations under uncertainty, J. Phys. Chem. B 117 (47) (2013) 14808-14816. [5] P.E. Hadjidoukas, E. Lappas, V.V. Dimakopoulos, A runtime library for platform-independent task parallelism, in: PDP, IEEE, 2012, pp. 229-236. [6] C. Voglis, P.E. Hadjidoukas, D.G. Papageorgiou, I. Lagaris, A parallel hybrid optimization algorithm for fitting interatomic potentials, Appl. Soft Comput. 13 (12) (2013) 4481-4492. [7] P.E. Hadjidoukas, C. Voglis, V.V. Dimakopoulos, I. Lagaris, D.G. Papageorgiou, Supporting adaptive and irregular parallelism for non-linear numerical optimization, Appl. Math. Comput. 231 (2014) 544-559.
A Hybrid Shared-Memory Parallel Max-Tree Algorithm for Extreme Dynamic-Range Images.

PubMed

Moschini, Ugo; Meijster, Arnold; Wilkinson, Michael H F

2018-03-01

Max-trees, or component trees, are graph structures that represent the connected components of an image in a hierarchical way. Nowadays, many application fields rely on images with high-dynamic range or floating point values. Efficient sequential algorithms exist to build trees and compute attributes for images of any bit depth. However, we show that the current parallel algorithms perform poorly already with integers at bit depths higher than 16 bits per pixel. We propose a parallel method combining the two worlds of flooding and merging max-tree algorithms. First, a pilot max-tree of a quantized version of the image is built in parallel using a flooding method. Later, this structure is used in a parallel leaf-to-root approach to compute efficiently the final max-tree and to drive the merging of the sub-trees computed by the threads. We present an analysis of the performance both on simulated and actual 2D images and 3D volumes. Execution times are about better than the fastest sequential algorithm and speed-up goes up to on 64 threads.
Empirical study of parallel LRU simulation algorithms

NASA Technical Reports Server (NTRS)

Carr, Eric; Nicol, David M.

1994-01-01

This paper reports on the performance of five parallel algorithms for simulating a fully associative cache operating under the LRU (Least-Recently-Used) replacement policy. Three of the algorithms are SIMD, and are implemented on the MasPar MP-2 architecture. Two other algorithms are parallelizations of an efficient serial algorithm on the Intel Paragon. One SIMD algorithm is quite simple, but its cost is linear in the cache size. The two other SIMD algorithm are more complex, but have costs that are independent on the cache size. Both the second and third SIMD algorithms compute all stack distances; the second SIMD algorithm is completely general, whereas the third SIMD algorithm presumes and takes advantage of bounds on the range of reference tags. Both MIMD algorithm implemented on the Paragon are general and compute all stack distances; they differ in one step that may affect their respective scalability. We assess the strengths and weaknesses of these algorithms as a function of problem size and characteristics, and compare their performance on traces derived from execution of three SPEC benchmark programs.
Robot navigation research at CESAR (Center for Engineering Systems Advanced Research)

DOE Office of Scientific and Technical Information (OSTI.GOV)

Barnett, D.L.; de Saussure, G.; Pin, F.G.

1989-01-01

A considerable amount of work has been reported on the problem of robot navigation in known static terrains. Algorithms have been proposed and implemented to search for an optimum path to the goal, taking into account the finite size and shape of the robot. Not as much work has been reported on robot navigation in unknown, unstructured, or dynamic environments. A robot navigating in an unknown environment must explore with its sensors, construct an abstract representation of its global environment to plan a path to the goal, and update or revise its plan based on accumulated data obtained and processedmore » in real-time. The core of the navigation program for the CESAR robots is a production system developed on the expert-system-shell CLIPS which runs on an NCUBE hypercube on board the robot. The production system can call on C-compiled navigation procedures. The production rules can read the sensor data and address the robot's effectors. This architecture was found efficient and flexible for the development and testing of the navigation algorithms; however, in order to process intelligently unexpected emergencies, it was found necessary to be able to control the production system through externally generated asynchronous data. This led to the design of a new asynchronous production system, APS, which is now being developed on the robot. This paper will review some of the navigation algorithms developed and tested at CESAR and will discuss the need for the new APS and how it is being integrated into the robot architecture. 18 refs., 3 figs., 1 tab.« less
A Ternary Brain-Computer Interface Based on Single-Trial Readiness Potentials of Self-initiated Fine Movements: A Diversified Classification Scheme

PubMed Central

Abou Zeid, Elias; Rezazadeh Sereshkeh, Alborz; Schultz, Benjamin; Chau, Tom

2017-01-01

In recent years, the readiness potential (RP), a type of pre-movement neural activity, has been investigated for asynchronous electroencephalogram (EEG)-based brain-computer interfaces (BCIs). Since the RP is attenuated for involuntary movements, a BCI driven by RP alone could facilitate intentional control amid a plethora of unintentional movements. Previous studies have mainly attempted binary single-trial classification of RP. An RP-based BCI with three or more states would expand the options for functional control. Here, we propose a ternary BCI based on single-trial RPs. This BCI classifies amongst an idle state, a left hand and a right hand self-initiated fine movement. A pipeline of spatio-temporal filtering with per participant parameter optimization was used for feature extraction. The ternary classification was decomposed into binary classifications using a decision-directed acyclic graph (DDAG). For each class pair in the DDAG structure, an ordered diversified classifier system (ODCS-DDAG) was used to select the best among various classification algorithms or to combine the results of different classification algorithms. Using EEG data from 14 participants performing self-initiated left or right key presses, punctuated with rest periods, we compared the performance of ODCS-DDAG to a ternary classifier and four popular multiclass decomposition methods using only a single classification algorithm. ODCS-DDAG had the highest performance (0.769 Cohen's Kappa score) and was significantly better than the ternary classifier and two of the four multiclass decomposition methods. Our work supports further study of RP-based BCI for intuitive asynchronous environmental control or augmentative communication. PMID:28596725
A new augmentation based algorithm for extracting maximal chordal subgraphs

DOE PAGES

Bhowmick, Sanjukta; Chen, Tzu-Yi; Halappanavar, Mahantesh

2014-10-18

If every cycle of a graph is chordal length greater than three then it contains an edge between non-adjacent vertices. Chordal graphs are of interest both theoretically, since they admit polynomial time solutions to a range of NP-hard graph problems, and practically, since they arise in many applications including sparse linear algebra, computer vision, and computational biology. A maximal chordal subgraph is a chordal subgraph that is not a proper subgraph of any other chordal subgraph. Existing algorithms for computing maximal chordal subgraphs depend on dynamically ordering the vertices, which is an inherently sequential process and therefore limits the algorithms’more » parallelizability. In our paper we explore techniques to develop a scalable parallel algorithm for extracting a maximal chordal subgraph. We demonstrate that an earlier attempt at developing a parallel algorithm may induce a non-optimal vertex ordering and is therefore not guaranteed to terminate with a maximal chordal subgraph. We then give a new algorithm that first computes and then repeatedly augments a spanning chordal subgraph. After proving that the algorithm terminates with a maximal chordal subgraph, we then demonstrate that this algorithm is more amenable to parallelization and that the parallel version also terminates with a maximal chordal subgraph. That said, the complexity of the new algorithm is higher than that of the previous parallel algorithm, although the earlier algorithm computes a chordal subgraph which is not guaranteed to be maximal. Finally, we experimented with our augmentation-based algorithm on both synthetic and real-world graphs. We provide scalability results and also explore the effect of different choices for the initial spanning chordal subgraph on both the running time and on the number of edges in the maximal chordal subgraph.« less
A parallel approximate string matching under Levenshtein distance on graphics processing units using warp-shuffle operations

PubMed Central

Ho, ThienLuan; Oh, Seung-Rohk

2017-01-01

Approximate string matching with k-differences has a number of practical applications, ranging from pattern recognition to computational biology. This paper proposes an efficient memory-access algorithm for parallel approximate string matching with k-differences on Graphics Processing Units (GPUs). In the proposed algorithm, all threads in the same GPUs warp share data using warp-shuffle operation instead of accessing the shared memory. Moreover, we implement the proposed algorithm by exploiting the memory structure of GPUs to optimize its performance. Experiment results for real DNA packages revealed that the performance of the proposed algorithm and its implementation archived up to 122.64 and 1.53 times compared to that of sequential algorithm on CPU and previous parallel approximate string matching algorithm on GPUs, respectively. PMID:29016700

A matrix-algebraic formulation of distributed-memory maximal cardinality matching algorithms in bipartite graphs

DOE PAGES

Azad, Ariful; Buluç, Aydın

2016-05-16

We describe parallel algorithms for computing maximal cardinality matching in a bipartite graph on distributed-memory systems. Unlike traditional algorithms that match one vertex at a time, our algorithms process many unmatched vertices simultaneously using a matrix-algebraic formulation of maximal matching. This generic matrix-algebraic framework is used to develop three efficient maximal matching algorithms with minimal changes. The newly developed algorithms have two benefits over existing graph-based algorithms. First, unlike existing parallel algorithms, cardinality of matching obtained by the new algorithms stays constant with increasing processor counts, which is important for predictable and reproducible performance. Second, relying on bulk-synchronous matrix operations,more » these algorithms expose a higher degree of parallelism on distributed-memory platforms than existing graph-based algorithms. We report high-performance implementations of three maximal matching algorithms using hybrid OpenMP-MPI and evaluate the performance of these algorithm using more than 35 real and randomly generated graphs. On real instances, our algorithms achieve up to 200 × speedup on 2048 cores of a Cray XC30 supercomputer. Even higher speedups are obtained on larger synthetically generated graphs where our algorithms show good scaling on up to 16,384 cores.« less
Parallel grid generation algorithm for distributed memory computers

NASA Technical Reports Server (NTRS)

Moitra, Stuti; Moitra, Anutosh

1994-01-01

A parallel grid-generation algorithm and its implementation on the Intel iPSC/860 computer are described. The grid-generation scheme is based on an algebraic formulation of homotopic relations. Methods for utilizing the inherent parallelism of the grid-generation scheme are described, and implementation of multiple levELs of parallelism on multiple instruction multiple data machines are indicated. The algorithm is capable of providing near orthogonality and spacing control at solid boundaries while requiring minimal interprocessor communications. Results obtained on the Intel hypercube for a blended wing-body configuration are used to demonstrate the effectiveness of the algorithm. Fortran implementations bAsed on the native programming model of the iPSC/860 computer and the Express system of software tools are reported. Computational gains in execution time speed-up ratios are given.
Best-First Heuristic Search for Multicore Machines

DTIC Science & Technology

2010-01-01

Otto, 1998) to implement an asynchronous version of PRA* that they call Hash Distributed A* ( HDA *). HDA * distributes nodes using a hash function in...nodes which are being communicated between peers are in transit. In contact with the authors of HDA *, we have created an implementation of HDA * for...Also, our implementation of HDA * allows us to make a fair comparison between algorithms by sharing common data structures such as priority queues and
Distributed Time Synchronization Algorithms and Opinion Dynamics

NASA Astrophysics Data System (ADS)

Manita, Anatoly; Manita, Larisa

2018-01-01

We propose new deterministic and stochastic models for synchronization of clocks in nodes of distributed networks. An external accurate time server is used to ensure convergence of the node clocks to the exact time. These systems have much in common with mathematical models of opinion formation in multiagent systems. There is a direct analogy between the time server/node clocks pair in asynchronous networks and the leader/follower pair in the context of social network models.
Parallel processing considerations for image recognition tasks

NASA Astrophysics Data System (ADS)

Simske, Steven J.

2011-01-01

Many image recognition tasks are well-suited to parallel processing. The most obvious example is that many imaging tasks require the analysis of multiple images. From this standpoint, then, parallel processing need be no more complicated than assigning individual images to individual processors. However, there are three less trivial categories of parallel processing that will be considered in this paper: parallel processing (1) by task; (2) by image region; and (3) by meta-algorithm. Parallel processing by task allows the assignment of multiple workflows-as diverse as optical character recognition [OCR], document classification and barcode reading-to parallel pipelines. This can substantially decrease time to completion for the document tasks. For this approach, each parallel pipeline is generally performing a different task. Parallel processing by image region allows a larger imaging task to be sub-divided into a set of parallel pipelines, each performing the same task but on a different data set. This type of image analysis is readily addressed by a map-reduce approach. Examples include document skew detection and multiple face detection and tracking. Finally, parallel processing by meta-algorithm allows different algorithms to be deployed on the same image simultaneously. This approach may result in improved accuracy.
A privacy-preserving parallel and homomorphic encryption scheme

NASA Astrophysics Data System (ADS)

Min, Zhaoe; Yang, Geng; Shi, Jingqi

2017-04-01

In order to protect data privacy whilst allowing efficient access to data in multi-nodes cloud environments, a parallel homomorphic encryption (PHE) scheme is proposed based on the additive homomorphism of the Paillier encryption algorithm. In this paper we propose a PHE algorithm, in which plaintext is divided into several blocks and blocks are encrypted with a parallel mode. Experiment results demonstrate that the encryption algorithm can reach a speed-up ratio at about 7.1 in the MapReduce environment with 16 cores and 4 nodes.
A derivation and scalable implementation of the synchronous parallel kinetic Monte Carlo method for simulating long-time dynamics

NASA Astrophysics Data System (ADS)

Byun, Hye Suk; El-Naggar, Mohamed Y.; Kalia, Rajiv K.; Nakano, Aiichiro; Vashishta, Priya

2017-10-01

Kinetic Monte Carlo (KMC) simulations are used to study long-time dynamics of a wide variety of systems. Unfortunately, the conventional KMC algorithm is not scalable to larger systems, since its time scale is inversely proportional to the simulated system size. A promising approach to resolving this issue is the synchronous parallel KMC (SPKMC) algorithm, which makes the time scale size-independent. This paper introduces a formal derivation of the SPKMC algorithm based on local transition-state and time-dependent Hartree approximations, as well as its scalable parallel implementation based on a dual linked-list cell method. The resulting algorithm has achieved a weak-scaling parallel efficiency of 0.935 on 1024 Intel Xeon processors for simulating biological electron transfer dynamics in a 4.2 billion-heme system, as well as decent strong-scaling parallel efficiency. The parallel code has been used to simulate a lattice of cytochrome complexes on a bacterial-membrane nanowire, and it is broadly applicable to other problems such as computational synthesis of new materials.
A parallel row-based algorithm with error control for standard-cell replacement on a hypercube multiprocessor

NASA Technical Reports Server (NTRS)

Sargent, Jeff Scott

1988-01-01

A new row-based parallel algorithm for standard-cell placement targeted for execution on a hypercube multiprocessor is presented. Key features of this implementation include a dynamic simulated-annealing schedule, row-partitioning of the VLSI chip image, and two novel new approaches to controlling error in parallel cell-placement algorithms; Heuristic Cell-Coloring and Adaptive (Parallel Move) Sequence Control. Heuristic Cell-Coloring identifies sets of noninteracting cells that can be moved repeatedly, and in parallel, with no buildup of error in the placement cost. Adaptive Sequence Control allows multiple parallel cell moves to take place between global cell-position updates. This feedback mechanism is based on an error bound derived analytically from the traditional annealing move-acceptance profile. Placement results are presented for real industry circuits and the performance is summarized of an implementation on the Intel iPSC/2 Hypercube. The runtime of this algorithm is 5 to 16 times faster than a previous program developed for the Hypercube, while producing equivalent quality placement. An integrated place and route program for the Intel iPSC/2 Hypercube is currently being developed.
Parallel algorithms for mapping pipelined and parallel computations

NASA Technical Reports Server (NTRS)

Nicol, David M.

1988-01-01

Many computational problems in image processing, signal processing, and scientific computing are naturally structured for either pipelined or parallel computation. When mapping such problems onto a parallel architecture it is often necessary to aggregate an obvious problem decomposition. Even in this context the general mapping problem is known to be computationally intractable, but recent advances have been made in identifying classes of problems and architectures for which optimal solutions can be found in polynomial time. Among these, the mapping of pipelined or parallel computations onto linear array, shared memory, and host-satellite systems figures prominently. This paper extends that work first by showing how to improve existing serial mapping algorithms. These improvements have significantly lower time and space complexities: in one case a published O(nm sup 3) time algorithm for mapping m modules onto n processors is reduced to an O(nm log m) time complexity, and its space requirements reduced from O(nm sup 2) to O(m). Run time complexity is further reduced with parallel mapping algorithms based on these improvements, which run on the architecture for which they create the mappings.
Efficient sequential and parallel algorithms for record linkage

PubMed Central

Mamun, Abdullah-Al; Mi, Tian; Aseltine, Robert; Rajasekaran, Sanguthevar

2014-01-01

Background and objective Integrating data from multiple sources is a crucial and challenging problem. Even though there exist numerous algorithms for record linkage or deduplication, they suffer from either large time needs or restrictions on the number of datasets that they can integrate. In this paper we report efficient sequential and parallel algorithms for record linkage which handle any number of datasets and outperform previous algorithms. Methods Our algorithms employ hierarchical clustering algorithms as the basis. A key idea that we use is radix sorting on certain attributes to eliminate identical records before any further processing. Another novel idea is to form a graph that links similar records and find the connected components. Results Our sequential and parallel algorithms have been tested on a real dataset of 1 083 878 records and synthetic datasets ranging in size from 50 000 to 9 000 000 records. Our sequential algorithm runs at least two times faster, for any dataset, than the previous best-known algorithm, the two-phase algorithm using faster computation of the edit distance (TPA (FCED)). The speedups obtained by our parallel algorithm are almost linear. For example, we get a speedup of 7.5 with 8 cores (residing in a single node), 14.1 with 16 cores (residing in two nodes), and 26.4 with 32 cores (residing in four nodes). Conclusions We have compared the performance of our sequential algorithm with TPA (FCED) and found that our algorithm outperforms the previous one. The accuracy is the same as that of this previous best-known algorithm. PMID:24154837
A Parallel Ghosting Algorithm for The Flexible Distributed Mesh Database

DOE PAGES

Mubarak, Misbah; Seol, Seegyoung; Lu, Qiukai; ...

2013-01-01

Critical to the scalability of parallel adaptive simulations are parallel control functions including load balancing, reduced inter-process communication and optimal data decomposition. In distributed meshes, many mesh-based applications frequently access neighborhood information for computational purposes which must be transmitted efficiently to avoid parallel performance degradation when the neighbors are on different processors. This article presents a parallel algorithm of creating and deleting data copies, referred to as ghost copies, which localize neighborhood data for computation purposes while minimizing inter-process communication. The key characteristics of the algorithm are: (1) It can create ghost copies of any permissible topological order in amore » 1D, 2D or 3D mesh based on selected adjacencies. (2) It exploits neighborhood communication patterns during the ghost creation process thus eliminating all-to-all communication. (3) For applications that need neighbors of neighbors, the algorithm can create n number of ghost layers up to a point where the whole partitioned mesh can be ghosted. Strong and weak scaling results are presented for the IBM BG/P and Cray XE6 architectures up to a core count of 32,768 processors. The algorithm also leads to scalable results when used in a parallel super-convergent patch recovery error estimator, an application that frequently accesses neighborhood data to carry out computation.« less
Parallelization of a blind deconvolution algorithm

NASA Astrophysics Data System (ADS)

Matson, Charles L.; Borelli, Kathy J.

2006-09-01

Often it is of interest to deblur imagery in order to obtain higher-resolution images. Deblurring requires knowledge of the blurring function - information that is often not available separately from the blurred imagery. Blind deconvolution algorithms overcome this problem by jointly estimating both the high-resolution image and the blurring function from the blurred imagery. Because blind deconvolution algorithms are iterative in nature, they can take minutes to days to deblur an image depending how many frames of data are used for the deblurring and the platforms on which the algorithms are executed. Here we present our progress in parallelizing a blind deconvolution algorithm to increase its execution speed. This progress includes sub-frame parallelization and a code structure that is not specialized to a specific computer hardware architecture.
Algorithm for solving the linear Cauchy problem for large systems of ordinary differential equations with the use of parallel computations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Moryakov, A. V., E-mail: sailor@orc.ru

2016-12-15

An algorithm for solving the linear Cauchy problem for large systems of ordinary differential equations is presented. The algorithm for systems of first-order differential equations is implemented in the EDELWEISS code with the possibility of parallel computations on supercomputers employing the MPI (Message Passing Interface) standard for the data exchange between parallel processes. The solution is represented by a series of orthogonal polynomials on the interval [0, 1]. The algorithm is characterized by simplicity and the possibility to solve nonlinear problems with a correction of the operator in accordance with the solution obtained in the previous iterative process.
Parallel, stochastic measurement of molecular surface area.

PubMed

Juba, Derek; Varshney, Amitabh

2008-08-01

Biochemists often wish to compute surface areas of proteins. A variety of algorithms have been developed for this task, but they are designed for traditional single-processor architectures. The current trend in computer hardware is towards increasingly parallel architectures for which these algorithms are not well suited. We describe a parallel, stochastic algorithm for molecular surface area computation that maps well to the emerging multi-core architectures. Our algorithm is also progressive, providing a rough estimate of surface area immediately and refining this estimate as time goes on. Furthermore, the algorithm generates points on the molecular surface which can be used for point-based rendering. We demonstrate a GPU implementation of our algorithm and show that it compares favorably with several existing molecular surface computation programs, giving fast estimates of the molecular surface area with good accuracy.
Concurrent extensions to the FORTRAN language for parallel programming of computational fluid dynamics algorithms

NASA Technical Reports Server (NTRS)

Weeks, Cindy Lou

1986-01-01

Experiments were conducted at NASA Ames Research Center to define multi-tasking software requirements for multiple-instruction, multiple-data stream (MIMD) computer architectures. The focus was on specifying solutions for algorithms in the field of computational fluid dynamics (CFD). The program objectives were to allow researchers to produce usable parallel application software as soon as possible after acquiring MIMD computer equipment, to provide researchers with an easy-to-learn and easy-to-use parallel software language which could be implemented on several different MIMD machines, and to enable researchers to list preferred design specifications for future MIMD computer architectures. Analysis of CFD algorithms indicated that extensions of an existing programming language, adaptable to new computer architectures, provided the best solution to meeting program objectives. The CoFORTRAN Language was written in response to these objectives and to provide researchers a means to experiment with parallel software solutions to CFD algorithms on machines with parallel architectures.
Parallel algorithm for determining motion vectors in ice floe images by matching edge features

NASA Technical Reports Server (NTRS)

Manohar, M.; Ramapriyan, H. K.; Strong, J. P.

1988-01-01

A parallel algorithm is described to determine motion vectors of ice floes using time sequences of images of the Arctic ocean obtained from the Synthetic Aperture Radar (SAR) instrument flown on-board the SEASAT spacecraft. Researchers describe a parallel algorithm which is implemented on the MPP for locating corresponding objects based on their translationally and rotationally invariant features. The algorithm first approximates the edges in the images by polygons or sets of connected straight-line segments. Each such edge structure is then reduced to a seed point. Associated with each seed point are the descriptions (lengths, orientations and sequence numbers) of the lines constituting the corresponding edge structure. A parallel matching algorithm is used to match packed arrays of such descriptions to identify corresponding seed points in the two images. The matching algorithm is designed such that fragmentation and merging of ice floes are taken into account by accepting partial matches. The technique has been demonstrated to work on synthetic test patterns and real image pairs from SEASAT in times ranging from .5 to 0.7 seconds for 128 x 128 images.
A Parallel Point Matching Algorithm for Landmark Based Image Registration Using Multicore Platform

PubMed Central

Yang, Lin; Gong, Leiguang; Zhang, Hong; Nosher, John L.; Foran, David J.

2013-01-01

Point matching is crucial for many computer vision applications. Establishing the correspondence between a large number of data points is a computationally intensive process. Some point matching related applications, such as medical image registration, require real time or near real time performance if applied to critical clinical applications like image assisted surgery. In this paper, we report a new multicore platform based parallel algorithm for fast point matching in the context of landmark based medical image registration. We introduced a non-regular data partition algorithm which utilizes the K-means clustering algorithm to group the landmarks based on the number of available processing cores, which optimize the memory usage and data transfer. We have tested our method using the IBM Cell Broadband Engine (Cell/B.E.) platform. The results demonstrated a significant speed up over its sequential implementation. The proposed data partition and parallelization algorithm, though tested only on one multicore platform, is generic by its design. Therefore the parallel algorithm can be extended to other computing platforms, as well as other point matching related applications. PMID:24308014
Implementation and analysis of a Navier-Stokes algorithm on parallel computers

NASA Technical Reports Server (NTRS)

Fatoohi, Raad A.; Grosch, Chester E.

1988-01-01

The results of the implementation of a Navier-Stokes algorithm on three parallel/vector computers are presented. The object of this research is to determine how well, or poorly, a single numerical algorithm would map onto three different architectures. The algorithm is a compact difference scheme for the solution of the incompressible, two-dimensional, time-dependent Navier-Stokes equations. The computers were chosen so as to encompass a variety of architectures. They are the following: the MPP, an SIMD machine with 16K bit serial processors; Flex/32, an MIMD machine with 20 processors; and Cray/2. The implementation of the algorithm is discussed in relation to these architectures and measures of the performance on each machine are given. The basic comparison is among SIMD instruction parallelism on the MPP, MIMD process parallelism on the Flex/32, and vectorization of a serial code on the Cray/2. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally, conclusions are presented.
Solving very large, sparse linear systems on mesh-connected parallel computers

NASA Technical Reports Server (NTRS)

Opsahl, Torstein; Reif, John

1987-01-01

The implementation of Pan and Reif's Parallel Nested Dissection (PND) algorithm on mesh connected parallel computers is described. This is the first known algorithm that allows very large, sparse linear systems of equations to be solved efficiently in polylog time using a small number of processors. How the processor bound of PND can be matched to the number of processors available on a given parallel computer by slowing down the algorithm by constant factors is described. Also, for the important class of problems where G(A) is a grid graph, a unique memory mapping that reduces the inter-processor communication requirements of PND to those that can be executed on mesh connected parallel machines is detailed. A description of an implementation on the Goodyear Massively Parallel Processor (MPP), located at Goddard is given. Also, a detailed discussion of data mappings and performance issues is given.
Handling Big Data in Medical Imaging: Iterative Reconstruction with Large-Scale Automated Parallel Computation

PubMed Central

Lee, Jae H.; Yao, Yushu; Shrestha, Uttam; Gullberg, Grant T.; Seo, Youngho

2014-01-01

The primary goal of this project is to implement the iterative statistical image reconstruction algorithm, in this case maximum likelihood expectation maximum (MLEM) used for dynamic cardiac single photon emission computed tomography, on Spark/GraphX. This involves porting the algorithm to run on large-scale parallel computing systems. Spark is an easy-to- program software platform that can handle large amounts of data in parallel. GraphX is a graph analytic system running on top of Spark to handle graph and sparse linear algebra operations in parallel. The main advantage of implementing MLEM algorithm in Spark/GraphX is that it allows users to parallelize such computation without any expertise in parallel computing or prior knowledge in computer science. In this paper we demonstrate a successful implementation of MLEM in Spark/GraphX and present the performance gains with the goal to eventually make it useable in clinical setting. PMID:27081299

Handling Big Data in Medical Imaging: Iterative Reconstruction with Large-Scale Automated Parallel Computation.

PubMed

Lee, Jae H; Yao, Yushu; Shrestha, Uttam; Gullberg, Grant T; Seo, Youngho

2014-11-01

The primary goal of this project is to implement the iterative statistical image reconstruction algorithm, in this case maximum likelihood expectation maximum (MLEM) used for dynamic cardiac single photon emission computed tomography, on Spark/GraphX. This involves porting the algorithm to run on large-scale parallel computing systems. Spark is an easy-to- program software platform that can handle large amounts of data in parallel. GraphX is a graph analytic system running on top of Spark to handle graph and sparse linear algebra operations in parallel. The main advantage of implementing MLEM algorithm in Spark/GraphX is that it allows users to parallelize such computation without any expertise in parallel computing or prior knowledge in computer science. In this paper we demonstrate a successful implementation of MLEM in Spark/GraphX and present the performance gains with the goal to eventually make it useable in clinical setting.
An Integrated Approach to Locality-Conscious Processor Allocation and Scheduling of Mixed-Parallel Applications

DOE Office of Scientific and Technical Information (OSTI.GOV)

Vydyanathan, Naga; Krishnamoorthy, Sriram; Sabin, Gerald M.

2009-08-01

Complex parallel applications can often be modeled as directed acyclic graphs of coarse-grained application-tasks with dependences. These applications exhibit both task- and data-parallelism, and combining these two (also called mixedparallelism), has been shown to be an effective model for their execution. In this paper, we present an algorithm to compute the appropriate mix of task- and data-parallelism required to minimize the parallel completion time (makespan) of these applications. In other words, our algorithm determines the set of tasks that should be run concurrently and the number of processors to be allocated to each task. The processor allocation and scheduling decisionsmore » are made in an integrated manner and are based on several factors such as the structure of the taskgraph, the runtime estimates and scalability characteristics of the tasks and the inter-task data communication volumes. A locality conscious scheduling strategy is used to improve inter-task data reuse. Evaluation through simulations and actual executions of task graphs derived from real applications as well as synthetic graphs shows that our algorithm consistently generates schedules with lower makespan as compared to CPR and CPA, two previously proposed scheduling algorithms. Our algorithm also produces schedules that have lower makespan than pure taskand data-parallel schedules. For task graphs with known optimal schedules or lower bounds on the makespan, our algorithm generates schedules that are closer to the optima than other scheduling approaches.« less
Saturation: An efficient iteration strategy for symbolic state-space generation

NASA Technical Reports Server (NTRS)

Ciardo, Gianfranco; Luettgen, Gerald; Siminiceanu, Radu; Bushnell, Dennis M. (Technical Monitor)

2001-01-01

This paper presents a novel algorithm for generating state spaces of asynchronous systems using Multi-valued Decision Diagrams. In contrast to related work, the next-state function of a system is not encoded as a single Boolean function, but as cross-products of integer functions. This permits the application of various iteration strategies to build a system's state space. In particular, this paper introduces a new elegant strategy, called saturation, and implements it in the tool SMART. On top of usually performing several orders of magnitude faster than existing BDD-based state-space generators, the algorithm's required peak memory is often close to the nal memory needed for storing the overall state spaces.
Microwave-Based Reaction Screening: Tandem Retro-Diels-Alder/Diels-Alder Cycloadditions of ortho-Quinol Dimers

PubMed Central

Dong, Suwei; Cahill, Kath arine J.; Kang, Moon -Il; Colburn, Nancy H.; Henrich, Curtis J.; Wilson, Jennifer A.; Beutler, John A.; Johnson, Richard P.; Porco, John A.

2011-01-01

We have accomplished a parallel screen of cycloaddition partners for ortho-quinols utilizing a plate-based microwave system. Microwave irradiation improves the efficiency of retro-Diels-Alder/Diels-Alder cascades of ortho-quinol dimers which generally proceed in a diastereoselective fashion. Computational studies indicate that asynchronous transition states are favored in Diels-Alder cycloadditions of ortho-quinols. Subsequent biological evaluation of a collection of cycloadducts has identified an inhibitor of activator protein-1 (AP-1), an oncogenic transcription factor. PMID:21942286
A Parallel Compact Multi-Dimensional Numerical Algorithm with Aeroacoustics Applications

NASA Technical Reports Server (NTRS)

Povitsky, Alex; Morris, Philip J.

1999-01-01

In this study we propose a novel method to parallelize high-order compact numerical algorithms for the solution of three-dimensional PDEs (Partial Differential Equations) in a space-time domain. For this numerical integration most of the computer time is spent in computation of spatial derivatives at each stage of the Runge-Kutta temporal update. The most efficient direct method to compute spatial derivatives on a serial computer is a version of Gaussian elimination for narrow linear banded systems known as the Thomas algorithm. In a straightforward pipelined implementation of the Thomas algorithm processors are idle due to the forward and backward recurrences of the Thomas algorithm. To utilize processors during this time, we propose to use them for either non-local data independent computations, solving lines in the next spatial direction, or local data-dependent computations by the Runge-Kutta method. To achieve this goal, control of processor communication and computations by a static schedule is adopted. Thus, our parallel code is driven by a communication and computation schedule instead of the usual "creative, programming" approach. The obtained parallelization speed-up of the novel algorithm is about twice as much as that for the standard pipelined algorithm and close to that for the explicit DRP algorithm.
Relation of Parallel Discrete Event Simulation algorithms with physical models

NASA Astrophysics Data System (ADS)

Shchur, L. N.; Shchur, L. V.

2015-09-01

We extend concept of local simulation times in parallel discrete event simulation (PDES) in order to take into account architecture of the current hardware and software in high-performance computing. We shortly review previous research on the mapping of PDES on physical problems, and emphasise how physical results may help to predict parallel algorithms behaviour.
Highly parallel sparse Cholesky factorization

NASA Technical Reports Server (NTRS)

Gilbert, John R.; Schreiber, Robert

1990-01-01

Several fine grained parallel algorithms were developed and compared to compute the Cholesky factorization of a sparse matrix. The experimental implementations are on the Connection Machine, a distributed memory SIMD machine whose programming model conceptually supplies one processor per data element. In contrast to special purpose algorithms in which the matrix structure conforms to the connection structure of the machine, the focus is on matrices with arbitrary sparsity structure. The most promising algorithm is one whose inner loop performs several dense factorizations simultaneously on a 2-D grid of processors. Virtually any massively parallel dense factorization algorithm can be used as the key subroutine. The sparse code attains execution rates comparable to those of the dense subroutine. Although at present architectural limitations prevent the dense factorization from realizing its potential efficiency, it is concluded that a regular data parallel architecture can be used efficiently to solve arbitrarily structured sparse problems. A performance model is also presented and it is used to analyze the algorithms.
Mining algorithm for association rules in big data based on Hadoop

NASA Astrophysics Data System (ADS)

Fu, Chunhua; Wang, Xiaojing; Zhang, Lijun; Qiao, Liying

2018-04-01

In order to solve the problem that the traditional association rules mining algorithm has been unable to meet the mining needs of large amount of data in the aspect of efficiency and scalability, take FP-Growth as an example, the algorithm is realized in the parallelization based on Hadoop framework and Map Reduce model. On the basis, it is improved using the transaction reduce method for further enhancement of the algorithm's mining efficiency. The experiment, which consists of verification of parallel mining results, comparison on efficiency between serials and parallel, variable relationship between mining time and node number and between mining time and data amount, is carried out in the mining results and efficiency by Hadoop clustering. Experiments show that the paralleled FP-Growth algorithm implemented is able to accurately mine frequent item sets, with a better performance and scalability. It can be better to meet the requirements of big data mining and efficiently mine frequent item sets and association rules from large dataset.
A heuristic for suffix solutions

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bilgory, A.; Gajski, D.D.

1986-01-01

The suffix problem has appeared in solutions of recurrence systems for parallel and pipelined machines and more recently in the design of gate and silicon compilers. In this paper the authors present two algorithms. The first algorithm generates parallel suffix solutions with minimum cost for a given length, time delay, availability of initial values, and fanout. This algorithm generates a minimal solution for any length n and depth range log/sub 2/ N to N. The second algorithm reduces the size of the solutions generated by the first algorithm.
GPU based cloud system for high-performance arrhythmia detection with parallel k-NN algorithm.

PubMed

Tae Joon Jun; Hyun Ji Park; Hyuk Yoo; Young-Hak Kim; Daeyoung Kim

2016-08-01

In this paper, we propose an GPU based Cloud system for high-performance arrhythmia detection. Pan-Tompkins algorithm is used for QRS detection and we optimized beat classification algorithm with K-Nearest Neighbor (K-NN). To support high performance beat classification on the system, we parallelized beat classification algorithm with CUDA to execute the algorithm on virtualized GPU devices on the Cloud system. MIT-BIH Arrhythmia database is used for validation of the algorithm. The system achieved about 93.5% of detection rate which is comparable to previous researches while our algorithm shows 2.5 times faster execution time compared to CPU only detection algorithm.
Parallel Clustering Algorithm for Large-Scale Biological Data Sets

PubMed Central

Wang, Minchao; Zhang, Wu; Ding, Wang; Dai, Dongbo; Zhang, Huiran; Xie, Hao; Chen, Luonan; Guo, Yike; Xie, Jiang

2014-01-01

Backgrounds Recent explosion of biological data brings a great challenge for the traditional clustering algorithms. With increasing scale of data sets, much larger memory and longer runtime are required for the cluster identification problems. The affinity propagation algorithm outperforms many other classical clustering algorithms and is widely applied into the biological researches. However, the time and space complexity become a great bottleneck when handling the large-scale data sets. Moreover, the similarity matrix, whose constructing procedure takes long runtime, is required before running the affinity propagation algorithm, since the algorithm clusters data sets based on the similarities between data pairs. Methods Two types of parallel architectures are proposed in this paper to accelerate the similarity matrix constructing procedure and the affinity propagation algorithm. The memory-shared architecture is used to construct the similarity matrix, and the distributed system is taken for the affinity propagation algorithm, because of its large memory size and great computing capacity. An appropriate way of data partition and reduction is designed in our method, in order to minimize the global communication cost among processes. Result A speedup of 100 is gained with 128 cores. The runtime is reduced from serval hours to a few seconds, which indicates that parallel algorithm is capable of handling large-scale data sets effectively. The parallel affinity propagation also achieves a good performance when clustering large-scale gene data (microarray) and detecting families in large protein superfamilies. PMID:24705246
Parallel-Processing Test Bed For Simulation Software

NASA Technical Reports Server (NTRS)

Blech, Richard; Cole, Gary; Townsend, Scott

1996-01-01

Second-generation Hypercluster computing system is multiprocessor test bed for research on parallel algorithms for simulation in fluid dynamics, electromagnetics, chemistry, and other fields with large computational requirements but relatively low input/output requirements. Built from standard, off-shelf hardware readily upgraded as improved technology becomes available. System used for experiments with such parallel-processing concepts as message-passing algorithms, debugging software tools, and computational steering. First-generation Hypercluster system described in "Hypercluster Parallel Processor" (LEW-15283).
Poster — Thur Eve — 74: Distributed, asynchronous, reactive dosimetric and outcomes analysis using DICOMautomaton

DOE Office of Scientific and Technical Information (OSTI.GOV)

Clark, Haley; BC Cancer Agency, Surrey, B.C.; BC Cancer Agency, Vancouver, B.C.

2014-08-15

Many have speculated about the future of computational technology in clinical radiation oncology. It has been advocated that the next generation of computational infrastructure will improve on the current generation by incorporating richer aspects of automation, more heavily and seamlessly featuring distributed and parallel computation, and providing more flexibility toward aggregate data analysis. In this report we describe how a recently created — but currently existing — analysis framework (DICOMautomaton) incorporates these aspects. DICOMautomaton supports a variety of use cases but is especially suited for dosimetric outcomes correlation analysis, investigation and comparison of radiotherapy treatment efficacy, and dose-volume computation. Wemore » describe: how it overcomes computational bottlenecks by distributing workload across a network of machines; how modern, asynchronous computational techniques are used to reduce blocking and avoid unnecessary computation; and how issues of out-of-date data are addressed using reactive programming techniques and data dependency chains. We describe internal architecture of the software and give a detailed demonstration of how DICOMautomaton could be used to search for correlations between dosimetric and outcomes data.« less
3D Hybrid Simulations of Interactions of High-Velocity Plasmoids with Obstacles

NASA Astrophysics Data System (ADS)

Omelchenko, Y. A.; Weber, T. E.; Smith, R. J.

2015-11-01

Interactions of fast plasma streams and objects with magnetic obstacles (dipoles, mirrors, etc) lie at the core of many space and laboratory plasma phenomena ranging from magnetoshells and solar wind interactions with planetary magnetospheres to compact fusion plasmas (spheromaks and FRCs) to astrophysics-in-lab experiments. Properly modeling ion kinetic, finite-Larmor radius and Hall effects is essential for describing large-scale plasma dynamics, turbulence and heating in complex magnetic field geometries. Using an asynchronous parallel hybrid code, HYPERS, we conduct 3D hybrid (particle-in-cell ion, fluid electron) simulations of such interactions under realistic conditions that include magnetic flux coils, ion-ion collisions and the Chodura resistivity. HYPERS does not step simulation variables synchronously in time but instead performs time integration by executing asynchronous discrete events: updates of particles and fields carried out as frequently as dictated by local physical time scales. Simulations are compared with data from the MSX experiment which studies the physics of magnetized collisionless shocks through the acceleration and subsequent stagnation of FRC plasmoids against a strong magnetic mirror and flux-conserving boundary.
Customizing FP-growth algorithm to parallel mining with Charm++ library

NASA Astrophysics Data System (ADS)

Puścian, Marek

2017-08-01

This paper presents a frequent item mining algorithm that was customized to handle growing data repositories. The proposed solution applies Master Slave scheme to frequent pattern growth technique. Efficient utilization of available computation units is achieved by dynamic reallocation of tasks. Conditional frequent trees are assigned to parallel workers basing on their workload. Proposed enhancements have been successfully implemented using Charm++ library. This paper discusses results of the performance of parallelized FP-growth algorithm against different datasets. The approach has been illustrated with many experiments and measurements performed using multiprocessor and multithreaded computer.
Simulated parallel annealing within a neighborhood for optimization of biomechanical systems.

PubMed

Higginson, J S; Neptune, R R; Anderson, F C

2005-09-01

Optimization problems for biomechanical systems have become extremely complex. Simulated annealing (SA) algorithms have performed well in a variety of test problems and biomechanical applications; however, despite advances in computer speed, convergence to optimal solutions for systems of even moderate complexity has remained prohibitive. The objective of this study was to develop a portable parallel version of a SA algorithm for solving optimization problems in biomechanics. The algorithm for simulated parallel annealing within a neighborhood (SPAN) was designed to minimize interprocessor communication time and closely retain the heuristics of the serial SA algorithm. The computational speed of the SPAN algorithm scaled linearly with the number of processors on different computer platforms for a simple quadratic test problem and for a more complex forward dynamic simulation of human pedaling.
A GPU-paralleled implementation of an enhanced face recognition algorithm

NASA Astrophysics Data System (ADS)

Chen, Hao; Liu, Xiyang; Shao, Shuai; Zan, Jiguo

2013-03-01

Face recognition algorithm based on compressed sensing and sparse representation is hotly argued in these years. The scheme of this algorithm increases recognition rate as well as anti-noise capability. However, the computational cost is expensive and has become a main restricting factor for real world applications. In this paper, we introduce a GPU-accelerated hybrid variant of face recognition algorithm named parallel face recognition algorithm (pFRA). We describe here how to carry out parallel optimization design to take full advantage of many-core structure of a GPU. The pFRA is tested and compared with several other implementations under different data sample size. Finally, Our pFRA, implemented with NVIDIA GPU and Computer Unified Device Architecture (CUDA) programming model, achieves a significant speedup over the traditional CPU implementations.
Conjugate-Gradient Algorithms For Dynamics Of Manipulators

NASA Technical Reports Server (NTRS)

Fijany, Amir; Scheid, Robert E.

1993-01-01

Algorithms for serial and parallel computation of forward dynamics of multiple-link robotic manipulators by conjugate-gradient method developed. Parallel algorithms have potential for speedup of computations on multiple linked, specialized processors implemented in very-large-scale integrated circuits. Such processors used to stimulate dynamics, possibly faster than in real time, for purposes of planning and control.
Evolution of a minimal parallel programming model

DOE PAGES

Lusk, Ewing; Butler, Ralph; Pieper, Steven C.

2017-04-30

Here, we take a historical approach to our presentation of self-scheduled task parallelism, a programming model with its origins in early irregular and nondeterministic computations encountered in automated theorem proving and logic programming. We show how an extremely simple task model has evolved into a system, asynchronous dynamic load balancing (ADLB), and a scalable implementation capable of supporting sophisticated applications on today’s (and tomorrow’s) largest supercomputers; and we illustrate the use of ADLB with a Green’s function Monte Carlo application, a modern, mature nuclear physics code in production use. Our lesson is that by surrendering a certain amount of generalitymore » and thus applicability, a minimal programming model (in terms of its basic concepts and the size of its application programmer interface) can achieve extreme scalability without introducing complexity.« less
An Analysis of Performance Enhancement Techniques for Overset Grid Applications

NASA Technical Reports Server (NTRS)

Djomehri, J. J.; Biswas, R.; Potsdam, M.; Strawn, R. C.; Biegel, Bryan (Technical Monitor)

2002-01-01

The overset grid methodology has significantly reduced time-to-solution of high-fidelity computational fluid dynamics (CFD) simulations about complex aerospace configurations. The solution process resolves the geometrical complexity of the problem domain by using separately generated but overlapping structured discretization grids that periodically exchange information through interpolation. However, high performance computations of such large-scale realistic applications must be handled efficiently on state-of-the-art parallel supercomputers. This paper analyzes the effects of various performance enhancement techniques on the parallel efficiency of an overset grid Navier-Stokes CFD application running on an SGI Origin2000 machine. Specifically, the role of asynchronous communication, grid splitting, and grid grouping strategies are presented and discussed. Results indicate that performance depends critically on the level of latency hiding and the quality of load balancing across the processors.

Advanced Technology and Mitigation (ATDM) SPARC Re-Entry Code Fiscal Year 2017 Progress and Accomplishments for ECP.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Crozier, Paul; Howard, Micah; Rider, William J.

The SPARC (Sandia Parallel Aerodynamics and Reentry Code) will provide nuclear weapon qualification evidence for the random vibration and thermal environments created by re-entry of a warhead into the earth’s atmosphere. SPARC incorporates the innovative approaches of ATDM projects on several fronts including: effective harnessing of heterogeneous compute nodes using Kokkos, exascale-ready parallel scalability through asynchronous multi-tasking, uncertainty quantification through Sacado integration, implementation of state-of-the-art reentry physics and multiscale models, use of advanced verification and validation methods, and enabling of improved workflows for users. SPARC is being developed primarily for the Department of Energy nuclear weapon program, with additional developmentmore » and use of the code is being supported by the Department of Defense for conventional weapons programs.« less
Human Exposure to Electromagnetic Fields from Parallel Wireless Power Transfer Systems.

PubMed

Wen, Feng; Huang, Xueliang

2017-02-08

The scenario of multiple wireless power transfer (WPT) systems working closely, synchronously or asynchronously with phase difference often occurs in power supply for household appliances and electric vehicles in parking lots. Magnetic field leakage from the WPT systems is also varied due to unpredictable asynchronous working conditions. In this study, the magnetic field leakage from parallel WPT systems working with phase difference is predicted, and the induced electric field and specific absorption rate (SAR) in a human body standing in the vicinity are also evaluated. Computational results are compared with the restrictions prescribed in the regulations established to limit human exposure to time-varying electromagnetic fields (EMFs). The results show that the middle region between the two WPT coils is safer for the two WPT systems working in-phase, and the peripheral regions are safer around the WPT systems working anti-phase. Thin metallic plates larger than the WPT coils can shield the magnetic field leakage well, while smaller ones may worsen the situation. The orientation of the human body will influence the maximum magnitude of induced electric field and its distribution within the human body. The induced electric field centralizes in the trunk, groin, and genitals with only one exception: when the human body is standing right at the middle of the two WPT coils working in-phase, the induced electric field focuses on lower limbs. The SAR value in the lungs always seems to be greater than in other organs, while the value in the liver is minimal. Human exposure to EMFs meets the guidelines of the International Committee on Non-Ionizing Radiation Protection (ICNIRP), specifically reference levels with respect to magnetic field and basic restrictions on induced electric fields and SAR, as the charging power is lower than 3.1 kW and 55.5 kW, respectively. These results are positive with respect to the safe applications of parallel WPT systems working simultaneously.
Human Exposure to Electromagnetic Fields from Parallel Wireless Power Transfer Systems

PubMed Central

Wen, Feng; Huang, Xueliang

2017-01-01

The scenario of multiple wireless power transfer (WPT) systems working closely, synchronously or asynchronously with phase difference often occurs in power supply for household appliances and electric vehicles in parking lots. Magnetic field leakage from the WPT systems is also varied due to unpredictable asynchronous working conditions. In this study, the magnetic field leakage from parallel WPT systems working with phase difference is predicted, and the induced electric field and specific absorption rate (SAR) in a human body standing in the vicinity are also evaluated. Computational results are compared with the restrictions prescribed in the regulations established to limit human exposure to time-varying electromagnetic fields (EMFs). The results show that the middle region between the two WPT coils is safer for the two WPT systems working in-phase, and the peripheral regions are safer around the WPT systems working anti-phase. Thin metallic plates larger than the WPT coils can shield the magnetic field leakage well, while smaller ones may worsen the situation. The orientation of the human body will influence the maximum magnitude of induced electric field and its distribution within the human body. The induced electric field centralizes in the trunk, groin, and genitals with only one exception: when the human body is standing right at the middle of the two WPT coils working in-phase, the induced electric field focuses on lower limbs. The SAR value in the lungs always seems to be greater than in other organs, while the value in the liver is minimal. Human exposure to EMFs meets the guidelines of the International Committee on Non-Ionizing Radiation Protection (ICNIRP), specifically reference levels with respect to magnetic field and basic restrictions on induced electric fields and SAR, as the charging power is lower than 3.1 kW and 55.5 kW, respectively. These results are positive with respect to the safe applications of parallel WPT systems working simultaneously. PMID:28208709
Extending molecular simulation time scales: Parallel in time integrations for high-level quantum chemistry and complex force representations

NASA Astrophysics Data System (ADS)

Bylaska, Eric J.; Weare, Jonathan Q.; Weare, John H.

2013-08-01

Parallel in time simulation algorithms are presented and applied to conventional molecular dynamics (MD) and ab initio molecular dynamics (AIMD) models of realistic complexity. Assuming that a forward time integrator, f (e.g., Verlet algorithm), is available to propagate the system from time ti (trajectory positions and velocities xi = (ri, vi)) to time ti + 1 (xi + 1) by xi + 1 = fi(xi), the dynamics problem spanning an interval from t0…tM can be transformed into a root finding problem, F(X) = [xi - f(x(i - 1)]i = 1, M = 0, for the trajectory variables. The root finding problem is solved using a variety of root finding techniques, including quasi-Newton and preconditioned quasi-Newton schemes that are all unconditionally convergent. The algorithms are parallelized by assigning a processor to each time-step entry in the columns of F(X). The relation of this approach to other recently proposed parallel in time methods is discussed, and the effectiveness of various approaches to solving the root finding problem is tested. We demonstrate that more efficient dynamical models based on simplified interactions or coarsening time-steps provide preconditioners for the root finding problem. However, for MD and AIMD simulations, such preconditioners are not required to obtain reasonable convergence and their cost must be considered in the performance of the algorithm. The parallel in time algorithms developed are tested by applying them to MD and AIMD simulations of size and complexity similar to those encountered in present day applications. These include a 1000 Si atom MD simulation using Stillinger-Weber potentials, and a HCl + 4H2O AIMD simulation at the MP2 level. The maximum speedup (serial execution time/parallel execution time) obtained by parallelizing the Stillinger-Weber MD simulation was nearly 3.0. For the AIMD MP2 simulations, the algorithms achieved speedups of up to 14.3. The parallel in time algorithms can be implemented in a distributed computing environment using very slow transmission control protocol/Internet protocol networks. Scripts written in Python that make calls to a precompiled quantum chemistry package (NWChem) are demonstrated to provide an actual speedup of 8.2 for a 2.5 ps AIMD simulation of HCl + 4H2O at the MP2/6-31G* level. Implemented in this way these algorithms can be used for long time high-level AIMD simulations at a modest cost using machines connected by very slow networks such as WiFi, or in different time zones connected by the Internet. The algorithms can also be used with programs that are already parallel. Using these algorithms, we are able to reduce the cost of a MP2/6-311++G(2d,2p) simulation that had reached its maximum possible speedup in the parallelization of the electronic structure calculation from 32 s/time step to 6.9 s/time step.
A parallel algorithm for generation and assembly of finite element stiffness and mass matrices

NASA Technical Reports Server (NTRS)

Storaasli, O. O.; Carmona, E. A.; Nguyen, D. T.; Baddourah, M. A.

1991-01-01

A new algorithm is proposed for parallel generation and assembly of the finite element stiffness and mass matrices. The proposed assembly algorithm is based on a node-by-node approach rather than the more conventional element-by-element approach. The new algorithm's generality and computation speed-up when using multiple processors are demonstrated for several practical applications on multi-processor Cray Y-MP and Cray 2 supercomputers.
Parallel Implementation of the Wideband DOA Algorithm on the IBM Cell BE Processor

DTIC Science & Technology

2010-05-01

Abstract—The Multiple Signal Classification ( MUSIC ) algorithm is a powerful technique for determining the Direction of Arrival (DOA) of signals...Broadband Engine Processor (Cell BE). The process of adapting the serial based MUSIC algorithm to the Cell BE will be analyzed in terms of parallelism and...using Multiple Signal Classification MUSIC algorithm [4] • Computation of Focus matrix • Computation of number of sources • Separation of Signal
On Parallel Push-Relabel based Algorithms for Bipartite Maximum Matching

DOE Office of Scientific and Technical Information (OSTI.GOV)

Langguth, Johannes; Azad, Md Ariful; Halappanavar, Mahantesh

2014-07-01

We study multithreaded push-relabel based algorithms for computing maximum cardinality matching in bipartite graphs. Matching is a fundamental combinatorial (graph) problem with applications in a wide variety of problems in science and engineering. We are motivated by its use in the context of sparse linear solvers for computing maximum transversal of a matrix. We implement and test our algorithms on several multi-socket multicore systems and compare their performance to state-of-the-art augmenting path-based serial and parallel algorithms using a testset comprised of a wide range of real-world instances. Building on several heuristics for enhancing performance, we demonstrate good scaling for themore » parallel push-relabel algorithm. We show that it is comparable to the best augmenting path-based algorithms for bipartite matching. To the best of our knowledge, this is the first extensive study of multithreaded push-relabel based algorithms. In addition to a direct impact on the applications using matching, the proposed algorithmic techniques can be extended to preflow-push based algorithms for computing maximum flow in graphs.« less
Implementation of a parallel protein structure alignment service on cloud.

PubMed

Hung, Che-Lun; Lin, Yaw-Ling

2013-01-01

Protein structure alignment has become an important strategy by which to identify evolutionary relationships between protein sequences. Several alignment tools are currently available for online comparison of protein structures. In this paper, we propose a parallel protein structure alignment service based on the Hadoop distribution framework. This service includes a protein structure alignment algorithm, a refinement algorithm, and a MapReduce programming model. The refinement algorithm refines the result of alignment. To process vast numbers of protein structures in parallel, the alignment and refinement algorithms are implemented using MapReduce. We analyzed and compared the structure alignments produced by different methods using a dataset randomly selected from the PDB database. The experimental results verify that the proposed algorithm refines the resulting alignments more accurately than existing algorithms. Meanwhile, the computational performance of the proposed service is proportional to the number of processors used in our cloud platform.
Implementation of a Parallel Protein Structure Alignment Service on Cloud

PubMed Central

Hung, Che-Lun; Lin, Yaw-Ling

2013-01-01

Protein structure alignment has become an important strategy by which to identify evolutionary relationships between protein sequences. Several alignment tools are currently available for online comparison of protein structures. In this paper, we propose a parallel protein structure alignment service based on the Hadoop distribution framework. This service includes a protein structure alignment algorithm, a refinement algorithm, and a MapReduce programming model. The refinement algorithm refines the result of alignment. To process vast numbers of protein structures in parallel, the alignment and refinement algorithms are implemented using MapReduce. We analyzed and compared the structure alignments produced by different methods using a dataset randomly selected from the PDB database. The experimental results verify that the proposed algorithm refines the resulting alignments more accurately than existing algorithms. Meanwhile, the computational performance of the proposed service is proportional to the number of processors used in our cloud platform. PMID:23671842
3D Kirchhoff depth migration algorithm: A new scalable approach for parallelization on multicore CPU based cluster

NASA Astrophysics Data System (ADS)

Rastogi, Richa; Londhe, Ashutosh; Srivastava, Abhishek; Sirasala, Kirannmayi M.; Khonde, Kiran

2017-03-01

In this article, a new scalable 3D Kirchhoff depth migration algorithm is presented on state of the art multicore CPU based cluster. Parallelization of 3D Kirchhoff depth migration is challenging due to its high demand of compute time, memory, storage and I/O along with the need of their effective management. The most resource intensive modules of the algorithm are traveltime calculations and migration summation which exhibit an inherent trade off between compute time and other resources. The parallelization strategy of the algorithm largely depends on the storage of calculated traveltimes and its feeding mechanism to the migration process. The presented work is an extension of our previous work, wherein a 3D Kirchhoff depth migration application for multicore CPU based parallel system had been developed. Recently, we have worked on improving parallel performance of this application by re-designing the parallelization approach. The new algorithm is capable to efficiently migrate both prestack and poststack 3D data. It exhibits flexibility for migrating large number of traces within the available node memory and with minimal requirement of storage, I/O and inter-node communication. The resultant application is tested using 3D Overthrust data on PARAM Yuva II, which is a Xeon E5-2670 based multicore CPU cluster with 16 cores/node and 64 GB shared memory. Parallel performance of the algorithm is studied using different numerical experiments and the scalability results show striking improvement over its previous version. An impressive 49.05X speedup with 76.64% efficiency is achieved for 3D prestack data and 32.00X speedup with 50.00% efficiency for 3D poststack data, using 64 nodes. The results also demonstrate the effectiveness and robustness of the improved algorithm with high scalability and efficiency on a multicore CPU cluster.
NETRA: A parallel architecture for integrated vision systems. 1: Architecture and organization

NASA Technical Reports Server (NTRS)

Choudhary, Alok N.; Patel, Janak H.; Ahuja, Narendra

1989-01-01

Computer vision is regarded as one of the most complex and computationally intensive problems. An integrated vision system (IVS) is considered to be a system that uses vision algorithms from all levels of processing for a high level application (such as object recognition). A model of computation is presented for parallel processing for an IVS. Using the model, desired features and capabilities of a parallel architecture suitable for IVSs are derived. Then a multiprocessor architecture (called NETRA) is presented. This architecture is highly flexible without the use of complex interconnection schemes. The topology of NETRA is recursively defined and hence is easily scalable from small to large systems. Homogeneity of NETRA permits fault tolerance and graceful degradation under faults. It is a recursively defined tree-type hierarchical architecture where each of the leaf nodes consists of a cluster of processors connected with a programmable crossbar with selective broadcast capability to provide for desired flexibility. A qualitative evaluation of NETRA is presented. Then general schemes are described to map parallel algorithms onto NETRA. Algorithms are classified according to their communication requirements for parallel processing. An extensive analysis of inter-cluster communication strategies in NETRA is presented, and parameters affecting performance of parallel algorithms when mapped on NETRA are discussed. Finally, a methodology to evaluate performance of algorithms on NETRA is described.
Applications of Parallel Computation in Micro-Mechanics and Finite Element Method

NASA Technical Reports Server (NTRS)

Tan, Hui-Qian

1996-01-01

This project discusses the application of parallel computations related with respect to material analyses. Briefly speaking, we analyze some kind of material by elements computations. We call an element a cell here. A cell is divided into a number of subelements called subcells and all subcells in a cell have the identical structure. The detailed structure will be given later in this paper. It is obvious that the problem is "well-structured". SIMD machine would be a better choice. In this paper we try to look into the potentials of SIMD machine in dealing with finite element computation by developing appropriate algorithms on MasPar, a SIMD parallel machine. In section 2, the architecture of MasPar will be discussed. A brief review of the parallel programming language MPL also is given in that section. In section 3, some general parallel algorithms which might be useful to the project will be proposed. And, combining with the algorithms, some features of MPL will be discussed in more detail. In section 4, the computational structure of cell/subcell model will be given. The idea of designing the parallel algorithm for the model will be demonstrated. Finally in section 5, a summary will be given.
Eigensolution of finite element problems in a completely connected parallel architecture

NASA Technical Reports Server (NTRS)

Akl, F.; Morel, M.

1989-01-01

A parallel algorithm is presented for the solution of the generalized eigenproblem in linear elastic finite element analysis. The algorithm is based on a completely connected parallel architecture in which each processor is allowed to communicate with all other processors. The algorithm is successfully implemented on a tightly coupled MIMD parallel processor. A finite element model is divided into m domains each of which is assumed to process n elements. Each domain is then assigned to a processor or to a logical processor (task) if the number of domains exceeds the number of physical processors. The effect of the number of domains, the number of degrees-of-freedom located along the global fronts, and the dimension of the subspace on the performance of the algorithm is investigated. For a 64-element rectangular plate, speed-ups of 1.86, 3.13, 3.18, and 3.61 are achieved on two, four, six, and eight processors, respectively.
Data decomposition method for parallel polygon rasterization considering load balancing

NASA Astrophysics Data System (ADS)

Zhou, Chen; Chen, Zhenjie; Liu, Yongxue; Li, Feixue; Cheng, Liang; Zhu, A.-xing; Li, Manchun

2015-12-01

It is essential to adopt parallel computing technology to rapidly rasterize massive polygon data. In parallel rasterization, it is difficult to design an effective data decomposition method. Conventional methods ignore load balancing of polygon complexity in parallel rasterization and thus fail to achieve high parallel efficiency. In this paper, a novel data decomposition method based on polygon complexity (DMPC) is proposed. First, four factors that possibly affect the rasterization efficiency were investigated. Then, a metric represented by the boundary number and raster pixel number in the minimum bounding rectangle was developed to calculate the complexity of each polygon. Using this metric, polygons were rationally allocated according to the polygon complexity, and each process could achieve balanced loads of polygon complexity. To validate the efficiency of DMPC, it was used to parallelize different polygon rasterization algorithms and tested on different datasets. Experimental results showed that DMPC could effectively parallelize polygon rasterization algorithms. Furthermore, the implemented parallel algorithms with DMPC could achieve good speedup ratios of at least 15.69 and generally outperformed conventional decomposition methods in terms of parallel efficiency and load balancing. In addition, the results showed that DMPC exhibited consistently better performance for different spatial distributions of polygons.
Adaptive receiver structures for asynchronous CDMA systems

NASA Astrophysics Data System (ADS)

Rapajic, Predrag B.; Vucetic, Branka S.

1994-05-01

Adaptive linear and decision feedback receiver structures for coherent demodulation in asynchronous code division multiple access (CDMA) systems are considered. It is assumed that the adaptive receiver has no knowledge of the signature waveforms and timing of other users. The receiver is trained by a known training sequence prior to data transmission and continuously adjusted by an adaptive algorithm during data transmission. The proposed linear receiver is as simple as a standard single-user detector receiver consisting of a matched filter with constant coefficients, but achieves essential advantages with respect to timing recovery, multiple access interference elimination, near/far effect, narrowband and frequency-selective fading interference suppression, and user privacy. An adaptive centralized decision feedback receiver has the same advantages of the linear receiver but, in addition, achieves a further improvement in multiple access interference cancellation at the expense of higher complexity. The proposed receiver structures are tested by simulation over a channel with multipath propagation, multiple access interference, narrowband interference, and additive white Gaussian noise.
A Massively Parallel Computational Method of Reading Index Files for SOAPsnv.

PubMed

Zhu, Xiaoqian; Peng, Shaoliang; Liu, Shaojie; Cui, Yingbo; Gu, Xiang; Gao, Ming; Fang, Lin; Fang, Xiaodong

2015-12-01

SOAPsnv is the software used for identifying the single nucleotide variation in cancer genes. However, its performance is yet to match the massive amount of data to be processed. Experiments reveal that the main performance bottleneck of SOAPsnv software is the pileup algorithm. The original pileup algorithm's I/O process is time-consuming and inefficient to read input files. Moreover, the scalability of the pileup algorithm is also poor. Therefore, we designed a new algorithm, named BamPileup, aiming to improve the performance of sequential read, and the new pileup algorithm implemented a parallel read mode based on index. Using this method, each thread can directly read the data start from a specific position. The results of experiments on the Tianhe-2 supercomputer show that, when reading data in a multi-threaded parallel I/O way, the processing time of algorithm is reduced to 3.9 s and the application program can achieve a speedup up to 100×. Moreover, the scalability of the new algorithm is also satisfying.
A Local Scalable Distributed Expectation Maximization Algorithm for Large Peer-to-Peer Networks

NASA Technical Reports Server (NTRS)

Bhaduri, Kanishka; Srivastava, Ashok N.

2009-01-01

This paper offers a local distributed algorithm for expectation maximization in large peer-to-peer environments. The algorithm can be used for a variety of well-known data mining tasks in a distributed environment such as clustering, anomaly detection, target tracking to name a few. This technology is crucial for many emerging peer-to-peer applications for bioinformatics, astronomy, social networking, sensor networks and web mining. Centralizing all or some of the data for building global models is impractical in such peer-to-peer environments because of the large number of data sources, the asynchronous nature of the peer-to-peer networks, and dynamic nature of the data/network. The distributed algorithm we have developed in this paper is provably-correct i.e. it converges to the same result compared to a similar centralized algorithm and can automatically adapt to changes to the data and the network. We show that the communication overhead of the algorithm is very low due to its local nature. This monitoring algorithm is then used as a feedback loop to sample data from the network and rebuild the model when it is outdated. We present thorough experimental results to verify our theoretical claims.
Parallel design of JPEG-LS encoder on graphics processing units

NASA Astrophysics Data System (ADS)

Duan, Hao; Fang, Yong; Huang, Bormin

2012-01-01

With recent technical advances in graphic processing units (GPUs), GPUs have outperformed CPUs in terms of compute capability and memory bandwidth. Many successful GPU applications to high performance computing have been reported. JPEG-LS is an ISO/IEC standard for lossless image compression which utilizes adaptive context modeling and run-length coding to improve compression ratio. However, adaptive context modeling causes data dependency among adjacent pixels and the run-length coding has to be performed in a sequential way. Hence, using JPEG-LS to compress large-volume hyperspectral image data is quite time-consuming. We implement an efficient parallel JPEG-LS encoder for lossless hyperspectral compression on a NVIDIA GPU using the computer unified device architecture (CUDA) programming technology. We use the block parallel strategy, as well as such CUDA techniques as coalesced global memory access, parallel prefix sum, and asynchronous data transfer. We also show the relation between GPU speedup and AVIRIS block size, as well as the relation between compression ratio and AVIRIS block size. When AVIRIS images are divided into blocks, each with 64×64 pixels, we gain the best GPU performance with 26.3x speedup over its original CPU code.
A Parallel Nonrigid Registration Algorithm Based on B-Spline for Medical Images.

PubMed

Du, Xiaogang; Dang, Jianwu; Wang, Yangping; Wang, Song; Lei, Tao

2016-01-01

The nonrigid registration algorithm based on B-spline Free-Form Deformation (FFD) plays a key role and is widely applied in medical image processing due to the good flexibility and robustness. However, it requires a tremendous amount of computing time to obtain more accurate registration results especially for a large amount of medical image data. To address the issue, a parallel nonrigid registration algorithm based on B-spline is proposed in this paper. First, the Logarithm Squared Difference (LSD) is considered as the similarity metric in the B-spline registration algorithm to improve registration precision. After that, we create a parallel computing strategy and lookup tables (LUTs) to reduce the complexity of the B-spline registration algorithm. As a result, the computing time of three time-consuming steps including B-splines interpolation, LSD computation, and the analytic gradient computation of LSD, is efficiently reduced, for the B-spline registration algorithm employs the Nonlinear Conjugate Gradient (NCG) optimization method. Experimental results of registration quality and execution efficiency on the large amount of medical images show that our algorithm achieves a better registration accuracy in terms of the differences between the best deformation fields and ground truth and a speedup of 17 times over the single-threaded CPU implementation due to the powerful parallel computing ability of Graphics Processing Unit (GPU).
DOE Office of Scientific and Technical Information (OSTI.GOV)

Sadowski, Greg

In one form, a logic circuit includes an asynchronous logic circuit, a synchronous logic circuit, and an interface circuit coupled between the asynchronous logic circuit and the synchronous logic circuit. The asynchronous logic circuit has a plurality of asynchronous outputs for providing a corresponding plurality of asynchronous signals. The synchronous logic circuit has a plurality of synchronous inputs corresponding to the plurality of asynchronous outputs, a stretch input for receiving a stretch signal, and a clock output for providing a clock signal. The synchronous logic circuit provides the clock signal as a periodic signal but prolongs a predetermined state ofmore » the clock signal while the stretch signal is active. The asynchronous interface detects whether metastability could occur when latching any of the plurality of the asynchronous outputs of the asynchronous logic circuit using said clock signal, and activates the stretch signal while the metastability could occur.« less

A scalable parallel algorithm for multiple objective linear programs

NASA Technical Reports Server (NTRS)

Wiecek, Malgorzata M.; Zhang, Hong

1994-01-01

This paper presents an ADBASE-based parallel algorithm for solving multiple objective linear programs (MOLP's). Job balance, speedup and scalability are of primary interest in evaluating efficiency of the new algorithm. Implementation results on Intel iPSC/2 and Paragon multiprocessors show that the algorithm significantly speeds up the process of solving MOLP's, which is understood as generating all or some efficient extreme points and unbounded efficient edges. The algorithm gives specially good results for large and very large problems. Motivation and justification for solving such large MOLP's are also included.
Integrated Network Decompositions and Dynamic Programming for Graph Optimization (INDDGO)

DOE Office of Scientific and Technical Information (OSTI.GOV)

The INDDGO software package offers a set of tools for finding exact solutions to graph optimization problems via tree decompositions and dynamic programming algorithms. Currently the framework offers serial and parallel (distributed memory) algorithms for finding tree decompositions and solving the maximum weighted independent set problem. The parallel dynamic programming algorithm is implemented on top of the MADNESS task-based runtime.
Automated Handling of Garments for Pressing

DTIC Science & Technology

1991-09-30

Parallel Algorithms for 2D Kalman Filtering ................................. 47 DJ. Potter and M.P. Cline Hash Table and Sorted Array: A Case Study of... Kalman Filtering on the Connection Machine ............................ 55 MA. Palis and D.K. Krecker Parallel Sorting of Large Arrays on the MasPar...ALGORITHM’VS FOR SEAM SENSING. .. .. .. ... ... .... ..... 24 6.1 KarelTW Algorithms .. .. ... ... ... ... .... ... ...... 24 6.1.1 Image Filtering
Efficient Scalable Median Filtering Using Histogram-Based Operations.

PubMed

Green, Oded

2018-05-01

Median filtering is a smoothing technique for noise removal in images. While there are various implementations of median filtering for a single-core CPU, there are few implementations for accelerators and multi-core systems. Many parallel implementations of median filtering use a sorting algorithm for rearranging the values within a filtering window and taking the median of the sorted value. While using sorting algorithms allows for simple parallel implementations, the cost of the sorting becomes prohibitive as the filtering windows grow. This makes such algorithms, sequential and parallel alike, inefficient. In this work, we introduce the first software parallel median filtering that is non-sorting-based. The new algorithm uses efficient histogram-based operations. These reduce the computational requirements of the new algorithm while also accessing the image fewer times. We show an implementation of our algorithm for both the CPU and NVIDIA's CUDA supported graphics processing unit (GPU). The new algorithm is compared with several other leading CPU and GPU implementations. The CPU implementation has near perfect linear scaling with a speedup on a quad-core system. The GPU implementation is several orders of magnitude faster than the other GPU implementations for mid-size median filters. For small kernels, and , comparison-based approaches are preferable as fewer operations are required. Lastly, the new algorithm is open-source and can be found in the OpenCV library.
A fast sorting algorithm for a hypersonic rarefied flow particle simulation on the connection machine

NASA Technical Reports Server (NTRS)

Dagum, Leonardo

1989-01-01

The data parallel implementation of a particle simulation for hypersonic rarefied flow described by Dagum associates a single parallel data element with each particle in the simulation. The simulated space is divided into discrete regions called cells containing a variable and constantly changing number of particles. The implementation requires a global sort of the parallel data elements so as to arrange them in an order that allows immediate access to the information associated with cells in the simulation. Described here is a very fast algorithm for performing the necessary ranking of the parallel data elements. The performance of the new algorithm is compared with that of the microcoded instruction for ranking on the Connection Machine.
Parallelization of sequential Gaussian, indicator and direct simulation algorithms

NASA Astrophysics Data System (ADS)

Nunes, Ruben; Almeida, José A.

2010-08-01

Improving the performance and robustness of algorithms on new high-performance parallel computing architectures is a key issue in efficiently performing 2D and 3D studies with large amount of data. In geostatistics, sequential simulation algorithms are good candidates for parallelization. When compared with other computational applications in geosciences (such as fluid flow simulators), sequential simulation software is not extremely computationally intensive, but parallelization can make it more efficient and creates alternatives for its integration in inverse modelling approaches. This paper describes the implementation and benchmarking of a parallel version of the three classic sequential simulation algorithms: direct sequential simulation (DSS), sequential indicator simulation (SIS) and sequential Gaussian simulation (SGS). For this purpose, the source used was GSLIB, but the entire code was extensively modified to take into account the parallelization approach and was also rewritten in the C programming language. The paper also explains in detail the parallelization strategy and the main modifications. Regarding the integration of secondary information, the DSS algorithm is able to perform simple kriging with local means, kriging with an external drift and collocated cokriging with both local and global correlations. SIS includes a local correction of probabilities. Finally, a brief comparison is presented of simulation results using one, two and four processors. All performance tests were carried out on 2D soil data samples. The source code is completely open source and easy to read. It should be noted that the code is only fully compatible with Microsoft Visual C and should be adapted for other systems/compilers.
Three-dimensional photoacoustic tomography based on graphics-processing-unit-accelerated finite element method.

PubMed

Peng, Kuan; He, Ling; Zhu, Ziqiang; Tang, Jingtian; Xiao, Jiaying

2013-12-01

Compared with commonly used analytical reconstruction methods, the frequency-domain finite element method (FEM) based approach has proven to be an accurate and flexible algorithm for photoacoustic tomography. However, the FEM-based algorithm is computationally demanding, especially for three-dimensional cases. To enhance the algorithm's efficiency, in this work a parallel computational strategy is implemented in the framework of the FEM-based reconstruction algorithm using a graphic-processing-unit parallel frame named the "compute unified device architecture." A series of simulation experiments is carried out to test the accuracy and accelerating effect of the improved method. The results obtained indicate that the parallel calculation does not change the accuracy of the reconstruction algorithm, while its computational cost is significantly reduced by a factor of 38.9 with a GTX 580 graphics card using the improved method.
Parallel Algorithms for Image Analysis.

DTIC Science & Technology

1982-06-01

8217 _ _ _ _ _ _ _ 4. TITLE (aid Subtitle) S. TYPE OF REPORT & PERIOD COVERED PARALLEL ALGORITHMS FOR IMAGE ANALYSIS TECHNICAL 6. PERFORMING O4G. REPORT NUMBER TR-1180...Continue on reverse side it neceesary aid Identlfy by block number) Image processing; image analysis ; parallel processing; cellular computers. 20... IMAGE ANALYSIS TECHNICAL 6. PERFORMING ONG. REPORT NUMBER TR-1180 - 7. AUTHOR(&) S. CONTRACT OR GRANT NUMBER(s) Azriel Rosenfeld AFOSR-77-3271 9
Efficient parallel resolution of the simplified transport equations in mixed-dual formulation

NASA Astrophysics Data System (ADS)

Barrault, M.; Lathuilière, B.; Ramet, P.; Roman, J.

2011-03-01

A reactivity computation consists of computing the highest eigenvalue of a generalized eigenvalue problem, for which an inverse power algorithm is commonly used. Very fine modelizations are difficult to treat for our sequential solver, based on the simplified transport equations, in terms of memory consumption and computational time. A first implementation of a Lagrangian based domain decomposition method brings to a poor parallel efficiency because of an increase in the power iterations [1]. In order to obtain a high parallel efficiency, we improve the parallelization scheme by changing the location of the loop over the subdomains in the overall algorithm and by benefiting from the characteristics of the Raviart-Thomas finite element. The new parallel algorithm still allows us to locally adapt the numerical scheme (mesh, finite element order). However, it can be significantly optimized for the matching grid case. The good behavior of the new parallelization scheme is demonstrated for the matching grid case on several hundreds of nodes for computations based on a pin-by-pin discretization.
Scalable Domain Decomposed Monte Carlo Particle Transport

NASA Astrophysics Data System (ADS)

O'Brien, Matthew Joseph

In this dissertation, we present the parallel algorithms necessary to run domain decomposed Monte Carlo particle transport on large numbers of processors (millions of processors). Previous algorithms were not scalable, and the parallel overhead became more computationally costly than the numerical simulation. The main algorithms we consider are: • Domain decomposition of constructive solid geometry: enables extremely large calculations in which the background geometry is too large to fit in the memory of a single computational node. • Load Balancing: keeps the workload per processor as even as possible so the calculation runs efficiently. • Global Particle Find: if particles are on the wrong processor, globally resolve their locations to the correct processor based on particle coordinate and background domain. • Visualizing constructive solid geometry, sourcing particles, deciding that particle streaming communication is completed and spatial redecomposition. These algorithms are some of the most important parallel algorithms required for domain decomposed Monte Carlo particle transport. We demonstrate that our previous algorithms were not scalable, prove that our new algorithms are scalable, and run some of the algorithms up to 2 million MPI processes on the Sequoia supercomputer.
Field Programmable Gate Array Based Parallel Strapdown Algorithm Design for Strapdown Inertial Navigation Systems

PubMed Central

Li, Zong-Tao; Wu, Tie-Jun; Lin, Can-Long; Ma, Long-Hua

2011-01-01

A new generalized optimum strapdown algorithm with coning and sculling compensation is presented, in which the position, velocity and attitude updating operations are carried out based on the single-speed structure in which all computations are executed at a single updating rate that is sufficiently high to accurately account for high frequency angular rate and acceleration rectification effects. Different from existing algorithms, the updating rates of the coning and sculling compensations are unrelated with the number of the gyro incremental angle samples and the number of the accelerometer incremental velocity samples. When the output sampling rate of inertial sensors remains constant, this algorithm allows increasing the updating rate of the coning and sculling compensation, yet with more numbers of gyro incremental angle and accelerometer incremental velocity in order to improve the accuracy of system. Then, in order to implement the new strapdown algorithm in a single FPGA chip, the parallelization of the algorithm is designed and its computational complexity is analyzed. The performance of the proposed parallel strapdown algorithm is tested on the Xilinx ISE 12.3 software platform and the FPGA device XC6VLX550T hardware platform on the basis of some fighter data. It is shown that this parallel strapdown algorithm on the FPGA platform can greatly decrease the execution time of algorithm to meet the real-time and high precision requirements of system on the high dynamic environment, relative to the existing implemented on the DSP platform. PMID:22164058
Special purpose parallel computer architecture for real-time control and simulation in robotic applications

NASA Technical Reports Server (NTRS)

Fijany, Amir (Inventor); Bejczy, Antal K. (Inventor)

1993-01-01

This is a real-time robotic controller and simulator which is a MIMD-SIMD parallel architecture for interfacing with an external host computer and providing a high degree of parallelism in computations for robotic control and simulation. It includes a host processor for receiving instructions from the external host computer and for transmitting answers to the external host computer. There are a plurality of SIMD microprocessors, each SIMD processor being a SIMD parallel processor capable of exploiting fine grain parallelism and further being able to operate asynchronously to form a MIMD architecture. Each SIMD processor comprises a SIMD architecture capable of performing two matrix-vector operations in parallel while fully exploiting parallelism in each operation. There is a system bus connecting the host processor to the plurality of SIMD microprocessors and a common clock providing a continuous sequence of clock pulses. There is also a ring structure interconnecting the plurality of SIMD microprocessors and connected to the clock for providing the clock pulses to the SIMD microprocessors and for providing a path for the flow of data and instructions between the SIMD microprocessors. The host processor includes logic for controlling the RRCS by interpreting instructions sent by the external host computer, decomposing the instructions into a series of computations to be performed by the SIMD microprocessors, using the system bus to distribute associated data among the SIMD microprocessors, and initiating activity of the SIMD microprocessors to perform the computations on the data by procedure call.
Scalable Domain Decomposed Monte Carlo Particle Transport

DOE Office of Scientific and Technical Information (OSTI.GOV)

O'Brien, Matthew Joseph

2013-12-05

In this dissertation, we present the parallel algorithms necessary to run domain decomposed Monte Carlo particle transport on large numbers of processors (millions of processors). Previous algorithms were not scalable, and the parallel overhead became more computationally costly than the numerical simulation.
Multi-GPU parallel algorithm design and analysis for improved inversion of probability tomography with gravity gradiometry data

NASA Astrophysics Data System (ADS)

Hou, Zhenlong; Huang, Danian

2017-09-01

In this paper, we make a study on the inversion of probability tomography (IPT) with gravity gradiometry data at first. The space resolution of the results is improved by multi-tensor joint inversion, depth weighting matrix and the other methods. Aiming at solving the problems brought by the big data in the exploration, we present the parallel algorithm and the performance analysis combining Compute Unified Device Architecture (CUDA) with Open Multi-Processing (OpenMP) based on Graphics Processing Unit (GPU) accelerating. In the test of the synthetic model and real data from Vinton Dome, we get the improved results. It is also proved that the improved inversion algorithm is effective and feasible. The performance of parallel algorithm we designed is better than the other ones with CUDA. The maximum speedup could be more than 200. In the performance analysis, multi-GPU speedup and multi-GPU efficiency are applied to analyze the scalability of the multi-GPU programs. The designed parallel algorithm is demonstrated to be able to process larger scale of data and the new analysis method is practical.
Algorithms and programming tools for image processing on the MPP, part 2

NASA Technical Reports Server (NTRS)

Reeves, Anthony P.

1986-01-01

A number of algorithms were developed for image warping and pyramid image filtering. Techniques were investigated for the parallel processing of a large number of independent irregular shaped regions on the MPP. In addition some utilities for dealing with very long vectors and for sorting were developed. Documentation pages for the algorithms which are available for distribution are given. The performance of the MPP for a number of basic data manipulations was determined. From these results it is possible to predict the efficiency of the MPP for a number of algorithms and applications. The Parallel Pascal development system, which is a portable programming environment for the MPP, was improved and better documentation including a tutorial was written. This environment allows programs for the MPP to be developed on any conventional computer system; it consists of a set of system programs and a library of general purpose Parallel Pascal functions. The algorithms were tested on the MPP and a presentation on the development system was made to the MPP users group. The UNIX version of the Parallel Pascal System was distributed to a number of new sites.
Research of the effectiveness of parallel multithreaded realizations of interpolation methods for scaling raster images

NASA Astrophysics Data System (ADS)

Vnukov, A. A.; Shershnev, M. B.

2018-01-01

The aim of this work is the software implementation of three image scaling algorithms using parallel computations, as well as the development of an application with a graphical user interface for the Windows operating system to demonstrate the operation of algorithms and to study the relationship between system performance, algorithm execution time and the degree of parallelization of computations. Three methods of interpolation were studied, formalized and adapted to scale images. The result of the work is a program for scaling images by different methods. Comparison of the quality of scaling by different methods is given.
On the impact of communication complexity on the design of parallel numerical algorithms

NASA Technical Reports Server (NTRS)

Gannon, D. B.; Van Rosendale, J.

1984-01-01

This paper describes two models of the cost of data movement in parallel numerical alorithms. One model is a generalization of an approach due to Hockney, and is suitable for shared memory multiprocessors where each processor has vector capabilities. The other model is applicable to highly parallel nonshared memory MIMD systems. In this second model, algorithm performance is characterized in terms of the communication network design. Techniques used in VLSI complexity theory are also brought in, and algorithm-independent upper bounds on system performance are derived for several problems that are important to scientific computation.
Eigensolution of finite element problems in a completely connected parallel architecture

NASA Technical Reports Server (NTRS)

Akl, Fred A.; Morel, Michael R.

1989-01-01

A parallel algorithm for the solution of the generalized eigenproblem in linear elastic finite element analysis, (K)(phi)=(M)(phi)(omega), where (K) and (M) are of order N, and (omega) is of order q is presented. The parallel algorithm is based on a completely connected parallel architecture in which each processor is allowed to communicate with all other processors. The algorithm has been successfully implemented on a tightly coupled multiple-instruction-multiple-data (MIMD) parallel processing computer, Cray X-MP. A finite element model is divided into m domains each of which is assumed to process n elements. Each domain is then assigned to a processor, or to a logical processor (task) if the number of domains exceeds the number of physical processors. The macro-tasking library routines are used in mapping each domain to a user task. Computational speed-up and efficiency are used to determine the effectiveness of the algorithm. The effect of the number of domains, the number of degrees-of-freedom located along the global fronts and the dimension of the subspace on the performance of the algorithm are investigated. For a 64-element rectangular plate, speed-ups of 1.86, 3.13, 3.18 and 3.61 are achieved on two, four, six and eight processors, respectively.
A review on quantum search algorithms

NASA Astrophysics Data System (ADS)

Giri, Pulak Ranjan; Korepin, Vladimir E.

2017-12-01

The use of superposition of states in quantum computation, known as quantum parallelism, has significant advantage in terms of speed over the classical computation. It is evident from the early invented quantum algorithms such as Deutsch's algorithm, Deutsch-Jozsa algorithm and its variation as Bernstein-Vazirani algorithm, Simon algorithm, Shor's algorithms, etc. Quantum parallelism also significantly speeds up the database search algorithm, which is important in computer science because it comes as a subroutine in many important algorithms. Quantum database search of Grover achieves the task of finding the target element in an unsorted database in a time quadratically faster than the classical computer. We review Grover's quantum search algorithms for a singe and multiple target elements in a database. The partial search algorithm of Grover and Radhakrishnan and its optimization by Korepin called GRK algorithm are also discussed.
Data parallel sorting for particle simulation

NASA Technical Reports Server (NTRS)

Dagum, Leonardo

1992-01-01

Sorting on a parallel architecture is a communications intensive event which can incur a high penalty in applications where it is required. In the case of particle simulation, only integer sorting is necessary, and sequential implementations easily attain the minimum performance bound of O (N) for N particles. Parallel implementations, however, have to cope with the parallel sorting problem which, in addition to incurring a heavy communications cost, can make the minimun performance bound difficult to attain. This paper demonstrates how the sorting problem in a particle simulation can be reduced to a merging problem, and describes an efficient data parallel algorithm to solve this merging problem in a particle simulation. The new algorithm is shown to be optimal under conditions usual for particle simulation, and its fieldwise implementation on the Connection Machine is analyzed in detail. The new algorithm is about four times faster than a fieldwise implementation of radix sort on the Connection Machine.

Message-passing-interface-based parallel FDTD investigation on the EM scattering from a 1-D rough sea surface using uniaxial perfectly matched layer absorbing boundary.

PubMed

Li, J; Guo, L-X; Zeng, H; Han, X-B

2009-06-01

A message-passing-interface (MPI)-based parallel finite-difference time-domain (FDTD) algorithm for the electromagnetic scattering from a 1-D randomly rough sea surface is presented. The uniaxial perfectly matched layer (UPML) medium is adopted for truncation of FDTD lattices, in which the finite-difference equations can be used for the total computation domain by properly choosing the uniaxial parameters. This makes the parallel FDTD algorithm easier to implement. The parallel performance with different processors is illustrated for one sea surface realization, and the computation time of the parallel FDTD algorithm is dramatically reduced compared to a single-process implementation. Finally, some numerical results are shown, including the backscattering characteristics of sea surface for different polarization and the bistatic scattering from a sea surface with large incident angle and large wind speed.
A Parallel Numerical Algorithm To Solve Linear Systems Of Equations Emerging From 3D Radiative Transfer

NASA Astrophysics Data System (ADS)

Wichert, Viktoria; Arkenberg, Mario; Hauschildt, Peter H.

2016-10-01

Highly resolved state-of-the-art 3D atmosphere simulations will remain computationally extremely expensive for years to come. In addition to the need for more computing power, rethinking coding practices is necessary. We take a dual approach by introducing especially adapted, parallel numerical methods and correspondingly parallelizing critical code passages. In the following, we present our respective work on PHOENIX/3D. With new parallel numerical algorithms, there is a big opportunity for improvement when iteratively solving the system of equations emerging from the operator splitting of the radiative transfer equation J = ΛS. The narrow-banded approximate Λ-operator Λ* , which is used in PHOENIX/3D, occurs in each iteration step. By implementing a numerical algorithm which takes advantage of its characteristic traits, the parallel code's efficiency is further increased and a speed-up in computational time can be achieved.
Partitioning and packing mathematical simulation models for calculation on parallel computers

NASA Technical Reports Server (NTRS)

Arpasi, D. J.; Milner, E. J.

1986-01-01

The development of multiprocessor simulations from a serial set of ordinary differential equations describing a physical system is described. Degrees of parallelism (i.e., coupling between the equations) and their impact on parallel processing are discussed. The problem of identifying computational parallelism within sets of closely coupled equations that require the exchange of current values of variables is described. A technique is presented for identifying this parallelism and for partitioning the equations for parallel solution on a multiprocessor. An algorithm which packs the equations into a minimum number of processors is also described. The results of the packing algorithm when applied to a turbojet engine model are presented in terms of processor utilization.
Parallel integer sorting with medium and fine-scale parallelism

NASA Technical Reports Server (NTRS)

Dagum, Leonardo

1993-01-01

Two new parallel integer sorting algorithms, queue-sort and barrel-sort, are presented and analyzed in detail. These algorithms do not have optimal parallel complexity, yet they show very good performance in practice. Queue-sort designed for fine-scale parallel architectures which allow the queueing of multiple messages to the same destination. Barrel-sort is designed for medium-scale parallel architectures with a high message passing overhead. The performance results from the implementation of queue-sort on a Connection Machine CM-2 and barrel-sort on a 128 processor iPSC/860 are given. The two implementations are found to be comparable in performance but not as good as a fully vectorized bucket sort on the Cray YMP.
Asynchronous Distributed Flow Control Algorithms.

DTIC Science & Technology

1984-10-01

all yeX, y <x. We may think of X as a "feasible" set. The fair allocation vector solves the following set of nested problems. The first problem is to...interval parameters that can be adjusted: the time between a succesful update and the next update attempt, and the time between an unsuccessful attempt...For years she took great delight in being able to tell people , quite truthfully, that she was from Mars. In 1970, she graduated from the University of
A study of interactive control scheduling and economic assessment for robotic systems

NASA Technical Reports Server (NTRS)

1982-01-01

A class of interactive control systems is derived by generalizing interactive manipulator control systems. Tasks of interactive control systems can be represented as a network of a finite set of actions which have specific operational characteristics and specific resource requirements, and which are of limited duration. This has enabled the decomposition of the overall control algorithm simultaneously and asynchronously. The performance benefits of sensor referenced and computer-aided control of manipulators in a complex environment is evaluated. The first phase of the CURV arm control system software development and the basic features of the control algorithms and their software implementation are presented. An optimal solution for a production scheduling problem that will be easy to implement in practical situations is investigated.
A novel power harmonic analysis method based on Nuttall-Kaiser combination window double spectrum interpolated FFT algorithm

NASA Astrophysics Data System (ADS)

Jin, Tao; Chen, Yiyang; Flesch, Rodolfo C. C.

2017-11-01

Harmonics pose a great threat to safe and economical operation of power grids. Therefore, it is critical to detect harmonic parameters accurately to design harmonic compensation equipment. The fast Fourier transform (FFT) is widely used for electrical popular power harmonics analysis. However, the barrier effect produced by the algorithm itself and spectrum leakage caused by asynchronous sampling often affects the harmonic analysis accuracy. This paper examines a new approach for harmonic analysis based on deducing the modifier formulas of frequency, phase angle, and amplitude, utilizing the Nuttall-Kaiser window double spectrum line interpolation method, which overcomes the shortcomings in traditional FFT harmonic calculations. The proposed approach is verified numerically and experimentally to be accurate and reliable.
The design of mobile robot control system for the aged and the disabled

NASA Astrophysics Data System (ADS)

Qiang, Wang; Lei, Shi; Xiang, Gao; Jin, Zhang

2017-01-01

This paper designs a control system of mobile robot for the aged and the disabled, which consists of two main parts: human-computer interaction and drive control module. The data of the two parts is transferred via universal asynchronous receiver/transmitter. In the former part, the speed and direction information of the mobile robot is obtained by hall joystick. In the latter part, the electronic differential algorithm is developed to implement the robot mobile function by driving two-wheel motors. In order to improve the comfort of the robot when speed or direction is changed, the least squares algorithm is used to optimize the speed characteristic curves of the two motors. Experimental results have verified the effectiveness of the designed system.
Extending molecular simulation time scales: Parallel in time integrations for high-level quantum chemistry and complex force representations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bylaska, Eric J.; Weare, Jonathan Q.; Weare, John H.

2013-08-21

Parallel in time simulation algorithms are presented and applied to conventional molecular dynamics (MD) and ab initio molecular dynamics (AIMD) models of realistic complexity. Assuming that a forward time integrator, f , (e.g. Verlet algorithm) is available to propagate the system from time ti (trajectory positions and velocities xi = (ri; vi)) to time ti+1 (xi+1) by xi+1 = fi(xi), the dynamics problem spanning an interval from t0 : : : tM can be transformed into a root finding problem, F(X) = [xi - f (x(i-1)]i=1;M = 0, for the trajectory variables. The root finding problem is solved using amore » variety of optimization techniques, including quasi-Newton and preconditioned quasi-Newton optimization schemes that are all unconditionally convergent. The algorithms are parallelized by assigning a processor to each time-step entry in the columns of F(X). The relation of this approach to other recently proposed parallel in time methods is discussed and the effectiveness of various approaches to solving the root finding problem are tested. We demonstrate that more efficient dynamical models based on simplified interactions or coarsening time-steps provide preconditioners for the root finding problem. However, for MD and AIMD simulations such preconditioners are not required to obtain reasonable convergence and their cost must be considered in the performance of the algorithm. The parallel in time algorithms developed are tested by applying them to MD and AIMD simulations of size and complexity similar to those encountered in present day applications. These include a 1000 Si atom MD simulation using Stillinger-Weber potentials, and a HCl+4H2O AIMD simulation at the MP2 level. The maximum speedup obtained by parallelizing the Stillinger-Weber MD simulation was nearly 3.0. For the AIMD MP2 simulations the algorithms achieved speedups of up to 14.3. The parallel in time algorithms can be implemented in a distributed computing environment using very slow TCP/IP networks. Scripts written in Python that make calls to a precompiled quantum chemistry package (NWChem) are demonstrated to provide an actual speedup of 8.2 for a 2.5 ps AIMD simulation of HCl+4H2O at the MP2/6-31G* level. Implemented in this way these algorithms can be used for long time high-level AIMD simulations at a modest cost using machines connected by very slow networks such as WiFi, or in different time zones connected by the Internet. The algorithms can also be used with programs that are already parallel. By using these algorithms we are able to reduce the cost of a MP2/6-311++G(2d,2p) simulation that had reached its maximum possible speedup in the parallelization of the electronic structure calculation from 32 seconds per time step to 6.9 seconds per time step.« less
Extending molecular simulation time scales: Parallel in time integrations for high-level quantum chemistry and complex force representations.

PubMed

Bylaska, Eric J; Weare, Jonathan Q; Weare, John H

2013-08-21

Parallel in time simulation algorithms are presented and applied to conventional molecular dynamics (MD) and ab initio molecular dynamics (AIMD) models of realistic complexity. Assuming that a forward time integrator, f (e.g., Verlet algorithm), is available to propagate the system from time ti (trajectory positions and velocities xi = (ri, vi)) to time ti + 1 (xi + 1) by xi + 1 = fi(xi), the dynamics problem spanning an interval from t0[ellipsis (horizontal)]tM can be transformed into a root finding problem, F(X) = [xi - f(x(i - 1)]i = 1, M = 0, for the trajectory variables. The root finding problem is solved using a variety of root finding techniques, including quasi-Newton and preconditioned quasi-Newton schemes that are all unconditionally convergent. The algorithms are parallelized by assigning a processor to each time-step entry in the columns of F(X). The relation of this approach to other recently proposed parallel in time methods is discussed, and the effectiveness of various approaches to solving the root finding problem is tested. We demonstrate that more efficient dynamical models based on simplified interactions or coarsening time-steps provide preconditioners for the root finding problem. However, for MD and AIMD simulations, such preconditioners are not required to obtain reasonable convergence and their cost must be considered in the performance of the algorithm. The parallel in time algorithms developed are tested by applying them to MD and AIMD simulations of size and complexity similar to those encountered in present day applications. These include a 1000 Si atom MD simulation using Stillinger-Weber potentials, and a HCl + 4H2O AIMD simulation at the MP2 level. The maximum speedup (serial execution/timeparallel execution time) obtained by parallelizing the Stillinger-Weber MD simulation was nearly 3.0. For the AIMD MP2 simulations, the algorithms achieved speedups of up to 14.3. The parallel in time algorithms can be implemented in a distributed computing environment using very slow transmission control protocol/Internet protocol networks. Scripts written in Python that make calls to a precompiled quantum chemistry package (NWChem) are demonstrated to provide an actual speedup of 8.2 for a 2.5 ps AIMD simulation of HCl + 4H2O at the MP2/6-31G* level. Implemented in this way these algorithms can be used for long time high-level AIMD simulations at a modest cost using machines connected by very slow networks such as WiFi, or in different time zones connected by the Internet. The algorithms can also be used with programs that are already parallel. Using these algorithms, we are able to reduce the cost of a MP2/6-311++G(2d,2p) simulation that had reached its maximum possible speedup in the parallelization of the electronic structure calculation from 32 s/time step to 6.9 s/time step.
Extending molecular simulation time scales: Parallel in time integrations for high-level quantum chemistry and complex force representations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bylaska, Eric J., E-mail: Eric.Bylaska@pnnl.gov; Weare, Jonathan Q., E-mail: weare@uchicago.edu; Weare, John H., E-mail: jweare@ucsd.edu

2013-08-21

Parallel in time simulation algorithms are presented and applied to conventional molecular dynamics (MD) and ab initio molecular dynamics (AIMD) models of realistic complexity. Assuming that a forward time integrator, f (e.g., Verlet algorithm), is available to propagate the system from time t{sub i} (trajectory positions and velocities x{sub i} = (r{sub i}, v{sub i})) to time t{sub i+1} (x{sub i+1}) by x{sub i+1} = f{sub i}(x{sub i}), the dynamics problem spanning an interval from t{sub 0}…t{sub M} can be transformed into a root finding problem, F(X) = [x{sub i} − f(x{sub (i−1})]{sub i} {sub =1,M} = 0, for themore » trajectory variables. The root finding problem is solved using a variety of root finding techniques, including quasi-Newton and preconditioned quasi-Newton schemes that are all unconditionally convergent. The algorithms are parallelized by assigning a processor to each time-step entry in the columns of F(X). The relation of this approach to other recently proposed parallel in time methods is discussed, and the effectiveness of various approaches to solving the root finding problem is tested. We demonstrate that more efficient dynamical models based on simplified interactions or coarsening time-steps provide preconditioners for the root finding problem. However, for MD and AIMD simulations, such preconditioners are not required to obtain reasonable convergence and their cost must be considered in the performance of the algorithm. The parallel in time algorithms developed are tested by applying them to MD and AIMD simulations of size and complexity similar to those encountered in present day applications. These include a 1000 Si atom MD simulation using Stillinger-Weber potentials, and a HCl + 4H{sub 2}O AIMD simulation at the MP2 level. The maximum speedup ((serial execution time)/(parallel execution time) ) obtained by parallelizing the Stillinger-Weber MD simulation was nearly 3.0. For the AIMD MP2 simulations, the algorithms achieved speedups of up to 14.3. The parallel in time algorithms can be implemented in a distributed computing environment using very slow transmission control protocol/Internet protocol networks. Scripts written in Python that make calls to a precompiled quantum chemistry package (NWChem) are demonstrated to provide an actual speedup of 8.2 for a 2.5 ps AIMD simulation of HCl + 4H{sub 2}O at the MP2/6-31G* level. Implemented in this way these algorithms can be used for long time high-level AIMD simulations at a modest cost using machines connected by very slow networks such as WiFi, or in different time zones connected by the Internet. The algorithms can also be used with programs that are already parallel. Using these algorithms, we are able to reduce the cost of a MP2/6-311++G(2d,2p) simulation that had reached its maximum possible speedup in the parallelization of the electronic structure calculation from 32 s/time step to 6.9 s/time step.« less
Parallel Processing of Broad-Band PPM Signals

NASA Technical Reports Server (NTRS)

Gray, Andrew; Kang, Edward; Lay, Norman; Vilnrotter, Victor; Srinivasan, Meera; Lee, Clement

2010-01-01

A parallel-processing algorithm and a hardware architecture to implement the algorithm have been devised for timeslot synchronization in the reception of pulse-position-modulated (PPM) optical or radio signals. As in the cases of some prior algorithms and architectures for parallel, discrete-time, digital processing of signals other than PPM, an incoming broadband signal is divided into multiple parallel narrower-band signals by means of sub-sampling and filtering. The number of parallel streams is chosen so that the frequency content of the narrower-band signals is low enough to enable processing by relatively-low speed complementary metal oxide semiconductor (CMOS) electronic circuitry. The algorithm and architecture are intended to satisfy requirements for time-varying time-slot synchronization and post-detection filtering, with correction of timing errors independent of estimation of timing errors. They are also intended to afford flexibility for dynamic reconfiguration and upgrading. The architecture is implemented in a reconfigurable CMOS processor in the form of a field-programmable gate array. The algorithm and its hardware implementation incorporate three separate time-varying filter banks for three distinct functions: correction of sub-sample timing errors, post-detection filtering, and post-detection estimation of timing errors. The design of the filter bank for correction of timing errors, the method of estimating timing errors, and the design of a feedback-loop filter are governed by a host of parameters, the most critical one, with regard to processing very broadband signals with CMOS hardware, being the number of parallel streams (equivalently, the rate-reduction parameter).
The monophasic action potential upstroke: a means of characterizing local conduction.

PubMed

Levine, J H; Moore, E N; Kadish, A H; Guarnieri, T; Spear, J F

1986-11-01

The upstrokes of monophasic action potentials (MAPs) recorded with an extracellular pressure electrode were characterized in isolated canine tissue preparations in vitro. The characteristics of the MAP upstroke were compared with those of the local action potential foot as well as with the characteristics of approaching electrical activation during uniform and asynchronous conduction. The upstroke of the MAP was exponential during uniform conduction. The time constant of rise of the MAP upstroke (TMAP) correlated with that of the action potential foot (Tfoot): TMAP + 1.01 Tfoot + 0.50; r2 = .80. Furthermore, changes in Tfoot with alterations in cycle length were associated with similar changes in TMAP: Tfoot = 1.06 TMAP - 0.11; r2 = .78. In addition, TMAP and Tfoot both deviated from exponential during asynchronous activation; the inflections that developed in the MAP upstroke correlated in time with intracellular action potential upstrokes that were asynchronous in onset in these tissues. Finally, the field of view of the MAP was determined and was found to be dependent in part on tissue architecture and the space constant. Specifically, the field of view of the MAP was found to be greater parallel compared with transverse to fiber orientation (6.02 +/- 1.74 vs 3.03 +/- 1.10 mm; p less than .01). These data suggest that the MAP upstroke may be used to define and characterize local electrical activation. The relatively large field of view of the MAP suggests that this technique may be a sensitive means to record focal membrane phenomena in vivo.
Acoustic simulation in architecture with parallel algorithm

NASA Astrophysics Data System (ADS)

Li, Xiaohong; Zhang, Xinrong; Li, Dan

2004-03-01

In allusion to complexity of architecture environment and Real-time simulation of architecture acoustics, a parallel radiosity algorithm was developed. The distribution of sound energy in scene is solved with this method. And then the impulse response between sources and receivers at frequency segment, which are calculated with multi-process, are combined into whole frequency response. The numerical experiment shows that parallel arithmetic can improve the acoustic simulating efficiency of complex scene.
Parallel eigenanalysis of finite element models in a completely connected architecture

NASA Technical Reports Server (NTRS)

Akl, F. A.; Morel, M. R.

1989-01-01

A parallel algorithm is presented for the solution of the generalized eigenproblem in linear elastic finite element analysis, (K)(phi) = (M)(phi)(omega), where (K) and (M) are of order N, and (omega) is order of q. The concurrent solution of the eigenproblem is based on the multifrontal/modified subspace method and is achieved in a completely connected parallel architecture in which each processor is allowed to communicate with all other processors. The algorithm was successfully implemented on a tightly coupled multiple-instruction multiple-data parallel processing machine, Cray X-MP. A finite element model is divided into m domains each of which is assumed to process n elements. Each domain is then assigned to a processor or to a logical processor (task) if the number of domains exceeds the number of physical processors. The macrotasking library routines are used in mapping each domain to a user task. Computational speed-up and efficiency are used to determine the effectiveness of the algorithm. The effect of the number of domains, the number of degrees-of-freedom located along the global fronts and the dimension of the subspace on the performance of the algorithm are investigated. A parallel finite element dynamic analysis program, p-feda, is documented and the performance of its subroutines in parallel environment is analyzed.
Highly Parallel Alternating Directions Algorithm for Time Dependent Problems

NASA Astrophysics Data System (ADS)

Ganzha, M.; Georgiev, K.; Lirkov, I.; Margenov, S.; Paprzycki, M.

2011-11-01

In our work, we consider the time dependent Stokes equation on a finite time interval and on a uniform rectangular mesh, written in terms of velocity and pressure. For this problem, a parallel algorithm based on a novel direction splitting approach is developed. Here, the pressure equation is derived from a perturbed form of the continuity equation, in which the incompressibility constraint is penalized in a negative norm induced by the direction splitting. The scheme used in the algorithm is composed of two parts: (i) velocity prediction, and (ii) pressure correction. This is a Crank-Nicolson-type two-stage time integration scheme for two and three dimensional parabolic problems in which the second-order derivative, with respect to each space variable, is treated implicitly while the other variable is made explicit at each time sub-step. In order to achieve a good parallel performance the solution of the Poison problem for the pressure correction is replaced by solving a sequence of one-dimensional second order elliptic boundary value problems in each spatial direction. The parallel code is implemented using the standard MPI functions and tested on two modern parallel computer systems. The performed numerical tests demonstrate good level of parallel efficiency and scalability of the studied direction-splitting-based algorithm.
Application of hybrid clustering using parallel k-means algorithm and DIANA algorithm

NASA Astrophysics Data System (ADS)

Umam, Khoirul; Bustamam, Alhadi; Lestari, Dian

2017-03-01

DNA is one of the carrier of genetic information of living organisms. Encoding, sequencing, and clustering DNA sequences has become the key jobs and routine in the world of molecular biology, in particular on bioinformatics application. There are two type of clustering, hierarchical clustering and partitioning clustering. In this paper, we combined two type clustering i.e. K-Means (partitioning clustering) and DIANA (hierarchical clustering), therefore it called Hybrid clustering. Application of hybrid clustering using Parallel K-Means algorithm and DIANA algorithm used to clustering DNA sequences of Human Papillomavirus (HPV). The clustering process is started with Collecting DNA sequences of HPV are obtained from NCBI (National Centre for Biotechnology Information), then performing characteristics extraction of DNA sequences. The characteristics extraction result is store in a matrix form, then normalize this matrix using Min-Max normalization and calculate genetic distance using Euclidian Distance. Furthermore, the hybrid clustering is applied by using implementation of Parallel K-Means algorithm and DIANA algorithm. The aim of using Hybrid Clustering is to obtain better clusters result. For validating the resulted clusters, to get optimum number of clusters, we use Davies-Bouldin Index (DBI). In this study, the result of implementation of Parallel K-Means clustering is data clustered become 5 clusters with minimal IDB value is 0.8741, and Hybrid Clustering clustered data become 13 sub-clusters with minimal IDB values = 0.8216, 0.6845, 0.3331, 0.1994 and 0.3952. The IDB value of hybrid clustering less than IBD value of Parallel K-Means clustering only that perform at 1ts stage. Its means clustering using Hybrid Clustering have the better result to clustered DNA sequence of HPV than perform parallel K-Means Clustering only.
FPGA implementation of sparse matrix algorithm for information retrieval

NASA Astrophysics Data System (ADS)

Bojanic, Slobodan; Jevtic, Ruzica; Nieto-Taladriz, Octavio

2005-06-01

Information text data retrieval requires a tremendous amount of processing time because of the size of the data and the complexity of information retrieval algorithms. In this paper the solution to this problem is proposed via hardware supported information retrieval algorithms. Reconfigurable computing may adopt frequent hardware modifications through its tailorable hardware and exploits parallelism for a given application through reconfigurable and flexible hardware units. The degree of the parallelism can be tuned for data. In this work we implemented standard BLAS (basic linear algebra subprogram) sparse matrix algorithm named Compressed Sparse Row (CSR) that is showed to be more efficient in terms of storage space requirement and query-processing timing over the other sparse matrix algorithms for information retrieval application. Although inverted index algorithm is treated as the de facto standard for information retrieval for years, an alternative approach to store the index of text collection in a sparse matrix structure gains more attention. This approach performs query processing using sparse matrix-vector multiplication and due to parallelization achieves a substantial efficiency over the sequential inverted index. The parallel implementations of information retrieval kernel are presented in this work targeting the Virtex II Field Programmable Gate Arrays (FPGAs) board from Xilinx. A recent development in scientific applications is the use of FPGA to achieve high performance results. Computational results are compared to implementations on other platforms. The design achieves a high level of parallelism for the overall function while retaining highly optimised hardware within processing unit.
Implementation of a fully-balanced periodic tridiagonal solver on a parallel distributed memory architecture

NASA Technical Reports Server (NTRS)

Eidson, T. M.; Erlebacher, G.

1994-01-01

While parallel computers offer significant computational performance, it is generally necessary to evaluate several programming strategies. Two programming strategies for a fairly common problem - a periodic tridiagonal solver - are developed and evaluated. Simple model calculations as well as timing results are presented to evaluate the various strategies. The particular tridiagonal solver evaluated is used in many computational fluid dynamic simulation codes. The feature that makes this algorithm unique is that these simulation codes usually require simultaneous solutions for multiple right-hand-sides (RHS) of the system of equations. Each RHS solutions is independent and thus can be computed in parallel. Thus a Gaussian elimination type algorithm can be used in a parallel computation and the more complicated approaches such as cyclic reduction are not required. The two strategies are a transpose strategy and a distributed solver strategy. For the transpose strategy, the data is moved so that a subset of all the RHS problems is solved on each of the several processors. This usually requires significant data movement between processor memories across a network. The second strategy attempts to have the algorithm allow the data across processor boundaries in a chained manner. This usually requires significantly less data movement. An approach to accomplish this second strategy in a near-perfect load-balanced manner is developed. In addition, an algorithm will be shown to directly transform a sequential Gaussian elimination type algorithm into the parallel chained, load-balanced algorithm.
Parallel fuzzy connected image segmentation on GPU

PubMed Central

Zhuge, Ying; Cao, Yong; Udupa, Jayaram K.; Miller, Robert W.

2011-01-01

Purpose: Image segmentation techniques using fuzzy connectedness (FC) principles have shown their effectiveness in segmenting a variety of objects in several large applications. However, one challenge in these algorithms has been their excessive computational requirements when processing large image datasets. Nowadays, commodity graphics hardware provides a highly parallel computing environment. In this paper, the authors present a parallel fuzzy connected image segmentation algorithm implementation on NVIDIA’s compute unified device Architecture (cuda) platform for segmenting medical image data sets. Methods: In the FC algorithm, there are two major computational tasks: (i) computing the fuzzy affinity relations and (ii) computing the fuzzy connectedness relations. These two tasks are implemented as cuda kernels and executed on GPU. A dramatic improvement in speed for both tasks is achieved as a result. Results: Our experiments based on three data sets of small, medium, and large data size demonstrate the efficiency of the parallel algorithm, which achieves a speed-up factor of 24.4x, 18.1x, and 10.3x, correspondingly, for the three data sets on the NVIDIA Tesla C1060 over the implementation of the algorithm on CPU, and takes 0.25, 0.72, and 15.04 s, correspondingly, for the three data sets. Conclusions: The authors developed a parallel algorithm of the widely used fuzzy connected image segmentation method on the NVIDIA GPUs, which are far more cost- and speed-effective than both cluster of workstations and multiprocessing systems. A near-interactive speed of segmentation has been achieved, even for the large data set. PMID:21859037

Parallel fuzzy connected image segmentation on GPU.

PubMed

Zhuge, Ying; Cao, Yong; Udupa, Jayaram K; Miller, Robert W

2011-07-01

Image segmentation techniques using fuzzy connectedness (FC) principles have shown their effectiveness in segmenting a variety of objects in several large applications. However, one challenge in these algorithms has been their excessive computational requirements when processing large image datasets. Nowadays, commodity graphics hardware provides a highly parallel computing environment. In this paper, the authors present a parallel fuzzy connected image segmentation algorithm implementation on NVIDIA's compute unified device Architecture (CUDA) platform for segmenting medical image data sets. In the FC algorithm, there are two major computational tasks: (i) computing the fuzzy affinity relations and (ii) computing the fuzzy connectedness relations. These two tasks are implemented as CUDA kernels and executed on GPU. A dramatic improvement in speed for both tasks is achieved as a result. Our experiments based on three data sets of small, medium, and large data size demonstrate the efficiency of the parallel algorithm, which achieves a speed-up factor of 24.4x, 18.1x, and 10.3x, correspondingly, for the three data sets on the NVIDIA Tesla C1060 over the implementation of the algorithm on CPU, and takes 0.25, 0.72, and 15.04 s, correspondingly, for the three data sets. The authors developed a parallel algorithm of the widely used fuzzy connected image segmentation method on the NVIDIA GPUs, which are far more cost- and speed-effective than both cluster of workstations and multiprocessing systems. A near-interactive speed of segmentation has been achieved, even for the large data set.
New algorithm for tensor contractions on multi-core CPUs, GPUs, and accelerators enables CCSD and EOM-CCSD calculations with over 1000 basis functions on a single compute node.

PubMed

Kaliman, Ilya A; Krylov, Anna I

2017-04-30

A new hardware-agnostic contraction algorithm for tensors of arbitrary symmetry and sparsity is presented. The algorithm is implemented as a stand-alone open-source code libxm. This code is also integrated with general tensor library libtensor and with the Q-Chem quantum-chemistry package. An overview of the algorithm, its implementation, and benchmarks are presented. Similarly to other tensor software, the algorithm exploits efficient matrix multiplication libraries and assumes that tensors are stored in a block-tensor form. The distinguishing features of the algorithm are: (i) efficient repackaging of the individual blocks into large matrices and back, which affords efficient graphics processing unit (GPU)-enabled calculations without modifications of higher-level codes; (ii) fully asynchronous data transfer between disk storage and fast memory. The algorithm enables canonical all-electron coupled-cluster and equation-of-motion coupled-cluster calculations with single and double substitutions (CCSD and EOM-CCSD) with over 1000 basis functions on a single quad-GPU machine. We show that the algorithm exhibits predicted theoretical scaling for canonical CCSD calculations, O(N 6 ), irrespective of the data size on disk. © 2017 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.
Experiments with a Parallel Multi-Objective Evolutionary Algorithm for Scheduling

NASA Technical Reports Server (NTRS)

Brown, Matthew; Johnston, Mark D.

2013-01-01

Evolutionary multi-objective algorithms have great potential for scheduling in those situations where tradeoffs among competing objectives represent a key requirement. One challenge, however, is runtime performance, as a consequence of evolving not just a single schedule, but an entire population, while attempting to sample the Pareto frontier as accurately and uniformly as possible. The growing availability of multi-core processors in end user workstations, and even laptops, has raised the question of the extent to which such hardware can be used to speed up evolutionary algorithms. In this paper we report on early experiments in parallelizing a Generalized Differential Evolution (GDE) algorithm for scheduling long-range activities on NASA's Deep Space Network. Initial results show that significant speedups can be achieved, but that performance does not necessarily improve as more cores are utilized. We describe our preliminary results and some initial suggestions from parallelizing the GDE algorithm. Directions for future work are outlined.
A biconjugate gradient type algorithm on massively parallel architectures

NASA Technical Reports Server (NTRS)

Freund, Roland W.; Hochbruck, Marlis

1991-01-01

The biconjugate gradient (BCG) method is the natural generalization of the classical conjugate gradient algorithm for Hermitian positive definite matrices to general non-Hermitian linear systems. Unfortunately, the original BCG algorithm is susceptible to possible breakdowns and numerical instabilities. Recently, Freund and Nachtigal have proposed a novel BCG type approach, the quasi-minimal residual method (QMR), which overcomes the problems of BCG. Here, an implementation is presented of QMR based on an s-step version of the nonsymmetric look-ahead Lanczos algorithm. The main feature of the s-step Lanczos algorithm is that, in general, all inner products, except for one, can be computed in parallel at the end of each block; this is unlike the other standard Lanczos process where inner products are generated sequentially. The resulting implementation of QMR is particularly attractive on massively parallel SIMD architectures, such as the Connection Machine.
Comparing an FPGA to a Cell for an Image Processing Application

NASA Astrophysics Data System (ADS)

Rakvic, Ryan N.; Ngo, Hau; Broussard, Randy P.; Ives, Robert W.

2010-12-01

Modern advancements in configurable hardware, most notably Field-Programmable Gate Arrays (FPGAs), have provided an exciting opportunity to discover the parallel nature of modern image processing algorithms. On the other hand, PlayStation3 (PS3) game consoles contain a multicore heterogeneous processor known as the Cell, which is designed to perform complex image processing algorithms at a high performance. In this research project, our aim is to study the differences in performance of a modern image processing algorithm on these two hardware platforms. In particular, Iris Recognition Systems have recently become an attractive identification method because of their extremely high accuracy. Iris matching, a repeatedly executed portion of a modern iris recognition algorithm, is parallelized on an FPGA system and a Cell processor. We demonstrate a 2.5 times speedup of the parallelized algorithm on the FPGA system when compared to a Cell processor-based version.
Acoustooptic linear algebra processors - Architectures, algorithms, and applications

NASA Technical Reports Server (NTRS)

Casasent, D.

1984-01-01

Architectures, algorithms, and applications for systolic processors are described with attention to the realization of parallel algorithms on various optical systolic array processors. Systolic processors for matrices with special structure and matrices of general structure, and the realization of matrix-vector, matrix-matrix, and triple-matrix products and such architectures are described. Parallel algorithms for direct and indirect solutions to systems of linear algebraic equations and their implementation on optical systolic processors are detailed with attention to the pipelining and flow of data and operations. Parallel algorithms and their optical realization for LU and QR matrix decomposition are specifically detailed. These represent the fundamental operations necessary in the implementation of least squares, eigenvalue, and SVD solutions. Specific applications (e.g., the solution of partial differential equations, adaptive noise cancellation, and optimal control) are described to typify the use of matrix processors in modern advanced signal processing.
NASA Tech Briefs, October 2011

NASA Technical Reports Server (NTRS)

2011-01-01

Topics covered include: Laser Truss Sensor for Segmented Telescope Phasing; Qualifications of Bonding Process of Temperature Sensors to Deep-Space Missions; Optical Sensors for Monitoring Gamma and Neutron Radiation; Compliant Tactile Sensors; Cytometer on a Chip; Measuring Input Thresholds on an Existing Board; Scanning and Defocusing Properties of Microstrip Reflectarray Antennas; Cable Tester Box; Programmable Oscillator; Fault-Tolerant, Radiation-Hard DSP; Sub-Shot Noise Power Source for Microelectronics; Asynchronous Message Service Reference Implementation; Zero-Copy Objects System; Delay and Disruption Tolerant Networking MACHETE Model; Contact Graph Routing; Parallel Eclipse Project Checkout; Technique for Configuring an Actively Cooled Thermal Shield in a Flight System; Use of Additives to Improve Performance of Methyl Butyrate-Based Lithium-Ion Electrolytes; Li-Ion Cells Employing Electrolytes with Methyl Propionate and Ethyl Butyrate Co-Solvents; Improved Devices for Collecting Sweat for Chemical Analysis; Tissue Photolithography; Method for Impeding Degradation of Porous Silicon Structures; External Cooling Coupled to Reduced Extremity Pressure Device; A Zero-Gravity Cup for Drinking Beverages in Microgravity; Co-Flow Hollow Cathode Technology; Programmable Aperture with MEMS Microshutter Arrays; Polished Panel Optical Receiver for Simultaneous RF/Optical Telemetry with Large DSN Antennas; Adaptive System Modeling for Spacecraft Simulation; Lidar-Based Navigation Algorithm for Safe Lunar Landing; Tracking Object Existence From an Autonomous Patrol Vehicle; Rad-Hard, Miniaturized, Scalable, High-Voltage Switching Module for Power Applications; and Architecture for a 1-GHz Digital RADAR.
GROMACS 4: Algorithms for Highly Efficient, Load-Balanced, and Scalable Molecular Simulation.

PubMed

Hess, Berk; Kutzner, Carsten; van der Spoel, David; Lindahl, Erik

2008-03-01

Molecular simulation is an extremely useful, but computationally very expensive tool for studies of chemical and biomolecular systems. Here, we present a new implementation of our molecular simulation toolkit GROMACS which now both achieves extremely high performance on single processors from algorithmic optimizations and hand-coded routines and simultaneously scales very well on parallel machines. The code encompasses a minimal-communication domain decomposition algorithm, full dynamic load balancing, a state-of-the-art parallel constraint solver, and efficient virtual site algorithms that allow removal of hydrogen atom degrees of freedom to enable integration time steps up to 5 fs for atomistic simulations also in parallel. To improve the scaling properties of the common particle mesh Ewald electrostatics algorithms, we have in addition used a Multiple-Program, Multiple-Data approach, with separate node domains responsible for direct and reciprocal space interactions. Not only does this combination of algorithms enable extremely long simulations of large systems but also it provides that simulation performance on quite modest numbers of standard cluster nodes.
A fast parallel clustering algorithm for molecular simulation trajectories.

PubMed

Zhao, Yutong; Sheong, Fu Kit; Sun, Jian; Sander, Pedro; Huang, Xuhui

2013-01-15

We implemented a GPU-powered parallel k-centers algorithm to perform clustering on the conformations of molecular dynamics (MD) simulations. The algorithm is up to two orders of magnitude faster than the CPU implementation. We tested our algorithm on four protein MD simulation datasets ranging from the small Alanine Dipeptide to a 370-residue Maltose Binding Protein (MBP). It is capable of grouping 250,000 conformations of the MBP into 4000 clusters within 40 seconds. To achieve this, we effectively parallelized the code on the GPU and utilize the triangle inequality of metric spaces. Furthermore, the algorithm's running time is linear with respect to the number of cluster centers. In addition, we found the triangle inequality to be less effective in higher dimensions and provide a mathematical rationale. Finally, using Alanine Dipeptide as an example, we show a strong correlation between cluster populations resulting from the k-centers algorithm and the underlying density. © 2012 Wiley Periodicals, Inc. Copyright © 2012 Wiley Periodicals, Inc.
Parallel Monotonic Basin Hopping for Low Thrust Trajectory Optimization

NASA Technical Reports Server (NTRS)

McCarty, Steven L.; McGuire, Melissa L.

2018-01-01

Monotonic Basin Hopping has been shown to be an effective method of solving low thrust trajectory optimization problems. This paper outlines an extension to the common serial implementation by parallelizing it over any number of available compute cores. The Parallel Monotonic Basin Hopping algorithm described herein is shown to be an effective way to more quickly locate feasible solutions, and improve locally optimal solutions in an automated way without requiring a feasible initial guess. The increased speed achieved through parallelization enables the algorithm to be applied to more complex problems that would otherwise be impractical for a serial implementation. Low thrust cislunar transfers and a hybrid Mars example case demonstrate the effectiveness of the algorithm. Finally, a preliminary scaling study quantifies the expected decrease in solve time compared to a serial implementation.,
UPC++ Programmer’s Guide, v1.0-2018.3.0

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bachan, J.; Baden, S.; Bonachea, Dan

UPC++ is a C++11 library that provides Partitioned Global Address Space (PGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The PGAS model is single program, multiple-data (SPMD), with each separate thread of execution (referred to as a rank, a term borrowed from MPI) having access to local memory as it would in C++. However, PGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the ranks. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operationsmore » that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.« less
Real time software tools and methodologies

NASA Technical Reports Server (NTRS)

Christofferson, M. J.

1981-01-01

Real time systems are characterized by high speed processing and throughput as well as asynchronous event processing requirements. These requirements give rise to particular implementations of parallel or pipeline multitasking structures, of intertask or interprocess communications mechanisms, and finally of message (buffer) routing or switching mechanisms. These mechanisms or structures, along with the data structue, describe the essential character of the system. These common structural elements and mechanisms are identified, their implementation in the form of routines, tasks or macros - in other words, tools are formalized. The tools developed support or make available the following: reentrant task creation, generalized message routing techniques, generalized task structures/task families, standardized intertask communications mechanisms, and pipeline and parallel processing architectures in a multitasking environment. Tools development raise some interesting prospects in the areas of software instrumentation and software portability. These issues are discussed following the description of the tools themselves.
Performance Enhancement Strategies for Multi-Block Overset Grid CFD Applications

NASA Technical Reports Server (NTRS)

Djomehri, M. Jahed; Biswas, Rupak

2003-01-01

The overset grid methodology has significantly reduced time-to-solution of highfidelity computational fluid dynamics (CFD) simulations about complex aerospace configurations. The solution process resolves the geometrical complexity of the problem domain by using separately generated but overlapping structured discretization grids that periodically exchange information through interpolation. However, high performance computations of such large-scale realistic applications must be handled efficiently on state-of-the-art parallel supercomputers. This paper analyzes the effects of various performance enhancement strategies on the parallel efficiency of an overset grid Navier-Stokes CFD application running on an SGI Origin2000 machinc. Specifically, the role of asynchronous communication, grid splitting, and grid grouping strategies are presented and discussed. Details of a sophisticated graph partitioning technique for grid grouping are also provided. Results indicate that performance depends critically on the level of latency hiding and the quality of load balancing across the processors.
High-speed parallel implementation of a modified PBR algorithm on DSP-based EH topology

NASA Astrophysics Data System (ADS)

Rajan, K.; Patnaik, L. M.; Ramakrishna, J.

1997-08-01

Algebraic Reconstruction Technique (ART) is an age-old method used for solving the problem of three-dimensional (3-D) reconstruction from projections in electron microscopy and radiology. In medical applications, direct 3-D reconstruction is at the forefront of investigation. The simultaneous iterative reconstruction technique (SIRT) is an ART-type algorithm with the potential of generating in a few iterations tomographic images of a quality comparable to that of convolution backprojection (CBP) methods. Pixel-based reconstruction (PBR) is similar to SIRT reconstruction, and it has been shown that PBR algorithms give better quality pictures compared to those produced by SIRT algorithms. In this work, we propose a few modifications to the PBR algorithms. The modified algorithms are shown to give better quality pictures compared to PBR algorithms. The PBR algorithm and the modified PBR algorithms are highly compute intensive, Not many attempts have been made to reconstruct objects in the true 3-D sense because of the high computational overhead. In this study, we have developed parallel two-dimensional (2-D) and 3-D reconstruction algorithms based on modified PBR. We attempt to solve the two problems encountered by the PBR and modified PBR algorithms, i.e., the long computational time and the large memory requirements, by parallelizing the algorithm on a multiprocessor system. We investigate the possible task and data partitioning schemes by exploiting the potential parallelism in the PBR algorithm subject to minimizing the memory requirement. We have implemented an extended hypercube (EH) architecture for the high-speed execution of the 3-D reconstruction algorithm using the commercially available fast floating point digital signal processor (DSP) chips as the processing elements (PEs) and dual-port random access memories (DPR) as channels between the PEs. We discuss and compare the performances of the PBR algorithm on an IBM 6000 RISC workstation, on a Silicon Graphics Indigo 2 workstation, and on an EH system. The results show that an EH(3,1) using DSP chips as PEs executes the modified PBR algorithm about 100 times faster than an LBM 6000 RISC workstation. We have executed the algorithms on a 4-node IBM SP2 parallel computer. The results show that execution time of the algorithm on an EH(3,1) is better than that of a 4-node IBM SP2 system. The speed-up of an EH(3,1) system with eight PEs and one network controller is approximately 7.85.
Design considerations for parallel graphics libraries

NASA Technical Reports Server (NTRS)

Crockett, Thomas W.

1994-01-01

Applications which run on parallel supercomputers are often characterized by massive datasets. Converting these vast collections of numbers to visual form has proven to be a powerful aid to comprehension. For a variety of reasons, it may be desirable to provide this visual feedback at runtime. One way to accomplish this is to exploit the available parallelism to perform graphics operations in place. In order to do this, we need appropriate parallel rendering algorithms and library interfaces. This paper provides a tutorial introduction to some of the issues which arise in designing parallel graphics libraries and their underlying rendering algorithms. The focus is on polygon rendering for distributed memory message-passing systems. We illustrate our discussion with examples from PGL, a parallel graphics library which has been developed on the Intel family of parallel systems.
Target-type probability combining algorithms for multisensor tracking

NASA Astrophysics Data System (ADS)

Wigren, Torbjorn

2001-08-01

Algorithms for the handing of target type information in an operational multi-sensor tracking system are presented. The paper discusses recursive target type estimation, computation of crosses from passive data (strobe track triangulation), as well as the computation of the quality of the crosses for deghosting purposes. The focus is on Bayesian algorithms that operate in the discrete target type probability space, and on the approximations introduced for computational complexity reduction. The centralized algorithms are able to fuse discrete data from a variety of sensors and information sources, including IFF equipment, ESM's, IRST's as well as flight envelopes estimated from track data. All algorithms are asynchronous and can be tuned to handle clutter, erroneous associations as well as missed and erroneous detections. A key to obtain this ability is the inclusion of data forgetting by a procedure for propagation of target type probability states between measurement time instances. Other important properties of the algorithms are their abilities to handle ambiguous data and scenarios. The above aspects are illustrated in a simulations study. The simulation setup includes 46 air targets of 6 different types that are tracked by 5 airborne sensor platforms using ESM's and IRST's as data sources.
Solution of a tridiagonal system of equations on the finite element machine

NASA Technical Reports Server (NTRS)

Bostic, S. W.

1984-01-01

Two parallel algorithms for the solution of tridiagonal systems of equations were implemented on the Finite Element Machine. The Accelerated Parallel Gauss method, an iterative method, and the Buneman algorithm, a direct method, are discussed and execution statistics are presented.
New Factorization Techniques and Parallel (log N) Algorithms for Forward Dynamics Solution of Single Closed-Chain Robot Manipulators

NASA Technical Reports Server (NTRS)

Fijany, Amir

1993-01-01

In this paper parallel 0(log N) algorithms for dynamic simulation of single closed-chain rigid multibody system as specialized to the case of a robot manipulatoar in contact with the environment are developed.
Evaluation of a new parallel numerical parameter optimization algorithm for a dynamical system

NASA Astrophysics Data System (ADS)

Duran, Ahmet; Tuncel, Mehmet

2016-10-01

It is important to have a scalable parallel numerical parameter optimization algorithm for a dynamical system used in financial applications where time limitation is crucial. We use Message Passing Interface parallel programming and present such a new parallel algorithm for parameter estimation. For example, we apply the algorithm to the asset flow differential equations that have been developed and analyzed since 1989 (see [3-6] and references contained therein). We achieved speed-up for some time series to run up to 512 cores (see [10]). Unlike [10], we consider more extensive financial market situations, for example, in presence of low volatility, high volatility and stock market price at a discount/premium to its net asset value with varying magnitude, in this work. Moreover, we evaluated the convergence of the model parameter vector, the nonlinear least squares error and maximum improvement factor to quantify the success of the optimization process depending on the number of initial parameter vectors.
A Parallel Nonrigid Registration Algorithm Based on B-Spline for Medical Images

PubMed Central

Wang, Yangping; Wang, Song

2016-01-01

The nonrigid registration algorithm based on B-spline Free-Form Deformation (FFD) plays a key role and is widely applied in medical image processing due to the good flexibility and robustness. However, it requires a tremendous amount of computing time to obtain more accurate registration results especially for a large amount of medical image data. To address the issue, a parallel nonrigid registration algorithm based on B-spline is proposed in this paper. First, the Logarithm Squared Difference (LSD) is considered as the similarity metric in the B-spline registration algorithm to improve registration precision. After that, we create a parallel computing strategy and lookup tables (LUTs) to reduce the complexity of the B-spline registration algorithm. As a result, the computing time of three time-consuming steps including B-splines interpolation, LSD computation, and the analytic gradient computation of LSD, is efficiently reduced, for the B-spline registration algorithm employs the Nonlinear Conjugate Gradient (NCG) optimization method. Experimental results of registration quality and execution efficiency on the large amount of medical images show that our algorithm achieves a better registration accuracy in terms of the differences between the best deformation fields and ground truth and a speedup of 17 times over the single-threaded CPU implementation due to the powerful parallel computing ability of Graphics Processing Unit (GPU). PMID:28053653

Optical implementation of a parallel out-of-band controller for large broadband ATM switch applications

NASA Astrophysics Data System (ADS)

Cloonan, Thomas J.; Richards, Gaylord W.; Lentine, Anthony L.

1996-03-01

Asynchronous transfer mode (ATM) is rapidly becoming the transport mechanism of choice for the information superhighway, because it promises the bandwidth and flexibility needed for many voice, video and data service offerings. Some industry experts project that the required sizes for ATM switching equipment in the public-switched environment will reach the Tbps range by the beginning of the next decade. This paper analyzes the problems associated with controlling the flow of packets within a broadband ATM switch of this size. The analysis is based on the requirements of the growable packet switch architecture. The paper proposes a novel solution to the problem of hunting paths within an ATM packet switch network. The resulting control scheme is unconventional in two ways. First, it uses an out-of-band control algorithm instead of the more common self-routing approach. In particular, we explore the benefits of using a parallel processor as an out-of-band controller for a growable packet switch distribution network. The processor permits additional levels of parallelism to be added to the out-of-band control function so that path hunts can be performed for all N of the input ports within a single cell interval. The proposed approach is also unconventional because it uses free-space digital optics to guide signals between successive stages of the controller. The paper describes the underlying motivations for implementing an optical out-of-band controller for an ATM switch, and it also describes the logic within a controller node that has been fabricated using a hybrid Si CMOS/GaAs SEED technology. The node uses optical detectors (in GaAs), amplifiers and digital control logic (in Si), and optical modulators (in GaAs). Free-space optical connections between successive device arrays can be provided using either bulk optical elements or micro-optics, but the optical interconnects must provide massive fanout capability. An architectural analysis studying the feasibility of applying free-space optics in this proposed ATM switch controller also is presented.
Methodes iteratives paralleles: Applications en neutronique et en mecanique des fluides

NASA Astrophysics Data System (ADS)

Qaddouri, Abdessamad

Dans cette these, le calcul parallele est applique successivement a la neutronique et a la mecanique des fluides. Dans chacune de ces deux applications, des methodes iteratives sont utilisees pour resoudre le systeme d'equations algebriques resultant de la discretisation des equations du probleme physique. Dans le probleme de neutronique, le calcul des matrices des probabilites de collision (PC) ainsi qu'un schema iteratif multigroupe utilisant une methode inverse de puissance sont parallelises. Dans le probleme de mecanique des fluides, un code d'elements finis utilisant un algorithme iteratif du type GMRES preconditionne est parallelise. Cette these est presentee sous forme de six articles suivis d'une conclusion. Les cinq premiers articles traitent des applications en neutronique, articles qui representent l'evolution de notre travail dans ce domaine. Cette evolution passe par un calcul parallele des matrices des PC et un algorithme multigroupe parallele teste sur un probleme unidimensionnel (article 1), puis par deux algorithmes paralleles l'un mutiregion l'autre multigroupe, testes sur des problemes bidimensionnels (articles 2--3). Ces deux premieres etapes sont suivies par l'application de deux techniques d'acceleration, le rebalancement neutronique et la minimisation du residu aux deux algorithmes paralleles (article 4). Finalement, on a mis en oeuvre l'algorithme multigroupe et le calcul parallele des matrices des PC sur un code de production DRAGON ou les tests sont plus realistes et peuvent etre tridimensionnels (article 5). Le sixieme article (article 6), consacre a l'application a la mecanique des fluides, traite la parallelisation d'un code d'elements finis FES ou le partitionneur de graphe METIS et la librairie PSPARSLIB sont utilises.
Development of Fast Algorithms Using Recursion, Nesting and Iterations for Computational Electromagnetics

NASA Technical Reports Server (NTRS)

Chew, W. C.; Song, J. M.; Lu, C. C.; Weedon, W. H.

1995-01-01

In the first phase of our work, we have concentrated on laying the foundation to develop fast algorithms, including the use of recursive structure like the recursive aggregate interaction matrix algorithm (RAIMA), the nested equivalence principle algorithm (NEPAL), the ray-propagation fast multipole algorithm (RPFMA), and the multi-level fast multipole algorithm (MLFMA). We have also investigated the use of curvilinear patches to build a basic method of moments code where these acceleration techniques can be used later. In the second phase, which is mainly reported on here, we have concentrated on implementing three-dimensional NEPAL on a massively parallel machine, the Connection Machine CM-5, and have been able to obtain some 3D scattering results. In order to understand the parallelization of codes on the Connection Machine, we have also studied the parallelization of 3D finite-difference time-domain (FDTD) code with PML material absorbing boundary condition (ABC). We found that simple algorithms like the FDTD with material ABC can be parallelized very well allowing us to solve within a minute a problem of over a million nodes. In addition, we have studied the use of the fast multipole method and the ray-propagation fast multipole algorithm to expedite matrix-vector multiplication in a conjugate-gradient solution to integral equations of scattering. We find that these methods are faster than LU decomposition for one incident angle, but are slower than LU decomposition when many incident angles are needed as in the monostatic RCS calculations.
An annealed chaotic maximum neural network for bipartite subgraph problem.

PubMed

Wang, Jiahai; Tang, Zheng; Wang, Ronglong

2004-04-01

In this paper, based on maximum neural network, we propose a new parallel algorithm that can help the maximum neural network escape from local minima by including a transient chaotic neurodynamics for bipartite subgraph problem. The goal of the bipartite subgraph problem, which is an NP- complete problem, is to remove the minimum number of edges in a given graph such that the remaining graph is a bipartite graph. Lee et al. presented a parallel algorithm using the maximum neural model (winner-take-all neuron model) for this NP- complete problem. The maximum neural model always guarantees a valid solution and greatly reduces the search space without a burden on the parameter-tuning. However, the model has a tendency to converge to a local minimum easily because it is based on the steepest descent method. By adding a negative self-feedback to the maximum neural network, we proposed a new parallel algorithm that introduces richer and more flexible chaotic dynamics and can prevent the network from getting stuck at local minima. After the chaotic dynamics vanishes, the proposed algorithm is then fundamentally reined by the gradient descent dynamics and usually converges to a stable equilibrium point. The proposed algorithm has the advantages of both the maximum neural network and the chaotic neurodynamics. A large number of instances have been simulated to verify the proposed algorithm. The simulation results show that our algorithm finds the optimum or near-optimum solution for the bipartite subgraph problem superior to that of the best existing parallel algorithms.
Cloud computing-based TagSNP selection algorithm for human genome data.

PubMed

Hung, Che-Lun; Chen, Wen-Pei; Hua, Guan-Jie; Zheng, Huiru; Tsai, Suh-Jen Jane; Lin, Yaw-Ling

2015-01-05

Single nucleotide polymorphisms (SNPs) play a fundamental role in human genetic variation and are used in medical diagnostics, phylogeny construction, and drug design. They provide the highest-resolution genetic fingerprint for identifying disease associations and human features. Haplotypes are regions of linked genetic variants that are closely spaced on the genome and tend to be inherited together. Genetics research has revealed SNPs within certain haplotype blocks that introduce few distinct common haplotypes into most of the population. Haplotype block structures are used in association-based methods to map disease genes. In this paper, we propose an efficient algorithm for identifying haplotype blocks in the genome. In chromosomal haplotype data retrieved from the HapMap project website, the proposed algorithm identified longer haplotype blocks than an existing algorithm. To enhance its performance, we extended the proposed algorithm into a parallel algorithm that copies data in parallel via the Hadoop MapReduce framework. The proposed MapReduce-paralleled combinatorial algorithm performed well on real-world data obtained from the HapMap dataset; the improvement in computational efficiency was proportional to the number of processors used.
Cloud Computing-Based TagSNP Selection Algorithm for Human Genome Data

PubMed Central

Hung, Che-Lun; Chen, Wen-Pei; Hua, Guan-Jie; Zheng, Huiru; Tsai, Suh-Jen Jane; Lin, Yaw-Ling

2015-01-01

Single nucleotide polymorphisms (SNPs) play a fundamental role in human genetic variation and are used in medical diagnostics, phylogeny construction, and drug design. They provide the highest-resolution genetic fingerprint for identifying disease associations and human features. Haplotypes are regions of linked genetic variants that are closely spaced on the genome and tend to be inherited together. Genetics research has revealed SNPs within certain haplotype blocks that introduce few distinct common haplotypes into most of the population. Haplotype block structures are used in association-based methods to map disease genes. In this paper, we propose an efficient algorithm for identifying haplotype blocks in the genome. In chromosomal haplotype data retrieved from the HapMap project website, the proposed algorithm identified longer haplotype blocks than an existing algorithm. To enhance its performance, we extended the proposed algorithm into a parallel algorithm that copies data in parallel via the Hadoop MapReduce framework. The proposed MapReduce-paralleled combinatorial algorithm performed well on real-world data obtained from the HapMap dataset; the improvement in computational efficiency was proportional to the number of processors used. PMID:25569088
Families of Graph Algorithms: SSSP Case Study

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kanewala Appuhamilage, Thejaka Amila Jay; Zalewski, Marcin J.; Lumsdaine, Andrew

2017-08-28

Single-Source Shortest Paths (SSSP) is a well-studied graph problem. Examples of SSSP algorithms include the original Dijkstra’s algorithm and the parallel Δ-stepping and KLA-SSSP algorithms. In this paper, we use a novel Abstract Graph Machine (AGM) model to show that all these algorithms share a common logic and differ from one another by the order in which they perform work. We use the AGM model to thoroughly analyze the family of algorithms that arises from the common logic. We start with the basic algorithm without any ordering (Chaotic), and then we derive the existing and new algorithms by methodically exploringmore » semantic and spatial ordering of work. Our experimental results show that new derived algorithms show better performance than the existing distributed memory parallel algorithms, especially at higher scales.« less
Reverse time migration: A seismic processing application on the connection machine

NASA Technical Reports Server (NTRS)

Fiebrich, Rolf-Dieter

1987-01-01

The implementation of a reverse time migration algorithm on the Connection Machine, a massively parallel computer is described. Essential architectural features of this machine as well as programming concepts are presented. The data structures and parallel operations for the implementation of the reverse time migration algorithm are described. The algorithm matches the Connection Machine architecture closely and executes almost at the peak performance of this machine.
LASER APPLICATIONS AND OTHER TOPICS IN QUANTUM ELECTRONICS: Application of the stochastic parallel gradient descent algorithm for numerical simulation and analysis of the coherent summation of radiation from fibre amplifiers

NASA Astrophysics Data System (ADS)

Zhou, Pu; Wang, Xiaolin; Li, Xiao; Chen, Zilum; Xu, Xiaojun; Liu, Zejin

2009-10-01

Coherent summation of fibre laser beams, which can be scaled to a relatively large number of elements, is simulated by using the stochastic parallel gradient descent (SPGD) algorithm. The applicability of this algorithm for coherent summation is analysed and its optimisaton parameters and bandwidth limitations are studied.
GPU accelerated fuzzy connected image segmentation by using CUDA.

PubMed

Zhuge, Ying; Cao, Yong; Miller, Robert W

2009-01-01

Image segmentation techniques using fuzzy connectedness principles have shown their effectiveness in segmenting a variety of objects in several large applications in recent years. However, one problem of these algorithms has been their excessive computational requirements when processing large image datasets. Nowadays commodity graphics hardware provides high parallel computing power. In this paper, we present a parallel fuzzy connected image segmentation algorithm on Nvidia's Compute Unified Device Architecture (CUDA) platform for segmenting large medical image data sets. Our experiments based on three data sets with small, medium, and large data size demonstrate the efficiency of the parallel algorithm, which achieves a speed-up factor of 7.2x, 7.3x, and 14.4x, correspondingly, for the three data sets over the sequential implementation of fuzzy connected image segmentation algorithm on CPU.
Parameter Estimation of Fractional-Order Chaotic Systems by Using Quantum Parallel Particle Swarm Optimization Algorithm

PubMed Central

Huang, Yu; Guo, Feng; Li, Yongling; Liu, Yufeng

2015-01-01

Parameter estimation for fractional-order chaotic systems is an important issue in fractional-order chaotic control and synchronization and could be essentially formulated as a multidimensional optimization problem. A novel algorithm called quantum parallel particle swarm optimization (QPPSO) is proposed to solve the parameter estimation for fractional-order chaotic systems. The parallel characteristic of quantum computing is used in QPPSO. This characteristic increases the calculation of each generation exponentially. The behavior of particles in quantum space is restrained by the quantum evolution equation, which consists of the current rotation angle, individual optimal quantum rotation angle, and global optimal quantum rotation angle. Numerical simulation based on several typical fractional-order systems and comparisons with some typical existing algorithms show the effectiveness and efficiency of the proposed algorithm. PMID:25603158
Matching pursuit parallel decomposition of seismic data

NASA Astrophysics Data System (ADS)

Li, Chuanhui; Zhang, Fanchang

2017-07-01

In order to improve the computation speed of matching pursuit decomposition of seismic data, a matching pursuit parallel algorithm is designed in this paper. We pick a fixed number of envelope peaks from the current signal in every iteration according to the number of compute nodes and assign them to the compute nodes on average to search the optimal Morlet wavelets in parallel. With the help of parallel computer systems and Message Passing Interface, the parallel algorithm gives full play to the advantages of parallel computing to significantly improve the computation speed of the matching pursuit decomposition and also has good expandability. Besides, searching only one optimal Morlet wavelet by every compute node in every iteration is the most efficient implementation.
Maintaining High Assurance in Asynchronous Messaging

DTIC Science & Technology

2015-10-24

Assurance in Asynchronous Messaging Kevin E. Foltz and William R. Simpson Abstract—Asynchronous messaging is the delivery of a message without... integrity , and confidentiality guarantees. End-to-end security for asynchronous messaging must be provided by the asynchronous messaging layer itself... continuing its processing. At the completion of message transmission, the sender does not know when or whether the receiver received it. The message
Accelerating the Gillespie Exact Stochastic Simulation Algorithm using hybrid parallel execution on graphics processing units.

PubMed

Komarov, Ivan; D'Souza, Roshan M

2012-01-01

The Gillespie Stochastic Simulation Algorithm (GSSA) and its variants are cornerstone techniques to simulate reaction kinetics in situations where the concentration of the reactant is too low to allow deterministic techniques such as differential equations. The inherent limitations of the GSSA include the time required for executing a single run and the need for multiple runs for parameter sweep exercises due to the stochastic nature of the simulation. Even very efficient variants of GSSA are prohibitively expensive to compute and perform parameter sweeps. Here we present a novel variant of the exact GSSA that is amenable to acceleration by using graphics processing units (GPUs). We parallelize the execution of a single realization across threads in a warp (fine-grained parallelism). A warp is a collection of threads that are executed synchronously on a single multi-processor. Warps executing in parallel on different multi-processors (coarse-grained parallelism) simultaneously generate multiple trajectories. Novel data-structures and algorithms reduce memory traffic, which is the bottleneck in computing the GSSA. Our benchmarks show an 8×-120× performance gain over various state-of-the-art serial algorithms when simulating different types of models.
A parallel implementation of an off-lattice individual-based model of multicellular populations

NASA Astrophysics Data System (ADS)

Harvey, Daniel G.; Fletcher, Alexander G.; Osborne, James M.; Pitt-Francis, Joe

2015-07-01

As computational models of multicellular populations include ever more detailed descriptions of biophysical and biochemical processes, the computational cost of simulating such models limits their ability to generate novel scientific hypotheses and testable predictions. While developments in microchip technology continue to increase the power of individual processors, parallel computing offers an immediate increase in available processing power. To make full use of parallel computing technology, it is necessary to develop specialised algorithms. To this end, we present a parallel algorithm for a class of off-lattice individual-based models of multicellular populations. The algorithm divides the spatial domain between computing processes and comprises communication routines that ensure the model is correctly simulated on multiple processors. The parallel algorithm is shown to accurately reproduce the results of a deterministic simulation performed using a pre-existing serial implementation. We test the scaling of computation time, memory use and load balancing as more processes are used to simulate a cell population of fixed size. We find approximate linear scaling of both speed-up and memory consumption on up to 32 processor cores. Dynamic load balancing is shown to provide speed-up for non-regular spatial distributions of cells in the case of a growing population.
A distributed-memory approximation algorithm for maximum weight perfect bipartite matching

DOE Office of Scientific and Technical Information (OSTI.GOV)

Azad, Ariful; Buluc, Aydin; Li, Xiaoye S.

We design and implement an efficient parallel approximation algorithm for the problem of maximum weight perfect matching in bipartite graphs, i.e. the problem of finding a set of non-adjacent edges that covers all vertices and has maximum weight. This problem differs from the maximum weight matching problem, for which scalable approximation algorithms are known. It is primarily motivated by finding good pivots in scalable sparse direct solvers before factorization where sequential implementations of maximum weight perfect matching algorithms, such as those available in MC64, are widely used due to the lack of scalable alternatives. To overcome this limitation, we proposemore » a fully parallel distributed memory algorithm that first generates a perfect matching and then searches for weightaugmenting cycles of length four in parallel and iteratively augments the matching with a vertex disjoint set of such cycles. For most practical problems the weights of the perfect matchings generated by our algorithm are very close to the optimum. An efficient implementation of the algorithm scales up to 256 nodes (17,408 cores) on a Cray XC40 supercomputer and can solve instances that are too large to be handled by a single node using the sequential algorithm.« less
Optimal Golomb Ruler Sequences Generation for Optical WDM Systems: A Novel Parallel Hybrid Multi-objective Bat Algorithm

NASA Astrophysics Data System (ADS)

Bansal, Shonak; Singh, Arun Kumar; Gupta, Neena

2017-02-01

In real-life, multi-objective engineering design problems are very tough and time consuming optimization problems due to their high degree of nonlinearities, complexities and inhomogeneity. Nature-inspired based multi-objective optimization algorithms are now becoming popular for solving multi-objective engineering design problems. This paper proposes original multi-objective Bat algorithm (MOBA) and its extended form, namely, novel parallel hybrid multi-objective Bat algorithm (PHMOBA) to generate shortest length Golomb ruler called optimal Golomb ruler (OGR) sequences at a reasonable computation time. The OGRs found their application in optical wavelength division multiplexing (WDM) systems as channel-allocation algorithm to reduce the four-wave mixing (FWM) crosstalk. The performances of both the proposed algorithms to generate OGRs as optical WDM channel-allocation is compared with other existing classical computing and nature-inspired algorithms, including extended quadratic congruence (EQC), search algorithm (SA), genetic algorithms (GAs), biogeography based optimization (BBO) and big bang-big crunch (BB-BC) optimization algorithms. Simulations conclude that the proposed parallel hybrid multi-objective Bat algorithm works efficiently as compared to original multi-objective Bat algorithm and other existing algorithms to generate OGRs for optical WDM systems. The algorithm PHMOBA to generate OGRs, has higher convergence and success rate than original MOBA. The efficiency improvement of proposed PHMOBA to generate OGRs up to 20-marks, in terms of ruler length and total optical channel bandwidth (TBW) is 100 %, whereas for original MOBA is 85 %. Finally the implications for further research are also discussed.
A real time microcomputer implementation of sensor failure detection for turbofan engines

NASA Technical Reports Server (NTRS)

Delaat, John C.; Merrill, Walter C.

1989-01-01

An algorithm was developed which detects, isolates, and accommodates sensor failures using analytical redundancy. The performance of this algorithm was demonstrated on a full-scale F100 turbofan engine. The algorithm was implemented in real-time on a microprocessor-based controls computer which includes parallel processing and high order language programming. Parallel processing was used to achieve the required computational power for the real-time implementation. High order language programming was used in order to reduce the programming and maintenance costs of the algorithm implementation software. The sensor failure algorithm was combined with an existing multivariable control algorithm to give a complete control implementation with sensor analytical redundancy. The real-time microprocessor implementation of the algorithm which resulted in the successful completion of the algorithm engine demonstration, is described.
Optical transmission testing based on asynchronous sampling techniques

NASA Astrophysics Data System (ADS)

Mrozek, T.; Perlicki, K.; Wilczewski, G.

2016-09-01

This paper presents a method of analysis of images obtained with the Asynchronous Delay Tap Sampling technique, which is used for simultaneous monitoring of a number of phenomena in the physical layer of an optical network. This method allows visualization of results in a form of an optical signal's waveform (characteristics depicting phase portraits). Depending on a specific phenomenon being observed (i.e.: chromatic dispersion, polarization mode dispersion and ASE noise), the shape of the waveform changes. Herein presented original waveforms were acquired utilizing the OptSim 4.0 simulation package. After specific simulation testing, the obtained numerical data was transformed into an image form, that was further subjected to the analysis using authors' custom algorithms. These algorithms utilize various pixel operations and creation of reports each image might be characterized with. Each individual report shows the number of black pixels being present in the specific image segment. Afterwards, generated reports are compared with each other, across the original-impaired relationship. The differential report is created which consists of a "binary key" that shows the increase in the number of pixels in each particular segment. The ultimate aim of this work is to find the correlation between the generated binary keys and the analyzed common phenomenon being observed, allowing identification of the type of interference occurring. In the further course of the work it is evitable to determine their respective values. The presented work delivers the first objective - the ability to recognize interference.
SPEEDES - A multiple-synchronization environment for parallel discrete-event simulation

NASA Technical Reports Server (NTRS)

Steinman, Jeff S.

1992-01-01

Synchronous Parallel Environment for Emulation and Discrete-Event Simulation (SPEEDES) is a unified parallel simulation environment. It supports multiple-synchronization protocols without requiring users to recompile their code. When a SPEEDES simulation runs on one node, all the extra parallel overhead is removed automatically at run time. When the same executable runs in parallel, the user preselects the synchronization algorithm from a list of options. SPEEDES currently runs on UNIX networks and on the California Institute of Technology/Jet Propulsion Laboratory Mark III Hypercube. SPEEDES also supports interactive simulations. Featured in the SPEEDES environment is a new parallel synchronization approach called Breathing Time Buckets. This algorithm uses some of the conservative techniques found in Time Bucket synchronization, along with the optimism that characterizes the Time Warp approach. A mathematical model derived from first principles predicts the performance of Breathing Time Buckets. Along with the Breathing Time Buckets algorithm, this paper discusses the rules for processing events in SPEEDES, describes the implementation of various other synchronization protocols supported by SPEEDES, describes some new ones for the future, discusses interactive simulations, and then gives some performance results.

Solving the multiple-set split equality common fixed-point problem of firmly quasi-nonexpansive operators.

PubMed

Zhao, Jing; Zong, Haili

2018-01-01

In this paper, we propose parallel and cyclic iterative algorithms for solving the multiple-set split equality common fixed-point problem of firmly quasi-nonexpansive operators. We also combine the process of cyclic and parallel iterative methods and propose two mixed iterative algorithms. Our several algorithms do not need any prior information about the operator norms. Under mild assumptions, we prove weak convergence of the proposed iterative sequences in Hilbert spaces. As applications, we obtain several iterative algorithms to solve the multiple-set split equality problem.
Relationship between cortical state and spiking activity in the lateral geniculate nucleus of marmosets

PubMed Central

Pietersen, Alexander N.J.; Cheong, Soon Keen; Munn, Brandon; Gong, Pulin; Solomon, Samuel G.

2017-01-01

Key points How parallel are the primate visual pathways? In the present study, we demonstrate that parallel visual pathways in the dorsal lateral geniculate nucleus (LGN) show distinct patterns of interaction with rhythmic activity in the primary visual cortex (V1).In the V1 of anaesthetized marmosets, the EEG frequency spectrum undergoes transient changes that are characterized by fluctuations in delta‐band EEG power.We show that, on multisecond timescales, spiking activity in an evolutionary primitive (koniocellular) LGN pathway is specifically linked to these slow EEG spectrum changes. By contrast, on subsecond (delta frequency) timescales, cortical oscillations can entrain spiking activity throughout the entire LGN.Our results are consistent with the hypothesis that, in waking animals, the koniocellular pathway selectively participates in brain circuits controlling vigilance and attention. Abstract The major afferent cortical pathway in the visual system passes through the dorsal lateral geniculate nucleus (LGN), where nerve signals originating in the eye can first interact with brain circuits regulating visual processing, vigilance and attention. In the present study, we investigated how ongoing and visually driven activity in magnocellular (M), parvocellular (P) and koniocellular (K) layers of the LGN are related to cortical state. We recorded extracellular spiking activity in the LGN simultaneously with local field potentials (LFP) in primary visual cortex, in sufentanil‐anaesthetized marmoset monkeys. We found that asynchronous cortical states (marked by low power in delta‐band LFPs) are linked to high spike rates in K cells (but not P cells or M cells), on multisecond timescales. Cortical asynchrony precedes the increases in K cell spike rates by 1–3 s, implying causality. At subsecond timescales, the spiking activity in many cells of all (M, P and K) classes is phase‐locked to delta waves in the cortical LFP, and more cells are phase‐locked during synchronous cortical states than during asynchronous cortical states. The switch from low‐to‐high spike rates in K cells does not degrade their visual signalling capacity. By contrast, during asynchronous cortical states, the fidelity of visual signals transmitted by K cells is improved, probably because K cell responses become less rectified. Overall, the data show that slow fluctuations in cortical state are selectively linked to K pathway spiking activity, whereas delta‐frequency cortical oscillations entrain spiking activity throughout the entire LGN, in anaesthetized marmosets. PMID:28116750
Parallel Newton-Krylov-Schwarz algorithms for the transonic full potential equation

NASA Technical Reports Server (NTRS)

Cai, Xiao-Chuan; Gropp, William D.; Keyes, David E.; Melvin, Robin G.; Young, David P.

1996-01-01

We study parallel two-level overlapping Schwarz algorithms for solving nonlinear finite element problems, in particular, for the full potential equation of aerodynamics discretized in two dimensions with bilinear elements. The overall algorithm, Newton-Krylov-Schwarz (NKS), employs an inexact finite-difference Newton method and a Krylov space iterative method, with a two-level overlapping Schwarz method as a preconditioner. We demonstrate that NKS, combined with a density upwinding continuation strategy for problems with weak shocks, is robust and, economical for this class of mixed elliptic-hyperbolic nonlinear partial differential equations, with proper specification of several parameters. We study upwinding parameters, inner convergence tolerance, coarse grid density, subdomain overlap, and the level of fill-in in the incomplete factorization, and report their effect on numerical convergence rate, overall execution time, and parallel efficiency on a distributed-memory parallel computer.
Convergence issues in domain decomposition parallel computation of hovering rotor

NASA Astrophysics Data System (ADS)

Xiao, Zhongyun; Liu, Gang; Mou, Bin; Jiang, Xiong

2018-05-01

Implicit LU-SGS time integration algorithm has been widely used in parallel computation in spite of its lack of information from adjacent domains. When applied to parallel computation of hovering rotor flows in a rotating frame, it brings about convergence issues. To remedy the problem, three LU factorization-based implicit schemes (consisting of LU-SGS, DP-LUR and HLU-SGS) are investigated comparatively. A test case of pure grid rotation is designed to verify these algorithms, which show that LU-SGS algorithm introduces errors on boundary cells. When partition boundaries are circumferential, errors arise in proportion to grid speed, accumulating along with the rotation, and leading to computational failure in the end. Meanwhile, DP-LUR and HLU-SGS methods show good convergence owing to boundary treatment which are desirable in domain decomposition parallel computations.
An efficient parallel algorithm for the calculation of canonical MP2 energies.

PubMed

Baker, Jon; Pulay, Peter

2002-09-01

We present the parallel version of a previous serial algorithm for the efficient calculation of canonical MP2 energies (Pulay, P.; Saebo, S.; Wolinski, K. Chem Phys Lett 2001, 344, 543). It is based on the Saebo-Almlöf direct-integral transformation, coupled with an efficient prescreening of the AO integrals. The parallel algorithm avoids synchronization delays by spawning a second set of slaves during the bin-sort prior to the second half-transformation. Results are presented for systems with up to 2000 basis functions. MP2 energies for molecules with 400-500 basis functions can be routinely calculated to microhartree accuracy on a small number of processors (6-8) in a matter of minutes with modern PC-based parallel computers. Copyright 2002 Wiley Periodicals, Inc. J Comput Chem 23: 1150-1156, 2002
The factorization of large composite numbers on the MPP

NASA Technical Reports Server (NTRS)

Mckurdy, Kathy J.; Wunderlich, Marvin C.

1987-01-01

The continued fraction method for factoring large integers (CFRAC) was an ideal algorithm to be implemented on a massively parallel computer such as the Massively Parallel Processor (MPP). After much effort, the first 60 digit number was factored on the MPP using about 6 1/2 hours of array time. Although this result added about 10 digits to the size number that could be factored using CFRAC on a serial machine, it was already badly beaten by the implementation of Davis and Holdridge on the CRAY-1 using the quadratic sieve, an algorithm which is clearly superior to CFRAC for large numbers. An algorithm is illustrated which is ideally suited to the single instruction multiple data (SIMD) massively parallel architecture and some of the modifications which were needed in order to make the parallel implementation effective and efficient are described.
Fast parallel molecular algorithms for DNA-based computation: factoring integers.

PubMed

Chang, Weng-Long; Guo, Minyi; Ho, Michael Shan-Hui

2005-06-01

The RSA public-key cryptosystem is an algorithm that converts input data to an unrecognizable encryption and converts the unrecognizable data back into its original decryption form. The security of the RSA public-key cryptosystem is based on the difficulty of factoring the product of two large prime numbers. This paper demonstrates to factor the product of two large prime numbers, and is a breakthrough in basic biological operations using a molecular computer. In order to achieve this, we propose three DNA-based algorithms for parallel subtractor, parallel comparator, and parallel modular arithmetic that formally verify our designed molecular solutions for factoring the product of two large prime numbers. Furthermore, this work indicates that the cryptosystems using public-key are perhaps insecure and also presents clear evidence of the ability of molecular computing to perform complicated mathematical operations.
Parallel algorithm of VLBI software correlator under multiprocessor environment

NASA Astrophysics Data System (ADS)

Zheng, Weimin; Zhang, Dong

2007-11-01

The correlator is the key signal processing equipment of a Very Lone Baseline Interferometry (VLBI) synthetic aperture telescope. It receives the mass data collected by the VLBI observatories and produces the visibility function of the target, which can be used to spacecraft position, baseline length measurement, synthesis imaging, and other scientific applications. VLBI data correlation is a task of data intensive and computation intensive. This paper presents the algorithms of two parallel software correlators under multiprocessor environments. A near real-time correlator for spacecraft tracking adopts the pipelining and thread-parallel technology, and runs on the SMP (Symmetric Multiple Processor) servers. Another high speed prototype correlator using the mixed Pthreads and MPI (Massage Passing Interface) parallel algorithm is realized on a small Beowulf cluster platform. Both correlators have the characteristic of flexible structure, scalability, and with 10-station data correlating abilities.
Exploring asynchronous brainstorming in large groups: a field comparison of serial and parallel subgroups.

PubMed

de Vreede, Gert-Jan; Briggs, Robert O; Reiter-Palmon, Roni

2010-04-01

The aim of this study was to compare the results of two different modes of using multiple groups (instead of one large group) to identify problems and develop solutions. Many of the complex problems facing organizations today require the use of very large groups or collaborations of groups from multiple organizations. There are many logistical problems associated with the use of such large groups, including the ability to bring everyone together at the same time and location. A field study involved two different organizations and compared productivity and satisfaction of group. The approaches included (a) multiple small groups, each completing the entire process from start to end and combining the results at the end (parallel mode); and (b) multiple subgroups, each building on the work provided by previous subgroups (serial mode). Groups using the serial mode produced more elaborations compared with parallel groups, whereas parallel groups produced more unique ideas compared with serial groups. No significant differences were found related to satisfaction with process and outcomes between the two modes. Preferred mode depends on the type of task facing the group. Parallel groups are more suited for tasks for which a variety of new ideas are needed, whereas serial groups are best suited when elaboration and in-depth thinking on the solution are required. Results of this research can guide the development of facilitated sessions of large groups or "teams of teams."
A Parallel Relational Database Management System Approach to Relevance Feedback in Information Retrieval.

ERIC Educational Resources Information Center

Lundquist, Carol; Frieder, Ophir; Holmes, David O.; Grossman, David

1999-01-01

Describes a scalable, parallel, relational database-drive information retrieval engine. To support portability across a wide range of execution environments, all algorithms adhere to the SQL-92 standard. By incorporating relevance feedback algorithms, accuracy is enhanced over prior database-driven information retrieval efforts. Presents…
Creating IRT-Based Parallel Test Forms Using the Genetic Algorithm Method

ERIC Educational Resources Information Center

Sun, Koun-Tem; Chen, Yu-Jen; Tsai, Shu-Yen; Cheng, Chien-Fen

2008-01-01

In educational measurement, the construction of parallel test forms is often a combinatorial optimization problem that involves the time-consuming selection of items to construct tests having approximately the same test information functions (TIFs) and constraints. This article proposes a novel method, genetic algorithm (GA), to construct parallel…
A Survey of Parallel Sorting Algorithms.

DTIC Science & Technology

1981-12-01

see that, in this algorithm, each Processor i, for 1 itp -2, interacts directly only with Processors i+l and i-l. Processor j 0 only interacts with...Chan76] Chandra, A.K., "Maximal Parallelism in Matrix Multiplication," IBM Report RC. 6193, Watson Research Center, Yorktown Heights, N.Y., October 1976
Parallel k-means++

DOE Office of Scientific and Technical Information (OSTI.GOV)

A parallelization of the k-means++ seed selection algorithm on three distinct hardware platforms: GPU, multicore CPU, and multithreaded architecture. K-means++ was developed by David Arthur and Sergei Vassilvitskii in 2007 as an extension of the k-means data clustering technique. These algorithms allow people to cluster multidimensional data, by attempting to minimize the mean distance of data points within a cluster. K-means++ improved upon traditional k-means by using a more intelligent approach to selecting the initial seeds for the clustering process. While k-means++ has become a popular alternative to traditional k-means clustering, little work has been done to parallelize this technique.more » We have developed original C++ code for parallelizing the algorithm on three unique hardware architectures: GPU using NVidia's CUDA/Thrust framework, multicore CPU using OpenMP, and the Cray XMT multithreaded architecture. By parallelizing the process for these platforms, we are able to perform k-means++ clustering much more quickly than it could be done before.« less
A Strapdown Interial Navigation System/Beidou/Doppler Velocity Log Integrated Navigation Algorithm Based on a Cubature Kalman Filter

PubMed Central

Gao, Wei; Zhang, Ya; Wang, Jianguo

2014-01-01

The integrated navigation system with strapdown inertial navigation system (SINS), Beidou (BD) receiver and Doppler velocity log (DVL) can be used in marine applications owing to the fact that the redundant and complementary information from different sensors can markedly improve the system accuracy. However, the existence of multisensor asynchrony will introduce errors into the system. In order to deal with the problem, conventionally the sampling interval is subdivided, which increases the computational complexity. In this paper, an innovative integrated navigation algorithm based on a Cubature Kalman filter (CKF) is proposed correspondingly. A nonlinear system model and observation model for the SINS/BD/DVL integrated system are established to more accurately describe the system. By taking multi-sensor asynchronization into account, a new sampling principle is proposed to make the best use of each sensor's information. Further, CKF is introduced in this new algorithm to enable the improvement of the filtering accuracy. The performance of this new algorithm has been examined through numerical simulations. The results have shown that the positional error can be effectively reduced with the new integrated navigation algorithm. Compared with the traditional algorithm based on EKF, the accuracy of the SINS/BD/DVL integrated navigation system is improved, making the proposed nonlinear integrated navigation algorithm feasible and efficient. PMID:24434842
DOE Office of Scientific and Technical Information (OSTI.GOV)

Deptuch, G. W.; Fahim, F.; Grybos, P.

An on-chip implementable algorithm for allocation of an X-ray photon imprint, called a hit, to a single pixel in the presence of charge sharing in a highly segmented pixel detector is described. Its proof-of-principle implementation is also given supported by the results of tests using a highly collimated X-ray photon beam from a synchrotron source. The algorithm handles asynchronous arrivals of X-ray photons. Activation of groups of pixels, comparisons of peak amplitudes of pulses within an active neighborhood and finally latching of the results of these comparisons constitute the three procedural steps of the algorithm. A grouping of pixels tomore » one virtual pixel that recovers composite signals and event driven strobes to control comparisons of fractional signals between neighboring pixels are the actuators of the algorithm. The circuitry necessary to implement the algorithm requires an extensive inter-pixel connection grid of analog and digital signals that are exchanged between pixels. A test-circuit implementation of the algorithm was achieved with a small array of 32×32 pixels and the device was exposed to an 8 keV highly collimated to a diameter of 3 μm X-ray beam. The results of these tests are given in the paper assessing physical implementation of the algorithm.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)

Deptuch, Grzegorz W.; Fahim, Farah; Grybos, Pawel

An on-chip implementable algorithm for allocation of an X-ray photon imprint, called a hit, to a single pixel in the presence of charge sharing in a highly segmented pixel detector is described. Its proof-of-principle implementation is also given supported by the results of tests using a highly collimated X-ray photon beam from a synchrotron source. The algorithm handles asynchronous arrivals of X-ray photons. Activation of groups of pixels, comparisons of peak amplitudes of pulses within an active neighborhood and finally latching of the results of these comparisons constitute the three procedural steps of the algorithm. A grouping of pixels tomore » one virtual pixel, that recovers composite signals and event driven strobes, to control comparisons of fractional signals between neighboring pixels are the actuators of the algorithm. The circuitry necessary to implement the algorithm requires an extensive inter-pixel connection grid of analog and digital signals, that are exchanged between pixels. A test-circuit implementation of the algorithm was achieved with a small array of 32 × 32 pixels and the device was exposed to an 8 keV highly collimated to a diameter of 3-μm X-ray beam. Furthermore, the results of these tests are given in this paper assessing physical implementation of the algorithm.« less
Parallel-vector unsymmetric Eigen-Solver on high performance computers

NASA Technical Reports Server (NTRS)

Nguyen, Duc T.; Jiangning, Qin

1993-01-01

The popular QR algorithm for solving all eigenvalues of an unsymmetric matrix is reviewed. Among the basic components in the QR algorithm, it was concluded from this study, that the reduction of an unsymmetric matrix to a Hessenberg form (before applying the QR algorithm itself) can be done effectively by exploiting the vector speed and multiple processors offered by modern high-performance computers. Numerical examples of several test cases have indicated that the proposed parallel-vector algorithm for converting a given unsymmetric matrix to a Hessenberg form offers computational advantages over the existing algorithm. The time saving obtained by the proposed methods is increased as the problem size increased.
Heinrich events simulated across the glacial

NASA Astrophysics Data System (ADS)

Ziemen, F. A.; Mikolajewicz, U.

2015-12-01

Heinrich events are among the most prominent climate change events recorded in proxies across the northern hemisphere. They are the archetype of ice sheet — climate interactions on millennial time scales. Nevertheless, the exact mechanisms that cause Heinrich events are still under discussion, and their climatic consequences are far from being fully understood. We contribute to answering the open questions by studying Heinrich events in a coupled ice sheet model (ISM) atmosphere-ocean-vegetation general circulation model (AOVGCM) framework, where this variability occurs as part of the model generated internal variability. The setup consists of a northern hemisphere setup of the modified Parallel Ice Sheet Model (mPISM) coupled to the global AOVGCM ECHAM5/MPIOM/LPJ. The simulations were performed fully coupled and with transient orbital and greenhouse gas forcing. They span from several millennia before the last glacial maximum into the deglaciation. We analyze simulations where the ISM is coupled asynchronously to the AOVGCM and simulations where the ISM and the ocean model are coupled synchronously and the atmosphere model is coupled asynchronously to them. The modeled Heinrich events show a marked influence of the ice discharge on the Atlantic circulation and heat transport.
Parallel simulation today

NASA Technical Reports Server (NTRS)

Nicol, David; Fujimoto, Richard

1992-01-01

This paper surveys topics that presently define the state of the art in parallel simulation. Included in the tutorial are discussions on new protocols, mathematical performance analysis, time parallelism, hardware support for parallel simulation, load balancing algorithms, and dynamic memory management for optimistic synchronization.
Accelerating global optimization of aerodynamic shapes using a new surrogate-assisted parallel genetic algorithm

NASA Astrophysics Data System (ADS)

Ebrahimi, Mehdi; Jahangirian, Alireza

2017-12-01

An efficient strategy is presented for global shape optimization of wing sections with a parallel genetic algorithm. Several computational techniques are applied to increase the convergence rate and the efficiency of the method. A variable fidelity computational evaluation method is applied in which the expensive Navier-Stokes flow solver is complemented by an inexpensive multi-layer perceptron neural network for the objective function evaluations. A population dispersion method that consists of two phases, of exploration and refinement, is developed to improve the convergence rate and the robustness of the genetic algorithm. Owing to the nature of the optimization problem, a parallel framework based on the master/slave approach is used. The outcomes indicate that the method is able to find the global optimum with significantly lower computational time in comparison to the conventional genetic algorithm.

Field-Programmable Gate Array Computer in Structural Analysis: An Initial Exploration

NASA Technical Reports Server (NTRS)

Singleterry, Robert C., Jr.; Sobieszczanski-Sobieski, Jaroslaw; Brown, Samuel

2002-01-01

This paper reports on an initial assessment of using a Field-Programmable Gate Array (FPGA) computational device as a new tool for solving structural mechanics problems. A FPGA is an assemblage of binary gates arranged in logical blocks that are interconnected via software in a manner dependent on the algorithm being implemented and can be reprogrammed thousands of times per second. In effect, this creates a computer specialized for the problem that automatically exploits all the potential for parallel computing intrinsic in an algorithm. This inherent parallelism is the most important feature of the FPGA computational environment. It is therefore important that if a problem offers a choice of different solution algorithms, an algorithm of a higher degree of inherent parallelism should be selected. It is found that in structural analysis, an 'analog computer' style of programming, which solves problems by direct simulation of the terms in the governing differential equations, yields a more favorable solution algorithm than current solution methods. This style of programming is facilitated by a 'drag-and-drop' graphic programming language that is supplied with the particular type of FPGA computer reported in this paper. Simple examples in structural dynamics and statics illustrate the solution approach used. The FPGA system also allows linear scalability in computing capability. As the problem grows, the number of FPGA chips can be increased with no loss of computing efficiency due to data flow or algorithmic latency that occurs when a single problem is distributed among many conventional processors that operate in parallel. This initial assessment finds the FPGA hardware and software to be in their infancy in regard to the user conveniences; however, they have enormous potential for shrinking the elapsed time of structural analysis solutions if programmed with algorithms that exhibit inherent parallelism and linear scalability. This potential warrants further development of FPGA-tailored algorithms for structural analysis.
Reducing The Risk Of Fires In Conveyor Transport

NASA Astrophysics Data System (ADS)

Cheremushkina, M. S.; Poddubniy, D. A.

2017-01-01

The paper deals with the actual problem of increasing the safety of operation of belt conveyors in mines. Was developed the control algorithm that meets the technical requirements of the mine belt conveyors, reduces the risk of fires of conveyors belt, and enables energy and resource savings taking into account random sort of traffic. The most effective method of decision such tasks is the construction of control systems with the use of variable speed drives for asynchronous motors. Was designed the mathematical model of the system "variable speed multiengine drive - conveyor - control system of conveyors", that takes into account the dynamic processes occurring in the elements of the transport system, provides an assessment of the energy efficiency of application the developed algorithms, which allows to reduce the dynamic overload in the belt to (15-20)%.
Parallel algorithms for boundary value problems

NASA Technical Reports Server (NTRS)

Lin, Avi

1990-01-01

A general approach to solve boundary value problems numerically in a parallel environment is discussed. The basic algorithm consists of two steps: the local step where all the P available processors work in parallel, and the global step where one processor solves a tridiagonal linear system of the order P. The main advantages of this approach are two fold. First, this suggested approach is very flexible, especially in the local step and thus the algorithm can be used with any number of processors and with any of the SIMD or MIMD machines. Secondly, the communication complexity is very small and thus can be used as easily with shared memory machines. Several examples for using this strategy are discussed.
Parallel k-means++ for Multiple Shared-Memory Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mackey, Patrick S.; Lewis, Robert R.

2016-09-22

In recent years k-means++ has become a popular initialization technique for improved k-means clustering. To date, most of the work done to improve its performance has involved parallelizing algorithms that are only approximations of k-means++. In this paper we present a parallelization of the exact k-means++ algorithm, with a proof of its correctness. We develop implementations for three distinct shared-memory architectures: multicore CPU, high performance GPU, and the massively multithreaded Cray XMT platform. We demonstrate the scalability of the algorithm on each platform. In addition we present a visual approach for showing which platform performed k-means++ the fastest for varyingmore » data sizes.« less
An O(log sup 2 N) parallel algorithm for computing the eigenvalues of a symmetric tridiagonal matrix

NASA Technical Reports Server (NTRS)

Swarztrauber, Paul N.

1989-01-01

An O(log sup 2 N) parallel algorithm is presented for computing the eigenvalues of a symmetric tridiagonal matrix using a parallel algorithm for computing the zeros of the characteristic polynomial. The method is based on a quadratic recurrence in which the characteristic polynomial is constructed on a binary tree from polynomials whose degree doubles at each level. Intervals that contain exactly one zero are determined by the zeros of polynomials at the previous level which ensures that different processors compute different zeros. The exact behavior of the polynomials at the interval endpoints is used to eliminate the usual problems induced by finite precision arithmetic.
Parallel matrix multiplication on the Connection Machine

NASA Technical Reports Server (NTRS)

Tichy, Walter F.

1988-01-01

Matrix multiplication is a computation and communication intensive problem. Six parallel algorithms for matrix multiplication on the Connection Machine are presented and compared with respect to their performance and processor usage. For n by n matrices, the algorithms have theoretical running times of O(n to the 2nd power log n), O(n log n), O(n), and O(log n), and require n, n to the 2nd power, n to the 2nd power, and n to the 3rd power processors, respectively. With careful attention to communication patterns, the theoretically predicted runtimes can indeed be achieved in practice. The parallel algorithms illustrate the tradeoffs between performance, communication cost, and processor usage.
High-speed prediction of crystal structures for organic molecules

NASA Astrophysics Data System (ADS)

Obata, Shigeaki; Goto, Hitoshi

2015-02-01

We developed a master-worker type parallel algorithm for allocating tasks of crystal structure optimizations to distributed compute nodes, in order to improve a performance of simulations for crystal structure predictions. The performance experiments were demonstrated on TUT-ADSIM supercomputer system (HITACHI HA8000-tc/HT210). The experimental results show that our parallel algorithm could achieve speed-ups of 214 and 179 times using 256 processor cores on crystal structure optimizations in predictions of crystal structures for 3-aza-bicyclo(3.3.1)nonane-2,4-dione and 2-diazo-3,5-cyclohexadiene-1-one, respectively. We expect that this parallel algorithm is always possible to reduce computational costs of any crystal structure predictions.
A Domain Decomposition Parallelization of the Fast Marching Method

NASA Technical Reports Server (NTRS)

Herrmann, M.

2003-01-01

In this paper, the first domain decomposition parallelization of the Fast Marching Method for level sets has been presented. Parallel speedup has been demonstrated in both the optimal and non-optimal domain decomposition case. The parallel performance of the proposed method is strongly dependent on load balancing separately the number of nodes on each side of the interface. A load imbalance of nodes on either side of the domain leads to an increase in communication and rollback operations. Furthermore, the amount of inter-domain communication can be reduced by aligning the inter-domain boundaries with the interface normal vectors. In the case of optimal load balancing and aligned inter-domain boundaries, the proposed parallel FMM algorithm is highly efficient, reaching efficiency factors of up to 0.98. Future work will focus on the extension of the proposed parallel algorithm to higher order accuracy. Also, to further enhance parallel performance, the coupling of the domain decomposition parallelization to the G(sub 0)-based parallelization will be investigated.
Hybrid optical CDMA-FSO communications network under spatially correlated gamma-gamma scintillation.

PubMed

Jurado-Navas, Antonio; Raddo, Thiago R; Garrido-Balsells, José María; Borges, Ben-Hur V; Olmos, Juan José Vegas; Monroy, Idelfonso Tafur

2016-07-25

In this paper, we propose a new hybrid network solution based on asynchronous optical code-division multiple-access (OCDMA) and free-space optical (FSO) technologies for last-mile access networks, where fiber deployment is impractical. The architecture of the proposed hybrid OCDMA-FSO network is thoroughly described. The users access the network in a fully asynchronous manner by means of assigned fast frequency hopping (FFH)-based codes. In the FSO receiver, an equal gain-combining technique is employed along with intensity modulation and direct detection. New analytical formalisms for evaluating the average bit error rate (ABER) performance are also proposed. These formalisms, based on the spatially correlated gamma-gamma statistical model, are derived considering three distinct scenarios, namely, uncorrelated, totally correlated, and partially correlated channels. Numerical results show that users can successfully achieve error-free ABER levels for the three scenarios considered as long as forward error correction (FEC) algorithms are employed. Therefore, OCDMA-FSO networks can be a prospective alternative to deliver high-speed communication services to access networks with deficient fiber infrastructure.
Alternative majority-voting methods for real-time computing systems

NASA Technical Reports Server (NTRS)

Shin, Kang G.; Dolter, James W.

1989-01-01

Two techniques that provide a compromise between the high time overhead in maintaining synchronous voting and the difficulty of combining results in asynchronous voting are proposed. These techniques are specifically suited for real-time applications with a single-source/single-sink structure that need instantaneous error masking. They provide a compromise between a tightly synchronized system in which the synchronization overhead can be quite high, and an asynchronous system which lacks suitable algorithms for combining the output data. Both quorum-majority voting (QMV) and compare-majority voting (CMV) are most applicable to distributed real-time systems with single-source/single-sink tasks. All real-time systems eventually have to resolve their outputs into a single action at some stage. The development of the advanced information processing system (AIPS) and other similar systems serve to emphasize the importance of these techniques. Time bounds suggest that it is possible to reduce the overhead for quorum-majority voting to below that for synchronous voting. All the bounds assume that the computation phase is nonpreemptive and that there is no multitasking.
Optical transmission testing based on asynchronous sampling techniques: images analysis containing chromatic dispersion using convolutional neural network

NASA Astrophysics Data System (ADS)

Mrozek, T.; Perlicki, K.; Tajmajer, T.; Wasilewski, P.

2017-08-01

The article presents an image analysis method, obtained from an asynchronous delay tap sampling (ADTS) technique, which is used for simultaneous monitoring of various impairments occurring in the physical layer of the optical network. The ADTS method enables the visualization of the optical signal in the form of characteristics (so called phase portraits) that change their shape under the influence of impairments such as chromatic dispersion, polarization mode dispersion and ASE noise. Using this method, a simulation model was built with OptSim 4.0. After the simulation study, data were obtained in the form of images that were further analyzed using the convolutional neural network algorithm. The main goal of the study was to train a convolutional neural network to recognize the selected impairment (distortion); then to test its accuracy and estimate the impairment for the selected set of test images. The input data consisted of processed binary images in the form of two-dimensional matrices, with the position of the pixel. This article focuses only on the analysis of images containing chromatic dispersion.
Sensor Fusion, Prognostics, Diagnostics and Failure Mode Control for Complex Aerospace Systems

DTIC Science & Technology

2010-10-01

algorithm and to then tune the candidates individually using known metaheuristics . As will be...parallel. The result of this arrangement is that the processing is a form that is analogous to standard parallel genetic algorithms , and as such...search algorithm then uses the hybrid of fitness data to rank the results. The ETRAS controller is developed using pre-selection, showing that a
Parallel AFSA algorithm accelerating based on MIC architecture

NASA Astrophysics Data System (ADS)

Zhou, Junhao; Xiao, Hong; Huang, Yifan; Li, Yongzhao; Xu, Yuanrui

2017-05-01

Analysis AFSA past for solving the traveling salesman problem, the algorithm efficiency is often a big problem, and the algorithm processing method, it does not fully responsive to the characteristics of the traveling salesman problem to deal with, and therefore proposes a parallel join improved AFSA process. The simulation with the current TSP known optimal solutions were analyzed, the results showed that the AFSA iterations improved less, on the MIC cards doubled operating efficiency, efficiency significantly.
Fully Parallel MHD Stability Analysis Tool

NASA Astrophysics Data System (ADS)

Svidzinski, Vladimir; Galkin, Sergei; Kim, Jin-Soo; Liu, Yueqiang

2014-10-01

Progress on full parallelization of the plasma stability code MARS will be reported. MARS calculates eigenmodes in 2D axisymmetric toroidal equilibria in MHD-kinetic plasma models. It is a powerful tool for studying MHD and MHD-kinetic instabilities and it is widely used by fusion community. Parallel version of MARS is intended for simulations on local parallel clusters. It will be an efficient tool for simulation of MHD instabilities with low, intermediate and high toroidal mode numbers within both fluid and kinetic plasma models, already implemented in MARS. Parallelization of the code includes parallelization of the construction of the matrix for the eigenvalue problem and parallelization of the inverse iterations algorithm, implemented in MARS for the solution of the formulated eigenvalue problem. Construction of the matrix is parallelized by distributing the load among processors assigned to different magnetic surfaces. Parallelization of the solution of the eigenvalue problem is made by repeating steps of the present MARS algorithm using parallel libraries and procedures. Initial results of the code parallelization will be reported. Work is supported by the U.S. DOE SBIR program.
Research in Parallel Algorithms and Software for Computational Aerosciences

NASA Technical Reports Server (NTRS)

Domel, Neal D.

1996-01-01

Phase I is complete for the development of a Computational Fluid Dynamics parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.
Research in Parallel Algorithms and Software for Computational Aerosciences

NASA Technical Reports Server (NTRS)

Domel, Neal D.

1996-01-01

Phase 1 is complete for the development of a computational fluid dynamics CFD) parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.
Real-time implementation of optimized maximum noise fraction transform for feature extraction of hyperspectral images

NASA Astrophysics Data System (ADS)

Wu, Yuanfeng; Gao, Lianru; Zhang, Bing; Zhao, Haina; Li, Jun

2014-01-01

We present a parallel implementation of the optimized maximum noise fraction (G-OMNF) transform algorithm for feature extraction of hyperspectral images on commodity graphics processing units (GPUs). The proposed approach explored the algorithm data-level concurrency and optimized the computing flow. We first defined a three-dimensional grid, in which each thread calculates a sub-block data to easily facilitate the spatial and spectral neighborhood data searches in noise estimation, which is one of the most important steps involved in OMNF. Then, we optimized the processing flow and computed the noise covariance matrix before computing the image covariance matrix to reduce the original hyperspectral image data transmission. These optimization strategies can greatly improve the computing efficiency and can be applied to other feature extraction algorithms. The proposed parallel feature extraction algorithm was implemented on an Nvidia Tesla GPU using the compute unified device architecture and basic linear algebra subroutines library. Through the experiments on several real hyperspectral images, our GPU parallel implementation provides a significant speedup of the algorithm compared with the CPU implementation, especially for highly data parallelizable and arithmetically intensive algorithm parts, such as noise estimation. In order to further evaluate the effectiveness of G-OMNF, we used two different applications: spectral unmixing and classification for evaluation. Considering the sensor scanning rate and the data acquisition time, the proposed parallel implementation met the on-board real-time feature extraction.
Hybrid-optimization strategy for the communication of large-scale Kinetic Monte Carlo simulation

NASA Astrophysics Data System (ADS)

Wu, Baodong; Li, Shigang; Zhang, Yunquan; Nie, Ningming

2017-02-01

The parallel Kinetic Monte Carlo (KMC) algorithm based on domain decomposition has been widely used in large-scale physical simulations. However, the communication overhead of the parallel KMC algorithm is critical, and severely degrades the overall performance and scalability. In this paper, we present a hybrid optimization strategy to reduce the communication overhead for the parallel KMC simulations. We first propose a communication aggregation algorithm to reduce the total number of messages and eliminate the communication redundancy. Then, we utilize the shared memory to reduce the memory copy overhead of the intra-node communication. Finally, we optimize the communication scheduling using the neighborhood collective operations. We demonstrate the scalability and high performance of our hybrid optimization strategy by both theoretical and experimental analysis. Results show that the optimized KMC algorithm exhibits better performance and scalability than the well-known open-source library-SPPARKS. On 32-node Xeon E5-2680 cluster (total 640 cores), the optimized algorithm reduces the communication time by 24.8% compared with SPPARKS.
Parallelization of Nullspace Algorithm for the computation of metabolic pathways

PubMed Central

Jevremović, Dimitrije; Trinh, Cong T.; Srienc, Friedrich; Sosa, Carlos P.; Boley, Daniel

2011-01-01

Elementary mode analysis is a useful metabolic pathway analysis tool in understanding and analyzing cellular metabolism, since elementary modes can represent metabolic pathways with unique and minimal sets of enzyme-catalyzed reactions of a metabolic network under steady state conditions. However, computation of the elementary modes of a genome- scale metabolic network with 100–1000 reactions is very expensive and sometimes not feasible with the commonly used serial Nullspace Algorithm. In this work, we develop a distributed memory parallelization of the Nullspace Algorithm to handle efficiently the computation of the elementary modes of a large metabolic network. We give an implementation in C++ language with the support of MPI library functions for the parallel communication. Our proposed algorithm is accompanied with an analysis of the complexity and identification of major bottlenecks during computation of all possible pathways of a large metabolic network. The algorithm includes methods to achieve load balancing among the compute-nodes and specific communication patterns to reduce the communication overhead and improve efficiency. PMID:22058581
An efficient parallel-processing method for transposing large matrices in place.

PubMed

Portnoff, M R

1999-01-01

We have developed an efficient algorithm for transposing large matrices in place. The algorithm is efficient because data are accessed either sequentially in blocks or randomly within blocks small enough to fit in cache, and because the same indexing calculations are shared among identical procedures operating on independent subsets of the data. This inherent parallelism makes the method well suited for a multiprocessor computing environment. The algorithm is easy to implement because the same two procedures are applied to the data in various groupings to carry out the complete transpose operation. Using only a single processor, we have demonstrated nearly an order of magnitude increase in speed over the previously published algorithm by Gate and Twigg for transposing a large rectangular matrix in place. With multiple processors operating in parallel, the processing speed increases almost linearly with the number of processors. A simplified version of the algorithm for square matrices is presented as well as an extension for matrices large enough to require virtual memory.

Algorithms and programming tools for image processing on the MPP:3

NASA Technical Reports Server (NTRS)

Reeves, Anthony P.

1987-01-01

This is the third and final report on the work done for NASA Grant 5-403 on Algorithms and Programming Tools for Image Processing on the MPP:3. All the work done for this grant is summarized in the introduction. Work done since August 1986 is reported in detail. Research for this grant falls under the following headings: (1) fundamental algorithms for the MPP; (2) programming utilities for the MPP; (3) the Parallel Pascal Development System; and (4) performance analysis. In this report, the results of two efforts are reported: region growing, and performance analysis of important characteristic algorithms. In each case, timing results from MPP implementations are included. A paper is included in which parallel algorithms for region growing on the MPP is discussed. These algorithms permit different sized regions to be merged in parallel. Details on the implementation and peformance of several important MPP algorithms are given. These include a number of standard permutations, the FFT, convolution, arbitrary data mappings, image warping, and pyramid operations, all of which have been implemented on the MPP. The permutation and image warping functions have been included in the standard development system library.
Parallel computer vision

DOE Office of Scientific and Technical Information (OSTI.GOV)

Uhr, L.

1987-01-01

This book is written by research scientists involved in the development of massively parallel, but hierarchically structured, algorithms, architectures, and programs for image processing, pattern recognition, and computer vision. The book gives an integrated picture of the programs and algorithms that are being developed, and also of the multi-computer hardware architectures for which these systems are designed.
A three-dimensional spectral algorithm for simulations of transition and turbulence

NASA Technical Reports Server (NTRS)

Zang, T. A.; Hussaini, M. Y.

1985-01-01

A spectral algorithm for simulating three dimensional, incompressible, parallel shear flows is described. It applies to the channel, to the parallel boundary layer, and to other shear flows with one wall bounded and two periodic directions. Representative applications to the channel and to the heated boundary layer are presented.
Assignment Of Finite Elements To Parallel Processors

NASA Technical Reports Server (NTRS)

Salama, Moktar A.; Flower, Jon W.; Otto, Steve W.

1990-01-01

Elements assigned approximately optimally to subdomains. Mapping algorithm based on simulated-annealing concept used to minimize approximate time required to perform finite-element computation on hypercube computer or other network of parallel data processors. Mapping algorithm needed when shape of domain complicated or otherwise not obvious what allocation of elements to subdomains minimizes cost of computation.
Fast parallel molecular algorithms for DNA-based computation: solving the elliptic curve discrete logarithm problem over GF2.

PubMed

Li, Kenli; Zou, Shuting; Xv, Jin

2008-01-01

Elliptic curve cryptographic algorithms convert input data to unrecognizable encryption and the unrecognizable data back again into its original decrypted form. The security of this form of encryption hinges on the enormous difficulty that is required to solve the elliptic curve discrete logarithm problem (ECDLP), especially over GF(2(n)), n in Z+. This paper describes an effective method to find solutions to the ECDLP by means of a molecular computer. We propose that this research accomplishment would represent a breakthrough for applied biological computation and this paper demonstrates that in principle this is possible. Three DNA-based algorithms: a parallel adder, a parallel multiplier, and a parallel inverse over GF(2(n)) are described. The biological operation time of all of these algorithms is polynomial with respect to n. Considering this analysis, cryptography using a public key might be less secure. In this respect, a principal contribution of this paper is to provide enhanced evidence of the potential of molecular computing to tackle such ambitious computations.
Fast Parallel Molecular Algorithms for DNA-Based Computation: Solving the Elliptic Curve Discrete Logarithm Problem over GF(2n)

PubMed Central

Li, Kenli; Zou, Shuting; Xv, Jin

2008-01-01

Elliptic curve cryptographic algorithms convert input data to unrecognizable encryption and the unrecognizable data back again into its original decrypted form. The security of this form of encryption hinges on the enormous difficulty that is required to solve the elliptic curve discrete logarithm problem (ECDLP), especially over GF(2n), n ∈ Z+. This paper describes an effective method to find solutions to the ECDLP by means of a molecular computer. We propose that this research accomplishment would represent a breakthrough for applied biological computation and this paper demonstrates that in principle this is possible. Three DNA-based algorithms: a parallel adder, a parallel multiplier, and a parallel inverse over GF(2n) are described. The biological operation time of all of these algorithms is polynomial with respect to n. Considering this analysis, cryptography using a public key might be less secure. In this respect, a principal contribution of this paper is to provide enhanced evidence of the potential of molecular computing to tackle such ambitious computations. PMID:18431451
The Complexity of Parallel Algorithms,

DTIC Science & Technology

1985-11-01

programns have been written for se(luiential coiipn ters. Many p~eop~le want coimp ~ilers dihal. will c(nimpile t he, code for parallel machines, to avoid...between two vertices. We also rely on parallel algorithms for maintaining data structures and manipulating graphs. We do not go into the details of these...Jpatlis and maintain connected coimp ~onents. The routine is: - 35 .- ExtendPath(r, Q, V) begin P +-0; s 4- while there is a path in V - P from s to a vertex
Options for Parallelizing a Planning and Scheduling Algorithm

NASA Technical Reports Server (NTRS)

Clement, Bradley J.; Estlin, Tara A.; Bornstein, Benjamin D.

2011-01-01

Space missions have a growing interest in putting multi-core processors onboard spacecraft. For many missions processing power significantly slows operations. We investigate how continual planning and scheduling algorithms can exploit multi-core processing and outline different potential design decisions for a parallelized planning architecture. This organization of choices and challenges helps us with an initial design for parallelizing the CASPER planning system for a mesh multi-core processor. This work extends that presented at another workshop with some preliminary results.
Parallel rendering

NASA Technical Reports Server (NTRS)

Crockett, Thomas W.

1995-01-01

This article provides a broad introduction to the subject of parallel rendering, encompassing both hardware and software systems. The focus is on the underlying concepts and the issues which arise in the design of parallel rendering algorithms and systems. We examine the different types of parallelism and how they can be applied in rendering applications. Concepts from parallel computing, such as data decomposition, task granularity, scalability, and load balancing, are considered in relation to the rendering problem. We also explore concepts from computer graphics, such as coherence and projection, which have a significant impact on the structure of parallel rendering algorithms. Our survey covers a number of practical considerations as well, including the choice of architectural platform, communication and memory requirements, and the problem of image assembly and display. We illustrate the discussion with numerous examples from the parallel rendering literature, representing most of the principal rendering methods currently used in computer graphics.
Developing Information Power Grid Based Algorithms and Software

NASA Technical Reports Server (NTRS)

Dongarra, Jack

1998-01-01

This exploratory study initiated our effort to understand performance modeling on parallel systems. The basic goal of performance modeling is to understand and predict the performance of a computer program or set of programs on a computer system. Performance modeling has numerous applications, including evaluation of algorithms, optimization of code implementations, parallel library development, comparison of system architectures, parallel system design, and procurement of new systems. Our work lays the basis for the construction of parallel libraries that allow for the reconstruction of application codes on several distinct architectures so as to assure performance portability. Following our strategy, once the requirements of applications are well understood, one can then construct a library in a layered fashion. The top level of this library will consist of architecture-independent geometric, numerical, and symbolic algorithms that are needed by the sample of applications. These routines should be written in a language that is portable across the targeted architectures.
Automated Vectorization of Decision-Based Algorithms

NASA Technical Reports Server (NTRS)

James, Mark

2006-01-01

Virtually all existing vectorization algorithms are designed to only analyze the numeric properties of an algorithm and distribute those elements across multiple processors. This advances the state of the practice because it is the only known system, at the time of this reporting, that takes high-level statements and analyzes them for their decision properties and converts them to a form that allows them to automatically be executed in parallel. The software takes a high-level source program that describes a complex decision- based condition and rewrites it as a disjunctive set of component Boolean relations that can then be executed in parallel. This is important because parallel architectures are becoming more commonplace in conventional systems and they have always been present in NASA flight systems. This technology allows one to take existing condition-based code and automatically vectorize it so it naturally decomposes across parallel architectures.
3D Data Denoising via Nonlocal Means Filter by Using Parallel GPU Strategies

PubMed Central

Cuomo, Salvatore; De Michele, Pasquale; Piccialli, Francesco

2014-01-01

Nonlocal Means (NLM) algorithm is widely considered as a state-of-the-art denoising filter in many research fields. Its high computational complexity leads researchers to the development of parallel programming approaches and the use of massively parallel architectures such as the GPUs. In the recent years, the GPU devices had led to achieving reasonable running times by filtering, slice-by-slice, and 3D datasets with a 2D NLM algorithm. In our approach we design and implement a fully 3D NonLocal Means parallel approach, adopting different algorithm mapping strategies on GPU architecture and multi-GPU framework, in order to demonstrate its high applicability and scalability. The experimental results we obtained encourage the usability of our approach in a large spectrum of applicative scenarios such as magnetic resonance imaging (MRI) or video sequence denoising. PMID:25045397
Asynchronous ice lobe retreat and glacial Lake Bascom: Deglaciation of the Hoosic and Vermont valleys, southwestern Vermont

DOE Office of Scientific and Technical Information (OSTI.GOV)

Small, E.; Desimone, D.

Deglaciation of the Hoosic River drainage basin in southwestern Vermont was more complex than previously described. Detailed surficial mapping, stratigraphic relationships, and terrace levels/delta elevations reveal new details in the chronology of glacial Lake Bascom: (1) a pre-Wisconsinan proglacial lake was present in a similar position to Lake Bascom as ice advanced: (2) the northern margin of 275m (900 ft) glacial Lake Bascom extended 10 km up the Vermont Valley; (3) the 215m (705 ft) Bascom level was stable and long lived; (4) intermediate water planes existed between 215m and 190m (625 ft) levels; and (5) a separate ice tonguemore » existed in Shaftsbury Hollow damming a small glacial lake, here named glacial Lake Emmons. This information is used to correlate ice margins to different lake levels. Distance of ice margin retreat during a lake level can be measured. Lake levels are then used as control points on a Lake Bascom relative time line to compare rate of retreat of different ice tongues. Correlation of ice margins to Bascom levels indicates ice retreat was asynchronous between nearby tongues in southwestern Vermont. The Vermont Valley ice tongue retreated between two and four times faster than the Hoosic Valley tongue during the Bascom 275m level. Rate of retreat of the Vermont Valley tongue slowed to one-half of the Hoosic tongue during the 215m--190m lake levels. Factors responsible for varying rates of retreat are subglacial bedrock gradient, proximity to the Hudson-Champlain lobe, and the presence of absence of a calving margins. Asynchronous retreat produced splayed ice margins in southwestern Vermont. Findings from this study do not support the model of parallel, synchronous retreat proposed by many workers for this region.« less
Scalable asynchronous execution of cellular automata

NASA Astrophysics Data System (ADS)

Folino, Gianluigi; Giordano, Andrea; Mastroianni, Carlo

2016-10-01

The performance and scalability of cellular automata, when executed on parallel/distributed machines, are limited by the necessity of synchronizing all the nodes at each time step, i.e., a node can execute only after the execution of the previous step at all the other nodes. However, these synchronization requirements can be relaxed: a node can execute one step after synchronizing only with the adjacent nodes. In this fashion, different nodes can execute different time steps. This can be a notable advantageous in many novel and increasingly popular applications of cellular automata, such as smart city applications, simulation of natural phenomena, etc., in which the execution times can be different and variable, due to the heterogeneity of machines and/or data and/or executed functions. Indeed, a longer execution time at a node does not slow down the execution at all the other nodes but only at the neighboring nodes. This is particularly advantageous when the nodes that act as bottlenecks vary during the application execution. The goal of the paper is to analyze the benefits that can be achieved with the described asynchronous implementation of cellular automata, when compared to the classical all-to-all synchronization pattern. The performance and scalability have been evaluated through a Petri net model, as this model is very useful to represent the synchronization barrier among nodes. We examined the usual case in which the territory is partitioned into a number of regions, and the computation associated with a region is assigned to a computing node. We considered both the cases of mono-dimensional and two-dimensional partitioning. The results show that the advantage obtained through the asynchronous execution, when compared to the all-to-all synchronous approach is notable, and it can be as large as 90% in terms of speedup.
ATM Quality of Service Tests for Digitized Video Using ATM Over Satellite: Laboratory Tests

NASA Technical Reports Server (NTRS)

Ivancic, William D.; Brooks, David E.; Frantz, Brian D.

1997-01-01

A digitized video application was used to help determine minimum quality of service parameters for asynchronous transfer mode (ATM) over satellite. For these tests, binomially distributed and other errors were digitally inserted in an intermediate frequency link via a satellite modem and a commercial gaussian noise generator. In this paper, the relation- ship between the ATM cell error and cell loss parameter specifications is discussed with regard to this application. In addition, the video-encoding algorithms, test configurations, and results are presented in detail.
The Data Transfer Kit: A geometric rendezvous-based tool for multiphysics data transfer

DOE Office of Scientific and Technical Information (OSTI.GOV)

Slattery, S. R.; Wilson, P. P. H.; Pawlowski, R. P.

2013-07-01

The Data Transfer Kit (DTK) is a software library designed to provide parallel data transfer services for arbitrary physics components based on the concept of geometric rendezvous. The rendezvous algorithm provides a means to geometrically correlate two geometric domains that may be arbitrarily decomposed in a parallel simulation. By repartitioning both domains such that they have the same geometric domain on each parallel process, efficient and load balanced search operations and data transfer can be performed at a desirable algorithmic time complexity with low communication overhead relative to other types of mapping algorithms. With the increased development efforts in multiphysicsmore » simulation and other multiple mesh and geometry problems, generating parallel topology maps for transferring fields and other data between geometric domains is a common operation. The algorithms used to generate parallel topology maps based on the concept of geometric rendezvous as implemented in DTK are described with an example using a conjugate heat transfer calculation and thermal coupling with a neutronics code. In addition, we provide the results of initial scaling studies performed on the Jaguar Cray XK6 system at Oak Ridge National Laboratory for a worse-case-scenario problem in terms of algorithmic complexity that shows good scaling on 0(1 x 104) cores for topology map generation and excellent scaling on 0(1 x 105) cores for the data transfer operation with meshes of O(1 x 109) elements. (authors)« less
Parallel replica dynamics with a heterogeneous distribution of barriers: Application to n-hexadecane pyrolysis

NASA Astrophysics Data System (ADS)

Kum, Oyeon; Dickson, Brad M.; Stuart, Steven J.; Uberuaga, Blas P.; Voter, Arthur F.

2004-11-01

Parallel replica dynamics simulation methods appropriate for the simulation of chemical reactions in molecular systems with many conformational degrees of freedom have been developed and applied to study the microsecond-scale pyrolysis of n-hexadecane in the temperature range of 2100-2500 K. The algorithm uses a transition detection scheme that is based on molecular topology, rather than energetic basins. This algorithm allows efficient parallelization of small systems even when using more processors than particles (in contrast to more traditional parallelization algorithms), and even when there are frequent conformational transitions (in contrast to previous implementations of the parallel replica algorithm). The parallel efficiency for pyrolysis initiation reactions was over 90% on 61 processors for this 50-atom system. The parallel replica dynamics technique results in reaction probabilities that are statistically indistinguishable from those obtained from direct molecular dynamics, under conditions where both are feasible, but allows simulations at temperatures as much as 1000 K lower than direct molecular dynamics simulations. The rate of initiation displayed Arrhenius behavior over the entire temperature range, with an activation energy and frequency factor of Ea=79.7 kcal/mol and log A/s-1=14.8, respectively, in reasonable agreement with experiment and empirical kinetic models. Several interesting unimolecular reaction mechanisms were observed in simulations of the chain propagation reactions above 2000 K, which are not included in most coarse-grained kinetic models. More studies are needed in order to determine whether these mechanisms are experimentally relevant, or specific to the potential energy surface used.
Learning control system design based on 2-D theory - An application to parallel link manipulator

NASA Technical Reports Server (NTRS)

Geng, Z.; Carroll, R. L.; Lee, J. D.; Haynes, L. H.

1990-01-01

An approach to iterative learning control system design based on two-dimensional system theory is presented. A two-dimensional model for the iterative learning control system which reveals the connections between learning control systems and two-dimensional system theory is established. A learning control algorithm is proposed, and the convergence of learning using this algorithm is guaranteed by two-dimensional stability. The learning algorithm is applied successfully to the trajectory tracking control problem for a parallel link robot manipulator. The excellent performance of this learning algorithm is demonstrated by the computer simulation results.
Algorithm implementation on the Navier-Stokes computer

DOE Office of Scientific and Technical Information (OSTI.GOV)

Krist, S.E.; Zang, T.A.

1987-03-01

The Navier-Stokes Computer is a multi-purpose parallel-processing supercomputer which is currently under development at Princeton University. It consists of multiple local memory parallel processors, called Nodes, which are interconnected in a hypercube network. Details of the procedures involved in implementing an algorithm on the Navier-Stokes computer are presented. The particular finite difference algorithm considered in this analysis was developed for simulation of laminar-turbulent transition in wall bounded shear flows. Projected timing results for implementing this algorithm indicate that operation rates in excess of 42 GFLOPS are feasible on a 128 Node machine.
Algorithm implementation on the Navier-Stokes computer

NASA Technical Reports Server (NTRS)

Krist, Steven E.; Zang, Thomas A.

1987-01-01

The Navier-Stokes Computer is a multi-purpose parallel-processing supercomputer which is currently under development at Princeton University. It consists of multiple local memory parallel processors, called Nodes, which are interconnected in a hypercube network. Details of the procedures involved in implementing an algorithm on the Navier-Stokes computer are presented. The particular finite difference algorithm considered in this analysis was developed for simulation of laminar-turbulent transition in wall bounded shear flows. Projected timing results for implementing this algorithm indicate that operation rates in excess of 42 GFLOPS are feasible on a 128 Node machine.

Some links on this page may take you to non-federal websites. Their policies may differ from this site.