asynchronous parallel iterative: Topics by Science.gov

Sample records for asynchronous parallel iterative

A Parallel Particle Swarm Optimization Algorithm Accelerated by Asynchronous Evaluations

NASA Technical Reports Server (NTRS)

Venter, Gerhard; Sobieszczanski-Sobieski, Jaroslaw

2005-01-01

A parallel Particle Swarm Optimization (PSO) algorithm is presented. Particle swarm optimization is a fairly recent addition to the family of non-gradient based, probabilistic search algorithms that is based on a simplified social model and is closely tied to swarming theory. Although PSO algorithms present several attractive properties to the designer, they are plagued by high computational cost as measured by elapsed time. One approach to reduce the elapsed time is to make use of coarse-grained parallelization to evaluate the design points. Previous parallel PSO algorithms were mostly implemented in a synchronous manner, where all design points within a design iteration are evaluated before the next iteration is started. This approach leads to poor parallel speedup in cases where a heterogeneous parallel environment is used and/or where the analysis time depends on the design point being analyzed. This paper introduces an asynchronous parallel PSO algorithm that greatly improves the parallel e ciency. The asynchronous algorithm is benchmarked on a cluster assembled of Apple Macintosh G5 desktop computers, using the multi-disciplinary optimization of a typical transport aircraft wing as an example.
Exploiting parallel computing with limited program changes using a network of microcomputers

NASA Technical Reports Server (NTRS)

Rogers, J. L., Jr.; Sobieszczanski-Sobieski, J.

1985-01-01

Network computing and multiprocessor computers are two discernible trends in parallel processing. The computational behavior of an iterative distributed process in which some subtasks are completed later than others because of an imbalance in computational requirements is of significant interest. The effects of asynchronus processing was studied. A small existing program was converted to perform finite element analysis by distributing substructure analysis over a network of four Apple IIe microcomputers connected to a shared disk, simulating a parallel computer. The substructure analysis uses an iterative, fully stressed, structural resizing procedure. A framework of beams divided into three substructures is used as the finite element model. The effects of asynchronous processing on the convergence of the design variables are determined by not resizing particular substructures on various iterations.
A wavelet approach to binary blackholes with asynchronous multitasking

NASA Astrophysics Data System (ADS)

Lim, Hyun; Hirschmann, Eric; Neilsen, David; Anderson, Matthew; Debuhr, Jackson; Zhang, Bo

2016-03-01

Highly accurate simulations of binary black holes and neutron stars are needed to address a variety of interesting problems in relativistic astrophysics. We present a new method for the solving the Einstein equations (BSSN formulation) using iterated interpolating wavelets. Wavelet coefficients provide a direct measure of the local approximation error for the solution and place collocation points that naturally adapt to features of the solution. Further, they exhibit exponential convergence on unevenly spaced collection points. The parallel implementation of the wavelet simulation framework presented here deviates from conventional practice in combining multi-threading with a form of message-driven computation sometimes referred to as asynchronous multitasking.
DOE Office of Scientific and Technical Information (OSTI.GOV)

D'Azevedo, Eduardo; Abbott, Stephen; Koskela, Tuomas

The XGC fusion gyrokinetic code combines state-of-the-art, portable computational and algorithmic technologies to enable complicated multiscale simulations of turbulence and transport dynamics in ITER edge plasma on the largest US open-science computer, the CRAY XK7 Titan, at its maximal heterogeneous capability, which have not been possible before due to a factor of over 10 shortage in the time-to-solution for less than 5 days of wall-clock time for one physics case. Frontier techniques such as nested OpenMP parallelism, adaptive parallel I/O, staging I/O and data reduction using dynamic and asynchronous applications interactions, dynamic repartitioning.
SIAM Conference on Parallel Processing for Scientific Computing, 4th, Chicago, IL, Dec. 11-13, 1989, Proceedings

NASA Technical Reports Server (NTRS)

Dongarra, Jack (Editor); Messina, Paul (Editor); Sorensen, Danny C. (Editor); Voigt, Robert G. (Editor)

1990-01-01

Attention is given to such topics as an evaluation of block algorithm variants in LAPACK and presents a large-grain parallel sparse system solver, a multiprocessor method for the solution of the generalized Eigenvalue problem on an interval, and a parallel QR algorithm for iterative subspace methods on the CM2. A discussion of numerical methods includes the topics of asynchronous numerical solutions of PDEs on parallel computers, parallel homotopy curve tracking on a hypercube, and solving Navier-Stokes equations on the Cedar Multi-Cluster system. A section on differential equations includes a discussion of a six-color procedure for the parallel solution of elliptic systems using the finite quadtree structure, data parallel algorithms for the finite element method, and domain decomposition methods in aerodynamics. Topics dealing with massively parallel computing include hypercube vs. 2-dimensional meshes and massively parallel computation of conservation laws. Performance and tools are also discussed.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Chow, Edmond

Solving sparse problems is at the core of many DOE computational science applications. We focus on the challenge of developing sparse algorithms that can fully exploit the parallelism in extreme-scale computing systems, in particular systems with massive numbers of cores per node. Our approach is to express a sparse matrix factorization as a large number of bilinear constraint equations, and then solving these equations via an asynchronous iterative method. The unknowns in these equations are the matrix entries of the factorization that is desired.
The fusion code XGC: Enabling kinetic study of multi-scale edge turbulent transport in ITER [Book Chapter

DOE Office of Scientific and Technical Information (OSTI.GOV)

D'Azevedo, Eduardo; Abbott, Stephen; Koskela, Tuomas

The XGC fusion gyrokinetic code combines state-of-the-art, portable computational and algorithmic technologies to enable complicated multiscale simulations of turbulence and transport dynamics in ITER edge plasma on the largest US open-science computer, the CRAY XK7 Titan, at its maximal heterogeneous capability, which have not been possible before due to a factor of over 10 shortage in the time-to-solution for less than 5 days of wall-clock time for one physics case. Frontier techniques such as nested OpenMP parallelism, adaptive parallel I/O, staging I/O and data reduction using dynamic and asynchronous applications interactions, dynamic repartitioning for balancing computational work in pushing particlesmore » and in grid related work, scalable and accurate discretization algorithms for non-linear Coulomb collisions, and communication-avoiding subcycling technology for pushing particles on both CPUs and GPUs are also utilized to dramatically improve the scalability and time-to-solution, hence enabling the difficult kinetic ITER edge simulation on a present-day leadership class computer.« less
Frog: Asynchronous Graph Processing on GPU with Hybrid Coloring Model

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shi, Xuanhua; Luo, Xuan; Liang, Junling

GPUs have been increasingly used to accelerate graph processing for complicated computational problems regarding graph theory. Many parallel graph algorithms adopt the asynchronous computing model to accelerate the iterative convergence. Unfortunately, the consistent asynchronous computing requires locking or atomic operations, leading to significant penalties/overheads when implemented on GPUs. As such, coloring algorithm is adopted to separate the vertices with potential updating conflicts, guaranteeing the consistency/correctness of the parallel processing. Common coloring algorithms, however, may suffer from low parallelism because of a large number of colors generally required for processing a large-scale graph with billions of vertices. We propose a light-weightmore » asynchronous processing framework called Frog with a preprocessing/hybrid coloring model. The fundamental idea is based on Pareto principle (or 80-20 rule) about coloring algorithms as we observed through masses of realworld graph coloring cases. We find that a majority of vertices (about 80%) are colored with only a few colors, such that they can be read and updated in a very high degree of parallelism without violating the sequential consistency. Accordingly, our solution separates the processing of the vertices based on the distribution of colors. In this work, we mainly answer three questions: (1) how to partition the vertices in a sparse graph with maximized parallelism, (2) how to process large-scale graphs that cannot fit into GPU memory, and (3) how to reduce the overhead of data transfers on PCIe while processing each partition. We conduct experiments on real-world data (Amazon, DBLP, YouTube, RoadNet-CA, WikiTalk and Twitter) to evaluate our approach and make comparisons with well-known non-preprocessed (such as Totem, Medusa, MapGraph and Gunrock) and preprocessed (Cusha) approaches, by testing four classical algorithms (BFS, PageRank, SSSP and CC). On all the tested applications and datasets, Frog is able to significantly outperform existing GPU-based graph processing systems except Gunrock and MapGraph. MapGraph gets better performance than Frog when running BFS on RoadNet-CA. The comparison between Gunrock and Frog is inconclusive. Frog can outperform Gunrock more than 1.04X when running PageRank and SSSP, while the advantage of Frog is not obvious when running BFS and CC on some datasets especially for RoadNet-CA.« less
Myria: Scalable Analytics as a Service

NASA Astrophysics Data System (ADS)

Howe, B.; Halperin, D.; Whitaker, A.

2014-12-01

At the UW eScience Institute, we're working to empower non-experts, especially in the sciences, to write and use data-parallel algorithms. To this end, we are building Myria, a web-based platform for scalable analytics and data-parallel programming. Myria's internal model of computation is the relational algebra extended with iteration, such that every program is inherently data-parallel, just as every query in a database is inherently data-parallel. But unlike databases, iteration is a first class concept, allowing us to express machine learning tasks, graph traversal tasks, and more. Programs can be expressed in a number of languages and can be executed on a number of execution environments, but we emphasize a particular language called MyriaL that supports both imperative and declarative styles and a particular execution engine called MyriaX that uses an in-memory column-oriented representation and asynchronous iteration. We deliver Myria over the web as a service, providing an editor, performance analysis tools, and catalog browsing features in a single environment. We find that this web-based "delivery vector" is critical in reaching non-experts: they are insulated from irrelevant effort technical work associated with installation, configuration, and resource management. The MyriaX backend, one of several execution runtimes we support, is a main-memory, column-oriented, RDBMS-on-the-worker system that supports cyclic data flows as a first-class citizen and has been shown to outperform competitive systems on 100-machine cluster sizes. I will describe the Myria system, give a demo, and present some new results in large-scale oceanographic microbiology.
A Parallel Fast Sweeping Method for the Eikonal Equation

NASA Astrophysics Data System (ADS)

Baker, B.

2017-12-01

Recently, there has been an exciting emergence of probabilistic methods for travel time tomography. Unlike gradient-based optimization strategies, probabilistic tomographic methods are resistant to becoming trapped in a local minimum and provide a much better quantification of parameter resolution than, say, appealing to ray density or performing checkerboard reconstruction tests. The benefits associated with random sampling methods however are only realized by successive computation of predicted travel times in, potentially, strongly heterogeneous media. To this end this abstract is concerned with expediting the solution of the Eikonal equation. While many Eikonal solvers use a fast marching method, the proposed solver will use the iterative fast sweeping method because the eight fixed sweep orderings in each iteration are natural targets for parallelization. To reduce the number of iterations and grid points required the high-accuracy finite difference stencil of Nobel et al., 2014 is implemented. A directed acyclic graph (DAG) is created with a priori knowledge of the sweep ordering and finite different stencil. By performing a topological sort of the DAG sets of independent nodes are identified as candidates for concurrent updating. Additionally, the proposed solver will also address scalability during earthquake relocation, a necessary step in local and regional earthquake tomography and a barrier to extending probabilistic methods from active source to passive source applications, by introducing an asynchronous parallel forward solve phase for all receivers in the network. Synthetic examples using the SEG over-thrust model will be presented.
Quasi-perfect FIFO: Synchronous or asynchronous with application in controller design for the UNICON laser memory. [digital memory and buffer storage

NASA Technical Reports Server (NTRS)

Lim, R. S.

1974-01-01

The first-in-first-out memory buffer (FIFO), is an elastic digital memory whose main application is in data buffering between devices operating at different rates. Data written into the top is moved autonomously down toward the bottom of the FIFO to the lowest unoccupied location, and data read from the bottom of the FIFO will cause data from the top to move autonomously down toward the bottom. The FIFO is available in MOS LSI asynchronous form with data rate in the 1 MHz region. The FIFO described yields a simple high-speed iterative implementation, either synchronous of asynchronous. Because of this simple iterative structure, the FIFO is expandable in both number of words and bits per word, and it is attractive from the viewpoint of integrated-circuit production. For the synchronous FIFO, a model was built and successfully used in the controller for the UNICON laser memory. For the asynchronous FIFO, a model was built and also successfully used in a high-performance magnetic tape controller.
Parallel, Asynchronous Executive (PAX): System concepts, facilities, and architecture

NASA Technical Reports Server (NTRS)

Jones, W. H.

1983-01-01

The Parallel, Asynchronous Executive (PAX) is a software operating system simulation that allows many computers to work on a single problem at the same time. PAX is currently implemented on a UNIVAC 1100/42 computer system. Independent UNIVAC runstreams are used to simulate independent computers. Data are shared among independent UNIVAC runstreams through shared mass-storage files. PAX has achieved the following: (1) applied several computing processes simultaneously to a single, logically unified problem; (2) resolved most parallel processor conflicts by careful work assignment; (3) resolved by means of worker requests to PAX all conflicts not resolved by work assignment; (4) provided fault isolation and recovery mechanisms to meet the problems of an actual parallel, asynchronous processing machine. Additionally, one real-life problem has been constructed for the PAX environment. This is CASPER, a collection of aerodynamic and structural dynamic problem simulation routines. CASPER is not discussed in this report except to provide examples of parallel-processing techniques.
Applications of New Surrogate Global Optimization Algorithms including Efficient Synchronous and Asynchronous Parallelism for Calibration of Expensive Nonlinear Geophysical Simulation Models.

NASA Astrophysics Data System (ADS)

Shoemaker, C. A.; Pang, M.; Akhtar, T.; Bindel, D.

2016-12-01

New parallel surrogate global optimization algorithms are developed and applied to objective functions that are expensive simulations (possibly with multiple local minima). The algorithms can be applied to most geophysical simulations, including those with nonlinear partial differential equations. The optimization does not require simulations be parallelized. Asynchronous (and synchronous) parallel execution is available in the optimization toolbox "pySOT". The parallel algorithms are modified from serial to eliminate fine grained parallelism. The optimization is computed with open source software pySOT, a Surrogate Global Optimization Toolbox that allows user to pick the type of surrogate (or ensembles), the search procedure on surrogate, and the type of parallelism (synchronous or asynchronous). pySOT also allows the user to develop new algorithms by modifying parts of the code. In the applications here, the objective function takes up to 30 minutes for one simulation, and serial optimization can take over 200 hours. Results from Yellowstone (NSF) and NCSS (Singapore) supercomputers are given for groundwater contaminant hydrology simulations with applications to model parameter estimation and decontamination management. All results are compared with alternatives. The first results are for optimization of pumping at many wells to reduce cost for decontamination of groundwater at a superfund site. The optimization runs with up to 128 processors. Superlinear speed up is obtained for up to 16 processors, and efficiency with 64 processors is over 80%. Each evaluation of the objective function requires the solution of nonlinear partial differential equations to describe the impact of spatially distributed pumping and model parameters on model predictions for the spatial and temporal distribution of groundwater contaminants. The second application uses an asynchronous parallel global optimization for groundwater quality model calibration. The time for a single objective function evaluation varies unpredictably, so efficiency is improved with asynchronous parallel calculations to improve load balancing. The third application (done at NCSS) incorporates new global surrogate multi-objective parallel search algorithms into pySOT and applies it to a large watershed calibration problem.
A massively asynchronous, parallel brain.

PubMed

Zeki, Semir

2015-05-19

Whether the visual brain uses a parallel or a serial, hierarchical, strategy to process visual signals, the end result appears to be that different attributes of the visual scene are perceived asynchronously--with colour leading form (orientation) by 40 ms and direction of motion by about 80 ms. Whatever the neural root of this asynchrony, it creates a problem that has not been properly addressed, namely how visual attributes that are perceived asynchronously over brief time windows after stimulus onset are bound together in the longer term to give us a unified experience of the visual world, in which all attributes are apparently seen in perfect registration. In this review, I suggest that there is no central neural clock in the (visual) brain that synchronizes the activity of different processing systems. More likely, activity in each of the parallel processing-perceptual systems of the visual brain is reset independently, making of the brain a massively asynchronous organ, just like the new generation of more efficient computers promise to be. Given the asynchronous operations of the brain, it is likely that the results of activities in the different processing-perceptual systems are not bound by physiological interactions between cells in the specialized visual areas, but post-perceptually, outside the visual brain.
An improved parallel fuzzy connected image segmentation method based on CUDA.

PubMed

Wang, Liansheng; Li, Dong; Huang, Shaohui

2016-05-12

Fuzzy connectedness method (FC) is an effective method for extracting fuzzy objects from medical images. However, when FC is applied to large medical image datasets, its running time will be greatly expensive. Therefore, a parallel CUDA version of FC (CUDA-kFOE) was proposed by Ying et al. to accelerate the original FC. Unfortunately, CUDA-kFOE does not consider the edges between GPU blocks, which causes miscalculation of edge points. In this paper, an improved algorithm is proposed by adding a correction step on the edge points. The improved algorithm can greatly enhance the calculation accuracy. In the improved method, an iterative manner is applied. In the first iteration, the affinity computation strategy is changed and a look up table is employed for memory reduction. In the second iteration, the error voxels because of asynchronism are updated again. Three different CT sequences of hepatic vascular with different sizes were used in the experiments with three different seeds. NVIDIA Tesla C2075 is used to evaluate our improved method over these three data sets. Experimental results show that the improved algorithm can achieve a faster segmentation compared to the CPU version and higher accuracy than CUDA-kFOE. The calculation results were consistent with the CPU version, which demonstrates that it corrects the edge point calculation error of the original CUDA-kFOE. The proposed method has a comparable time cost and has less errors compared to the original CUDA-kFOE as demonstrated in the experimental results. In the future, we will focus on automatic acquisition method and automatic processing.
A Binary Array Asynchronous Sorting Algorithm with Using Petri Nets

NASA Astrophysics Data System (ADS)

Voevoda, A. A.; Romannikov, D. O.

2017-01-01

Nowadays the tasks of computations speed-up and/or their optimization are actual. Among the approaches on how to solve these tasks, a method applying approaches of parallelization and asynchronization to a sorting algorithm is considered in the paper. The sorting methods are ones of elementary methods and they are used in a huge amount of different applications. In the paper, we offer a method of an array sorting that based on a division into a set of independent adjacent pairs of numbers and their parallel and asynchronous comparison. And this one distinguishes the offered method from the traditional sorting algorithms (like quick sorting, merge sorting, insertion sorting and others). The algorithm is implemented with the use of Petri nets, like the most suitable tool for an asynchronous systems description.
Multiple asynchronous stimulus- and task-dependent hierarchies (STDH) within the visual brain's parallel processing systems.

PubMed

Zeki, Semir

2016-10-01

Results from a variety of sources, some many years old, lead ineluctably to a re-appraisal of the twin strategies of hierarchical and parallel processing used by the brain to construct an image of the visual world. Contrary to common supposition, there are at least three 'feed-forward' anatomical hierarchies that reach the primary visual cortex (V1) and the specialized visual areas outside it, in parallel. These anatomical hierarchies do not conform to the temporal order with which visual signals reach the specialized visual areas through V1. Furthermore, neither the anatomical hierarchies nor the temporal order of activation through V1 predict the perceptual hierarchies. The latter shows that we see (and become aware of) different visual attributes at different times, with colour leading form (orientation) and directional visual motion, even though signals from fast-moving, high-contrast stimuli are among the earliest to reach the visual cortex (of area V5). Parallel processing, on the other hand, is much more ubiquitous than commonly supposed but is subject to a barely noticed but fundamental aspect of brain operations, namely that different parallel systems operate asynchronously with respect to each other and reach perceptual endpoints at different times. This re-assessment leads to the conclusion that the visual brain is constituted of multiple, parallel and asynchronously operating task- and stimulus-dependent hierarchies (STDH); which of these parallel anatomical hierarchies have temporal and perceptual precedence at any given moment is stimulus and task related, and dependent on the visual brain's ability to undertake multiple operations asynchronously. © 2016 Federation of European Neuroscience Societies and John Wiley & Sons Ltd.
Global interrupt and barrier networks

DOEpatents

Blumrich, Matthias A.; Chen, Dong; Coteus, Paul W.; Gara, Alan G.; Giampapa, Mark E; Heidelberger, Philip; Kopcsay, Gerard V.; Steinmacher-Burow, Burkhard D.; Takken, Todd E.

2008-10-28

A system and method for generating global asynchronous signals in a computing structure. Particularly, a global interrupt and barrier network is implemented that implements logic for generating global interrupt and barrier signals for controlling global asynchronous operations performed by processing elements at selected processing nodes of a computing structure in accordance with a processing algorithm; and includes the physical interconnecting of the processing nodes for communicating the global interrupt and barrier signals to the elements via low-latency paths. The global asynchronous signals respectively initiate interrupt and barrier operations at the processing nodes at times selected for optimizing performance of the processing algorithms. In one embodiment, the global interrupt and barrier network is implemented in a scalable, massively parallel supercomputing device structure comprising a plurality of processing nodes interconnected by multiple independent networks, with each node including one or more processing elements for performing computation or communication activity as required when performing parallel algorithm operations. One multiple independent network includes a global tree network for enabling high-speed global tree communications among global tree network nodes or sub-trees thereof. The global interrupt and barrier network may operate in parallel with the global tree network for providing global asynchronous sideband signals.
A FAST ITERATIVE METHOD FOR SOLVING THE EIKONAL EQUATION ON TETRAHEDRAL DOMAINS

PubMed Central

Fu, Zhisong; Kirby, Robert M.; Whitaker, Ross T.

2014-01-01

Generating numerical solutions to the eikonal equation and its many variations has a broad range of applications in both the natural and computational sciences. Efficient solvers on cutting-edge, parallel architectures require new algorithms that may not be theoretically optimal, but that are designed to allow asynchronous solution updates and have limited memory access patterns. This paper presents a parallel algorithm for solving the eikonal equation on fully unstructured tetrahedral meshes. The method is appropriate for the type of fine-grained parallelism found on modern massively-SIMD architectures such as graphics processors and takes into account the particular constraints and capabilities of these computing platforms. This work builds on previous work for solving these equations on triangle meshes; in this paper we adapt and extend previous two-dimensional strategies to accommodate three-dimensional, unstructured, tetrahedralized domains. These new developments include a local update strategy with data compaction for tetrahedral meshes that provides solutions on both serial and parallel architectures, with a generalization to inhomogeneous, anisotropic speed functions. We also propose two new update schemes, specialized to mitigate the natural data increase observed when moving to three dimensions, and the data structures necessary for efficiently mapping data to parallel SIMD processors in a way that maintains computational density. Finally, we present descriptions of the implementations for a single CPU, as well as multicore CPUs with shared memory and SIMD architectures, with comparative results against state-of-the-art eikonal solvers. PMID:25221418
Multiple grid problems on concurrent-processing computers

NASA Technical Reports Server (NTRS)

Eberhardt, D. S.; Baganoff, D.

1986-01-01

Three computer codes were studied which make use of concurrent processing computer architectures in computational fluid dynamics (CFD). The three parallel codes were tested on a two processor multiple-instruction/multiple-data (MIMD) facility at NASA Ames Research Center, and are suggested for efficient parallel computations. The first code is a well-known program which makes use of the Beam and Warming, implicit, approximate factored algorithm. This study demonstrates the parallelism found in a well-known scheme and it achieved speedups exceeding 1.9 on the two processor MIMD test facility. The second code studied made use of an embedded grid scheme which is used to solve problems having complex geometries. The particular application for this study considered an airfoil/flap geometry in an incompressible flow. The scheme eliminates some of the inherent difficulties found in adapting approximate factorization techniques onto MIMD machines and allows the use of chaotic relaxation and asynchronous iteration techniques. The third code studied is an application of overset grids to a supersonic blunt body problem. The code addresses the difficulties encountered when using embedded grids on a compressible, and therefore nonlinear, problem. The complex numerical boundary system associated with overset grids is discussed and several boundary schemes are suggested. A boundary scheme based on the method of characteristics achieved the best results.

A massively asynchronous, parallel brain

PubMed Central

Zeki, Semir

2015-01-01

Whether the visual brain uses a parallel or a serial, hierarchical, strategy to process visual signals, the end result appears to be that different attributes of the visual scene are perceived asynchronously—with colour leading form (orientation) by 40 ms and direction of motion by about 80 ms. Whatever the neural root of this asynchrony, it creates a problem that has not been properly addressed, namely how visual attributes that are perceived asynchronously over brief time windows after stimulus onset are bound together in the longer term to give us a unified experience of the visual world, in which all attributes are apparently seen in perfect registration. In this review, I suggest that there is no central neural clock in the (visual) brain that synchronizes the activity of different processing systems. More likely, activity in each of the parallel processing-perceptual systems of the visual brain is reset independently, making of the brain a massively asynchronous organ, just like the new generation of more efficient computers promise to be. Given the asynchronous operations of the brain, it is likely that the results of activities in the different processing-perceptual systems are not bound by physiological interactions between cells in the specialized visual areas, but post-perceptually, outside the visual brain. PMID:25823871
Advancing Underwater Acoustic Communication for Autonomous Distributed Networks via Sparse Channel Sensing, Coding, and Navigation Support

DTIC Science & Technology

2012-09-30

Estimation Methods for Underwater OFDM 5) Two Iterative Receivers for Distributed MIMO - OFDM with Large Doppler Deviations. 6) Asynchronous Multiuser...multi-input multi-output ( MIMO ) OFDM is also pursued, where it is shown that the proposed hybrid initialization enables drastically improved receiver...are investigated. 5) Two Iterative Receivers for Distributed MIMO - OFDM with Large Doppler Deviations. This work studies a distributed system with
Barrier-breaking performance for industrial problems on the CRAY C916

DOE Office of Scientific and Technical Information (OSTI.GOV)

Graffunder, S.K.

1993-12-31

Nine applications, including third-party codes, were submitted to the Gordon Bell Prize committee showing the CRAY C916 supercomputer providing record-breaking time to solution for industrial problems in several disciplines. Performance was obtained by balancing raw hardware speed; effective use of large, real, shared memory; compiler vectorization and autotasking; hand optimization; asynchronous I/O techniques; and new algorithms. The highest GFLOPS performance for the submissions was 11.1 GFLOPS out of a peak advertised performance of 16 GFLOPS for the CRAY C916 system. One program achieved a 15.45 speedup from the compiler with just two hand-inserted directives to scope variables properly for themore » mathematical library. New I/O techniques hide tens of gigabytes of I/O behind parallel computations. Finally, new iterative solver algorithms have demonstrated times to solution on 1 CPU as high as 70 times faster than the best direct solvers.« less
On the utility of threads for data parallel programming

NASA Technical Reports Server (NTRS)

Fahringer, Thomas; Haines, Matthew; Mehrotra, Piyush

1995-01-01

Threads provide a useful programming model for asynchronous behavior because of their ability to encapsulate units of work that can then be scheduled for execution at runtime, based on the dynamic state of a system. Recently, the threaded model has been applied to the domain of data parallel scientific codes, and initial reports indicate that the threaded model can produce performance gains over non-threaded approaches, primarily through the use of overlapping useful computation with communication latency. However, overlapping computation with communication is possible without the benefit of threads if the communication system supports asynchronous primitives, and this comparison has not been made in previous papers. This paper provides a critical look at the utility of lightweight threads as applied to data parallel scientific programming.
Bilingual asynchronous online discussion groups: design and delivery of an eLearning distance study module for nurse academics in a developing country.

PubMed

Lewis, Peter A; Mai, Van Anh Thi; Gray, Genevieve

2012-04-01

The advent of eLearning has seen online discussion forums widely used in both undergraduate and postgraduate nursing education. This paper reports an Australian university experience of design, delivery and redevelopment of a distance education module developed for Vietnamese nurse academics. The teaching experience of Vietnamese nurse academics is mixed and frequently limited. It was decided that the distance module should attempt to utilise the experience of senior Vietnamese nurse academics - asynchronous online discussion groups were used to facilitate this. Online discussion occurred in both Vietnamese and English and was moderated by an Australian academic working alongside a Vietnamese translator. This paper will discuss the design of an online learning environment for foreign correspondents, the resources and translation required to maximise the success of asynchronous online discussion groups, as well as the rationale of delivering complex content in a foreign language. While specifically addressing the first iteration of the first distance module designed, this paper will also address subsequent changes made for the second iteration of the module and comment on their success. While a translator is clearly a key component of success, the elements of simplicity and clarity combined with supportive online moderation must not be overlooked. Copyright © 2011 Elsevier Ltd. All rights reserved.
Parallel and Preemptable Dynamically Dimensioned Search Algorithms for Single and Multi-objective Optimization in Water Resources

NASA Astrophysics Data System (ADS)

Tolson, B.; Matott, L. S.; Gaffoor, T. A.; Asadzadeh, M.; Shafii, M.; Pomorski, P.; Xu, X.; Jahanpour, M.; Razavi, S.; Haghnegahdar, A.; Craig, J. R.

2015-12-01

We introduce asynchronous parallel implementations of the Dynamically Dimensioned Search (DDS) family of algorithms including DDS, discrete DDS, PA-DDS and DDS-AU. These parallel algorithms are unique from most existing parallel optimization algorithms in the water resources field in that parallel DDS is asynchronous and does not require an entire population (set of candidate solutions) to be evaluated before generating and then sending a new candidate solution for evaluation. One key advance in this study is developing the first parallel PA-DDS multi-objective optimization algorithm. The other key advance is enhancing the computational efficiency of solving optimization problems (such as model calibration) by combining a parallel optimization algorithm with the deterministic model pre-emption concept. These two efficiency techniques can only be combined because of the asynchronous nature of parallel DDS. Model pre-emption functions to terminate simulation model runs early, prior to completely simulating the model calibration period for example, when intermediate results indicate the candidate solution is so poor that it will definitely have no influence on the generation of further candidate solutions. The computational savings of deterministic model preemption available in serial implementations of population-based algorithms (e.g., PSO) disappear in synchronous parallel implementations as these algorithms. In addition to the key advances above, we implement the algorithms across a range of computation platforms (Windows and Unix-based operating systems from multi-core desktops to a supercomputer system) and package these for future modellers within a model-independent calibration software package called Ostrich as well as MATLAB versions. Results across multiple platforms and multiple case studies (from 4 to 64 processors) demonstrate the vast improvement over serial DDS-based algorithms and highlight the important role model pre-emption plays in the performance of parallel, pre-emptable DDS algorithms. Case studies include single- and multiple-objective optimization problems in water resources model calibration and in many cases linear or near linear speedups are observed.
An iterative method for systems of nonlinear hyperbolic equations

NASA Technical Reports Server (NTRS)

Scroggs, Jeffrey S.

1989-01-01

An iterative algorithm for the efficient solution of systems of nonlinear hyperbolic equations is presented. Parallelism is evident at several levels. In the formation of the iteration, the equations are decoupled, thereby providing large grain parallelism. Parallelism may also be exploited within the solves for each equation. Convergence of the interation is established via a bounding function argument. Experimental results in two-dimensions are presented.
Probabilistic Cellular Automata

PubMed Central

Agapie, Alexandru; Giuclea, Marius

2014-01-01

Abstract Cellular automata are binary lattices used for modeling complex dynamical systems. The automaton evolves iteratively from one configuration to another, using some local transition rule based on the number of ones in the neighborhood of each cell. With respect to the number of cells allowed to change per iteration, we speak of either synchronous or asynchronous automata. If randomness is involved to some degree in the transition rule, we speak of probabilistic automata, otherwise they are called deterministic. With either type of cellular automaton we are dealing with, the main theoretical challenge stays the same: starting from an arbitrary initial configuration, predict (with highest accuracy) the end configuration. If the automaton is deterministic, the outcome simplifies to one of two configurations, all zeros or all ones. If the automaton is probabilistic, the whole process is modeled by a finite homogeneous Markov chain, and the outcome is the corresponding stationary distribution. Based on our previous results for the asynchronous case—connecting the probability of a configuration in the stationary distribution to its number of zero-one borders—the article offers both numerical and theoretical insight into the long-term behavior of synchronous cellular automata. PMID:24999557
Probabilistic cellular automata.

PubMed

Agapie, Alexandru; Andreica, Anca; Giuclea, Marius

2014-09-01

Cellular automata are binary lattices used for modeling complex dynamical systems. The automaton evolves iteratively from one configuration to another, using some local transition rule based on the number of ones in the neighborhood of each cell. With respect to the number of cells allowed to change per iteration, we speak of either synchronous or asynchronous automata. If randomness is involved to some degree in the transition rule, we speak of probabilistic automata, otherwise they are called deterministic. With either type of cellular automaton we are dealing with, the main theoretical challenge stays the same: starting from an arbitrary initial configuration, predict (with highest accuracy) the end configuration. If the automaton is deterministic, the outcome simplifies to one of two configurations, all zeros or all ones. If the automaton is probabilistic, the whole process is modeled by a finite homogeneous Markov chain, and the outcome is the corresponding stationary distribution. Based on our previous results for the asynchronous case-connecting the probability of a configuration in the stationary distribution to its number of zero-one borders-the article offers both numerical and theoretical insight into the long-term behavior of synchronous cellular automata.
Reverse engineering a gene network using an asynchronous parallel evolution strategy

PubMed Central

2010-01-01

Background The use of reverse engineering methods to infer gene regulatory networks by fitting mathematical models to gene expression data is becoming increasingly popular and successful. However, increasing model complexity means that more powerful global optimisation techniques are required for model fitting. The parallel Lam Simulated Annealing (pLSA) algorithm has been used in such approaches, but recent research has shown that island Evolutionary Strategies can produce faster, more reliable results. However, no parallel island Evolutionary Strategy (piES) has yet been demonstrated to be effective for this task. Results Here, we present synchronous and asynchronous versions of the piES algorithm, and apply them to a real reverse engineering problem: inferring parameters in the gap gene network. We find that the asynchronous piES exhibits very little communication overhead, and shows significant speed-up for up to 50 nodes: the piES running on 50 nodes is nearly 10 times faster than the best serial algorithm. We compare the asynchronous piES to pLSA on the same test problem, measuring the time required to reach particular levels of residual error, and show that it shows much faster convergence than pLSA across all optimisation conditions tested. Conclusions Our results demonstrate that the piES is consistently faster and more reliable than the pLSA algorithm on this problem, and scales better with increasing numbers of nodes. In addition, the piES is especially well suited to further improvements and adaptations: Firstly, the algorithm's fast initial descent speed and high reliability make it a good candidate for being used as part of a global/local search hybrid algorithm. Secondly, it has the potential to be used as part of a hierarchical evolutionary algorithm, which takes advantage of modern multi-core computing architectures. PMID:20196855
Identification of unknown spatial load distributions in a vibrating Euler-Bernoulli beam from limited measured data

NASA Astrophysics Data System (ADS)

Hasanov, Alemdar; Kawano, Alexandre

2016-05-01

Two types of inverse source problems of identifying asynchronously distributed spatial loads governed by the Euler-Bernoulli beam equation ρ (x){w}{tt}+μ (x){w}t+{({EI}(x){w}{xx})}{xx}-{T}r{u}{xx}={\\sum }m=1M{g}m(t){f}m(x), (x,t)\\in {{{Ω }}}T := (0,l)× (0,T), with hinged-clamped ends (w(0,t)={w}{xx}(0,t)=0,w(l,t) = {w}x(l,t)=0,t\\in (0,T)), are studied. Here {g}m(t) are linearly independent functions, describing an asynchronous temporal loading, and {f}m(x) are the spatial load distributions. In the first identification problem the values {ν }k(t),k=\\bar{1,K}, of the deflection w(x,t), are assumed to be known, as measured output data, in a neighbourhood of the finite set of points P:= \\{{x}k\\in (0,l),k=\\bar{1,K}\\}\\subset (0,l), corresponding to the internal points of a continuous beam, for all t\\in ]0,T[. In the second identification problem the values {θ }k(t),k=\\bar{1,K}, of the slope {w}x(x,t), are assumed to be known, as measured output data in a neighbourhood of the same set of points P for all t\\in ]0,T[. These inverse source problems will be defined subsequently as the problems ISP1 and ISP2. The general purpose of this study is to develop mathematical concepts and tools that are capable of providing effective numerical algorithms for the numerical solution of the considered class of inverse problems. Note that both measured output data {ν }k(t) and {θ }k(t) contain random noise. In the first part of the study we prove that each measured output data {ν }k(t) and {θ }k(t),k=\\bar{1,K} can uniquely determine the unknown functions {f}m\\in {H}-1(]0,l[),m=\\bar{1,M}. In the second part of the study we will introduce the input-output operators {{ K }}d :{L}2(0,T)\\mapsto {L}2(0,T),({{ K }}df)(t):= w(x,t;f),x\\in P, f(x) := ({f}1(x),\\ldots ,{f}M(x)), and {{ K }}s :{L}2(0,T)\\mapsto {L}2(0,T), ({{ K }}sf)(t):= {w}x(x,t;f), x\\in P , corresponding to the problems ISP1 and ISP2, and then reformulate these problems as the operator equations: {{ K }}df=ν and {{ K }}sf=θ , where ν (t):= ({ν }1(t),\\ldots ,{ν }K(t)) and {θ }k(t):= ({θ }1(t),\\ldots ,{θ }K(t)). Since both measured output data contain random noise, we use the most prominent regularisation method, Tikhonov regularisation, introducing the regularised cost functionals {J}1α (f):= (1/2)\\parallel {{ K }}df-ν {\\parallel }{L2(0,T)}2+(1/2)α \\parallel f{\\parallel }{L2(0,T)}2 and {J}2α (f):= (1/2)\\parallel {{ K }}sf-θ {\\parallel }{L2(0,T)}2+(1/2)α \\parallel f{\\parallel }{L2(0,T)}2. Using a priori estimates for the weak solution of the direct problem and the Tikhonov regularisation method combined with the adjoint problem approach, we prove that the Fréchet gradients {J}1\\prime (f) and {J}2\\prime (f) of both cost functionals can explicitly be derived via the corresponding weak solutions of adjoint problems and the known temporal loads {g}m(t). Moreover, we show that these gradients are Lipschitz continuous, which allows the use of gradient type iteration convergent algorithms. Two applications of the proposed theory are presented. It is shown that solvability results for inverse source problems related to the synchronous loading case, with a single interior measured data, are special cases of the obtained results for asynchronously distributed spatial load cases.
Single-agent parallel window search

NASA Technical Reports Server (NTRS)

Powley, Curt; Korf, Richard E.

1991-01-01

Parallel window search is applied to single-agent problems by having different processes simultaneously perform iterations of Iterative-Deepening-A(asterisk) (IDA-asterisk) on the same problem but with different cost thresholds. This approach is limited by the time to perform the goal iteration. To overcome this disadvantage, the authors consider node ordering. They discuss how global node ordering by minimum h among nodes with equal f = g + h values can reduce the time complexity of serial IDA-asterisk by reducing the time to perform the iterations prior to the goal iteration. Finally, the two ideas of parallel window search and node ordering are combined to eliminate the weaknesses of each approach while retaining the strengths. The resulting approach, called simply parallel window search, can be used to find a near-optimal solution quickly, improve the solution until it is optimal, and then finally guarantee optimality, depending on the amount of time available.
Parallel tempering simulation of the three-dimensional Edwards-Anderson model with compact asynchronous multispin coding on GPU

NASA Astrophysics Data System (ADS)

Fang, Ye; Feng, Sheng; Tam, Ka-Ming; Yun, Zhifeng; Moreno, Juana; Ramanujam, J.; Jarrell, Mark

2014-10-01

Monte Carlo simulations of the Ising model play an important role in the field of computational statistical physics, and they have revealed many properties of the model over the past few decades. However, the effect of frustration due to random disorder, in particular the possible spin glass phase, remains a crucial but poorly understood problem. One of the obstacles in the Monte Carlo simulation of random frustrated systems is their long relaxation time making an efficient parallel implementation on state-of-the-art computation platforms highly desirable. The Graphics Processing Unit (GPU) is such a platform that provides an opportunity to significantly enhance the computational performance and thus gain new insight into this problem. In this paper, we present optimization and tuning approaches for the CUDA implementation of the spin glass simulation on GPUs. We discuss the integration of various design alternatives, such as GPU kernel construction with minimal communication, memory tiling, and look-up tables. We present a binary data format, Compact Asynchronous Multispin Coding (CAMSC), which provides an additional 28.4% speedup compared with the traditionally used Asynchronous Multispin Coding (AMSC). Our overall design sustains a performance of 33.5 ps per spin flip attempt for simulating the three-dimensional Edwards-Anderson model with parallel tempering, which significantly improves the performance over existing GPU implementations.
Solving the multiple-set split equality common fixed-point problem of firmly quasi-nonexpansive operators.

PubMed

Zhao, Jing; Zong, Haili

2018-01-01

In this paper, we propose parallel and cyclic iterative algorithms for solving the multiple-set split equality common fixed-point problem of firmly quasi-nonexpansive operators. We also combine the process of cyclic and parallel iterative methods and propose two mixed iterative algorithms. Our several algorithms do not need any prior information about the operator norms. Under mild assumptions, we prove weak convergence of the proposed iterative sequences in Hilbert spaces. As applications, we obtain several iterative algorithms to solve the multiple-set split equality problem.
Parallelized implicit propagators for the finite-difference Schrödinger equation

NASA Astrophysics Data System (ADS)

Parker, Jonathan; Taylor, K. T.

1995-08-01

We describe the application of block Gauss-Seidel and block Jacobi iterative methods to the design of implicit propagators for finite-difference models of the time-dependent Schrödinger equation. The block-wise iterative methods discussed here are mixed direct-iterative methods for solving simultaneous equations, in the sense that direct methods (e.g. LU decomposition) are used to invert certain block sub-matrices, and iterative methods are used to complete the solution. We describe parallel variants of the basic algorithm that are well suited to the medium- to coarse-grained parallelism of work-station clusters, and MIMD supercomputers, and we show that under a wide range of conditions, fine-grained parallelism of the computation can be achieved. Numerical tests are conducted on a typical one-electron atom Hamiltonian. The methods converge robustly to machine precision (15 significant figures), in some cases in as few as 6 or 7 iterations. The rate of convergence is nearly independent of the finite-difference grid-point separations.
Iterative algorithms for large sparse linear systems on parallel computers

NASA Technical Reports Server (NTRS)

Adams, L. M.

1982-01-01

Algorithms for assembling in parallel the sparse system of linear equations that result from finite difference or finite element discretizations of elliptic partial differential equations, such as those that arise in structural engineering are developed. Parallel linear stationary iterative algorithms and parallel preconditioned conjugate gradient algorithms are developed for solving these systems. In addition, a model for comparing parallel algorithms on array architectures is developed and results of this model for the algorithms are given.
Dynamic grid refinement for partial differential equations on parallel computers

NASA Technical Reports Server (NTRS)

Mccormick, S.; Quinlan, D.

1989-01-01

The fast adaptive composite grid method (FAC) is an algorithm that uses various levels of uniform grids to provide adaptive resolution and fast solution of PDEs. An asynchronous version of FAC, called AFAC, that completely eliminates the bottleneck to parallelism is presented. This paper describes the advantage that this algorithm has in adaptive refinement for moving singularities on multiprocessor computers. This work is applicable to the parallel solution of two- and three-dimensional shock tracking problems.
Highly efficient and exact method for parallelization of grid-based algorithms and its implementation in DelPhi

PubMed Central

Li, Chuan; Li, Lin; Zhang, Jie; Alexov, Emil

2012-01-01

The Gauss-Seidel method is a standard iterative numerical method widely used to solve a system of equations and, in general, is more efficient comparing to other iterative methods, such as the Jacobi method. However, standard implementation of the Gauss-Seidel method restricts its utilization in parallel computing due to its requirement of using updated neighboring values (i.e., in current iteration) as soon as they are available. Here we report an efficient and exact (not requiring assumptions) method to parallelize iterations and to reduce the computational time as a linear/nearly linear function of the number of CPUs. In contrast to other existing solutions, our method does not require any assumptions and is equally applicable for solving linear and nonlinear equations. This approach is implemented in the DelPhi program, which is a finite difference Poisson-Boltzmann equation solver to model electrostatics in molecular biology. This development makes the iterative procedure on obtaining the electrostatic potential distribution in the parallelized DelPhi several folds faster than that in the serial code. Further we demonstrate the advantages of the new parallelized DelPhi by computing the electrostatic potential and the corresponding energies of large supramolecular structures. PMID:22674480
An Asynchronous Many-Task Implementation of In-Situ Statistical Analysis using Legion.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Pebay, Philippe Pierre; Bennett, Janine Camille

2015-11-01

In this report, we propose a framework for the design and implementation of in-situ analy- ses using an asynchronous many-task (AMT) model, using the Legion programming model together with the MiniAero mini-application as a surrogate for full-scale parallel scientific computing applications. The bulk of this work consists of converting the Learn/Derive/Assess model which we had initially developed for parallel statistical analysis using MPI [PTBM11], from a SPMD to an AMT model. In this goal, we propose an original use of the concept of Legion logical regions as a replacement for the parallel communication schemes used for the only operation ofmore » the statistics engines that require explicit communication. We then evaluate this proposed scheme in a shared memory environment, using the Legion port of MiniAero as a proxy for a full-scale scientific application, as a means to provide input data sets of variable size for the in-situ statistical analyses in an AMT context. We demonstrate in particular that the approach has merit, and warrants further investigation, in collaboration with ongoing efforts to improve the overall parallel performance of the Legion system.« less
AP-IO: asynchronous pipeline I/O for hiding periodic output cost in CFD simulation.

PubMed

Xiaoguang, Ren; Xinhai, Xu

2014-01-01

Computational fluid dynamics (CFD) simulation often needs to periodically output intermediate results to files in the form of snapshots for visualization or restart, which seriously impacts the performance. In this paper, we present asynchronous pipeline I/O (AP-IO) optimization scheme for the periodically snapshot output on the basis of asynchronous I/O and CFD application characteristics. In AP-IO, dedicated background I/O processes or threads are in charge of handling the file write in pipeline mode, therefore the write overhead can be hidden with more calculation than classic asynchronous I/O. We design the framework of AP-IO and implement it in OpenFOAM, providing CFD users with a user-friendly interface. Experimental results on the Tianhe-2 supercomputer demonstrate that AP-IO can achieve a good optimization effect for the periodical snapshot output in CFD application, and the effect is especially better for massively parallel CFD simulations, which can reduce the total execution time up to about 40%.

AP-IO: Asynchronous Pipeline I/O for Hiding Periodic Output Cost in CFD Simulation

PubMed Central

Xiaoguang, Ren; Xinhai, Xu

2014-01-01

Computational fluid dynamics (CFD) simulation often needs to periodically output intermediate results to files in the form of snapshots for visualization or restart, which seriously impacts the performance. In this paper, we present asynchronous pipeline I/O (AP-IO) optimization scheme for the periodically snapshot output on the basis of asynchronous I/O and CFD application characteristics. In AP-IO, dedicated background I/O processes or threads are in charge of handling the file write in pipeline mode, therefore the write overhead can be hidden with more calculation than classic asynchronous I/O. We design the framework of AP-IO and implement it in OpenFOAM, providing CFD users with a user-friendly interface. Experimental results on the Tianhe-2 supercomputer demonstrate that AP-IO can achieve a good optimization effect for the periodical snapshot output in CFD application, and the effect is especially better for massively parallel CFD simulations, which can reduce the total execution time up to about 40%. PMID:24955390
Reliability models for dataflow computer systems

NASA Technical Reports Server (NTRS)

Kavi, K. M.; Buckles, B. P.

1985-01-01

The demands for concurrent operation within a computer system and the representation of parallelism in programming languages have yielded a new form of program representation known as data flow (DENN 74, DENN 75, TREL 82a). A new model based on data flow principles for parallel computations and parallel computer systems is presented. Necessary conditions for liveness and deadlock freeness in data flow graphs are derived. The data flow graph is used as a model to represent asynchronous concurrent computer architectures including data flow computers.
Adaptive parallel logic networks

NASA Technical Reports Server (NTRS)

Martinez, Tony R.; Vidal, Jacques J.

1988-01-01

Adaptive, self-organizing concurrent systems (ASOCS) that combine self-organization with massive parallelism for such applications as adaptive logic devices, robotics, process control, and system malfunction management, are presently discussed. In ASOCS, an adaptive network composed of many simple computing elements operating in combinational and asynchronous fashion is used and problems are specified by presenting if-then rules to the system in the form of Boolean conjunctions. During data processing, which is a different operational phase from adaptation, the network acts as a parallel hardware circuit.
Twostep-by-twostep PIRK-type PC methods with continuous output formulas

NASA Astrophysics Data System (ADS)

Cong, Nguyen Huu; Xuan, Le Ngoc

2008-11-01

This paper deals with parallel predictor-corrector (PC) iteration methods based on collocation Runge-Kutta (RK) corrector methods with continuous output formulas for solving nonstiff initial-value problems (IVPs) for systems of first-order differential equations. At nth step, the continuous output formulas are used not only for predicting the stage values in the PC iteration methods but also for calculating the step values at (n+2)th step. In this case, the integration processes can be proceeded twostep-by-twostep. The resulting twostep-by-twostep (TBT) parallel-iterated RK-type (PIRK-type) methods with continuous output formulas (twostep-by-twostep PIRKC methods or TBTPIRKC methods) give us a faster integration process. Fixed stepsize applications of these TBTPIRKC methods to a few widely-used test problems reveal that the new PC methods are much more efficient when compared with the well-known parallel-iterated RK methods (PIRK methods), parallel-iterated RK-type PC methods with continuous output formulas (PIRKC methods) and sequential explicit RK codes DOPRI5 and DOP853 available from the literature.
UPC++ Programmer’s Guide (v1.0 2017.9)

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bachan, J.; Baden, S.; Bonachea, D.

UPC++ is a C++11 library that provides Asynchronous Partitioned Global Address Space (APGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The APGAS model is single program, multiple-data (SPMD), with each separate thread of execution (referred to as a rank, a term borrowed from MPI) having access to local memory as it would in C++. However, APGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the ranks. UPC++ provides numerous methods for accessing and using global memory. In UPC++, allmore » operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.« less
Asynchronous multilevel adaptive methods for solving partial differential equations on multiprocessors - Performance results

NASA Technical Reports Server (NTRS)

Mccormick, S.; Quinlan, D.

1989-01-01

The fast adaptive composite grid method (FAC) is an algorithm that uses various levels of uniform grids (global and local) to provide adaptive resolution and fast solution of PDEs. Like all such methods, it offers parallelism by using possibly many disconnected patches per level, but is hindered by the need to handle these levels sequentially. The finest levels must therefore wait for processing to be essentially completed on all the coarser ones. A recently developed asynchronous version of FAC, called AFAC, completely eliminates this bottleneck to parallelism. This paper describes timing results for AFAC, coupled with a simple load balancing scheme, applied to the solution of elliptic PDEs on an Intel iPSC hypercube. These tests include performance of certain processes necessary in adaptive methods, including moving grids and changing refinement. A companion paper reports on numerical and analytical results for estimating convergence factors of AFAC applied to very large scale examples.
Saturation: An efficient iteration strategy for symbolic state-space generation

NASA Technical Reports Server (NTRS)

Ciardo, Gianfranco; Luettgen, Gerald; Siminiceanu, Radu; Bushnell, Dennis M. (Technical Monitor)

2001-01-01

This paper presents a novel algorithm for generating state spaces of asynchronous systems using Multi-valued Decision Diagrams. In contrast to related work, the next-state function of a system is not encoded as a single Boolean function, but as cross-products of integer functions. This permits the application of various iteration strategies to build a system's state space. In particular, this paper introduces a new elegant strategy, called saturation, and implements it in the tool SMART. On top of usually performing several orders of magnitude faster than existing BDD-based state-space generators, the algorithm's required peak memory is often close to the nal memory needed for storing the overall state spaces.
Multi-Objective and Multidisciplinary Design Optimisation (MDO) of UAV Systems using Hierarchical Asynchronous Parallel Evolutionary Algorithms

DTIC Science & Technology

2007-09-17

been proposed; these include a combination of variable fidelity models, parallelisation strategies and hybridisation techniques (Coello, Veldhuizen et...Coello et al (Coello, Veldhuizen et al. 2002). 4.4.2 HIERARCHICAL POPULATION TOPOLOGY A hierarchical population topology, when integrated into...to hybrid parallel Multi-Objective Evolutionary Algorithms (pMOEA) (Cantu-Paz 2000; Veldhuizen , Zydallis et al. 2003); it uses a master slave
RAMP: A fault tolerant distributed microcomputer structure for aircraft navigation and control

NASA Technical Reports Server (NTRS)

Dunn, W. R.

1980-01-01

RAMP consists of distributed sets of parallel computers partioned on the basis of software and packaging constraints. To minimize hardware and software complexity, the processors operate asynchronously. It was shown that through the design of asymptotically stable control laws, data errors due to the asynchronism were minimized. It was further shown that by designing control laws with this property and making minor hardware modifications to the RAMP modules, the system became inherently tolerant to intermittent faults. A laboratory version of RAMP was constructed and is described in the paper along with the experimental results.
Data-Driven Based Asynchronous Motor Control for Printing Servo Systems

NASA Astrophysics Data System (ADS)

Bian, Min; Guo, Qingyun

Modern digital printing equipment aims to the environmental-friendly industry with high dynamic performances and control precision and low vibration and abrasion. High performance motion control system of printing servo systems was required. Control system of asynchronous motor based on data acquisition was proposed. Iterative learning control (ILC) algorithm was studied. PID control was widely used in the motion control. However, it was sensitive to the disturbances and model parameters variation. The ILC applied the history error data and present control signals to approximate the control signal directly in order to fully track the expect trajectory without the system models and structures. The motor control algorithm based on the ILC and PID was constructed and simulation results were given. The results show that data-driven control method is effective dealing with bounded disturbances for the motion control of printing servo systems.
An Environment for Incremental Development of Distributed Extensible Asynchronous Real-time Systems

NASA Technical Reports Server (NTRS)

Ames, Charles K.; Burleigh, Scott; Briggs, Hugh C.; Auernheimer, Brent

1996-01-01

Incremental parallel development of distributed real-time systems is difficult. Architectural techniques and software tools developed at the Jet Propulsion Laboratory's (JPL's) Flight System Testbed make feasible the integration of complex systems in various stages of development.
Parallel conjugate gradient algorithms for manipulator dynamic simulation

NASA Technical Reports Server (NTRS)

Fijany, Amir; Scheld, Robert E.

1989-01-01

Parallel conjugate gradient algorithms for the computation of multibody dynamics are developed for the specialized case of a robot manipulator. For an n-dimensional positive-definite linear system, the Classical Conjugate Gradient (CCG) algorithms are guaranteed to converge in n iterations, each with a computation cost of O(n); this leads to a total computational cost of O(n sq) on a serial processor. A conjugate gradient algorithms is presented that provide greater efficiency using a preconditioner, which reduces the number of iterations required, and by exploiting parallelism, which reduces the cost of each iteration. Two Preconditioned Conjugate Gradient (PCG) algorithms are proposed which respectively use a diagonal and a tridiagonal matrix, composed of the diagonal and tridiagonal elements of the mass matrix, as preconditioners. Parallel algorithms are developed to compute the preconditioners and their inversions in O(log sub 2 n) steps using n processors. A parallel algorithm is also presented which, on the same architecture, achieves the computational time of O(log sub 2 n) for each iteration. Simulation results for a seven degree-of-freedom manipulator are presented. Variants of the proposed algorithms are also developed which can be efficiently implemented on the Robot Mathematics Processor (RMP).
Programs as Polypeptides.

PubMed

Williams, Lance R

2016-01-01

Object-oriented combinator chemistry (OOCC) is an artificial chemistry with composition devices borrowed from object-oriented and functional programming languages. Actors in OOCC are embedded in space and subject to diffusion; since they are neither created nor destroyed, their mass is conserved. Actors use programs constructed from combinators to asynchronously update their own states and the states of other actors in their neighborhoods. The fact that programs and combinators are themselves reified as actors makes it possible to build programs that build programs from combinators of a few primitive types using asynchronous spatial processes that resemble chemistry as much as computation. To demonstrate this, OOCC is used to define a parallel, asynchronous, spatially distributed self-replicating system modeled in part on the living cell. Since interactions among its parts result in the construction of more of these same parts, the system is strongly constructive. The system's high normalized complexity is contrasted with that of a simple composome.
AZTEC: A parallel iterative package for the solving linear systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hutchinson, S.A.; Shadid, J.N.; Tuminaro, R.S.

1996-12-31

We describe a parallel linear system package, AZTEC. The package incorporates a number of parallel iterative methods (e.g. GMRES, biCGSTAB, CGS, TFQMR) and preconditioners (e.g. Jacobi, Gauss-Seidel, polynomial, domain decomposition with LU or ILU within subdomains). Additionally, AZTEC allows for the reuse of previous preconditioning factorizations within Newton schemes for nonlinear methods. Currently, a number of different users are using this package to solve a variety of PDE applications.
Design, development and use of the finite element machine

NASA Technical Reports Server (NTRS)

Adams, L. M.; Voigt, R. C.

1983-01-01

Some of the considerations that went into the design of the Finite Element Machine, a research asynchronous parallel computer are described. The present status of the system is also discussed along with some indication of the type of results that were obtained.
A Verification System for Distributed Objects with Asynchronous Method Calls

NASA Astrophysics Data System (ADS)

Ahrendt, Wolfgang; Dylla, Maximilian

We present a verification system for Creol, an object-oriented modeling language for concurrent distributed applications. The system is an instance of KeY, a framework for object-oriented software verification, which has so far been applied foremost to sequential Java. Building on KeY characteristic concepts, like dynamic logic, sequent calculus, explicit substitutions, and the taclet rule language, the system presented in this paper addresses functional correctness of Creol models featuring local cooperative thread parallelism and global communication via asynchronous method calls. The calculus heavily operates on communication histories which describe the interfaces of Creol units. Two example scenarios demonstrate the usage of the system.
Asynchronous broadcast for ordered delivery between compute nodes in a parallel computing system where packet header space is limited

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kumar, Sameer

Disclosed is a mechanism on receiving processors in a parallel computing system for providing order to data packets received from a broadcast call and to distinguish data packets received at nodes from several incoming asynchronous broadcast messages where header space is limited. In the present invention, processors at lower leafs of a tree do not need to obtain a broadcast message by directly accessing the data in a root processor's buffer. Instead, each subsequent intermediate node's rank id information is squeezed into the software header of packet headers. In turn, the entire broadcast message is not transferred from the rootmore » processor to each processor in a communicator but instead is replicated on several intermediate nodes which then replicated the message to nodes in lower leafs. Hence, the intermediate compute nodes become "virtual root compute nodes" for the purpose of replicating the broadcast message to lower levels of a tree.« less
Exploring Asynchronous Many-Task Runtime Systems toward Extreme Scales

DOE Office of Scientific and Technical Information (OSTI.GOV)

Knight, Samuel; Baker, Gavin Matthew; Gamell, Marc

2015-10-01

Major exascale computing reports indicate a number of software challenges to meet the dramatic change of system architectures in near future. While several-orders-of-magnitude increase in parallelism is the most commonly cited of those, hurdles also include performance heterogeneity of compute nodes across the system, increased imbalance between computational capacity and I/O capabilities, frequent system interrupts, and complex hardware architectures. Asynchronous task-parallel programming models show a great promise in addressing these issues, but are not yet fully understood nor developed su ciently for computational science and engineering application codes. We address these knowledge gaps through quantitative and qualitative exploration of leadingmore » candidate solutions in the context of engineering applications at Sandia. In this poster, we evaluate MiniAero code ported to three leading candidate programming models (Charm++, Legion and UINTAH) to examine the feasibility of these models that permits insertion of new programming model elements into an existing code base.« less
Optimization by nonhierarchical asynchronous decomposition

NASA Technical Reports Server (NTRS)

Shankar, Jayashree; Ribbens, Calvin J.; Haftka, Raphael T.; Watson, Layne T.

1992-01-01

Large scale optimization problems are tractable only if they are somehow decomposed. Hierarchical decompositions are inappropriate for some types of problems and do not parallelize well. Sobieszczanski-Sobieski has proposed a nonhierarchical decomposition strategy for nonlinear constrained optimization that is naturally parallel. Despite some successes on engineering problems, the algorithm as originally proposed fails on simple two dimensional quadratic programs. The algorithm is carefully analyzed for quadratic programs, and a number of modifications are suggested to improve its robustness.
Portable parallel portfolio optimization in the Aurora Financial Management System

NASA Astrophysics Data System (ADS)

Laure, Erwin; Moritsch, Hans

2001-07-01

Financial planning problems are formulated as large scale, stochastic, multiperiod, tree structured optimization problems. An efficient technique for solving this kind of problems is the nested Benders decomposition method. In this paper we present a parallel, portable, asynchronous implementation of this technique. To achieve our portability goals we elected the programming language Java for our implementation and used a high level Java based framework, called OpusJava, for expressing the parallelism potential as well as synchronization constraints. Our implementation is embedded within a modular decision support tool for portfolio and asset liability management, the Aurora Financial Management System.

Matching pursuit parallel decomposition of seismic data

NASA Astrophysics Data System (ADS)

Li, Chuanhui; Zhang, Fanchang

2017-07-01

In order to improve the computation speed of matching pursuit decomposition of seismic data, a matching pursuit parallel algorithm is designed in this paper. We pick a fixed number of envelope peaks from the current signal in every iteration according to the number of compute nodes and assign them to the compute nodes on average to search the optimal Morlet wavelets in parallel. With the help of parallel computer systems and Message Passing Interface, the parallel algorithm gives full play to the advantages of parallel computing to significantly improve the computation speed of the matching pursuit decomposition and also has good expandability. Besides, searching only one optimal Morlet wavelet by every compute node in every iteration is the most efficient implementation.
Organizing Compression of Hyperspectral Imagery to Allow Efficient Parallel Decompression

NASA Technical Reports Server (NTRS)

Klimesh, Matthew A.; Kiely, Aaron B.

2014-01-01

family of schemes has been devised for organizing the output of an algorithm for predictive data compression of hyperspectral imagery so as to allow efficient parallelization in both the compressor and decompressor. In these schemes, the compressor performs a number of iterations, during each of which a portion of the data is compressed via parallel threads operating on independent portions of the data. The general idea is that for each iteration it is predetermined how much compressed data will be produced from each thread.
Fast l₁-SPIRiT compressed sensing parallel imaging MRI: scalable parallel implementation and clinically feasible runtime.

PubMed

Murphy, Mark; Alley, Marcus; Demmel, James; Keutzer, Kurt; Vasanawala, Shreyas; Lustig, Michael

2012-06-01

We present l₁-SPIRiT, a simple algorithm for auto calibrating parallel imaging (acPI) and compressed sensing (CS) that permits an efficient implementation with clinically-feasible runtimes. We propose a CS objective function that minimizes cross-channel joint sparsity in the wavelet domain. Our reconstruction minimizes this objective via iterative soft-thresholding, and integrates naturally with iterative self-consistent parallel imaging (SPIRiT). Like many iterative magnetic resonance imaging reconstructions, l₁-SPIRiT's image quality comes at a high computational cost. Excessively long runtimes are a barrier to the clinical use of any reconstruction approach, and thus we discuss our approach to efficiently parallelizing l₁-SPIRiT and to achieving clinically-feasible runtimes. We present parallelizations of l₁-SPIRiT for both multi-GPU systems and multi-core CPUs, and discuss the software optimization and parallelization decisions made in our implementation. The performance of these alternatives depends on the processor architecture, the size of the image matrix, and the number of parallel imaging channels. Fundamentally, achieving fast runtime requires the correct trade-off between cache usage and parallelization overheads. We demonstrate image quality via a case from our clinical experimentation, using a custom 3DFT spoiled gradient echo (SPGR) sequence with up to 8× acceleration via Poisson-disc undersampling in the two phase-encoded directions.
Fast ℓ1-SPIRiT Compressed Sensing Parallel Imaging MRI: Scalable Parallel Implementation and Clinically Feasible Runtime

PubMed Central

Murphy, Mark; Alley, Marcus; Demmel, James; Keutzer, Kurt; Vasanawala, Shreyas; Lustig, Michael

2012-01-01

We present ℓ1-SPIRiT, a simple algorithm for auto calibrating parallel imaging (acPI) and compressed sensing (CS) that permits an efficient implementation with clinically-feasible runtimes. We propose a CS objective function that minimizes cross-channel joint sparsity in the Wavelet domain. Our reconstruction minimizes this objective via iterative soft-thresholding, and integrates naturally with iterative Self-Consistent Parallel Imaging (SPIRiT). Like many iterative MRI reconstructions, ℓ1-SPIRiT’s image quality comes at a high computational cost. Excessively long runtimes are a barrier to the clinical use of any reconstruction approach, and thus we discuss our approach to efficiently parallelizing ℓ1-SPIRiT and to achieving clinically-feasible runtimes. We present parallelizations of ℓ1-SPIRiT for both multi-GPU systems and multi-core CPUs, and discuss the software optimization and parallelization decisions made in our implementation. The performance of these alternatives depends on the processor architecture, the size of the image matrix, and the number of parallel imaging channels. Fundamentally, achieving fast runtime requires the correct trade-off between cache usage and parallelization overheads. We demonstrate image quality via a case from our clinical experimentation, using a custom 3DFT Spoiled Gradient Echo (SPGR) sequence with up to 8× acceleration via poisson-disc undersampling in the two phase-encoded directions. PMID:22345529
Distributed Computing for Signal Processing: Modeling of Asynchronous Parallel Computation.

DTIC Science & Technology

1986-03-01

the proposed approaches 16, 16, 40 . 451. The conclusion most often reached is that the best scheme to use in a particular design depends highly upon...76. 40 . Siegel, H. J., McMillen. R. J., and Mueller. P. T.. Jr. A survey of interconnection methods for reconligurable parallel processing systems...addressing meehaanm distributed in the network area rimonication% tit reach gigabit./second speeds je g.. PoCoS83 .’ i.V--i the lirO! lk i nitronment is
Solution of the within-group multidimensional discrete ordinates transport equations on massively parallel architectures

NASA Astrophysics Data System (ADS)

Zerr, Robert Joseph

2011-12-01

The integral transport matrix method (ITMM) has been used as the kernel of new parallel solution methods for the discrete ordinates approximation of the within-group neutron transport equation. The ITMM abandons the repetitive mesh sweeps of the traditional source iterations (SI) scheme in favor of constructing stored operators that account for the direct coupling factors among all the cells and between the cells and boundary surfaces. The main goals of this work were to develop the algorithms that construct these operators and employ them in the solution process, determine the most suitable way to parallelize the entire procedure, and evaluate the behavior and performance of the developed methods for increasing number of processes. This project compares the effectiveness of the ITMM with the SI scheme parallelized with the Koch-Baker-Alcouffe (KBA) method. The primary parallel solution method involves a decomposition of the domain into smaller spatial sub-domains, each with their own transport matrices, and coupled together via interface boundary angular fluxes. Each sub-domain has its own set of ITMM operators and represents an independent transport problem. Multiple iterative parallel solution methods have investigated, including parallel block Jacobi (PBJ), parallel red/black Gauss-Seidel (PGS), and parallel GMRES (PGMRES). The fastest observed parallel solution method, PGS, was used in a weak scaling comparison with the PARTISN code. Compared to the state-of-the-art SI-KBA with diffusion synthetic acceleration (DSA), this new method without acceleration/preconditioning is not competitive for any problem parameters considered. The best comparisons occur for problems that are difficult for SI DSA, namely highly scattering and optically thick. SI DSA execution time curves are generally steeper than the PGS ones. However, until further testing is performed it cannot be concluded that SI DSA does not outperform the ITMM with PGS even on several thousand or tens of thousands of processors. The PGS method does outperform SI DSA for the periodic heterogeneous layers (PHL) configuration problems. Although this demonstrates a relative strength/weakness between the two methods, the practicality of these problems is much less, further limiting instances where it would be beneficial to select ITMM over SI DSA. The results strongly indicate a need for a robust, stable, and efficient acceleration method (or preconditioner for PGMRES). The spatial multigrid (SMG) method is currently incomplete in that it does not work for all cases considered and does not effectively improve the convergence rate for all values of scattering ratio c or cell dimension h. Nevertheless, it does display the desired trend for highly scattering, optically thin problems. That is, it tends to lower the rate of growth of number of iterations with increasing number of processes, P, while not increasing the number of additional operations per iteration to the extent that the total execution time of the rapidly converging accelerated iterations exceeds that of the slower unaccelerated iterations. A predictive parallel performance model has been developed for the PBJ method. Timing tests were performed such that trend lines could be fitted to the data for the different components and used to estimate the execution times. Applied to the weak scaling results, the model notably underestimates construction time, but combined with a slight overestimation in iterative solution time, the model predicts total execution time very well for large P. It also does a decent job with the strong scaling results, closely predicting the construction time and time per iteration, especially as P increases. Although not shown to be competitive up to 1,024 processing elements with the current state of the art, the parallelized ITMM exhibits promising scaling trends. Ultimately, compared to the KBA method, the parallelized ITMM may be found to be a very attractive option for transport calculations spatially decomposed over several tens of thousands of processes. Acceleration/preconditioning of the parallelized ITMM once developed will improve the convergence rate and improve its competitiveness. (Abstract shortened by UMI.)
Computational aspects of helicopter trim analysis and damping levels from Floquet theory

NASA Technical Reports Server (NTRS)

Gaonkar, Gopal H.; Achar, N. S.

1992-01-01

Helicopter trim settings of periodic initial state and control inputs are investigated for convergence of Newton iteration in computing the settings sequentially and in parallel. The trim analysis uses a shooting method and a weak version of two temporal finite element methods with displacement formulation and with mixed formulation of displacements and momenta. These three methods broadly represent two main approaches of trim analysis: adaptation of initial-value and finite element boundary-value codes to periodic boundary conditions, particularly for unstable and marginally stable systems. In each method, both the sequential and in-parallel schemes are used and the resulting nonlinear algebraic equations are solved by damped Newton iteration with an optimally selected damping parameter. The impact of damped Newton iteration, including earlier-observed divergence problems in trim analysis, is demonstrated by the maximum condition number of the Jacobian matrices of the iterative scheme and by virtual elimination of divergence. The advantages of the in-parallel scheme over the conventional sequential scheme are also demonstrated.
Run-time parallelization and scheduling of loops

NASA Technical Reports Server (NTRS)

Saltz, Joel H.; Mirchandaney, Ravi; Crowley, Kay

1991-01-01

Run-time methods are studied to automatically parallelize and schedule iterations of a do loop in certain cases where compile-time information is inadequate. The methods presented involve execution time preprocessing of the loop. At compile-time, these methods set up the framework for performing a loop dependency analysis. At run-time, wavefronts of concurrently executable loop iterations are identified. Using this wavefront information, loop iterations are reordered for increased parallelism. Symbolic transformation rules are used to produce: inspector procedures that perform execution time preprocessing, and executors or transformed versions of source code loop structures. These transformed loop structures carry out the calculations planned in the inspector procedures. Performance results are presented from experiments conducted on the Encore Multimax. These results illustrate that run-time reordering of loop indexes can have a significant impact on performance.
A Parallel Saturation Algorithm on Shared Memory Architectures

NASA Technical Reports Server (NTRS)

Ezekiel, Jonathan; Siminiceanu

2007-01-01

Symbolic state-space generators are notoriously hard to parallelize. However, the Saturation algorithm implemented in the SMART verification tool differs from other sequential symbolic state-space generators in that it exploits the locality of ring events in asynchronous system models. This paper explores whether event locality can be utilized to efficiently parallelize Saturation on shared-memory architectures. Conceptually, we propose to parallelize the ring of events within a decision diagram node, which is technically realized via a thread pool. We discuss the challenges involved in our parallel design and conduct experimental studies on its prototypical implementation. On a dual-processor dual core PC, our studies show speed-ups for several example models, e.g., of up to 50% for a Kanban model, when compared to running our algorithm only on a single core.
An iterative reduced field-of-view reconstruction for periodically rotated overlapping parallel lines with enhanced reconstruction (PROPELLER) MRI.

PubMed

Lin, Jyh-Miin; Patterson, Andrew J; Chang, Hing-Chiu; Gillard, Jonathan H; Graves, Martin J

2015-10-01

To propose a new reduced field-of-view (rFOV) strategy for iterative reconstructions in a clinical environment. Iterative reconstructions can incorporate regularization terms to improve the image quality of periodically rotated overlapping parallel lines with enhanced reconstruction (PROPELLER) MRI. However, the large amount of calculations required for full FOV iterative reconstructions has posed a huge computational challenge for clinical usage. By subdividing the entire problem into smaller rFOVs, the iterative reconstruction can be accelerated on a desktop with a single graphic processing unit (GPU). This rFOV strategy divides the iterative reconstruction into blocks, based on the block-diagonal dominant structure. A near real-time reconstruction system was developed for the clinical MR unit, and parallel computing was implemented using the object-oriented model. In addition, the Toeplitz method was implemented on the GPU to reduce the time required for full interpolation. Using the data acquired from the PROPELLER MRI, the reconstructed images were then saved in the digital imaging and communications in medicine format. The proposed rFOV reconstruction reduced the gridding time by 97%, as the total iteration time was 3 s even with multiple processes running. A phantom study showed that the structure similarity index for rFOV reconstruction was statistically superior to conventional density compensation (p < 0.001). In vivo study validated the increased signal-to-noise ratio, which is over four times higher than with density compensation. Image sharpness index was improved using the regularized reconstruction implemented. The rFOV strategy permits near real-time iterative reconstruction to improve the image quality of PROPELLER images. Substantial improvements in image quality metrics were validated in the experiments. The concept of rFOV reconstruction may potentially be applied to other kinds of iterative reconstructions for shortened reconstruction duration.
Vectorized and multitasked solution of the few-group neutron diffusion equations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Zee, S.K.; Turinsky, P.J.; Shayer, Z.

1989-03-01

A numerical algorithm with parallelism was used to solve the two-group, multidimensional neutron diffusion equations on computers characterized by shared memory, vector pipeline, and multi-CPU architecture features. Specifically, solutions were obtained on the Cray X/MP-48, the IBM-3090 with vector facilities, and the FPS-164. The material-centered mesh finite difference method approximation and outer-inner iteration method were employed. Parallelism was introduced in the inner iterations using the cyclic line successive overrelaxation iterative method and solving in parallel across lines. The outer iterations were completed using the Chebyshev semi-iterative method that allows parallelism to be introduced in both space and energy groups. Formore » the three-dimensional model, power, soluble boron, and transient fission product feedbacks were included. Concentrating on the pressurized water reactor (PWR), the thermal-hydraulic calculation of moderator density assumed single-phase flow and a closed flow channel, allowing parallelism to be introduced in the solution across the radial plane. Using a pinwise detail, quarter-core model of a typical PWR in cycle 1, for the two-dimensional model without feedback the measured million floating point operations per second (MFLOPS)/vector speedups were 83/11.7. 18/2.2, and 2.4/5.6 on the Cray, IBM, and FPS without multitasking, respectively. Lower performance was observed with a coarser mesh, i.e., shorter vector length, due to vector pipeline start-up. For an 18 x 18 x 30 (x-y-z) three-dimensional model with feedback of the same core, MFLOPS/vector speedups of --61/6.7 and an execution time of 0.8 CPU seconds on the Cray without multitasking were measured. Finally, using two CPUs and the vector pipelines of the Cray, a multitasking efficiency of 81% was noted for the three-dimensional model.« less
An efficient parallel algorithm: Poststack and prestack Kirchhoff 3D depth migration using flexi-depth iterations

NASA Astrophysics Data System (ADS)

Rastogi, Richa; Srivastava, Abhishek; Khonde, Kiran; Sirasala, Kirannmayi M.; Londhe, Ashutosh; Chavhan, Hitesh

2015-07-01

This paper presents an efficient parallel 3D Kirchhoff depth migration algorithm suitable for current class of multicore architecture. The fundamental Kirchhoff depth migration algorithm exhibits inherent parallelism however, when it comes to 3D data migration, as the data size increases the resource requirement of the algorithm also increases. This challenges its practical implementation even on current generation high performance computing systems. Therefore a smart parallelization approach is essential to handle 3D data for migration. The most compute intensive part of Kirchhoff depth migration algorithm is the calculation of traveltime tables due to its resource requirements such as memory/storage and I/O. In the current research work, we target this area and develop a competent parallel algorithm for post and prestack 3D Kirchhoff depth migration, using hybrid MPI+OpenMP programming techniques. We introduce a concept of flexi-depth iterations while depth migrating data in parallel imaging space, using optimized traveltime table computations. This concept provides flexibility to the algorithm by migrating data in a number of depth iterations, which depends upon the available node memory and the size of data to be migrated during runtime. Furthermore, it minimizes the requirements of storage, I/O and inter-node communication, thus making it advantageous over the conventional parallelization approaches. The developed parallel algorithm is demonstrated and analysed on Yuva II, a PARAM series of supercomputers. Optimization, performance and scalability experiment results along with the migration outcome show the effectiveness of the parallel algorithm.
An asynchronous traversal engine for graph-based rich metadata management

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dai, Dong; Carns, Philip; Ross, Robert B.

Rich metadata in high-performance computing (HPC) systems contains extended information about users, jobs, data files, and their relationships. Property graphs are a promising data model to represent heterogeneous rich metadata flexibly. Specifically, a property graph can use vertices to represent different entities and edges to record the relationships between vertices with unique annotations. The high-volume HPC use case, with millions of entities and relationships, naturally requires an out-of-core distributed property graph database, which must support live updates (to ingest production information in real time), low-latency point queries (for frequent metadata operations such as permission checking), and large-scale traversals (for provenancemore » data mining). Among these needs, large-scale property graph traversals are particularly challenging for distributed graph storage systems. Most existing graph systems implement a "level synchronous" breadth-first search algorithm that relies on global synchronization in each traversal step. This performs well in many problem domains; but a rich metadata management system is characterized by imbalanced graphs, long traversal lengths, and concurrent workloads, each of which has the potential to introduce or exacerbate stragglers (i.e., abnormally slow steps or servers in a graph traversal) that lead to low overall throughput for synchronous traversal algorithms. Previous research indicated that the straggler problem can be mitigated by using asynchronous traversal algorithms, and many graph-processing frameworks have successfully demonstrated this approach. Such systems require the graph to be loaded into a separate batch-processing framework instead of being iteratively accessed, however. In this work, we investigate a general asynchronous graph traversal engine that can operate atop a rich metadata graph in its native format. We outline a traversal-aware query language and key optimizations (traversal-affiliate caching and execution merging) necessary for efficient performance. We further explore the effect of different graph partitioning strategies on the traversal performance for both synchronous and asynchronous traversal engines. Our experiments show that the asynchronous graph traversal engine is more efficient than its synchronous counterpart in the case of HPC rich metadata processing, where more servers are involved and larger traversals are needed. Furthermore, the asynchronous traversal engine is more adaptive to different graph partitioning strategies.« less
An asynchronous traversal engine for graph-based rich metadata management

DOE PAGES

Dai, Dong; Carns, Philip; Ross, Robert B.; ...

2016-06-23

Rich metadata in high-performance computing (HPC) systems contains extended information about users, jobs, data files, and their relationships. Property graphs are a promising data model to represent heterogeneous rich metadata flexibly. Specifically, a property graph can use vertices to represent different entities and edges to record the relationships between vertices with unique annotations. The high-volume HPC use case, with millions of entities and relationships, naturally requires an out-of-core distributed property graph database, which must support live updates (to ingest production information in real time), low-latency point queries (for frequent metadata operations such as permission checking), and large-scale traversals (for provenancemore » data mining). Among these needs, large-scale property graph traversals are particularly challenging for distributed graph storage systems. Most existing graph systems implement a "level synchronous" breadth-first search algorithm that relies on global synchronization in each traversal step. This performs well in many problem domains; but a rich metadata management system is characterized by imbalanced graphs, long traversal lengths, and concurrent workloads, each of which has the potential to introduce or exacerbate stragglers (i.e., abnormally slow steps or servers in a graph traversal) that lead to low overall throughput for synchronous traversal algorithms. Previous research indicated that the straggler problem can be mitigated by using asynchronous traversal algorithms, and many graph-processing frameworks have successfully demonstrated this approach. Such systems require the graph to be loaded into a separate batch-processing framework instead of being iteratively accessed, however. In this work, we investigate a general asynchronous graph traversal engine that can operate atop a rich metadata graph in its native format. We outline a traversal-aware query language and key optimizations (traversal-affiliate caching and execution merging) necessary for efficient performance. We further explore the effect of different graph partitioning strategies on the traversal performance for both synchronous and asynchronous traversal engines. Our experiments show that the asynchronous graph traversal engine is more efficient than its synchronous counterpart in the case of HPC rich metadata processing, where more servers are involved and larger traversals are needed. Furthermore, the asynchronous traversal engine is more adaptive to different graph partitioning strategies.« less
Using parallel banded linear system solvers in generalized eigenvalue problems

NASA Technical Reports Server (NTRS)

Zhang, Hong; Moss, William F.

1993-01-01

Subspace iteration is a reliable and cost effective method for solving positive definite banded symmetric generalized eigenproblems, especially in the case of large scale problems. This paper discusses an algorithm that makes use of two parallel banded solvers in subspace iteration. A shift is introduced to decompose the banded linear systems into relatively independent subsystems and to accelerate the iterations. With this shift, an eigenproblem is mapped efficiently into the memories of a multiprocessor and a high speed-up is obtained for parallel implementations. An optimal shift is a shift that balances total computation and communication costs. Under certain conditions, we show how to estimate an optimal shift analytically using the decay rate for the inverse of a banded matrix, and how to improve this estimate. Computational results on iPSC/2 and iPSC/860 multiprocessors are presented.
Parallel solution of the symmetric tridiagonal eigenproblem. Research report

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jessup, E.R.

1989-10-01

This thesis discusses methods for computing all eigenvalues and eigenvectors of a symmetric tridiagonal matrix on a distributed-memory Multiple Instruction, Multiple Data multiprocessor. Only those techniques having the potential for both high numerical accuracy and significant large-grained parallelism are investigated. These include the QL method or Cuppen's divide and conquer method based on rank-one updating to compute both eigenvalues and eigenvectors, bisection to determine eigenvalues and inverse iteration to compute eigenvectors. To begin, the methods are compared with respect to computation time, communication time, parallel speed up, and accuracy. Experiments on an IPSC hypercube multiprocessor reveal that Cuppen's method ismore » the most accurate approach, but bisection with inverse iteration is the fastest and most parallel. Because the accuracy of the latter combination is determined by the quality of the computed eigenvectors, the factors influencing the accuracy of inverse iteration are examined. This includes, in part, statistical analysis of the effect of a starting vector with random components. These results are used to develop an implementation of inverse iteration producing eigenvectors with lower residual error and better orthogonality than those generated by the EISPACK routine TINVIT. This thesis concludes with adaptions of methods for the symmetric tridiagonal eigenproblem to the related problem of computing the singular value decomposition (SVD) of a bidiagonal matrix.« less
Parallel solution of the symmetric tridiagonal eigenproblem

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jessup, E.R.

1989-01-01

This thesis discusses methods for computing all eigenvalues and eigenvectors of a symmetric tridiagonal matrix on a distributed memory MIMD multiprocessor. Only those techniques having the potential for both high numerical accuracy and significant large-grained parallelism are investigated. These include the QL method or Cuppen's divide and conquer method based on rank-one updating to compute both eigenvalues and eigenvectors, bisection to determine eigenvalues, and inverse iteration to compute eigenvectors. To begin, the methods are compared with respect to computation time, communication time, parallel speedup, and accuracy. Experiments on an iPSC hyper-cube multiprocessor reveal that Cuppen's method is the most accuratemore » approach, but bisection with inverse iteration is the fastest and most parallel. Because the accuracy of the latter combination is determined by the quality of the computed eigenvectors, the factors influencing the accuracy of inverse iteration are examined. This includes, in part, statistical analysis of the effects of a starting vector with random components. These results are used to develop an implementation of inverse iteration producing eigenvectors with lower residual error and better orthogonality than those generated by the EISPACK routine TINVIT. This thesis concludes with adaptations of methods for the symmetric tridiagonal eigenproblem to the related problem of computing the singular value decomposition (SVD) of a bidiagonal matrix.« less
Parallel Preconditioning for CFD Problems on the CM-5

NASA Technical Reports Server (NTRS)

Simon, Horst D.; Kremenetsky, Mark D.; Richardson, John; Lasinski, T. A. (Technical Monitor)

1994-01-01

Up to today, preconditioning methods on massively parallel systems have faced a major difficulty. The most successful preconditioning methods in terms of accelerating the convergence of the iterative solver such as incomplete LU factorizations are notoriously difficult to implement on parallel machines for two reasons: (1) the actual computation of the preconditioner is not very floating-point intensive, but requires a large amount of unstructured communication, and (2) the application of the preconditioning matrix in the iteration phase (i.e. triangular solves) are difficult to parallelize because of the recursive nature of the computation. Here we present a new approach to preconditioning for very large, sparse, unsymmetric, linear systems, which avoids both difficulties. We explicitly compute an approximate inverse to our original matrix. This new preconditioning matrix can be applied most efficiently for iterative methods on massively parallel machines, since the preconditioning phase involves only a matrix-vector multiplication, with possibly a dense matrix. Furthermore the actual computation of the preconditioning matrix has natural parallelism. For a problem of size n, the preconditioning matrix can be computed by solving n independent small least squares problems. The algorithm and its implementation on the Connection Machine CM-5 are discussed in detail and supported by extensive timings obtained from real problem data.
Parallel iterative solution for h and p approximations of the shallow water equations

USGS Publications Warehouse

Barragy, E.J.; Walters, R.A.

1998-01-01

A p finite element scheme and parallel iterative solver are introduced for a modified form of the shallow water equations. The governing equations are the three-dimensional shallow water equations. After a harmonic decomposition in time and rearrangement, the resulting equations are a complex Helmholz problem for surface elevation, and a complex momentum equation for the horizontal velocity. Both equations are nonlinear and the resulting system is solved using the Picard iteration combined with a preconditioned biconjugate gradient (PBCG) method for the linearized subproblems. A subdomain-based parallel preconditioner is developed which uses incomplete LU factorization with thresholding (ILUT) methods within subdomains, overlapping ILUT factorizations for subdomain boundaries and under-relaxed iteration for the resulting block system. The method builds on techniques successfully applied to linear elements by introducing ordering and condensation techniques to handle uniform p refinement. The combined methods show good performance for a range of p (element order), h (element size), and N (number of processors). Performance and scalability results are presented for a field scale problem where up to 512 processors are used. ?? 1998 Elsevier Science Ltd. All rights reserved.
Simplified Parallel Domain Traversal

DOE Office of Scientific and Technical Information (OSTI.GOV)

Erickson III, David J

2011-01-01

Many data-intensive scientific analysis techniques require global domain traversal, which over the years has been a bottleneck for efficient parallelization across distributed-memory architectures. Inspired by MapReduce and other simplified parallel programming approaches, we have designed DStep, a flexible system that greatly simplifies efficient parallelization of domain traversal techniques at scale. In order to deliver both simplicity to users as well as scalability on HPC platforms, we introduce a novel two-tiered communication architecture for managing and exploiting asynchronous communication loads. We also integrate our design with advanced parallel I/O techniques that operate directly on native simulation output. We demonstrate DStep bymore » performing teleconnection analysis across ensemble runs of terascale atmospheric CO{sub 2} and climate data, and we show scalability results on up to 65,536 IBM BlueGene/P cores.« less

Handling Big Data in Medical Imaging: Iterative Reconstruction with Large-Scale Automated Parallel Computation

PubMed Central

Lee, Jae H.; Yao, Yushu; Shrestha, Uttam; Gullberg, Grant T.; Seo, Youngho

2014-01-01

The primary goal of this project is to implement the iterative statistical image reconstruction algorithm, in this case maximum likelihood expectation maximum (MLEM) used for dynamic cardiac single photon emission computed tomography, on Spark/GraphX. This involves porting the algorithm to run on large-scale parallel computing systems. Spark is an easy-to- program software platform that can handle large amounts of data in parallel. GraphX is a graph analytic system running on top of Spark to handle graph and sparse linear algebra operations in parallel. The main advantage of implementing MLEM algorithm in Spark/GraphX is that it allows users to parallelize such computation without any expertise in parallel computing or prior knowledge in computer science. In this paper we demonstrate a successful implementation of MLEM in Spark/GraphX and present the performance gains with the goal to eventually make it useable in clinical setting. PMID:27081299
Handling Big Data in Medical Imaging: Iterative Reconstruction with Large-Scale Automated Parallel Computation.

PubMed

Lee, Jae H; Yao, Yushu; Shrestha, Uttam; Gullberg, Grant T; Seo, Youngho

2014-11-01

The primary goal of this project is to implement the iterative statistical image reconstruction algorithm, in this case maximum likelihood expectation maximum (MLEM) used for dynamic cardiac single photon emission computed tomography, on Spark/GraphX. This involves porting the algorithm to run on large-scale parallel computing systems. Spark is an easy-to- program software platform that can handle large amounts of data in parallel. GraphX is a graph analytic system running on top of Spark to handle graph and sparse linear algebra operations in parallel. The main advantage of implementing MLEM algorithm in Spark/GraphX is that it allows users to parallelize such computation without any expertise in parallel computing or prior knowledge in computer science. In this paper we demonstrate a successful implementation of MLEM in Spark/GraphX and present the performance gains with the goal to eventually make it useable in clinical setting.
Ultrawideband asynchronous tracking system and method

NASA Technical Reports Server (NTRS)

Arndt, G. Dickey (Inventor); Ngo, Phong H. (Inventor); Phan, Chau T. (Inventor); Gross, Julia A. (Inventor); Ni, Jianjun (Inventor); Dusl, John (Inventor)

2012-01-01

A passive tracking system is provided with a plurality of ultrawideband (UWB) receivers that is asynchronous with respect to a UWB transmitter. A geometry of the tracking system may utilize a plurality of clusters with each cluster comprising a plurality of antennas. Time Difference of Arrival (TDOA) may be determined for the antennas in each cluster and utilized to determine Angle of Arrival (AOA) based on a far field assumption regarding the geometry. Parallel software communication sockets may be established with each of the plurality of UWB receivers. Transfer of waveform data may be processed by alternately receiving packets of waveform data from each UWB receiver. Cross Correlation Peak Detection (CCPD) is utilized to estimate TDOA information to reduce errors in a noisy, multipath environment.
Run-time parallelization and scheduling of loops

NASA Technical Reports Server (NTRS)

Saltz, Joel H.; Mirchandaney, Ravi; Crowley, Kay

1990-01-01

Run time methods are studied to automatically parallelize and schedule iterations of a do loop in certain cases, where compile-time information is inadequate. The methods presented involve execution time preprocessing of the loop. At compile-time, these methods set up the framework for performing a loop dependency analysis. At run time, wave fronts of concurrently executable loop iterations are identified. Using this wavefront information, loop iterations are reordered for increased parallelism. Symbolic transformation rules are used to produce: inspector procedures that perform execution time preprocessing and executors or transformed versions of source code loop structures. These transformed loop structures carry out the calculations planned in the inspector procedures. Performance results are presented from experiments conducted on the Encore Multimax. These results illustrate that run time reordering of loop indices can have a significant impact on performance. Furthermore, the overheads associated with this type of reordering are amortized when the loop is executed several times with the same dependency structure.
Multi-petascale highly efficient parallel supercomputer

DOE Office of Scientific and Technical Information (OSTI.GOV)

Asaad, Sameh; Bellofatto, Ralph E.; Blocksome, Michael A.

A Multi-Petascale Highly Efficient Parallel Supercomputer of 100 petaflop-scale includes node architectures based upon System-On-a-Chip technology, where each processing node comprises a single Application Specific Integrated Circuit (ASIC). The ASIC nodes are interconnected by a five dimensional torus network that optimally maximize the throughput of packet communications between nodes and minimize latency. The network implements collective network and a global asynchronous network that provides global barrier and notification functions. Integrated in the node design include a list-based prefetcher. The memory system implements transaction memory, thread level speculation, and multiversioning cache that improves soft error rate at the same time andmore » supports DMA functionality allowing for parallel processing message-passing.« less
Data Elevator

DOE Office of Scientific and Technical Information (OSTI.GOV)

BYNA, SUNRENDRA; DONG, BIN; WU, KESHENG

Data Elevator: Efficient Asynchronous Data Movement in Hierarchical Storage Systems Multi-layer storage subsystems, including SSD-based burst buffers and disk-based parallel file systems (PFS), are becoming part of HPC systems. However, software for this storage hierarchy is still in its infancy. Applications may have to explicitly move data among the storage layers. We propose Data Elevator for transparently and efficiently moving data between a burst buffer and a PFS. Users specify the final destination for their data, typically on PFS, Data Elevator intercepts the I/O calls, stages data on burst buffer, and then asynchronously transfers the data to their final destinationmore » in the background. This system allows extensive optimizations, such as overlapping read and write operations, choosing I/O modes, and aligning buffer boundaries. In tests with large-scale scientific applications, Data Elevator is as much as 4.2X faster than Cray DataWarp, the start-of-art software for burst buffer, and 4X faster than directly writing to PFS. The Data Elevator library uses HDF5's Virtual Object Layer (VOL) for intercepting parallel I/O calls that write data to PFS. The intercepted calls are redirected to the Data Elevator, which provides a handle to write the file in a faster and intermediate burst buffer system. Once the application finishes writing the data to the burst buffer, the Data Elevator job uses HDF5 to move the data to final destination in an asynchronous manner. Hence, using the Data Elevator library is currently useful for applications that call HDF5 for writing data files. Also, the Data Elevator depends on the HDF5 VOL functionality.« less
Parallel asynchronous systems and image processing algorithms

NASA Technical Reports Server (NTRS)

Coon, D. D.; Perera, A. G. U.

1989-01-01

A new hardware approach to implementation of image processing algorithms is described. The approach is based on silicon devices which would permit an independent analog processing channel to be dedicated to evey pixel. A laminar architecture consisting of a stack of planar arrays of the device would form a two-dimensional array processor with a 2-D array of inputs located directly behind a focal plane detector array. A 2-D image data stream would propagate in neuronlike asynchronous pulse coded form through the laminar processor. Such systems would integrate image acquisition and image processing. Acquisition and processing would be performed concurrently as in natural vision systems. The research is aimed at implementation of algorithms, such as the intensity dependent summation algorithm and pyramid processing structures, which are motivated by the operation of natural vision systems. Implementation of natural vision algorithms would benefit from the use of neuronlike information coding and the laminar, 2-D parallel, vision system type architecture. Besides providing a neural network framework for implementation of natural vision algorithms, a 2-D parallel approach could eliminate the serial bottleneck of conventional processing systems. Conversion to serial format would occur only after raw intensity data has been substantially processed. An interesting challenge arises from the fact that the mathematical formulation of natural vision algorithms does not specify the means of implementation, so that hardware implementation poses intriguing questions involving vision science.
Domain decomposition methods for the parallel computation of reacting flows

NASA Technical Reports Server (NTRS)

Keyes, David E.

1988-01-01

Domain decomposition is a natural route to parallel computing for partial differential equation solvers. Subdomains of which the original domain of definition is comprised are assigned to independent processors at the price of periodic coordination between processors to compute global parameters and maintain the requisite degree of continuity of the solution at the subdomain interfaces. In the domain-decomposed solution of steady multidimensional systems of PDEs by finite difference methods using a pseudo-transient version of Newton iteration, the only portion of the computation which generally stands in the way of efficient parallelization is the solution of the large, sparse linear systems arising at each Newton step. For some Jacobian matrices drawn from an actual two-dimensional reacting flow problem, comparisons are made between relaxation-based linear solvers and also preconditioned iterative methods of Conjugate Gradient and Chebyshev type, focusing attention on both iteration count and global inner product count. The generalized minimum residual method with block-ILU preconditioning is judged the best serial method among those considered, and parallel numerical experiments on the Encore Multimax demonstrate for it approximately 10-fold speedup on 16 processors.
Superconducting magnetic energy storage for asynchronous electrical systems

DOEpatents

Boenig, Heinrich J.

1986-01-01

A superconducting magnetic energy storage coil connected in parallel between converters of two or more ac power systems provides load leveling and stability improvement to any or all of the ac systems. Control is provided to direct the charging and independently the discharging of the superconducting coil to at least a selected one of the ac power systems.
Project Integration Architecture: Distributed Lock Management, Deadlock Detection, and Set Iteration

NASA Technical Reports Server (NTRS)

Jones, William Henry

2005-01-01

The migration of the Project Integration Architecture (PIA) to the distributed object environment of the Common Object Request Broker Architecture (CORBA) brings with it the nearly unavoidable requirements of multiaccessor, asynchronous operations. In order to maintain the integrity of data structures in such an environment, it is necessary to provide a locking mechanism capable of protecting the complex operations typical of the PIA architecture. This paper reports on the implementation of a locking mechanism to treat that need. Additionally, the ancillary features necessary to make the distributed lock mechanism work are discussed.
Image segmentation by iterative parallel region growing with application to data compression and image analysis

NASA Technical Reports Server (NTRS)

Tilton, James C.

1988-01-01

Image segmentation can be a key step in data compression and image analysis. However, the segmentation results produced by most previous approaches to region growing are suspect because they depend on the order in which portions of the image are processed. An iterative parallel segmentation algorithm avoids this problem by performing globally best merges first. Such a segmentation approach, and two implementations of the approach on NASA's Massively Parallel Processor (MPP) are described. Application of the segmentation approach to data compression and image analysis is then described, and results of such application are given for a LANDSAT Thematic Mapper image.
Parallelizing flow-accumulation calculations on graphics processing units—From iterative DEM preprocessing algorithm to recursive multiple-flow-direction algorithm

NASA Astrophysics Data System (ADS)

Qin, Cheng-Zhi; Zhan, Lijun

2012-06-01

As one of the important tasks in digital terrain analysis, the calculation of flow accumulations from gridded digital elevation models (DEMs) usually involves two steps in a real application: (1) using an iterative DEM preprocessing algorithm to remove the depressions and flat areas commonly contained in real DEMs, and (2) using a recursive flow-direction algorithm to calculate the flow accumulation for every cell in the DEM. Because both algorithms are computationally intensive, quick calculation of the flow accumulations from a DEM (especially for a large area) presents a practical challenge to personal computer (PC) users. In recent years, rapid increases in hardware capacity of the graphics processing units (GPUs) provided in modern PCs have made it possible to meet this challenge in a PC environment. Parallel computing on GPUs using a compute-unified-device-architecture (CUDA) programming model has been explored to speed up the execution of the single-flow-direction algorithm (SFD). However, the parallel implementation on a GPU of the multiple-flow-direction (MFD) algorithm, which generally performs better than the SFD algorithm, has not been reported. Moreover, GPU-based parallelization of the DEM preprocessing step in the flow-accumulation calculations has not been addressed. This paper proposes a parallel approach to calculate flow accumulations (including both iterative DEM preprocessing and a recursive MFD algorithm) on a CUDA-compatible GPU. For the parallelization of an MFD algorithm (MFD-md), two different parallelization strategies using a GPU are explored. The first parallelization strategy, which has been used in the existing parallel SFD algorithm on GPU, has the problem of computing redundancy. Therefore, we designed a parallelization strategy based on graph theory. The application results show that the proposed parallel approach to calculate flow accumulations on a GPU performs much faster than either sequential algorithms or other parallel GPU-based algorithms based on existing parallelization strategies.
Parallel computation of multigroup reactivity coefficient using iterative method

NASA Astrophysics Data System (ADS)

Susmikanti, Mike; Dewayatna, Winter

2013-09-01

One of the research activities to support the commercial radioisotope production program is a safety research target irradiation FPM (Fission Product Molybdenum). FPM targets form a tube made of stainless steel in which the nuclear degrees of superimposed high-enriched uranium. FPM irradiation tube is intended to obtain fission. The fission material widely used in the form of kits in the world of nuclear medicine. Irradiation FPM tube reactor core would interfere with performance. One of the disorders comes from changes in flux or reactivity. It is necessary to study a method for calculating safety terrace ongoing configuration changes during the life of the reactor, making the code faster became an absolute necessity. Neutron safety margin for the research reactor can be reused without modification to the calculation of the reactivity of the reactor, so that is an advantage of using perturbation method. The criticality and flux in multigroup diffusion model was calculate at various irradiation positions in some uranium content. This model has a complex computation. Several parallel algorithms with iterative method have been developed for the sparse and big matrix solution. The Black-Red Gauss Seidel Iteration and the power iteration parallel method can be used to solve multigroup diffusion equation system and calculated the criticality and reactivity coeficient. This research was developed code for reactivity calculation which used one of safety analysis with parallel processing. It can be done more quickly and efficiently by utilizing the parallel processing in the multicore computer. This code was applied for the safety limits calculation of irradiated targets FPM with increment Uranium.
A software architecture for multidisciplinary applications: Integrating task and data parallelism

NASA Technical Reports Server (NTRS)

Chapman, Barbara; Mehrotra, Piyush; Vanrosendale, John; Zima, Hans

1994-01-01

Data parallel languages such as Vienna Fortran and HPF can be successfully applied to a wide range of numerical applications. However, many advanced scientific and engineering applications are of a multidisciplinary and heterogeneous nature and thus do not fit well into the data parallel paradigm. In this paper we present new Fortran 90 language extensions to fill this gap. Tasks can be spawned as asynchronous activities in a homogeneous or heterogeneous computing environment; they interact by sharing access to Shared Data Abstractions (SDA's). SDA's are an extension of Fortran 90 modules, representing a pool of common data, together with a set of Methods for controlled access to these data and a mechanism for providing persistent storage. Our language supports the integration of data and task parallelism as well as nested task parallelism and thus can be used to express multidisciplinary applications in a natural and efficient way.
Data preprocessing for determining outer/inner parallelization in the nested loop problem using OpenMP

NASA Astrophysics Data System (ADS)

Handhika, T.; Bustamam, A.; Ernastuti, Kerami, D.

2017-07-01

Multi-thread programming using OpenMP on the shared-memory architecture with hyperthreading technology allows the resource to be accessed by multiple processors simultaneously. Each processor can execute more than one thread for a certain period of time. However, its speedup depends on the ability of the processor to execute threads in limited quantities, especially the sequential algorithm which contains a nested loop. The number of the outer loop iterations is greater than the maximum number of threads that can be executed by a processor. The thread distribution technique that had been found previously only be applied by the high-level programmer. This paper generates a parallelization procedure for low-level programmer in dealing with 2-level nested loop problems with the maximum number of threads that can be executed by a processor is smaller than the number of the outer loop iterations. Data preprocessing which is related to the number of the outer loop and the inner loop iterations, the computational time required to execute each iteration and the maximum number of threads that can be executed by a processor are used as a strategy to determine which parallel region that will produce optimal speedup.
Efficient Iterative Methods Applied to the Solution of Transonic Flows

NASA Astrophysics Data System (ADS)

Wissink, Andrew M.; Lyrintzis, Anastasios S.; Chronopoulos, Anthony T.

1996-02-01

We investigate the use of an inexact Newton's method to solve the potential equations in the transonic regime. As a test case, we solve the two-dimensional steady transonic small disturbance equation. Approximate factorization/ADI techniques have traditionally been employed for implicit solutions of this nonlinear equation. Instead, we apply Newton's method using an exact analytical determination of the Jacobian with preconditioned conjugate gradient-like iterative solvers for solution of the linear systems in each Newton iteration. Two iterative solvers are tested; a block s-step version of the classical Orthomin(k) algorithm called orthogonal s-step Orthomin (OSOmin) and the well-known GMRES method. The preconditioner is a vectorizable and parallelizable version of incomplete LU (ILU) factorization. Efficiency of the Newton-Iterative method on vector and parallel computer architectures is the main issue addressed. In vectorized tests on a single processor of the Cray C-90, the performance of Newton-OSOmin is superior to Newton-GMRES and a more traditional monotone AF/ADI method (MAF) for a variety of transonic Mach numbers and mesh sizes. Newton-GMRES is superior to MAF for some cases. The parallel performance of the Newton method is also found to be very good on multiple processors of the Cray C-90 and on the massively parallel thinking machine CM-5, where very fast execution rates (up to 9 Gflops) are found for large problems.
A Parallel Numerical Algorithm To Solve Linear Systems Of Equations Emerging From 3D Radiative Transfer

NASA Astrophysics Data System (ADS)

Wichert, Viktoria; Arkenberg, Mario; Hauschildt, Peter H.

2016-10-01

Highly resolved state-of-the-art 3D atmosphere simulations will remain computationally extremely expensive for years to come. In addition to the need for more computing power, rethinking coding practices is necessary. We take a dual approach by introducing especially adapted, parallel numerical methods and correspondingly parallelizing critical code passages. In the following, we present our respective work on PHOENIX/3D. With new parallel numerical algorithms, there is a big opportunity for improvement when iteratively solving the system of equations emerging from the operator splitting of the radiative transfer equation J = ΛS. The narrow-banded approximate Λ-operator Λ* , which is used in PHOENIX/3D, occurs in each iteration step. By implementing a numerical algorithm which takes advantage of its characteristic traits, the parallel code's efficiency is further increased and a speed-up in computational time can be achieved.
Multidisciplinary systems optimization by linear decomposition

NASA Technical Reports Server (NTRS)

Sobieski, J.

1984-01-01

In a typical design process major decisions are made sequentially. An illustrated example is given for an aircraft design in which the aerodynamic shape is usually decided first, then the airframe is sized for strength and so forth. An analogous sequence could be laid out for any other major industrial product, for instance, a ship. The loops in the discipline boxes symbolize iterative design improvements carried out within the confines of a single engineering discipline, or subsystem. The loops spanning several boxes depict multidisciplinary design improvement iterations. Omitted for graphical simplicity is parallelism of the disciplinary subtasks. The parallelism is important in order to develop a broad workfront necessary to shorten the design time. If all the intradisciplinary and interdisciplinary iterations were carried out to convergence, the process could yield a numerically optimal design. However, it usually stops short of that because of time and money limitations. This is especially true for the interdisciplinary iterations.
Distributed Computing for Signal Processing: Modeling of Asynchronous Parallel Computation. Appendix G. On the Design and Modeling of Special Purpose Parallel Processing Systems.

DTIC Science & Technology

1985-05-01

unit in the data base, with knowing one generic assembly language. °-’--a 139 The 5-tuple describing single operation execution time of the operations...TSi-- generate , random eventi ( ,.0-15 tieit tmls - ((floa egus ()16 274 r Ispt imet imel I at :EVE’JS- II ktime=0.0; /0 present time 0/ rrs ptime=0.0...computing machinery capable of performing these tasks within a given time constraint. Because the majority of the available computing machinery is general
Proxy-equation paradigm: A strategy for massively parallel asynchronous computations

NASA Astrophysics Data System (ADS)

Mittal, Ankita; Girimaji, Sharath

2017-09-01

Massively parallel simulations of transport equation systems call for a paradigm change in algorithm development to achieve efficient scalability. Traditional approaches require time synchronization of processing elements (PEs), which severely restricts scalability. Relaxing synchronization requirement introduces error and slows down convergence. In this paper, we propose and develop a novel "proxy equation" concept for a general transport equation that (i) tolerates asynchrony with minimal added error, (ii) preserves convergence order and thus, (iii) expected to scale efficiently on massively parallel machines. The central idea is to modify a priori the transport equation at the PE boundaries to offset asynchrony errors. Proof-of-concept computations are performed using a one-dimensional advection (convection) diffusion equation. The results demonstrate the promise and advantages of the present strategy.

Ropes: Support for collective opertions among distributed threads

NASA Technical Reports Server (NTRS)

Haines, Matthew; Mehrotra, Piyush; Cronk, David

1995-01-01

Lightweight threads are becoming increasingly useful in supporting parallelism and asynchronous control structures in applications and language implementations. Recently, systems have been designed and implemented to support interprocessor communication between lightweight threads so that threads can be exploited in a distributed memory system. Their use, in this setting, has been largely restricted to supporting latency hiding techniques and functional parallelism within a single application. However, to execute data parallel codes independent of other threads in the system, collective operations and relative indexing among threads are required. This paper describes the design of ropes: a scoping mechanism for collective operations and relative indexing among threads. We present the design of ropes in the context of the Chant system, and provide performance results evaluating our initial design decisions.
Reducing power consumption while synchronizing a plurality of compute nodes during execution of a parallel application

DOEpatents

Archer, Charles J [Rochester, MN; Blocksome, Michael A [Rochester, MN; Peters, Amanda A [Rochester, MN; Ratterman, Joseph D [Rochester, MN; Smith, Brian E [Rochester, MN

2012-01-10

Methods, apparatus, and products are disclosed for reducing power consumption while synchronizing a plurality of compute nodes during execution of a parallel application that include: beginning, by each compute node, performance of a blocking operation specified by the parallel application, each compute node beginning the blocking operation asynchronously with respect to the other compute nodes; reducing, for each compute node, power to one or more hardware components of that compute node in response to that compute node beginning the performance of the blocking operation; and restoring, for each compute node, the power to the hardware components having power reduced in response to all of the compute nodes beginning the performance of the blocking operation.
Reducing power consumption while synchronizing a plurality of compute nodes during execution of a parallel application

DOEpatents

Archer, Charles J [Rochester, MN; Blocksome, Michael A [Rochester, MN; Peters, Amanda E [Cambridge, MA; Ratterman, Joseph D [Rochester, MN; Smith, Brian E [Rochester, MN

2012-04-17

Methods, apparatus, and products are disclosed for reducing power consumption while synchronizing a plurality of compute nodes during execution of a parallel application that include: beginning, by each compute node, performance of a blocking operation specified by the parallel application, each compute node beginning the blocking operation asynchronously with respect to the other compute nodes; reducing, for each compute node, power to one or more hardware components of that compute node in response to that compute node beginning the performance of the blocking operation; and restoring, for each compute node, the power to the hardware components having power reduced in response to all of the compute nodes beginning the performance of the blocking operation.
Towards Cloud-based Asynchronous Elasticity for Iterative HPC Applications

NASA Astrophysics Data System (ADS)

da Rosa Righi, Rodrigo; Facco Rodrigues, Vinicius; André da Costa, Cristiano; Kreutz, Diego; Heiss, Hans-Ulrich

2015-10-01

Elasticity is one of the key features of cloud computing. It allows applications to dynamically scale computing and storage resources, avoiding over- and under-provisioning. In high performance computing (HPC), initiatives are normally modeled to handle bag-of-tasks or key-value applications through a load balancer and a loosely-coupled set of virtual machine (VM) instances. In the joint-field of Message Passing Interface (MPI) and tightly-coupled HPC applications, we observe the need of rewriting source codes, previous knowledge of the application and/or stop-reconfigure-and-go approaches to address cloud elasticity. Besides, there are problems related to how profit this new feature in the HPC scope, since in MPI 2.0 applications the programmers need to handle communicators by themselves, and a sudden consolidation of a VM, together with a process, can compromise the entire execution. To address these issues, we propose a PaaS-based elasticity model, named AutoElastic. It acts as a middleware that allows iterative HPC applications to take advantage of dynamic resource provisioning of cloud infrastructures without any major modification. AutoElastic provides a new concept denoted here as asynchronous elasticity, i.e., it provides a framework to allow applications to either increase or decrease their computing resources without blocking the current execution. The feasibility of AutoElastic is demonstrated through a prototype that runs a CPU-bound numerical integration application on top of the OpenNebula middleware. The results showed the saving of about 3 min at each scaling out operations, emphasizing the contribution of the new concept on contexts where seconds are precious.
Opus: A Coordination Language for Multidisciplinary Applications

NASA Technical Reports Server (NTRS)

Chapman, Barbara; Haines, Matthew; Mehrotra, Piyush; Zima, Hans; vanRosendale, John

1997-01-01

Data parallel languages, such as High Performance fortran, can be successfully applied to a wide range of numerical applications. However, many advanced scientific and engineering applications are multidisciplinary and heterogeneous in nature, and thus do not fit well into the data parallel paradigm. In this paper we present Opus, a language designed to fill this gap. The central concept of Opus is a mechanism called ShareD Abstractions (SDA). An SDA can be used as a computation server, i.e., a locus of computational activity, or as a data repository for sharing data between asynchronous tasks. SDAs can be internally data parallel, providing support for the integration of data and task parallelism as well as nested task parallelism. They can thus be used to express multidisciplinary applications in a natural and efficient way. In this paper we describe the features of the language through a series of examples and give an overview of the runtime support required to implement these concepts in parallel and distributed environments.
Analysis and Design of ITER 1 MV Core Snubber

NASA Astrophysics Data System (ADS)

Wang, Haitian; Li, Ge

2012-11-01

The core snubber, as a passive protection device, can suppress arc current and absorb stored energy in stray capacitance during the electrical breakdown in accelerating electrodes of ITER NBI. In order to design the core snubber of ITER, the control parameters of the arc peak current have been firstly analyzed by the Fink-Baker-Owren (FBO) method, which are used for designing the DIIID 100 kV snubber. The B-H curve can be derived from the measured voltage and current waveforms, and the hysteresis loss of the core snubber can be derived using the revised parallelogram method. The core snubber can be a simplified representation as an equivalent parallel resistance and inductance, which has been neglected by the FBO method. A simulation code including the parallel equivalent resistance and inductance has been set up. The simulation and experiments result in dramatically large arc shorting currents due to the parallel inductance effect. The case shows that the core snubber utilizing the FBO method gives more compact design.
Towards a Better Distributed Framework for Learning Big Data

DTIC Science & Technology

2017-06-14

UNLIMITED: PB Public Release 13. SUPPLEMENTARY NOTES 14. ABSTRACT This work aimed at solving issues in distributed machine learning. The PI’s team proposed...communication load. Finally, the team proposed the parallel least-squares policy iteration (parallel LSPI) to parallelize a reinforcement policy learning. 15
Parallel/Vector Integration Methods for Dynamical Astronomy

NASA Astrophysics Data System (ADS)

Fukushima, Toshio

1999-01-01

This paper reviews three recent works on the numerical methods to integrate ordinary differential equations (ODE), which are specially designed for parallel, vector, and/or multi-processor-unit(PU) computers. The first is the Picard-Chebyshev method (Fukushima, 1997a). It obtains a global solution of ODE in the form of Chebyshev polynomial of large (> 1000) degree by applying the Picard iteration repeatedly. The iteration converges for smooth problems and/or perturbed dynamics. The method runs around 100-1000 times faster in the vector mode than in the scalar mode of a certain computer with vector processors (Fukushima, 1997b). The second is a parallelization of a symplectic integrator (Saha et al., 1997). It regards the implicit midpoint rules covering thousands of timesteps as large-scale nonlinear equations and solves them by the fixed-point iteration. The method is applicable to Hamiltonian systems and is expected to lead an acceleration factor of around 50 in parallel computers with more than 1000 PUs. The last is a parallelization of the extrapolation method (Ito and Fukushima, 1997). It performs trial integrations in parallel. Also the trial integrations are further accelerated by balancing computational load among PUs by the technique of folding. The method is all-purpose and achieves an acceleration factor of around 3.5 by using several PUs. Finally, we give a perspective on the parallelization of some implicit integrators which require multiple corrections in solving implicit formulas like the implicit Hermitian integrators (Makino and Aarseth, 1992), (Hut et al., 1995) or the implicit symmetric multistep methods (Fukushima, 1998), (Fukushima, 1999).
Highly Scalable Asynchronous Computing Method for Partial Differential Equations: A Path Towards Exascale

NASA Astrophysics Data System (ADS)

Konduri, Aditya

Many natural and engineering systems are governed by nonlinear partial differential equations (PDEs) which result in a multiscale phenomena, e.g. turbulent flows. Numerical simulations of these problems are computationally very expensive and demand for extreme levels of parallelism. At realistic conditions, simulations are being carried out on massively parallel computers with hundreds of thousands of processing elements (PEs). It has been observed that communication between PEs as well as their synchronization at these extreme scales take up a significant portion of the total simulation time and result in poor scalability of codes. This issue is likely to pose a bottleneck in scalability of codes on future Exascale systems. In this work, we propose an asynchronous computing algorithm based on widely used finite difference methods to solve PDEs in which synchronization between PEs due to communication is relaxed at a mathematical level. We show that while stability is conserved when schemes are used asynchronously, accuracy is greatly degraded. Since message arrivals at PEs are random processes, so is the behavior of the error. We propose a new statistical framework in which we show that average errors drop always to first-order regardless of the original scheme. We propose new asynchrony-tolerant schemes that maintain accuracy when synchronization is relaxed. The quality of the solution is shown to depend, not only on the physical phenomena and numerical schemes, but also on the characteristics of the computing machine. A novel algorithm using remote memory access communications has been developed to demonstrate excellent scalability of the method for large-scale computing. Finally, we present a path to extend this method in solving complex multi-scale problems on Exascale machines.
On a model of three-dimensional bursting and its parallel implementation

NASA Astrophysics Data System (ADS)

Tabik, S.; Romero, L. F.; Garzón, E. M.; Ramos, J. I.

2008-04-01

A mathematical model for the simulation of three-dimensional bursting phenomena and its parallel implementation are presented. The model consists of four nonlinearly coupled partial differential equations that include fast and slow variables, and exhibits bursting in the absence of diffusion. The differential equations have been discretized by means of a second-order accurate in both space and time, linearly-implicit finite difference method in equally-spaced grids. The resulting system of linear algebraic equations at each time level has been solved by means of the Preconditioned Conjugate Gradient (PCG) method. Three different parallel implementations of the proposed mathematical model have been developed; two of these implementations, i.e., the MPI and the PETSc codes, are based on a message passing paradigm, while the third one, i.e., the OpenMP code, is based on a shared space address paradigm. These three implementations are evaluated on two current high performance parallel architectures, i.e., a dual-processor cluster and a Shared Distributed Memory (SDM) system. A novel representation of the results that emphasizes the most relevant factors that affect the performance of the paralled implementations, is proposed. The comparative analysis of the computational results shows that the MPI and the OpenMP implementations are about twice more efficient than the PETSc code on the SDM system. It is also shown that, for the conditions reported here, the nonlinear dynamics of the three-dimensional bursting phenomena exhibits three stages characterized by asynchronous, synchronous and then asynchronous oscillations, before a quiescent state is reached. It is also shown that the fast system reaches steady state in much less time than the slow variables.
Efficient iterative methods applied to the solution of transonic flows

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wissink, A.M.; Lyrintzis, A.S.; Chronopoulos, A.T.

1996-02-01

We investigate the use of an inexact Newton`s method to solve the potential equations in the transonic regime. As a test case, we solve the two-dimensional steady transonic small disturbance equation. Approximate factorization/ADI techniques have traditionally been employed for implicit solutions of this nonlinear equation. Instead, we apply Newton`s method using an exact analytical determination of the Jacobian with preconditioned conjugate gradient-like iterative solvers for solution of the linear systems in each Newton iteration. Two iterative solvers are tested; a block s-step version of the classical Orthomin(k) algorithm called orthogonal s-step Orthomin (OSOmin) and the well-known GIVIRES method. The preconditionermore » is a vectorizable and parallelizable version of incomplete LU (ILU) factorization. Efficiency of the Newton-Iterative method on vector and parallel computer architectures is the main issue addressed. In vectorized tests on a single processor of the Cray C-90, the performance of Newton-OSOmin is superior to Newton-GMRES and a more traditional monotone AF/ADI method (MAF) for a variety of transonic Mach numbers and mesh sizes. Newton- GIVIRES is superior to MAF for some cases. The parallel performance of the Newton method is also found to be very good on multiple processors of the Cray C-90 and on the massively parallel thinking machine CM-5, where very fast execution rates (up to 9 Gflops) are found for large problems. 38 refs., 14 figs., 7 tabs.« less
Visual tracking using neuromorphic asynchronous event-based cameras.

PubMed

Ni, Zhenjiang; Ieng, Sio-Hoi; Posch, Christoph; Régnier, Stéphane; Benosman, Ryad

2015-04-01

This letter presents a novel computationally efficient and robust pattern tracking method based on a time-encoded, frame-free visual data. Recent interdisciplinary developments, combining inputs from engineering and biology, have yielded a novel type of camera that encodes visual information into a continuous stream of asynchronous, temporal events. These events encode temporal contrast and intensity locally in space and time. We show that the sparse yet accurately timed information is well suited as a computational input for object tracking. In this letter, visual data processing is performed for each incoming event at the time it arrives. The method provides a continuous and iterative estimation of the geometric transformation between the model and the events representing the tracked object. It can handle isometry, similarities, and affine distortions and allows for unprecedented real-time performance at equivalent frame rates in the kilohertz range on a standard PC. Furthermore, by using the dimension of time that is currently underexploited by most artificial vision systems, the method we present is able to solve ambiguous cases of object occlusions that classical frame-based techniques handle poorly.
Highly Asynchronous VisitOr Queue Graph Toolkit

DOE Office of Scientific and Technical Information (OSTI.GOV)

Pearce, R.

2012-10-01

HAVOQGT is a C++ framework that can be used to create highly parallel graph traversal algorithms. The framework stores the graph and algorithmic data structures on external memory that is typically mapped to high performance locally attached NAND FLASH arrays. The framework supports a vertex-centered visitor programming model. The frameworkd has been used to implement breadth first search, connected components, and single source shortest path.
Block iterative restoration of astronomical images with the massively parallel processor

NASA Technical Reports Server (NTRS)

Heap, Sara R.; Lindler, Don J.

1987-01-01

A method is described for algebraic image restoration capable of treating astronomical images. For a typical 500 x 500 image, direct algebraic restoration would require the solution of a 250,000 x 250,000 linear system. The block iterative approach is used to reduce the problem to solving 4900 121 x 121 linear systems. The algorithm was implemented on the Goddard Massively Parallel Processor, which can solve a 121 x 121 system in approximately 0.06 seconds. Examples are shown of the results for various astronomical images.
Increasing processor utilization during parallel computation rundown

NASA Technical Reports Server (NTRS)

Jones, W. H.

1986-01-01

Some parallel processing environments provide for asynchronous execution and completion of general purpose parallel computations from a single computational phase. When all the computations from such a phase are complete, a new parallel computational phase is begun. Depending upon the granularity of the parallel computations to be performed, there may be a shortage of available work as a particular computational phase draws to a close (computational rundown). This can result in the waste of computing resources and the delay of the overall problem. In many practical instances, strict sequential ordering of phases of parallel computation is not totally required. In such cases, the beginning of one phase can be correctly computed before the end of a previous phase is completed. This allows additional work to be generated somewhat earlier to keep computing resources busy during each computational rundown. The conditions under which this can occur are identified and the frequency of occurrence of such overlapping in an actual parallel Navier-Stokes code is reported. A language construct is suggested and possible control strategies for the management of such computational phase overlapping are discussed.
Parallel adaptive wavelet collocation method for PDEs

DOE Office of Scientific and Technical Information (OSTI.GOV)

Nejadmalayeri, Alireza, E-mail: Alireza.Nejadmalayeri@gmail.com; Vezolainen, Alexei, E-mail: Alexei.Vezolainen@Colorado.edu; Brown-Dymkoski, Eric, E-mail: Eric.Browndymkoski@Colorado.edu

2015-10-01

A parallel adaptive wavelet collocation method for solving a large class of Partial Differential Equations is presented. The parallelization is achieved by developing an asynchronous parallel wavelet transform, which allows one to perform parallel wavelet transform and derivative calculations with only one data synchronization at the highest level of resolution. The data are stored using tree-like structure with tree roots starting at a priori defined level of resolution. Both static and dynamic domain partitioning approaches are developed. For the dynamic domain partitioning, trees are considered to be the minimum quanta of data to be migrated between the processes. This allowsmore » fully automated and efficient handling of non-simply connected partitioning of a computational domain. Dynamic load balancing is achieved via domain repartitioning during the grid adaptation step and reassigning trees to the appropriate processes to ensure approximately the same number of grid points on each process. The parallel efficiency of the approach is discussed based on parallel adaptive wavelet-based Coherent Vortex Simulations of homogeneous turbulence with linear forcing at effective non-adaptive resolutions up to 2048{sup 3} using as many as 2048 CPU cores.« less
Noniterative Multireference Coupled Cluster Methods on Heterogeneous CPU-GPU Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bhaskaran-Nair, Kiran; Ma, Wenjing; Krishnamoorthy, Sriram

2013-04-09

A novel parallel algorithm for non-iterative multireference coupled cluster (MRCC) theories, which merges recently introduced reference-level parallelism (RLP) [K. Bhaskaran-Nair, J.Brabec, E. Aprà, H.J.J. van Dam, J. Pittner, K. Kowalski, J. Chem. Phys. 137, 094112 (2012)] with the possibility of accelerating numerical calculations using graphics processing unit (GPU) is presented. We discuss the performance of this algorithm on the example of the MRCCSD(T) method (iterative singles and doubles and perturbative triples), where the corrections due to triples are added to the diagonal elements of the MRCCSD (iterative singles and doubles) effective Hamiltonian matrix. The performance of the combined RLP/GPU algorithmmore » is illustrated on the example of the Brillouin-Wigner (BW) and Mukherjee (Mk) state-specific MRCCSD(T) formulations.« less
A Two Colorable Fourth Order Compact Difference Scheme and Parallel Iterative Solution of the 3D Convection Diffusion Equation

NASA Technical Reports Server (NTRS)

Zhang, Jun; Ge, Lixin; Kouatchou, Jules

2000-01-01

A new fourth order compact difference scheme for the three dimensional convection diffusion equation with variable coefficients is presented. The novelty of this new difference scheme is that it Only requires 15 grid points and that it can be decoupled with two colors. The entire computational grid can be updated in two parallel subsweeps with the Gauss-Seidel type iterative method. This is compared with the known 19 point fourth order compact differenCe scheme which requires four colors to decouple the computational grid. Numerical results, with multigrid methods implemented on a shared memory parallel computer, are presented to compare the 15 point and the 19 point fourth order compact schemes.
Parallel Reconstruction Using Null Operations (PRUNO)

PubMed Central

Zhang, Jian; Liu, Chunlei; Moseley, Michael E.

2011-01-01

A novel iterative k-space data-driven technique, namely Parallel Reconstruction Using Null Operations (PRUNO), is presented for parallel imaging reconstruction. In PRUNO, both data calibration and image reconstruction are formulated into linear algebra problems based on a generalized system model. An optimal data calibration strategy is demonstrated by using Singular Value Decomposition (SVD). And an iterative conjugate- gradient approach is proposed to efficiently solve missing k-space samples during reconstruction. With its generalized formulation and precise mathematical model, PRUNO reconstruction yields good accuracy, flexibility, stability. Both computer simulation and in vivo studies have shown that PRUNO produces much better reconstruction quality than autocalibrating partially parallel acquisition (GRAPPA), especially under high accelerating rates. With the aid of PRUO reconstruction, ultra high accelerating parallel imaging can be performed with decent image quality. For example, we have done successful PRUNO reconstruction at a reduction factor of 6 (effective factor of 4.44) with 8 coils and only a few autocalibration signal (ACS) lines. PMID:21604290
Ultrascalable petaflop parallel supercomputer

DOEpatents

Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton On Hudson, NY; Chiu, George [Cross River, NY; Cipolla, Thomas M [Katonah, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Hall, Shawn [Pleasantville, NY; Haring, Rudolf A [Cortlandt Manor, NY; Heidelberger, Philip [Cortlandt Manor, NY; Kopcsay, Gerard V [Yorktown Heights, NY; Ohmacht, Martin [Yorktown Heights, NY; Salapura, Valentina [Chappaqua, NY; Sugavanam, Krishnan [Mahopac, NY; Takken, Todd [Brewster, NY

2010-07-20

A massively parallel supercomputer of petaOPS-scale includes node architectures based upon System-On-a-Chip technology, where each processing node comprises a single Application Specific Integrated Circuit (ASIC) having up to four processing elements. The ASIC nodes are interconnected by multiple independent networks that optimally maximize the throughput of packet communications between nodes with minimal latency. The multiple networks may include three high-speed networks for parallel algorithm message passing including a Torus, collective network, and a Global Asynchronous network that provides global barrier and notification functions. These multiple independent networks may be collaboratively or independently utilized according to the needs or phases of an algorithm for optimizing algorithm processing performance. The use of a DMA engine is provided to facilitate message passing among the nodes without the expenditure of processing resources at the node.

Xyce

DOE Office of Scientific and Technical Information (OSTI.GOV)

Thomquist, Heidi K.; Fixel, Deborah A.; Fett, David Brian

The Xyce Parallel Electronic Simulator simulates electronic circuit behavior in DC, AC, HB, MPDE and transient mode using standard analog (DAE) and/or device (PDE) device models including several age and radiation aware devices. It supports a variety of computing platforms (both serial and parallel) computers. Lastly, it uses a variety of modern solution algorithms dynamic parallel load-balancing and iterative solvers.
On nonlinear finite element analysis in single-, multi- and parallel-processors

NASA Technical Reports Server (NTRS)

Utku, S.; Melosh, R.; Islam, M.; Salama, M.

1982-01-01

Numerical solution of nonlinear equilibrium problems of structures by means of Newton-Raphson type iterations is reviewed. Each step of the iteration is shown to correspond to the solution of a linear problem, therefore the feasibility of the finite element method for nonlinear analysis is established. Organization and flow of data for various types of digital computers, such as single-processor/single-level memory, single-processor/two-level-memory, vector-processor/two-level-memory, and parallel-processors, with and without sub-structuring (i.e. partitioning) are given. The effect of the relative costs of computation, memory and data transfer on substructuring is shown. The idea of assigning comparable size substructures to parallel processors is exploited. Under Cholesky type factorization schemes, the efficiency of parallel processing is shown to decrease due to the occasional shared data, just as that due to the shared facilities.
Parallelization of implicit finite difference schemes in computational fluid dynamics

NASA Technical Reports Server (NTRS)

Decker, Naomi H.; Naik, Vijay K.; Nicoules, Michel

1990-01-01

Implicit finite difference schemes are often the preferred numerical schemes in computational fluid dynamics, requiring less stringent stability bounds than the explicit schemes. Each iteration in an implicit scheme involves global data dependencies in the form of second and higher order recurrences. Efficient parallel implementations of such iterative methods are considerably more difficult and non-intuitive. The parallelization of the implicit schemes that are used for solving the Euler and the thin layer Navier-Stokes equations and that require inversions of large linear systems in the form of block tri-diagonal and/or block penta-diagonal matrices is discussed. Three-dimensional cases are emphasized and schemes that minimize the total execution time are presented. Partitioning and scheduling schemes for alleviating the effects of the global data dependencies are described. An analysis of the communication and the computation aspects of these methods is presented. The effect of the boundary conditions on the parallel schemes is also discussed.
A survey of packages for large linear systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wu, Kesheng; Milne, Brent

2000-02-11

This paper evaluates portable software packages for the iterative solution of very large sparse linear systems on parallel architectures. While we cannot hope to tell individual users which package will best suit their needs, we do hope that our systematic evaluation provides essential unbiased information about the packages and the evaluation process may serve as an example on how to evaluate these packages. The information contained here include feature comparisons, usability evaluations and performance characterizations. This review is primarily focused on self-contained packages that can be easily integrated into an existing program and are capable of computing solutions to verymore » large sparse linear systems of equations. More specifically, it concentrates on portable parallel linear system solution packages that provide iterative solution schemes and related preconditioning schemes because iterative methods are more frequently used than competing schemes such as direct methods. The eight packages evaluated are: Aztec, BlockSolve,ISIS++, LINSOL, P-SPARSLIB, PARASOL, PETSc, and PINEAPL. Among the eight portable parallel iterative linear system solvers reviewed, we recommend PETSc and Aztec for most application programmers because they have well designed user interface, extensive documentation and very responsive user support. Both PETSc and Aztec are written in the C language and are callable from Fortran. For those users interested in using Fortran 90, PARASOL is a good alternative. ISIS++is a good alternative for those who prefer the C++ language. Both PARASOL and ISIS++ are relatively new and are continuously evolving. Thus their user interface may change. In general, those packages written in Fortran 77 are more cumbersome to use because the user may need to directly deal with a number of arrays of varying sizes. Languages like C++ and Fortran 90 offer more convenient data encapsulation mechanisms which make it easier to implement a clean and intuitive user interface. In addition to reviewing these portable parallel iterative solver packages, we also provide a more cursory assessment of a range of related packages, from specialized parallel preconditioners to direct methods for sparse linear systems.« less
Communication library for run-time visualization of distributed, asynchronous data

DOE Office of Scientific and Technical Information (OSTI.GOV)

Rowlan, J.; Wightman, B.T.

1994-04-01

In this paper we present a method for collecting and visualizing data generated by a parallel computational simulation during run time. Data distributed across multiple processes is sent across parallel communication lines to a remote workstation, which sorts and queues the data for visualization. We have implemented our method in a set of tools called PORTAL (for Parallel aRchitecture data-TrAnsfer Library). The tools comprise generic routines for sending data from a parallel program (callable from either C or FORTRAN), a semi-parallel communication scheme currently built upon Unix Sockets, and a real-time connection to the scientific visualization program AVS. Our methodmore » is most valuable when used to examine large datasets that can be efficiently generated and do not need to be stored on disk. The PORTAL source libraries, detailed documentation, and a working example can be obtained by anonymous ftp from info.mcs.anl.gov from the file portal.tar.Z from the directory pub/portal.« less
Parallel Implementation of 3-D Iterative Reconstruction With Intra-Thread Update for the jPET-D4

NASA Astrophysics Data System (ADS)

Lam, Chih Fung; Yamaya, Taiga; Obi, Takashi; Yoshida, Eiji; Inadama, Naoko; Shibuya, Kengo; Nishikido, Fumihiko; Murayama, Hideo

2009-02-01

One way to speed-up iterative image reconstruction is by parallel computing with a computer cluster. However, as the number of computing threads increases, parallel efficiency decreases due to network transfer delay. In this paper, we proposed a method to reduce data transfer between computing threads by introducing an intra-thread update. The update factor is collected from each slave thread and a global image is updated as usual in the first K sub-iteration. In the rest of the sub-iterations, the global image is only updated at an interval which is controlled by a parameter L. In between that interval, the intra-thread update is carried out whereby an image update is performed in each slave thread locally. We investigated combinations of K and L parameters based on parallel implementation of RAMLA for the jPET-D4 scanner. Our evaluation used four workstations with a total of 16 slave threads. Each slave thread calculated a different set of LORs which are divided according to ring difference numbers. We assessed image quality of the proposed method with a hotspot simulation phantom. The figure of merit was the full-width-half-maximum of hotspots and the background normalized standard deviation. At an optimum K and L setting, we did not find significant change in the output images. We also applied the proposed method to a Hoffman phantom experiment and found the difference due to intra-thread update was negligible. With the intra-thread update, computation time could be reduced by about 23%.
New Bandwidth Efficient Parallel Concatenated Coding Schemes

NASA Technical Reports Server (NTRS)

Denedetto, S.; Divsalar, D.; Montorsi, G.; Pollara, F.

1996-01-01

We propose a new solution to parallel concatenation of trellis codes with multilevel amplitude/phase modulations and a suitable iterative decoding structure. Examples are given for throughputs 2 bits/sec/Hz with 8PSK and 16QAM signal constellations.
Fiber optic cable-based high-resolution, long-distance VGA extenders

NASA Astrophysics Data System (ADS)

Rhee, Jin-Geun; Lee, Iksoo; Kim, Heejoon; Kim, Sungjoon; Koh, Yeon-Wan; Kim, Hoik; Lim, Jiseok; Kim, Chur; Kim, Jungwon

2013-02-01

Remote transfer of high-resolution video information finds more applications in detached display applications for large facilities such as theaters, sports complex, airports, and security facilities. Active optical cables (AOCs) provide a promising approach for enhancing both the transmittable resolution and distance that standard copper-based cables cannot reach. In addition to the standard digital formats such as HDMI, the high-resolution, long-distance transfer of VGA format signals is important for applications where high-resolution analog video ports should be also supported, such as military/defense applications and high-resolution video camera links. In this presentation we present the development of a compressionless, high-resolution (up to WUXGA, 1920x1200), long-distance (up to 2 km) VGA extenders based on serialized technique. We employed asynchronous serial transmission and clock regeneration techniques, which enables lower cost implementation of VGA extenders by removing the necessity for clock transmission and large memory at the receiver. Two 3.125-Gbps transceivers are used in parallel to meet the required maximum video data rate of 6.25 Gbps. As the data are transmitted asynchronously, 24-bit pixel clock time stamp is employed to regenerate video pixel clock accurately at the receiver side. In parallel to the video information, stereo audio and RS-232 control signals are transmitted as well.
NDetermin: Inferring Nondeterministic Sequential Specifications for Parallelism Correctness

DTIC Science & Technology

2011-12-16

other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a...Lab affiliates National Instruments, NEC, Nokia , NVIDIA, and Samsung. NDetermin: Inferring Nondeterministic Sequential Specifications for Parallelism...concurrently update x, some of these CAS’s will fail and those parallel loop iterations will recompute their updates to x and try again. Consider the parallel
Learning control system design based on 2-D theory - An application to parallel link manipulator

NASA Technical Reports Server (NTRS)

Geng, Z.; Carroll, R. L.; Lee, J. D.; Haynes, L. H.

1990-01-01

An approach to iterative learning control system design based on two-dimensional system theory is presented. A two-dimensional model for the iterative learning control system which reveals the connections between learning control systems and two-dimensional system theory is established. A learning control algorithm is proposed, and the convergence of learning using this algorithm is guaranteed by two-dimensional stability. The learning algorithm is applied successfully to the trajectory tracking control problem for a parallel link robot manipulator. The excellent performance of this learning algorithm is demonstrated by the computer simulation results.
Adaptive implicit-explicit and parallel element-by-element iteration schemes

NASA Technical Reports Server (NTRS)

Tezduyar, T. E.; Liou, J.; Nguyen, T.; Poole, S.

1989-01-01

Adaptive implicit-explicit (AIE) and grouped element-by-element (GEBE) iteration schemes are presented for the finite element solution of large-scale problems in computational mechanics and physics. The AIE approach is based on the dynamic arrangement of the elements into differently treated groups. The GEBE procedure, which is a way of rewriting the EBE formulation to make its parallel processing potential and implementation more clear, is based on the static arrangement of the elements into groups with no inter-element coupling within each group. Various numerical tests performed demonstrate the savings in the CPU time and memory.
Soft-output decoding algorithms in iterative decoding of turbo codes

NASA Technical Reports Server (NTRS)

Benedetto, S.; Montorsi, G.; Divsalar, D.; Pollara, F.

1996-01-01

In this article, we present two versions of a simplified maximum a posteriori decoding algorithm. The algorithms work in a sliding window form, like the Viterbi algorithm, and can thus be used to decode continuously transmitted sequences obtained by parallel concatenated codes, without requiring code trellis termination. A heuristic explanation is also given of how to embed the maximum a posteriori algorithms into the iterative decoding of parallel concatenated codes (turbo codes). The performances of the two algorithms are compared on the basis of a powerful rate 1/3 parallel concatenated code. Basic circuits to implement the simplified a posteriori decoding algorithm using lookup tables, and two further approximations (linear and threshold), with a very small penalty, to eliminate the need for lookup tables are proposed.
A Primer for Telemetry Interfacing in Accordance with NASA Standards Using Low Cost FPGAs

NASA Astrophysics Data System (ADS)

McCoy, Jake; Schultz, Ted; Tutt, James; Rogers, Thomas; Miles, Drew; McEntaffer, Randall

2016-03-01

Photon counting detector systems on sounding rocket payloads often require interfacing asynchronous outputs with a synchronously clocked telemetry (TM) stream. Though this can be handled with an on-board computer, there are several low cost alternatives including custom hardware, microcontrollers and field-programmable gate arrays (FPGAs). This paper outlines how a TM interface (TMIF) for detectors on a sounding rocket with asynchronous parallel digital output can be implemented using low cost FPGAs and minimal custom hardware. Low power consumption and high speed FPGAs are available as commercial off-the-shelf (COTS) products and can be used to develop the main component of the TMIF. Then, only a small amount of additional hardware is required for signal buffering and level translating. This paper also discusses how this system can be tested with a simulated TM chain in the small laboratory setting using FPGAs and COTS specialized data acquisition products.
Architecture and method for a burst buffer using flash technology

DOEpatents

Tzelnic, Percy; Faibish, Sorin; Gupta, Uday K.; Bent, John; Grider, Gary Alan; Chen, Hsing-bung

2016-03-15

A parallel supercomputing cluster includes compute nodes interconnected in a mesh of data links for executing an MPI job, and solid-state storage nodes each linked to a respective group of the compute nodes for receiving checkpoint data from the respective compute nodes, and magnetic disk storage linked to each of the solid-state storage nodes for asynchronous migration of the checkpoint data from the solid-state storage nodes to the magnetic disk storage. Each solid-state storage node presents a file system interface to the MPI job, and multiple MPI processes of the MPI job write the checkpoint data to a shared file in the solid-state storage in a strided fashion, and the solid-state storage node asynchronously migrates the checkpoint data from the shared file in the solid-state storage to the magnetic disk storage and writes the checkpoint data to the magnetic disk storage in a sequential fashion.
Comparison of multihardware parallel implementations for a phase unwrapping algorithm

NASA Astrophysics Data System (ADS)

Hernandez-Lopez, Francisco Javier; Rivera, Mariano; Salazar-Garibay, Adan; Legarda-Sáenz, Ricardo

2018-04-01

Phase unwrapping is an important problem in the areas of optical metrology, synthetic aperture radar (SAR) image analysis, and magnetic resonance imaging (MRI) analysis. These images are becoming larger in size and, particularly, the availability and need for processing of SAR and MRI data have increased significantly with the acquisition of remote sensing data and the popularization of magnetic resonators in clinical diagnosis. Therefore, it is important to develop faster and accurate phase unwrapping algorithms. We propose a parallel multigrid algorithm of a phase unwrapping method named accumulation of residual maps, which builds on a serial algorithm that consists of the minimization of a cost function; minimization achieved by means of a serial Gauss-Seidel kind algorithm. Our algorithm also optimizes the original cost function, but unlike the original work, our algorithm is a parallel Jacobi class with alternated minimizations. This strategy is known as the chessboard type, where red pixels can be updated in parallel at same iteration since they are independent. Similarly, black pixels can be updated in parallel in an alternating iteration. We present parallel implementations of our algorithm for different parallel multicore architecture such as CPU-multicore, Xeon Phi coprocessor, and Nvidia graphics processing unit. In all the cases, we obtain a superior performance of our parallel algorithm when compared with the original serial version. In addition, we present a detailed comparative performance of the developed parallel versions.
Quinoa - Adaptive Computational Fluid Dynamics, 0.2

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bakosi, Jozsef; Gonzalez, Francisco; Rogers, Brandon

Quinoa is a set of computational tools that enables research and numerical analysis in fluid dynamics. At this time it remains a test-bed to experiment with various algorithms using fully asynchronous runtime systems. Currently, Quinoa consists of the following tools: (1) Walker, a numerical integrator for systems of stochastic differential equations in time. It is a mathematical tool to analyze and design the behavior of stochastic differential equations. It allows the estimation of arbitrary coupled statistics and probability density functions and is currently used for the design of statistical moment approximations for multiple mixing materials in variable-density turbulence. (2) Inciter,more » an overdecomposition-aware finite element field solver for partial differential equations using 3D unstructured grids. Inciter is used to research asynchronous mesh-based algorithms and to experiment with coupling asynchronous to bulk-synchronous parallel code. Two planned new features of Inciter, compared to the previous release (LA-CC-16-015), to be implemented in 2017, are (a) a simple Navier-Stokes solver for ideal single-material compressible gases, and (b) solution-adaptive mesh refinement (AMR), which enables dynamically concentrating compute resources to regions with interesting physics. Using the NS-AMR problem we plan to explore how to scale such high-load-imbalance simulations, representative of large production multiphysics codes, to very large problems on very large computers using an asynchronous runtime system. (3) RNGTest, a test harness to subject random number generators to stringent statistical tests enabling quantitative ranking with respect to their quality and computational cost. (4) UnitTest, a unit test harness, running hundreds of tests per second, capable of testing serial, synchronous, and asynchronous functions. (5) MeshConv, a mesh file converter that can be used to convert 3D tetrahedron meshes from and to either of the following formats: Gmsh, (http://www.geuz.org/gmsh), Netgen, (http://sourceforge.net/apps/mediawiki/netgen-mesher), ExodusII, (http://sourceforge.net/projects/exodusii), HyperMesh, (http://www.altairhyperworks.com/product/HyperMesh).« less
Produsage in a/synchronous learner-led e-learning

NASA Astrophysics Data System (ADS)

Kazmer, Michelle M.

2011-04-01

Creating a successful produsage environment for a required course taught via e-learning requires analyzing various factors: the learning context, learner-led education in required classes, the structure of the class, and reflections and evaluations of each semester's iteration of the course. Taking a produsage perspective, this paper analyzes the long-term development of a required graduate-level course in information organization. The course is examined closely to show how its materials, assignments, technology, instruction, and culture contribute to a learner-led produsage environment and lasting knowledge creation. The analysis leads to implications for course design and working with learners to create knowledge that may be applied in multiple settings.
COMP Superscalar, an interoperable programming framework

NASA Astrophysics Data System (ADS)

Badia, Rosa M.; Conejero, Javier; Diaz, Carlos; Ejarque, Jorge; Lezzi, Daniele; Lordan, Francesc; Ramon-Cortes, Cristian; Sirvent, Raul

2015-12-01

COMPSs is a programming framework that aims to facilitate the parallelization of existing applications written in Java, C/C++ and Python scripts. For that purpose, it offers a simple programming model based on sequential development in which the user is mainly responsible for (i) identifying the functions to be executed as asynchronous parallel tasks and (ii) annotating them with annotations or standard Python decorators. A runtime system is in charge of exploiting the inherent concurrency of the code, automatically detecting and enforcing the data dependencies between tasks and spawning these tasks to the available resources, which can be nodes in a cluster, clouds or grids. In cloud environments, COMPSs provides scalability and elasticity features allowing the dynamic provision of resources.
GraphReduce: Processing Large-Scale Graphs on Accelerator-Based Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sengupta, Dipanjan; Song, Shuaiwen; Agarwal, Kapil

2015-11-15

Recent work on real-world graph analytics has sought to leverage the massive amount of parallelism offered by GPU devices, but challenges remain due to the inherent irregularity of graph algorithms and limitations in GPU-resident memory for storing large graphs. We present GraphReduce, a highly efficient and scalable GPU-based framework that operates on graphs that exceed the device’s internal memory capacity. GraphReduce adopts a combination of edge- and vertex-centric implementations of the Gather-Apply-Scatter programming model and operates on multiple asynchronous GPU streams to fully exploit the high degrees of parallelism in GPUs with efficient graph data movement between the host andmore » device.« less
Collective network for computer structures

DOEpatents

Blumrich, Matthias A; Coteus, Paul W; Chen, Dong; Gara, Alan; Giampapa, Mark E; Heidelberger, Philip; Hoenicke, Dirk; Takken, Todd E; Steinmacher-Burow, Burkhard D; Vranas, Pavlos M

2014-01-07

A system and method for enabling high-speed, low-latency global collective communications among interconnected processing nodes. The global collective network optimally enables collective reduction operations to be performed during parallel algorithm operations executing in a computer structure having a plurality of the interconnected processing nodes. Router devices are included that interconnect the nodes of the network via links to facilitate performance of low-latency global processing operations at nodes of the virtual network. The global collective network may be configured to provide global barrier and interrupt functionality in asynchronous or synchronized manner. When implemented in a massively-parallel supercomputing structure, the global collective network is physically and logically partitionable according to the needs of a processing algorithm.

Collective network for computer structures

DOEpatents

Blumrich, Matthias A [Ridgefield, CT; Coteus, Paul W [Yorktown Heights, NY; Chen, Dong [Croton On Hudson, NY; Gara, Alan [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Hoenicke, Dirk [Ossining, NY; Takken, Todd E [Brewster, NY; Steinmacher-Burow, Burkhard D [Wernau, DE; Vranas, Pavlos M [Bedford Hills, NY

2011-08-16

A system and method for enabling high-speed, low-latency global collective communications among interconnected processing nodes. The global collective network optimally enables collective reduction operations to be performed during parallel algorithm operations executing in a computer structure having a plurality of the interconnected processing nodes. Router devices ate included that interconnect the nodes of the network via links to facilitate performance of low-latency global processing operations at nodes of the virtual network and class structures. The global collective network may be configured to provide global barrier and interrupt functionality in asynchronous or synchronized manner. When implemented in a massively-parallel supercomputing structure, the global collective network is physically and logically partitionable according to needs of a processing algorithm.
Asynchronous Replica Exchange Software for Grid and Heterogeneous Computing.

PubMed

Gallicchio, Emilio; Xia, Junchao; Flynn, William F; Zhang, Baofeng; Samlalsingh, Sade; Mentes, Ahmet; Levy, Ronald M

2015-11-01

Parallel replica exchange sampling is an extended ensemble technique often used to accelerate the exploration of the conformational ensemble of atomistic molecular simulations of chemical systems. Inter-process communication and coordination requirements have historically discouraged the deployment of replica exchange on distributed and heterogeneous resources. Here we describe the architecture of a software (named ASyncRE) for performing asynchronous replica exchange molecular simulations on volunteered computing grids and heterogeneous high performance clusters. The asynchronous replica exchange algorithm on which the software is based avoids centralized synchronization steps and the need for direct communication between remote processes. It allows molecular dynamics threads to progress at different rates and enables parameter exchanges among arbitrary sets of replicas independently from other replicas. ASyncRE is written in Python following a modular design conducive to extensions to various replica exchange schemes and molecular dynamics engines. Applications of the software for the modeling of association equilibria of supramolecular and macromolecular complexes on BOINC campus computational grids and on the CPU/MIC heterogeneous hardware of the XSEDE Stampede supercomputer are illustrated. They show the ability of ASyncRE to utilize large grids of desktop computers running the Windows, MacOS, and/or Linux operating systems as well as collections of high performance heterogeneous hardware devices.
Throughput analysis of the IEEE 802.4 token bus standard under heavy load

NASA Technical Reports Server (NTRS)

Pang, Joseph; Tobagi, Fouad

1987-01-01

It has become clear in the last few years that there is a trend towards integrated digital services. Parallel to the development of public Integrated Services Digital Network (ISDN) is service integration in the local area (e.g., a campus, a building, an aircraft). The types of services to be integrated depend very much on the specific local environment. However, applications tend to generate data traffic belonging to one of two classes. According to IEEE 802.4 terminology, the first major class of traffic is termed synchronous, such as packetized voice and data generated from other applications with real-time constraints, and the second class is called asynchronous which includes most computer data traffic such as file transfer or facsimile. The IEEE 802.4 token bus protocol which was designed to support both synchronous and asynchronous traffic is examined. The protocol is basically a timer-controlled token bus access scheme. By a suitable choice of the design parameters, it can be shown that access delay is bounded for synchronous traffic. As well, the bandwidth allocated to asynchronous traffic can be controlled. A throughput analysis of the protocol under heavy load with constant channel occupation of synchronous traffic and constant token-passing times is presented.
Accelerating nuclear configuration interaction calculations through a preconditioned block iterative eigensolver

NASA Astrophysics Data System (ADS)

Shao, Meiyue; Aktulga, H. Metin; Yang, Chao; Ng, Esmond G.; Maris, Pieter; Vary, James P.

2018-01-01

We describe a number of recently developed techniques for improving the performance of large-scale nuclear configuration interaction calculations on high performance parallel computers. We show the benefit of using a preconditioned block iterative method to replace the Lanczos algorithm that has traditionally been used to perform this type of computation. The rapid convergence of the block iterative method is achieved by a proper choice of starting guesses of the eigenvectors and the construction of an effective preconditioner. These acceleration techniques take advantage of special structure of the nuclear configuration interaction problem which we discuss in detail. The use of a block method also allows us to improve the concurrency of the computation, and take advantage of the memory hierarchy of modern microprocessors to increase the arithmetic intensity of the computation relative to data movement. We also discuss the implementation details that are critical to achieving high performance on massively parallel multi-core supercomputers, and demonstrate that the new block iterative solver is two to three times faster than the Lanczos based algorithm for problems of moderate sizes on a Cray XC30 system.
Fusion of Asynchronous, Parallel, Unreliable Data Streams

DTIC Science & Technology

2010-09-01

channels that might be used. The two channels chosen for this study, galvanic skin response (GSR) and pulse rate, are convenient and reasonably well...vector as NA. The MDS software tool, PERMAP, uses this same abbreviation. The impact of the lack of information may vary depending on the situation...of how PERMAP (and MDS in general) functions when the input parameters are varied. That is outlined in this section; the impact of those choices is
Solution of a tridiagonal system of equations on the finite element machine

NASA Technical Reports Server (NTRS)

Bostic, S. W.

1984-01-01

Two parallel algorithms for the solution of tridiagonal systems of equations were implemented on the Finite Element Machine. The Accelerated Parallel Gauss method, an iterative method, and the Buneman algorithm, a direct method, are discussed and execution statistics are presented.
Fast Time and Space Parallel Algorithms for Solution of Parabolic Partial Differential Equations

NASA Technical Reports Server (NTRS)

Fijany, Amir

1993-01-01

In this paper, fast time- and Space -Parallel agorithms for solution of linear parabolic PDEs are developed. It is shown that the seemingly strictly serial iterations of the time-stepping procedure for solution of the problem can be completed decoupled.
Hybrid parallelization of the XTOR-2F code for the simulation of two-fluid MHD instabilities in tokamaks

NASA Astrophysics Data System (ADS)

Marx, Alain; Lütjens, Hinrich

2017-03-01

A hybrid MPI/OpenMP parallel version of the XTOR-2F code [Lütjens and Luciani, J. Comput. Phys. 229 (2010) 8130] solving the two-fluid MHD equations in full tokamak geometry by means of an iterative Newton-Krylov matrix-free method has been developed. The present work shows that the code has been parallelized significantly despite the numerical profile of the problem solved by XTOR-2F, i.e. a discretization with pseudo-spectral representations in all angular directions, the stiffness of the two-fluid stability problem in tokamaks, and the use of a direct LU decomposition to invert the physical pre-conditioner at every Krylov iteration of the solver. The execution time of the parallelized version is an order of magnitude smaller than the sequential one for low resolution cases, with an increasing speedup when the discretization mesh is refined. Moreover, it allows to perform simulations with higher resolutions, previously forbidden because of memory limitations.
Modified Chebyshev Picard Iteration for Efficient Numerical Integration of Ordinary Differential Equations

NASA Astrophysics Data System (ADS)

Macomber, B.; Woollands, R. M.; Probe, A.; Younes, A.; Bai, X.; Junkins, J.

2013-09-01

Modified Chebyshev Picard Iteration (MCPI) is an iterative numerical method for approximating solutions of linear or non-linear Ordinary Differential Equations (ODEs) to obtain time histories of system state trajectories. Unlike other step-by-step differential equation solvers, the Runge-Kutta family of numerical integrators for example, MCPI approximates long arcs of the state trajectory with an iterative path approximation approach, and is ideally suited to parallel computation. Orthogonal Chebyshev Polynomials are used as basis functions during each path iteration; the integrations of the Picard iteration are then done analytically. Due to the orthogonality of the Chebyshev basis functions, the least square approximations are computed without matrix inversion; the coefficients are computed robustly from discrete inner products. As a consequence of discrete sampling and weighting adopted for the inner product definition, Runge phenomena errors are minimized near the ends of the approximation intervals. The MCPI algorithm utilizes a vector-matrix framework for computational efficiency. Additionally, all Chebyshev coefficients and integrand function evaluations are independent, meaning they can be simultaneously computed in parallel for further decreased computational cost. Over an order of magnitude speedup from traditional methods is achieved in serial processing, and an additional order of magnitude is achievable in parallel architectures. This paper presents a new MCPI library, a modular toolset designed to allow MCPI to be easily applied to a wide variety of ODE systems. Library users will not have to concern themselves with the underlying mathematics behind the MCPI method. Inputs are the boundary conditions of the dynamical system, the integrand function governing system behavior, and the desired time interval of integration, and the output is a time history of the system states over the interval of interest. Examples from the field of astrodynamics are presented to compare the output from the MCPI library to current state-of-practice numerical integration methods. It is shown that MCPI is capable of out-performing the state-of-practice in terms of computational cost and accuracy.
Parallel fast multipole boundary element method applied to computational homogenization

NASA Astrophysics Data System (ADS)

Ptaszny, Jacek

2018-01-01

In the present work, a fast multipole boundary element method (FMBEM) and a parallel computer code for 3D elasticity problem is developed and applied to the computational homogenization of a solid containing spherical voids. The system of equation is solved by using the GMRES iterative solver. The boundary of the body is dicretized by using the quadrilateral serendipity elements with an adaptive numerical integration. Operations related to a single GMRES iteration, performed by traversing the corresponding tree structure upwards and downwards, are parallelized by using the OpenMP standard. The assignment of tasks to threads is based on the assumption that the tree nodes at which the moment transformations are initialized can be partitioned into disjoint sets of equal or approximately equal size and assigned to the threads. The achieved speedup as a function of number of threads is examined.
Parallel iterative methods for sparse linear and nonlinear equations

NASA Technical Reports Server (NTRS)

Saad, Youcef

1989-01-01

As three-dimensional models are gaining importance, iterative methods will become almost mandatory. Among these, preconditioned Krylov subspace methods have been viewed as the most efficient and reliable, when solving linear as well as nonlinear systems of equations. There has been several different approaches taken to adapt iterative methods for supercomputers. Some of these approaches are discussed and the methods that deal more specifically with general unstructured sparse matrices, such as those arising from finite element methods, are emphasized.
The Battlefield Environment Division Modeling Framework (BMF). Part 1: Optimizing the Atmospheric Boundary Layer Environment Model for Cluster Computing

DTIC Science & Technology

2014-02-01

idle waiting for the wavefront to reach it. To overcome this, Reeve et al. (2001) 3 developed a scheme in analogy to the red-black Gauss - Seidel iterative ...understandable procedure calls. Parallelization of the SIMPLE iterative scheme with SIP used a red-black scheme similar to the red-black Gauss - Seidel ...scheme, the SIMPLE method, for pressure-velocity coupling. The result is a slowing convergence of the outer iterations . The red-black scheme excites a 2
Performance Evaluation in Network-Based Parallel Computing

NASA Technical Reports Server (NTRS)

Dezhgosha, Kamyar

1996-01-01

Network-based parallel computing is emerging as a cost-effective alternative for solving many problems which require use of supercomputers or massively parallel computers. The primary objective of this project has been to conduct experimental research on performance evaluation for clustered parallel computing. First, a testbed was established by augmenting our existing SUNSPARCs' network with PVM (Parallel Virtual Machine) which is a software system for linking clusters of machines. Second, a set of three basic applications were selected. The applications consist of a parallel search, a parallel sort, a parallel matrix multiplication. These application programs were implemented in C programming language under PVM. Third, we conducted performance evaluation under various configurations and problem sizes. Alternative parallel computing models and workload allocations for application programs were explored. The performance metric was limited to elapsed time or response time which in the context of parallel computing can be expressed in terms of speedup. The results reveal that the overhead of communication latency between processes in many cases is the restricting factor to performance. That is, coarse-grain parallelism which requires less frequent communication between processes will result in higher performance in network-based computing. Finally, we are in the final stages of installing an Asynchronous Transfer Mode (ATM) switch and four ATM interfaces (each 155 Mbps) which will allow us to extend our study to newer applications, performance metrics, and configurations.
Scalable domain decomposition solvers for stochastic PDEs in high performance computing

DOE PAGES

Desai, Ajit; Khalil, Mohammad; Pettit, Chris; ...

2017-09-21

Stochastic spectral finite element models of practical engineering systems may involve solutions of linear systems or linearized systems for non-linear problems with billions of unknowns. For stochastic modeling, it is therefore essential to design robust, parallel and scalable algorithms that can efficiently utilize high-performance computing to tackle such large-scale systems. Domain decomposition based iterative solvers can handle such systems. And though these algorithms exhibit excellent scalabilities, significant algorithmic and implementational challenges exist to extend them to solve extreme-scale stochastic systems using emerging computing platforms. Intrusive polynomial chaos expansion based domain decomposition algorithms are extended here to concurrently handle high resolutionmore » in both spatial and stochastic domains using an in-house implementation. Sparse iterative solvers with efficient preconditioners are employed to solve the resulting global and subdomain level local systems through multi-level iterative solvers. We also use parallel sparse matrix–vector operations to reduce the floating-point operations and memory requirements. Numerical and parallel scalabilities of these algorithms are presented for the diffusion equation having spatially varying diffusion coefficient modeled by a non-Gaussian stochastic process. Scalability of the solvers with respect to the number of random variables is also investigated.« less
Scalable domain decomposition solvers for stochastic PDEs in high performance computing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Desai, Ajit; Khalil, Mohammad; Pettit, Chris

Stochastic spectral finite element models of practical engineering systems may involve solutions of linear systems or linearized systems for non-linear problems with billions of unknowns. For stochastic modeling, it is therefore essential to design robust, parallel and scalable algorithms that can efficiently utilize high-performance computing to tackle such large-scale systems. Domain decomposition based iterative solvers can handle such systems. And though these algorithms exhibit excellent scalabilities, significant algorithmic and implementational challenges exist to extend them to solve extreme-scale stochastic systems using emerging computing platforms. Intrusive polynomial chaos expansion based domain decomposition algorithms are extended here to concurrently handle high resolutionmore » in both spatial and stochastic domains using an in-house implementation. Sparse iterative solvers with efficient preconditioners are employed to solve the resulting global and subdomain level local systems through multi-level iterative solvers. We also use parallel sparse matrix–vector operations to reduce the floating-point operations and memory requirements. Numerical and parallel scalabilities of these algorithms are presented for the diffusion equation having spatially varying diffusion coefficient modeled by a non-Gaussian stochastic process. Scalability of the solvers with respect to the number of random variables is also investigated.« less
Development of parallel algorithms for electrical power management in space applications

NASA Technical Reports Server (NTRS)

Berry, Frederick C.

1989-01-01

The application of parallel techniques for electrical power system analysis is discussed. The Newton-Raphson method of load flow analysis was used along with the decomposition-coordination technique to perform load flow analysis. The decomposition-coordination technique enables tasks to be performed in parallel by partitioning the electrical power system into independent local problems. Each independent local problem represents a portion of the total electrical power system on which a loan flow analysis can be performed. The load flow analysis is performed on these partitioned elements by using the Newton-Raphson load flow method. These independent local problems will produce results for voltage and power which can then be passed to the coordinator portion of the solution procedure. The coordinator problem uses the results of the local problems to determine if any correction is needed on the local problems. The coordinator problem is also solved by an iterative method much like the local problem. The iterative method for the coordination problem will also be the Newton-Raphson method. Therefore, each iteration at the coordination level will result in new values for the local problems. The local problems will have to be solved again along with the coordinator problem until some convergence conditions are met.
Solving Partial Differential Equations in a data-driven multiprocessor environment

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gaudiot, J.L.; Lin, C.M.; Hosseiniyar, M.

1988-12-31

Partial differential equations can be found in a host of engineering and scientific problems. The emergence of new parallel architectures has spurred research in the definition of parallel PDE solvers. Concurrently, highly programmable systems such as data-how architectures have been proposed for the exploitation of large scale parallelism. The implementation of some Partial Differential Equation solvers (such as the Jacobi method) on a tagged token data-flow graph is demonstrated here. Asynchronous methods (chaotic relaxation) are studied and new scheduling approaches (the Token No-Labeling scheme) are introduced in order to support the implementation of the asychronous methods in a data-driven environment.more » New high-level data-flow language program constructs are introduced in order to handle chaotic operations. Finally, the performance of the program graphs is demonstrated by a deterministic simulation of a message passing data-flow multiprocessor. An analysis of the overhead in the data-flow graphs is undertaken to demonstrate the limits of parallel operations in dataflow PDE program graphs.« less
Research in computer science

NASA Technical Reports Server (NTRS)

Ortega, J. M.

1986-01-01

Various graduate research activities in the field of computer science are reported. Among the topics discussed are: (1) failure probabilities in multi-version software; (2) Gaussian Elimination on parallel computers; (3) three dimensional Poisson solvers on parallel/vector computers; (4) automated task decomposition for multiple robot arms; (5) multi-color incomplete cholesky conjugate gradient methods on the Cyber 205; and (6) parallel implementation of iterative methods for solving linear equations.
Hierarchical Parallelism in Finite Difference Analysis of Heat Conduction

NASA Technical Reports Server (NTRS)

Padovan, Joseph; Krishna, Lala; Gute, Douglas

1997-01-01

Based on the concept of hierarchical parallelism, this research effort resulted in highly efficient parallel solution strategies for very large scale heat conduction problems. Overall, the method of hierarchical parallelism involves the partitioning of thermal models into several substructured levels wherein an optimal balance into various associated bandwidths is achieved. The details are described in this report. Overall, the report is organized into two parts. Part 1 describes the parallel modelling methodology and associated multilevel direct, iterative and mixed solution schemes. Part 2 establishes both the formal and computational properties of the scheme.
Toward Millions of File System IOPS on Low-Cost, Commodity Hardware

PubMed Central

Zheng, Da; Burns, Randal; Szalay, Alexander S.

2013-01-01

We describe a storage system that removes I/O bottlenecks to achieve more than one million IOPS based on a user-space file abstraction for arrays of commodity SSDs. The file abstraction refactors I/O scheduling and placement for extreme parallelism and non-uniform memory and I/O. The system includes a set-associative, parallel page cache in the user space. We redesign page caching to eliminate CPU overhead and lock-contention in non-uniform memory architecture machines. We evaluate our design on a 32 core NUMA machine with four, eight-core processors. Experiments show that our design delivers 1.23 million 512-byte read IOPS. The page cache realizes the scalable IOPS of Linux asynchronous I/O (AIO) and increases user-perceived I/O performance linearly with cache hit rates. The parallel, set-associative cache matches the cache hit rates of the global Linux page cache under real workloads. PMID:24402052

Toward Millions of File System IOPS on Low-Cost, Commodity Hardware.

PubMed

Zheng, Da; Burns, Randal; Szalay, Alexander S

2013-01-01

We describe a storage system that removes I/O bottlenecks to achieve more than one million IOPS based on a user-space file abstraction for arrays of commodity SSDs. The file abstraction refactors I/O scheduling and placement for extreme parallelism and non-uniform memory and I/O. The system includes a set-associative, parallel page cache in the user space. We redesign page caching to eliminate CPU overhead and lock-contention in non-uniform memory architecture machines. We evaluate our design on a 32 core NUMA machine with four, eight-core processors. Experiments show that our design delivers 1.23 million 512-byte read IOPS. The page cache realizes the scalable IOPS of Linux asynchronous I/O (AIO) and increases user-perceived I/O performance linearly with cache hit rates. The parallel, set-associative cache matches the cache hit rates of the global Linux page cache under real workloads.
Cascaded VLSI neural network architecture for on-line learning

NASA Technical Reports Server (NTRS)

Thakoor, Anilkumar P. (Inventor); Duong, Tuan A. (Inventor); Daud, Taher (Inventor)

1992-01-01

High-speed, analog, fully-parallel, and asynchronous building blocks are cascaded for larger sizes and enhanced resolution. A hardware compatible algorithm permits hardware-in-the-loop learning despite limited weight resolution. A computation intensive feature classification application was demonstrated with this flexible hardware and new algorithm at high speed. This result indicates that these building block chips can be embedded as an application specific coprocessor for solving real world problems at extremely high data rates.
Data General Corporation Advanced Operating System/Virtual Storage (AOS/ VS). Revision 7.60

DTIC Science & Technology

1989-02-22

control list for each directory and data file. An access control list includes the users who can and cannot access files as well as the access...and any required data, it can -5- February 22, 1989 Final Evaluation Report Data General AOS/VS SYSTEM OVERVIEW operate asynchronously and in parallel...memory. The IOC can perform the data transfer without further interventiin from the CPU. The I/O channels interface with the processor or system
Cascaded VLSI neural network architecture for on-line learning

NASA Technical Reports Server (NTRS)

Duong, Tuan A. (Inventor); Daud, Taher (Inventor); Thakoor, Anilkumar P. (Inventor)

1995-01-01

High-speed, analog, fully-parallel and asynchronous building blocks are cascaded for larger sizes and enhanced resolution. A hardware-compatible algorithm permits hardware-in-the-loop learning despite limited weight resolution. A comparison-intensive feature classification application has been demonstrated with this flexible hardware and new algorithm at high speed. This result indicates that these building block chips can be embedded as application-specific-coprocessors for solving real-world problems at extremely high data rates.
Adapting high-level language programs for parallel processing using data flow

NASA Technical Reports Server (NTRS)

Standley, Hilda M.

1988-01-01

EASY-FLOW, a very high-level data flow language, is introduced for the purpose of adapting programs written in a conventional high-level language to a parallel environment. The level of parallelism provided is of the large-grained variety in which parallel activities take place between subprograms or processes. A program written in EASY-FLOW is a set of subprogram calls as units, structured by iteration, branching, and distribution constructs. A data flow graph may be deduced from an EASY-FLOW program.
Computational Issues in Damping Identification for Large Scale Problems

NASA Technical Reports Server (NTRS)

Pilkey, Deborah L.; Roe, Kevin P.; Inman, Daniel J.

1997-01-01

Two damping identification methods are tested for efficiency in large-scale applications. One is an iterative routine, and the other a least squares method. Numerical simulations have been performed on multiple degree-of-freedom models to test the effectiveness of the algorithm and the usefulness of parallel computation for the problems. High Performance Fortran is used to parallelize the algorithm. Tests were performed using the IBM-SP2 at NASA Ames Research Center. The least squares method tested incurs high communication costs, which reduces the benefit of high performance computing. This method's memory requirement grows at a very rapid rate meaning that larger problems can quickly exceed available computer memory. The iterative method's memory requirement grows at a much slower pace and is able to handle problems with 500+ degrees of freedom on a single processor. This method benefits from parallelization, and significant speedup can he seen for problems of 100+ degrees-of-freedom.
GraphReduce: Large-Scale Graph Analytics on Accelerator-Based HPC Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sengupta, Dipanjan; Agarwal, Kapil; Song, Shuaiwen

2015-09-30

Recent work on real-world graph analytics has sought to leverage the massive amount of parallelism offered by GPU devices, but challenges remain due to the inherent irregularity of graph algorithms and limitations in GPU-resident memory for storing large graphs. We present GraphReduce, a highly efficient and scalable GPU-based framework that operates on graphs that exceed the device’s internal memory capacity. GraphReduce adopts a combination of both edge- and vertex-centric implementations of the Gather-Apply-Scatter programming model and operates on multiple asynchronous GPU streams to fully exploit the high degrees of parallelism in GPUs with efficient graph data movement between the hostmore » and the device.« less
Parallel processing of real-time dynamic systems simulation on OSCAR (Optimally SCheduled Advanced multiprocessoR)

NASA Technical Reports Server (NTRS)

Kasahara, Hironori; Honda, Hiroki; Narita, Seinosuke

1989-01-01

Parallel processing of real-time dynamic systems simulation on a multiprocessor system named OSCAR is presented. In the simulation of dynamic systems, generally, the same calculation are repeated every time step. However, we cannot apply to Do-all or the Do-across techniques for parallel processing of the simulation since there exist data dependencies from the end of an iteration to the beginning of the next iteration and furthermore data-input and data-output are required every sampling time period. Therefore, parallelism inside the calculation required for a single time step, or a large basic block which consists of arithmetic assignment statements, must be used. In the proposed method, near fine grain tasks, each of which consists of one or more floating point operations, are generated to extract the parallelism from the calculation and assigned to processors by using optimal static scheduling at compile time in order to reduce large run time overhead caused by the use of near fine grain tasks. The practicality of the scheme is demonstrated on OSCAR (Optimally SCheduled Advanced multiprocessoR) which has been developed to extract advantageous features of static scheduling algorithms to the maximum extent.
Parallel heuristics for scalable community detection

DOE PAGES

Lu, Hao; Halappanavar, Mahantesh; Kalyanaraman, Ananth

2015-08-14

Community detection has become a fundamental operation in numerous graph-theoretic applications. Despite its potential for application, there is only limited support for community detection on large-scale parallel computers, largely owing to the irregular and inherently sequential nature of the underlying heuristics. In this paper, we present parallelization heuristics for fast community detection using the Louvain method as the serial template. The Louvain method is an iterative heuristic for modularity optimization. Originally developed in 2008, the method has become increasingly popular owing to its ability to detect high modularity community partitions in a fast and memory-efficient manner. However, the method ismore » also inherently sequential, thereby limiting its scalability. Here, we observe certain key properties of this method that present challenges for its parallelization, and consequently propose heuristics that are designed to break the sequential barrier. For evaluation purposes, we implemented our heuristics using OpenMP multithreading, and tested them over real world graphs derived from multiple application domains. Compared to the serial Louvain implementation, our parallel implementation is able to produce community outputs with a higher modularity for most of the inputs tested, in comparable number or fewer iterations, while providing real speedups of up to 16x using 32 threads.« less
Scalable Evaluation of Polarization Energy and Associated Forces in Polarizable Molecular Dynamics: II.Towards Massively Parallel Computations using Smooth Particle Mesh Ewald.

PubMed

Lagardère, Louis; Lipparini, Filippo; Polack, Étienne; Stamm, Benjamin; Cancès, Éric; Schnieders, Michael; Ren, Pengyu; Maday, Yvon; Piquemal, Jean-Philip

2014-02-28

In this paper, we present a scalable and efficient implementation of point dipole-based polarizable force fields for molecular dynamics (MD) simulations with periodic boundary conditions (PBC). The Smooth Particle-Mesh Ewald technique is combined with two optimal iterative strategies, namely, a preconditioned conjugate gradient solver and a Jacobi solver in conjunction with the Direct Inversion in the Iterative Subspace for convergence acceleration, to solve the polarization equations. We show that both solvers exhibit very good parallel performances and overall very competitive timings in an energy-force computation needed to perform a MD step. Various tests on large systems are provided in the context of the polarizable AMOEBA force field as implemented in the newly developed Tinker-HP package which is the first implementation for a polarizable model making large scale experiments for massively parallel PBC point dipole models possible. We show that using a large number of cores offers a significant acceleration of the overall process involving the iterative methods within the context of spme and a noticeable improvement of the memory management giving access to very large systems (hundreds of thousands of atoms) as the algorithm naturally distributes the data on different cores. Coupled with advanced MD techniques, gains ranging from 2 to 3 orders of magnitude in time are now possible compared to non-optimized, sequential implementations giving new directions for polarizable molecular dynamics in periodic boundary conditions using massively parallel implementations.
Scalable Evaluation of Polarization Energy and Associated Forces in Polarizable Molecular Dynamics: II.Towards Massively Parallel Computations using Smooth Particle Mesh Ewald

PubMed Central

Lagardère, Louis; Lipparini, Filippo; Polack, Étienne; Stamm, Benjamin; Cancès, Éric; Schnieders, Michael; Ren, Pengyu; Maday, Yvon; Piquemal, Jean-Philip

2015-01-01

In this paper, we present a scalable and efficient implementation of point dipole-based polarizable force fields for molecular dynamics (MD) simulations with periodic boundary conditions (PBC). The Smooth Particle-Mesh Ewald technique is combined with two optimal iterative strategies, namely, a preconditioned conjugate gradient solver and a Jacobi solver in conjunction with the Direct Inversion in the Iterative Subspace for convergence acceleration, to solve the polarization equations. We show that both solvers exhibit very good parallel performances and overall very competitive timings in an energy-force computation needed to perform a MD step. Various tests on large systems are provided in the context of the polarizable AMOEBA force field as implemented in the newly developed Tinker-HP package which is the first implementation for a polarizable model making large scale experiments for massively parallel PBC point dipole models possible. We show that using a large number of cores offers a significant acceleration of the overall process involving the iterative methods within the context of spme and a noticeable improvement of the memory management giving access to very large systems (hundreds of thousands of atoms) as the algorithm naturally distributes the data on different cores. Coupled with advanced MD techniques, gains ranging from 2 to 3 orders of magnitude in time are now possible compared to non-optimized, sequential implementations giving new directions for polarizable molecular dynamics in periodic boundary conditions using massively parallel implementations. PMID:26512230
Accelerating nuclear configuration interaction calculations through a preconditioned block iterative eigensolver

DOE PAGES

Shao, Meiyue; Aktulga, H. Metin; Yang, Chao; ...

2017-09-14

In this paper, we describe a number of recently developed techniques for improving the performance of large-scale nuclear configuration interaction calculations on high performance parallel computers. We show the benefit of using a preconditioned block iterative method to replace the Lanczos algorithm that has traditionally been used to perform this type of computation. The rapid convergence of the block iterative method is achieved by a proper choice of starting guesses of the eigenvectors and the construction of an effective preconditioner. These acceleration techniques take advantage of special structure of the nuclear configuration interaction problem which we discuss in detail. Themore » use of a block method also allows us to improve the concurrency of the computation, and take advantage of the memory hierarchy of modern microprocessors to increase the arithmetic intensity of the computation relative to data movement. Finally, we also discuss the implementation details that are critical to achieving high performance on massively parallel multi-core supercomputers, and demonstrate that the new block iterative solver is two to three times faster than the Lanczos based algorithm for problems of moderate sizes on a Cray XC30 system.« less
Iterative load-balancing method with multigrid level relaxation for particle simulation with short-range interactions

NASA Astrophysics Data System (ADS)

Furuichi, Mikito; Nishiura, Daisuke

2017-10-01

We developed dynamic load-balancing algorithms for Particle Simulation Methods (PSM) involving short-range interactions, such as Smoothed Particle Hydrodynamics (SPH), Moving Particle Semi-implicit method (MPS), and Discrete Element method (DEM). These are needed to handle billions of particles modeled in large distributed-memory computer systems. Our method utilizes flexible orthogonal domain decomposition, allowing the sub-domain boundaries in the column to be different for each row. The imbalances in the execution time between parallel logical processes are treated as a nonlinear residual. Load-balancing is achieved by minimizing the residual within the framework of an iterative nonlinear solver, combined with a multigrid technique in the local smoother. Our iterative method is suitable for adjusting the sub-domain frequently by monitoring the performance of each computational process because it is computationally cheaper in terms of communication and memory costs than non-iterative methods. Numerical tests demonstrated the ability of our approach to handle workload imbalances arising from a non-uniform particle distribution, differences in particle types, or heterogeneous computer architecture which was difficult with previously proposed methods. We analyzed the parallel efficiency and scalability of our method using Earth simulator and K-computer supercomputer systems.
Accelerating nuclear configuration interaction calculations through a preconditioned block iterative eigensolver

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shao, Meiyue; Aktulga, H. Metin; Yang, Chao

In this paper, we describe a number of recently developed techniques for improving the performance of large-scale nuclear configuration interaction calculations on high performance parallel computers. We show the benefit of using a preconditioned block iterative method to replace the Lanczos algorithm that has traditionally been used to perform this type of computation. The rapid convergence of the block iterative method is achieved by a proper choice of starting guesses of the eigenvectors and the construction of an effective preconditioner. These acceleration techniques take advantage of special structure of the nuclear configuration interaction problem which we discuss in detail. Themore » use of a block method also allows us to improve the concurrency of the computation, and take advantage of the memory hierarchy of modern microprocessors to increase the arithmetic intensity of the computation relative to data movement. Finally, we also discuss the implementation details that are critical to achieving high performance on massively parallel multi-core supercomputers, and demonstrate that the new block iterative solver is two to three times faster than the Lanczos based algorithm for problems of moderate sizes on a Cray XC30 system.« less
Performance Analysis of Iterative Channel Estimation and Multiuser Detection in Multipath DS-CDMA Channels

NASA Astrophysics Data System (ADS)

Li, Husheng; Betz, Sharon M.; Poor, H. Vincent

2007-05-01

This paper examines the performance of decision feedback based iterative channel estimation and multiuser detection in channel coded aperiodic DS-CDMA systems operating over multipath fading channels. First, explicit expressions describing the performance of channel estimation and parallel interference cancellation based multiuser detection are developed. These results are then combined to characterize the evolution of the performance of a system that iterates among channel estimation, multiuser detection and channel decoding. Sufficient conditions for convergence of this system to a unique fixed point are developed.
Eliciting design patterns for e-learning systems

NASA Astrophysics Data System (ADS)

Retalis, Symeon; Georgiakakis, Petros; Dimitriadis, Yannis

2006-06-01

Design pattern creation, especially in the e-learning domain, is a highly complex process that has not been sufficiently studied and formalized. In this paper, we propose a systematic pattern development cycle, whose most important aspects focus on reverse engineering of existing systems in order to elicit features that are cross-validated through the use of appropriate, authentic scenarios. However, an iterative pattern process is proposed that takes advantage of multiple data sources, thus emphasizing a holistic view of the teaching learning processes. The proposed schema of pattern mining has been extensively validated for Asynchronous Network Supported Collaborative Learning (ANSCL) systems, as well as for other types of tools in a variety of scenarios, with promising results.
Efficiency Analysis of the Parallel Implementation of the SIMPLE Algorithm on Multiprocessor Computers

NASA Astrophysics Data System (ADS)

Lashkin, S. V.; Kozelkov, A. S.; Yalozo, A. V.; Gerasimov, V. Yu.; Zelensky, D. K.

2017-12-01

This paper describes the details of the parallel implementation of the SIMPLE algorithm for numerical solution of the Navier-Stokes system of equations on arbitrary unstructured grids. The iteration schemes for the serial and parallel versions of the SIMPLE algorithm are implemented. In the description of the parallel implementation, special attention is paid to computational data exchange among processors under the condition of the grid model decomposition using fictitious cells. We discuss the specific features for the storage of distributed matrices and implementation of vector-matrix operations in parallel mode. It is shown that the proposed way of matrix storage reduces the number of interprocessor exchanges. A series of numerical experiments illustrates the effect of the multigrid SLAE solver tuning on the general efficiency of the algorithm; the tuning involves the types of the cycles used (V, W, and F), the number of iterations of a smoothing operator, and the number of cells for coarsening. Two ways (direct and indirect) of efficiency evaluation for parallelization of the numerical algorithm are demonstrated. The paper presents the results of solving some internal and external flow problems with the evaluation of parallelization efficiency by two algorithms. It is shown that the proposed parallel implementation enables efficient computations for the problems on a thousand processors. Based on the results obtained, some general recommendations are made for the optimal tuning of the multigrid solver, as well as for selecting the optimal number of cells per processor.
Simple and Flexible Self-Reproducing Structures in Asynchronous Cellular Automata and Their Dynamics

NASA Astrophysics Data System (ADS)

Huang, Xin; Lee, Jia; Yang, Rui-Long; Zhu, Qing-Sheng

2013-03-01

Self-reproduction on asynchronous cellular automata (ACAs) has attracted wide attention due to the evident artifacts induced by synchronous updating. Asynchronous updating, which allows cells to undergo transitions independently at random times, might be more compatible with the natural processes occurring at micro-scale, but the dark side of the coin is the increment in the complexity of an ACA in order to accomplish stable self-reproduction. This paper proposes a novel model of self-timed cellular automata (STCAs), a special type of ACAs, where unsheathed loops are able to duplicate themselves reliably in parallel. The removal of sheath cannot only allow various loops with more flexible and compact structures to replicate themselves, but also reduce the number of cell states of the STCA as compared to the previous model adopting sheathed loops [Y. Takada, T. Isokawa, F. Peper and N. Matsui, Physica D227, 26 (2007)]. The lack of sheath, on the other hand, often tends to cause much more complicated interactions among loops, when all of them struggle independently to stretch out their constructing arms at the same time. In particular, such intense collisions may even cause the emergence of a mess of twisted constructing arms in the cellular space. By using a simple and natural method, our self-reproducing loops (SRLs) are able to retract their arms successively, thereby disentangling from the mess successfully.
High performance interconnection between high data rate networks

NASA Technical Reports Server (NTRS)

Foudriat, E. C.; Maly, K.; Overstreet, C. M.; Zhang, L.; Sun, W.

1992-01-01

The bridge/gateway system needed to interconnect a wide range of computer networks to support a wide range of user quality-of-service requirements is discussed. The bridge/gateway must handle a wide range of message types including synchronous and asynchronous traffic, large, bursty messages, short, self-contained messages, time critical messages, etc. It is shown that messages can be classified into three basic classes, synchronous and large and small asynchronous messages. The first two require call setup so that packet identification, buffer handling, etc. can be supported in the bridge/gateway. Identification enables resequences in packet size. The third class is for messages which do not require call setup. Resequencing hardware based to handle two types of resequencing problems is presented. The first is for a virtual parallel circuit which can scramble channel bytes. The second system is effective in handling both synchronous and asynchronous traffic between networks with highly differing packet sizes and data rates. The two other major needs for the bridge/gateway are congestion and error control. A dynamic, lossless congestion control scheme which can easily support effective error correction is presented. Results indicate that the congestion control scheme provides close to optimal capacity under congested conditions. Under conditions where error may develop due to intervening networks which are not lossless, intermediate error recovery and correction takes 1/3 less time than equivalent end-to-end error correction under similar conditions.
Asynchronous parallel status comparator

DOEpatents

Arnold, Jeffrey W.; Hart, Mark M.

1992-01-01

Apparatus for matching asynchronously received signals and determining whether two or more out of a total number of possible signals match. The apparatus comprises, in one embodiment, an array of sensors positioned in discrete locations and in communication with one or more processors. The processors will receive signals if the sensors detect a change in the variable sensed from a nominal to a special condition and will transmit location information in the form of a digital data set to two or more receivers. The receivers collect, read, latch and acknowledge the data sets and forward them to decoders that produce an output signal for each data set received. The receivers also periodically reset the system following each scan of the sensor array. A comparator then determines if any two or more, as specified by the user, of the output signals corresponds to the same location. A sufficient number of matches produces a system output signal that activates a system to restore the array to its nominal condition.

Asynchronous parallel status comparator

DOEpatents

Arnold, J.W.; Hart, M.M.

1992-12-15

Disclosed is an apparatus for matching asynchronously received signals and determining whether two or more out of a total number of possible signals match. The apparatus comprises, in one embodiment, an array of sensors positioned in discrete locations and in communication with one or more processors. The processors will receive signals if the sensors detect a change in the variable sensed from a nominal to a special condition and will transmit location information in the form of a digital data set to two or more receivers. The receivers collect, read, latch and acknowledge the data sets and forward them to decoders that produce an output signal for each data set received. The receivers also periodically reset the system following each scan of the sensor array. A comparator then determines if any two or more, as specified by the user, of the output signals corresponds to the same location. A sufficient number of matches produces a system output signal that activates a system to restore the array to its nominal condition. 4 figs.
Parallel Finite Element Domain Decomposition for Structural/Acoustic Analysis

NASA Technical Reports Server (NTRS)

Nguyen, Duc T.; Tungkahotara, Siroj; Watson, Willie R.; Rajan, Subramaniam D.

2005-01-01

A domain decomposition (DD) formulation for solving sparse linear systems of equations resulting from finite element analysis is presented. The formulation incorporates mixed direct and iterative equation solving strategics and other novel algorithmic ideas that are optimized to take advantage of sparsity and exploit modern computer architecture, such as memory and parallel computing. The most time consuming part of the formulation is identified and the critical roles of direct sparse and iterative solvers within the framework of the formulation are discussed. Experiments on several computer platforms using several complex test matrices are conducted using software based on the formulation. Small-scale structural examples are used to validate thc steps in the formulation and large-scale (l,000,000+ unknowns) duct acoustic examples are used to evaluate the ORIGIN 2000 processors, and a duster of 6 PCs (running under the Windows environment). Statistics show that the formulation is efficient in both sequential and parallel computing environmental and that the formulation is significantly faster and consumes less memory than that based on one of the best available commercialized parallel sparse solvers.
Microwave-based reaction screening: tandem retro-Diels-Alder/Diels-Alder cycloadditions of o-quinol dimers.

PubMed

Dong, Suwei; Cahill, Katharine J; Kang, Moon-Il; Colburn, Nancy H; Henrich, Curtis J; Wilson, Jennifer A; Beutler, John A; Johnson, Richard P; Porco, John A

2011-11-04

We have accomplished a parallel screen of cycloaddition partners for o-quinols utilizing a plate-based microwave system. Microwave irradiation improves the efficiency of retro-Diels-Alder/Diels-Alder cascades of o-quinol dimers which generally proceed in a diastereoselective fashion. Computational studies indicate that asynchronous transition states are favored in Diels-Alder cycloadditions of o-quinols. Subsequent biological evaluation of a collection of cycloadducts has identified an inhibitor of activator protein-1 (AP-1), an oncogenic transcription factor.
Workshop on Solid State Switches for Pulsed Power, held January 12-14, 1983 at Tamarron, Colorado

DTIC Science & Technology

1983-05-31

of its anticipated scalabil- ity. However, the projected performance of other types of dis- crete switches made their continued exploration and...linking of "asynchronous AC power grids. Some present installations arid projected increases are showr. in Table 2. A new commercial power application...Average Power 62.5 KW 160 KW Device RBDT (RSR) T60R SCR 2N3873 Arra , 6 Series 10 Parallel-20 Series Table 18. Applications of solid state pulse
Extended estimator approach for 2×2 games and its mapping to the Ising Hamiltonian

NASA Astrophysics Data System (ADS)

Ariosa, D.; Fort, H.

2005-01-01

We consider a system of adaptive self-interested agents interacting by playing an iterated pairwise prisoner’s dilemma (PD) game. Each player has two options: either cooperate (C) or defect (D). Agents have no (long term) memory to reciprocate nor identifying tags to distinguish C from D. We show how their 16 possible elementary Markovian (one-step memory) strategies can be cast in a simple general formalism in terms of an estimator of expected utilities Δ* . This formalism is helpful to map a subset of these strategies into an Ising Hamiltonian in a straightforward way. This connection in turn serves to shed light on the evolution of the iterated games played by agents, which can represent a broad variety of individuals from firms of a market to species coexisting in an ecosystem. Additionally, this magnetic description may be useful to introduce noise in a natural and simple way. The equilibrium states reached by the system depend strongly on whether the dynamics are synchronous or asynchronous and also on the system connectivity.
Performance and capacity analysis of Poisson photon-counting based Iter-PIC OCDMA systems.

PubMed

Li, Lingbin; Zhou, Xiaolin; Zhang, Rong; Zhang, Dingchen; Hanzo, Lajos

2013-11-04

In this paper, an iterative parallel interference cancellation (Iter-PIC) technique is developed for optical code-division multiple-access (OCDMA) systems relying on shot-noise limited Poisson photon-counting reception. The novel semi-analytical tool of extrinsic information transfer (EXIT) charts is used for analysing both the bit error rate (BER) performance as well as the channel capacity of these systems and the results are verified by Monte Carlo simulations. The proposed Iter-PIC OCDMA system is capable of achieving two orders of magnitude BER improvements and a 0.1 nats of capacity improvement over the conventional chip-level OCDMA systems at a coding rate of 1/10.
Analysis of Monte Carlo accelerated iterative methods for sparse linear systems: Analysis of Monte Carlo accelerated iterative methods for sparse linear systems

DOE PAGES

Benzi, Michele; Evans, Thomas M.; Hamilton, Steven P.; ...

2017-03-05

Here, we consider hybrid deterministic-stochastic iterative algorithms for the solution of large, sparse linear systems. Starting from a convergent splitting of the coefficient matrix, we analyze various types of Monte Carlo acceleration schemes applied to the original preconditioned Richardson (stationary) iteration. We expect that these methods will have considerable potential for resiliency to faults when implemented on massively parallel machines. We also establish sufficient conditions for the convergence of the hybrid schemes, and we investigate different types of preconditioners including sparse approximate inverses. Numerical experiments on linear systems arising from the discretization of partial differential equations are presented.
Extending substructure based iterative solvers to multiple load and repeated analyses

NASA Technical Reports Server (NTRS)

Farhat, Charbel

1993-01-01

Direct solvers currently dominate commercial finite element structural software, but do not scale well in the fine granularity regime targeted by emerging parallel processors. Substructure based iterative solvers--often called also domain decomposition algorithms--lend themselves better to parallel processing, but must overcome several obstacles before earning their place in general purpose structural analysis programs. One such obstacle is the solution of systems with many or repeated right hand sides. Such systems arise, for example, in multiple load static analyses and in implicit linear dynamics computations. Direct solvers are well-suited for these problems because after the system matrix has been factored, the multiple or repeated solutions can be obtained through relatively inexpensive forward and backward substitutions. On the other hand, iterative solvers in general are ill-suited for these problems because they often must restart from scratch for every different right hand side. In this paper, we present a methodology for extending the range of applications of domain decomposition methods to problems with multiple or repeated right hand sides. Basically, we formulate the overall problem as a series of minimization problems over K-orthogonal and supplementary subspaces, and tailor the preconditioned conjugate gradient algorithm to solve them efficiently. The resulting solution method is scalable, whereas direct factorization schemes and forward and backward substitution algorithms are not. We illustrate the proposed methodology with the solution of static and dynamic structural problems, and highlight its potential to outperform forward and backward substitutions on parallel computers. As an example, we show that for a linear structural dynamics problem with 11640 degrees of freedom, every time-step beyond time-step 15 is solved in a single iteration and consumes 1.0 second on a 32 processor iPSC-860 system; for the same problem and the same parallel processor, a pair of forward/backward substitutions at each step consumes 15.0 seconds.
Domain decomposition preconditioners for the spectral collocation method

NASA Technical Reports Server (NTRS)

Quarteroni, Alfio; Sacchilandriani, Giovanni

1988-01-01

Several block iteration preconditioners are proposed and analyzed for the solution of elliptic problems by spectral collocation methods in a region partitioned into several rectangles. It is shown that convergence is achieved with a rate which does not depend on the polynomial degree of the spectral solution. The iterative methods here presented can be effectively implemented on multiprocessor systems due to their high degree of parallelism.
Self-Consistent Scheme for Spike-Train Power Spectra in Heterogeneous Sparse Networks.

PubMed

Pena, Rodrigo F O; Vellmer, Sebastian; Bernardi, Davide; Roque, Antonio C; Lindner, Benjamin

2018-01-01

Recurrent networks of spiking neurons can be in an asynchronous state characterized by low or absent cross-correlations and spike statistics which resemble those of cortical neurons. Although spatial correlations are negligible in this state, neurons can show pronounced temporal correlations in their spike trains that can be quantified by the autocorrelation function or the spike-train power spectrum. Depending on cellular and network parameters, correlations display diverse patterns (ranging from simple refractory-period effects and stochastic oscillations to slow fluctuations) and it is generally not well-understood how these dependencies come about. Previous work has explored how the single-cell correlations in a homogeneous network (excitatory and inhibitory integrate-and-fire neurons with nearly balanced mean recurrent input) can be determined numerically from an iterative single-neuron simulation. Such a scheme is based on the fact that every neuron is driven by the network noise (i.e., the input currents from all its presynaptic partners) but also contributes to the network noise, leading to a self-consistency condition for the input and output spectra. Here we first extend this scheme to homogeneous networks with strong recurrent inhibition and a synaptic filter, in which instabilities of the previous scheme are avoided by an averaging procedure. We then extend the scheme to heterogeneous networks in which (i) different neural subpopulations (e.g., excitatory and inhibitory neurons) have different cellular or connectivity parameters; (ii) the number and strength of the input connections are random (Erdős-Rényi topology) and thus different among neurons. In all heterogeneous cases, neurons are lumped in different classes each of which is represented by a single neuron in the iterative scheme; in addition, we make a Gaussian approximation of the input current to the neuron. These approximations seem to be justified over a broad range of parameters as indicated by comparison with simulation results of large recurrent networks. Our method can help to elucidate how network heterogeneity shapes the asynchronous state in recurrent neural networks.
Parallel programming of gradient-based iterative image reconstruction schemes for optical tomography.

PubMed

Hielscher, Andreas H; Bartel, Sebastian

2004-02-01

Optical tomography (OT) is a fast developing novel imaging modality that uses near-infrared (NIR) light to obtain cross-sectional views of optical properties inside the human body. A major challenge remains the time-consuming, computational-intensive image reconstruction problem that converts NIR transmission measurements into cross-sectional images. To increase the speed of iterative image reconstruction schemes that are commonly applied for OT, we have developed and implemented several parallel algorithms on a cluster of workstations. Static process distribution as well as dynamic load balancing schemes suitable for heterogeneous clusters and varying machine performances are introduced and tested. The resulting algorithms are shown to accelerate the reconstruction process to various degrees, substantially reducing the computation times for clinically relevant problems.
Parallel Dynamics Simulation Using a Krylov-Schwarz Linear Solution Scheme

DOE PAGES

Abhyankar, Shrirang; Constantinescu, Emil M.; Smith, Barry F.; ...

2016-11-07

Fast dynamics simulation of large-scale power systems is a computational challenge because of the need to solve a large set of stiff, nonlinear differential-algebraic equations at every time step. The main bottleneck in dynamic simulations is the solution of a linear system during each nonlinear iteration of Newton’s method. In this paper, we present a parallel Krylov- Schwarz linear solution scheme that uses the Krylov subspacebased iterative linear solver GMRES with an overlapping restricted additive Schwarz preconditioner. As a result, performance tests of the proposed Krylov-Schwarz scheme for several large test cases ranging from 2,000 to 20,000 buses, including amore » real utility network, show good scalability on different computing architectures.« less
Parallel Dynamics Simulation Using a Krylov-Schwarz Linear Solution Scheme

DOE Office of Scientific and Technical Information (OSTI.GOV)

Abhyankar, Shrirang; Constantinescu, Emil M.; Smith, Barry F.

Fast dynamics simulation of large-scale power systems is a computational challenge because of the need to solve a large set of stiff, nonlinear differential-algebraic equations at every time step. The main bottleneck in dynamic simulations is the solution of a linear system during each nonlinear iteration of Newton’s method. In this paper, we present a parallel Krylov- Schwarz linear solution scheme that uses the Krylov subspacebased iterative linear solver GMRES with an overlapping restricted additive Schwarz preconditioner. As a result, performance tests of the proposed Krylov-Schwarz scheme for several large test cases ranging from 2,000 to 20,000 buses, including amore » real utility network, show good scalability on different computing architectures.« less
Dissipation, Voltage Profile and Levy Dragon in a Special Ladder Network

ERIC Educational Resources Information Center

Ucak, C.

2009-01-01

A ladder network constructed by an elementary two-terminal network consisting of a parallel resistor-inductor block in series with a parallel resistor-capacitor block sometimes is said to have a non-dispersive dissipative response. This special ladder network is created iteratively by replacing the elementary two-terminal network in place of the…
Tuning iteration space slicing based tiled multi-core code implementing Nussinov's RNA folding.

PubMed

Palkowski, Marek; Bielecki, Wlodzimierz

2018-01-15

RNA folding is an ongoing compute-intensive task of bioinformatics. Parallelization and improving code locality for this kind of algorithms is one of the most relevant areas in computational biology. Fortunately, RNA secondary structure approaches, such as Nussinov's recurrence, involve mathematical operations over affine control loops whose iteration space can be represented by the polyhedral model. This allows us to apply powerful polyhedral compilation techniques based on the transitive closure of dependence graphs to generate parallel tiled code implementing Nussinov's RNA folding. Such techniques are within the iteration space slicing framework - the transitive dependences are applied to the statement instances of interest to produce valid tiles. The main problem at generating parallel tiled code is defining a proper tile size and tile dimension which impact parallelism degree and code locality. To choose the best tile size and tile dimension, we first construct parallel parametric tiled code (parameters are variables defining tile size). With this purpose, we first generate two nonparametric tiled codes with different fixed tile sizes but with the same code structure and then derive a general affine model, which describes all integer factors available in expressions of those codes. Using this model and known integer factors present in the mentioned expressions (they define the left-hand side of the model), we find unknown integers in this model for each integer factor available in the same fixed tiled code position and replace in this code expressions, including integer factors, with those including parameters. Then we use this parallel parametric tiled code to implement the well-known tile size selection (TSS) technique, which allows us to discover in a given search space the best tile size and tile dimension maximizing target code performance. For a given search space, the presented approach allows us to choose the best tile size and tile dimension in parallel tiled code implementing Nussinov's RNA folding. Experimental results, received on modern Intel multi-core processors, demonstrate that this code outperforms known closely related implementations when the length of RNA strands is bigger than 2500.
A FAST ITERATIVE METHOD FOR SOLVING THE EIKONAL EQUATION ON TRIANGULATED SURFACES*

PubMed Central

Fu, Zhisong; Jeong, Won-Ki; Pan, Yongsheng; Kirby, Robert M.; Whitaker, Ross T.

2012-01-01

This paper presents an efficient, fine-grained parallel algorithm for solving the Eikonal equation on triangular meshes. The Eikonal equation, and the broader class of Hamilton–Jacobi equations to which it belongs, have a wide range of applications from geometric optics and seismology to biological modeling and analysis of geometry and images. The ability to solve such equations accurately and efficiently provides new capabilities for exploring and visualizing parameter spaces and for solving inverse problems that rely on such equations in the forward model. Efficient solvers on state-of-the-art, parallel architectures require new algorithms that are not, in many cases, optimal, but are better suited to synchronous updates of the solution. In previous work [W. K. Jeong and R. T. Whitaker, SIAM J. Sci. Comput., 30 (2008), pp. 2512–2534], the authors proposed the fast iterative method (FIM) to efficiently solve the Eikonal equation on regular grids. In this paper we extend the fast iterative method to solve Eikonal equations efficiently on triangulated domains on the CPU and on parallel architectures, including graphics processors. We propose a new local update scheme that provides solutions of first-order accuracy for both architectures. We also propose a novel triangle-based update scheme and its corresponding data structure for efficient irregular data mapping to parallel single-instruction multiple-data (SIMD) processors. We provide detailed descriptions of the implementations on a single CPU, a multicore CPU with shared memory, and SIMD architectures with comparative results against state-of-the-art Eikonal solvers. PMID:22641200
Tempest: Accelerated MS/MS database search software for heterogeneous computing platforms

PubMed Central

Adamo, Mark E.; Gerber, Scott A.

2017-01-01

MS/MS database search algorithms derive a set of candidate peptide sequences from in-silico digest of a protein sequence database, and compute theoretical fragmentation patterns to match these candidates against observed MS/MS spectra. The original Tempest publication described these operations mapped to a CPU-GPU model, in which the CPU generates peptide candidates that are asynchronously sent to a discrete GPU to be scored against experimental spectra in parallel (Milloy et al., 2012). The current version of Tempest expands this model, incorporating OpenCL to offer seamless parallelization across multicore CPUs, GPUs, integrated graphics chips, and general-purpose coprocessors. Three protocols describe how to configure and run a Tempest search, including discussion of how to leverage Tempest's unique feature set to produce optimal results. PMID:27603022
Pteros 2.0: Evolution of the fast parallel molecular analysis library for C++ and python.

PubMed

Yesylevskyy, Semen O

2015-07-15

Pteros is the high-performance open-source library for molecular modeling and analysis of molecular dynamics trajectories. Starting from version 2.0 Pteros is available for C++ and Python programming languages with very similar interfaces. This makes it suitable for writing complex reusable programs in C++ and simple interactive scripts in Python alike. New version improves the facilities for asynchronous trajectory reading and parallel execution of analysis tasks by introducing analysis plugins which could be written in either C++ or Python in completely uniform way. The high level of abstraction provided by analysis plugins greatly simplifies prototyping and implementation of complex analysis algorithms. Pteros is available for free under Artistic License from http://sourceforge.net/projects/pteros/. © 2015 Wiley Periodicals, Inc.
The multigrid preconditioned conjugate gradient method

NASA Technical Reports Server (NTRS)

Tatebe, Osamu

1993-01-01

A multigrid preconditioned conjugate gradient method (MGCG method), which uses the multigrid method as a preconditioner of the PCG method, is proposed. The multigrid method has inherent high parallelism and improves convergence of long wavelength components, which is important in iterative methods. By using this method as a preconditioner of the PCG method, an efficient method with high parallelism and fast convergence is obtained. First, it is considered a necessary condition of the multigrid preconditioner in order to satisfy requirements of a preconditioner of the PCG method. Next numerical experiments show a behavior of the MGCG method and that the MGCG method is superior to both the ICCG method and the multigrid method in point of fast convergence and high parallelism. This fast convergence is understood in terms of the eigenvalue analysis of the preconditioned matrix. From this observation of the multigrid preconditioner, it is realized that the MGCG method converges in very few iterations and the multigrid preconditioner is a desirable preconditioner of the conjugate gradient method.
Development of iterative techniques for the solution of unsteady compressible viscous flows

NASA Technical Reports Server (NTRS)

Sankar, Lakshmi N.; Hixon, Duane

1991-01-01

Efficient iterative solution methods are being developed for the numerical solution of two- and three-dimensional compressible Navier-Stokes equations. Iterative time marching methods have several advantages over classical multi-step explicit time marching schemes, and non-iterative implicit time marching schemes. Iterative schemes have better stability characteristics than non-iterative explicit and implicit schemes. Thus, the extra work required by iterative schemes can also be designed to perform efficiently on current and future generation scalable, missively parallel machines. An obvious candidate for iteratively solving the system of coupled nonlinear algebraic equations arising in CFD applications is the Newton method. Newton's method was implemented in existing finite difference and finite volume methods. Depending on the complexity of the problem, the number of Newton iterations needed per step to solve the discretized system of equations can, however, vary dramatically from a few to several hundred. Another popular approach based on the classical conjugate gradient method, known as the GMRES (Generalized Minimum Residual) algorithm is investigated. The GMRES algorithm was used in the past by a number of researchers for solving steady viscous and inviscid flow problems with considerable success. Here, the suitability of this algorithm is investigated for solving the system of nonlinear equations that arise in unsteady Navier-Stokes solvers at each time step. Unlike the Newton method which attempts to drive the error in the solution at each and every node down to zero, the GMRES algorithm only seeks to minimize the L2 norm of the error. In the GMRES algorithm the changes in the flow properties from one time step to the next are assumed to be the sum of a set of orthogonal vectors. By choosing the number of vectors to a reasonably small value N (between 5 and 20) the work required for advancing the solution from one time step to the next may be kept to (N+1) times that of a noniterative scheme. Many of the operations required by the GMRES algorithm such as matrix-vector multiplies, matrix additions and subtractions can all be vectorized and parallelized efficiently.

LSRN: A PARALLEL ITERATIVE SOLVER FOR STRONGLY OVER- OR UNDERDETERMINED SYSTEMS*

PubMed Central

Meng, Xiangrui; Saunders, Michael A.; Mahoney, Michael W.

2014-01-01

We describe a parallel iterative least squares solver named LSRN that is based on random normal projection. LSRN computes the min-length solution to minx∈ℝn ‖Ax − b‖2, where A ∈ ℝm × n with m ≫ n or m ≪ n, and where A may be rank-deficient. Tikhonov regularization may also be included. Since A is involved only in matrix-matrix and matrix-vector multiplications, it can be a dense or sparse matrix or a linear operator, and LSRN automatically speeds up when A is sparse or a fast linear operator. The preconditioning phase consists of a random normal projection, which is embarrassingly parallel, and a singular value decomposition of size ⌈γ min(m, n)⌉ × min(m, n), where γ is moderately larger than 1, e.g., γ = 2. We prove that the preconditioned system is well-conditioned, with a strong concentration result on the extreme singular values, and hence that the number of iterations is fully predictable when we apply LSQR or the Chebyshev semi-iterative method. As we demonstrate, the Chebyshev method is particularly efficient for solving large problems on clusters with high communication cost. Numerical results show that on a shared-memory machine, LSRN is very competitive with LAPACK’s DGELSD and a fast randomized least squares solver called Blendenpik on large dense problems, and it outperforms the least squares solver from SuiteSparseQR on sparse problems without sparsity patterns that can be exploited to reduce fill-in. Further experiments show that LSRN scales well on an Amazon Elastic Compute Cloud cluster. PMID:25419094
Performance Implications of Synchronization Support for Parallel FORTRAN Programs

DTIC Science & Technology

1991-06-17

applications we used in this study are BDNA and FLO52. BDNA is a molecular dy- I namics simulator for biomolecules in water and it uses ordinary...parallelism structures and loop granularity. In the BDNA program, most of the parallel loops are not nested and the iterations are 200-1000 instructions long...are of concern. The BDNA curve in Figure 21 shows that for this program only 17% of all 32 I I 100 BDNA -4 FLO52 -I 80 3 CumuilatQe percentage of3
A highly parallel multigrid-like method for the solution of the Euler equations

NASA Technical Reports Server (NTRS)

Tuminaro, Ray S.

1989-01-01

We consider a highly parallel multigrid-like method for the solution of the two dimensional steady Euler equations. The new method, introduced as filtering multigrid, is similar to a standard multigrid scheme in that convergence on the finest grid is accelerated by iterations on coarser grids. In the filtering method, however, additional fine grid subproblems are processed concurrently with coarse grid computations to further accelerate convergence. These additional problems are obtained by splitting the residual into a smooth and an oscillatory component. The smooth component is then used to form a coarse grid problem (similar to standard multigrid) while the oscillatory component is used for a fine grid subproblem. The primary advantage in the filtering approach is that fewer iterations are required and that most of the additional work per iteration can be performed in parallel with the standard coarse grid computations. We generalize the filtering algorithm to a version suitable for nonlinear problems. We emphasize that this generalization is conceptually straight-forward and relatively easy to implement. In particular, no explicit linearization (e.g., formation of Jacobians) needs to be performed (similar to the FAS multigrid approach). We illustrate the nonlinear version by applying it to the Euler equations, and presenting numerical results. Finally, a performance evaluation is made based on execution time models and convergence information obtained from numerical experiments.
Microwave-Based Reaction Screening: Tandem Retro-Diels-Alder/Diels-Alder Cycloadditions of ortho-Quinol Dimers

PubMed Central

Dong, Suwei; Cahill, Kath arine J.; Kang, Moon -Il; Colburn, Nancy H.; Henrich, Curtis J.; Wilson, Jennifer A.; Beutler, John A.; Johnson, Richard P.; Porco, John A.

2011-01-01

We have accomplished a parallel screen of cycloaddition partners for ortho-quinols utilizing a plate-based microwave system. Microwave irradiation improves the efficiency of retro-Diels-Alder/Diels-Alder cascades of ortho-quinol dimers which generally proceed in a diastereoselective fashion. Computational studies indicate that asynchronous transition states are favored in Diels-Alder cycloadditions of ortho-quinols. Subsequent biological evaluation of a collection of cycloadducts has identified an inhibitor of activator protein-1 (AP-1), an oncogenic transcription factor. PMID:21942286
Nonlinear and parallel algorithms for finite element discretizations of the incompressible Navier-Stokes equations

NASA Astrophysics Data System (ADS)

Arteaga, Santiago Egido

1998-12-01

The steady-state Navier-Stokes equations are of considerable interest because they are used to model numerous common physical phenomena. The applications encountered in practice often involve small viscosities and complicated domain geometries, and they result in challenging problems in spite of the vast attention that has been dedicated to them. In this thesis we examine methods for computing the numerical solution of the primitive variable formulation of the incompressible equations on distributed memory parallel computers. We use the Galerkin method to discretize the differential equations, although most results are stated so that they apply also to stabilized methods. We also reformulate some classical results in a single framework and discuss some issues frequently dismissed in the literature, such as the implementation of pressure space basis and non- homogeneous boundary values. We consider three nonlinear methods: Newton's method, Oseen's (or Picard) iteration, and sequences of Stokes problems. All these iterative nonlinear methods require solving a linear system at every step. Newton's method has quadratic convergence while that of the others is only linear; however, we obtain theoretical bounds showing that Oseen's iteration is more robust, and we confirm it experimentally. In addition, although Oseen's iteration usually requires more iterations than Newton's method, the linear systems it generates tend to be simpler and its overall costs (in CPU time) are lower. The Stokes problems result in linear systems which are easier to solve, but its convergence is much slower, so that it is competitive only for large viscosities. Inexact versions of these methods are studied, and we explain why the best timings are obtained using relatively modest error tolerances in solving the corresponding linear systems. We also present a new damping optimization strategy based on the quadratic nature of the Navier-Stokes equations, which improves the robustness of all the linearization strategies considered and whose computational cost is negligible. The algebraic properties of these systems depend on both the discretization and nonlinear method used. We study in detail the positive definiteness and skewsymmetry of the advection submatrices (essentially, convection-diffusion problems). We propose a discretization based on a new trilinear form for Newton's method. We solve the linear systems using three Krylov subspace methods, GMRES, QMR and TFQMR, and compare the advantages of each. Our emphasis is on parallel algorithms, and so we consider preconditioners suitable for parallel computers such as line variants of the Jacobi and Gauss- Seidel methods, alternating direction implicit methods, and Chebyshev and least squares polynomial preconditioners. These work well for moderate viscosities (moderate Reynolds number). For small viscosities we show that effective parallel solution of the advection subproblem is a critical factor to improve performance. Implementation details on a CM-5 are presented.
What can neuromorphic event-driven precise timing add to spike-based pattern recognition?

PubMed

Akolkar, Himanshu; Meyer, Cedric; Clady, Zavier; Marre, Olivier; Bartolozzi, Chiara; Panzeri, Stefano; Benosman, Ryad

2015-03-01

This letter introduces a study to precisely measure what an increase in spike timing precision can add to spike-driven pattern recognition algorithms. The concept of generating spikes from images by converting gray levels into spike timings is currently at the basis of almost every spike-based modeling of biological visual systems. The use of images naturally leads to generating incorrect artificial and redundant spike timings and, more important, also contradicts biological findings indicating that visual processing is massively parallel, asynchronous with high temporal resolution. A new concept for acquiring visual information through pixel-individual asynchronous level-crossing sampling has been proposed in a recent generation of asynchronous neuromorphic visual sensors. Unlike conventional cameras, these sensors acquire data not at fixed points in time for the entire array but at fixed amplitude changes of their input, resulting optimally sparse in space and time-pixel individually and precisely timed only if new, (previously unknown) information is available (event based). This letter uses the high temporal resolution spiking output of neuromorphic event-based visual sensors to show that lowering time precision degrades performance on several recognition tasks specifically when reaching the conventional range of machine vision acquisition frequencies (30-60 Hz). The use of information theory to characterize separability between classes for each temporal resolution shows that high temporal acquisition provides up to 70% more information that conventional spikes generated from frame-based acquisition as used in standard artificial vision, thus drastically increasing the separability between classes of objects. Experiments on real data show that the amount of information loss is correlated with temporal precision. Our information-theoretic study highlights the potentials of neuromorphic asynchronous visual sensors for both practical applications and theoretical investigations. Moreover, it suggests that representing visual information as a precise sequence of spike times as reported in the retina offers considerable advantages for neuro-inspired visual computations.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Turner, A.; Davis, A.; University of Wisconsin-Madison, Madison, WI 53706

CCFE perform Monte-Carlo transport simulations on large and complex tokamak models such as ITER. Such simulations are challenging since streaming and deep penetration effects are equally important. In order to make such simulations tractable, both variance reduction (VR) techniques and parallel computing are used. It has been found that the application of VR techniques in such models significantly reduces the efficiency of parallel computation due to 'long histories'. VR in MCNP can be accomplished using energy-dependent weight windows. The weight window represents an 'average behaviour' of particles, and large deviations in the arriving weight of a particle give rise tomore » extreme amounts of splitting being performed and a long history. When running on parallel clusters, a long history can have a detrimental effect on the parallel efficiency - if one process is computing the long history, the other CPUs complete their batch of histories and wait idle. Furthermore some long histories have been found to be effectively intractable. To combat this effect, CCFE has developed an adaptation of MCNP which dynamically adjusts the WW where a large weight deviation is encountered. The method effectively 'de-optimises' the WW, reducing the VR performance but this is offset by a significant increase in parallel efficiency. Testing with a simple geometry has shown the method does not bias the result. This 'long history method' has enabled CCFE to significantly improve the performance of MCNP calculations for ITER on parallel clusters, and will be beneficial for any geometry combining streaming and deep penetration effects. (authors)« less
Fast in-memory elastic full-waveform inversion using consumer-grade GPUs

NASA Astrophysics Data System (ADS)

Sivertsen Bergslid, Tore; Birger Raknes, Espen; Arntsen, Børge

2017-04-01

Full-waveform inversion (FWI) is a technique to estimate subsurface properties by using the recorded waveform produced by a seismic source and applying inverse theory. This is done through an iterative optimization procedure, where each iteration requires solving the wave equation many times, then trying to minimize the difference between the modeled and the measured seismic data. Having to model many of these seismic sources per iteration means that this is a highly computationally demanding procedure, which usually involves writing a lot of data to disk. We have written code that does forward modeling and inversion entirely in memory. A typical HPC cluster has many more CPUs than GPUs. Since FWI involves modeling many seismic sources per iteration, the obvious approach is to parallelize the code on a source-by-source basis, where each core of the CPU performs one modeling, and do all modelings simultaneously. With this approach, the GPU is already at a major disadvantage in pure numbers. Fortunately, GPUs can more than make up for this hardware disadvantage by performing each modeling much faster than a CPU. Another benefit of parallelizing each individual modeling is that it lets each modeling use a lot more RAM. If one node has 128 GB of RAM and 20 CPU cores, each modeling can use only 6.4 GB RAM if one is running the node at full capacity with source-by-source parallelization on the CPU. A parallelized per-source code using GPUs can use 64 GB RAM per modeling. Whenever a modeling uses more RAM than is available and has to start using regular disk space the runtime increases dramatically, due to slow file I/O. The extremely high computational speed of the GPUs combined with the large amount of RAM available for each modeling lets us do high frequency FWI for fairly large models very quickly. For a single modeling, our GPU code outperforms the single-threaded CPU-code by a factor of about 75. Successful inversions have been run on data with frequencies up to 40 Hz for a model of 2001 by 600 grid points with 5 m grid spacing and 5000 time steps, in less than 2.5 minutes per source. In practice, using 15 nodes (30 GPUs) to model 101 sources, each iteration took approximately 9 minutes. For reference, the same inversion run with our CPU code uses two hours per iteration. This was done using only a very simple wavefield interpolation technique, saving every second timestep. Using a more sophisticated checkpointing or wavefield reconstruction method would allow us to increase this model size significantly. Our results show that ordinary gaming GPUs are a viable alternative to the expensive professional GPUs often used today, when performing large scale modeling and inversion in geophysics.
Parallel processing in finite element structural analysis

NASA Technical Reports Server (NTRS)

Noor, Ahmed K.

1987-01-01

A brief review is made of the fundamental concepts and basic issues of parallel processing. Discussion focuses on parallel numerical algorithms, performance evaluation of machines and algorithms, and parallelism in finite element computations. A computational strategy is proposed for maximizing the degree of parallelism at different levels of the finite element analysis process including: 1) formulation level (through the use of mixed finite element models); 2) analysis level (through additive decomposition of the different arrays in the governing equations into the contributions to a symmetrized response plus correction terms); 3) numerical algorithm level (through the use of operator splitting techniques and application of iterative processes); and 4) implementation level (through the effective combination of vectorization, multitasking and microtasking, whenever available).
Understanding and Optimizing Asynchronous Low-Precision Stochastic Gradient Descent

PubMed Central

De Sa, Christopher; Feldman, Matthew; Ré, Christopher; Olukotun, Kunle

2018-01-01

Stochastic gradient descent (SGD) is one of the most popular numerical algorithms used in machine learning and other domains. Since this is likely to continue for the foreseeable future, it is important to study techniques that can make it run fast on parallel hardware. In this paper, we provide the first analysis of a technique called Buckwild! that uses both asynchronous execution and low-precision computation. We introduce the DMGC model, the first conceptualization of the parameter space that exists when implementing low-precision SGD, and show that it provides a way to both classify these algorithms and model their performance. We leverage this insight to propose and analyze techniques to improve the speed of low-precision SGD. First, we propose software optimizations that can increase throughput on existing CPUs by up to 11×. Second, we propose architectural changes, including a new cache technique we call an obstinate cache, that increase throughput beyond the limits of current-generation hardware. We also implement and analyze low-precision SGD on the FPGA, which is a promising alternative to the CPU for future SGD systems. PMID:29391770
Poster — Thur Eve — 74: Distributed, asynchronous, reactive dosimetric and outcomes analysis using DICOMautomaton

DOE Office of Scientific and Technical Information (OSTI.GOV)

Clark, Haley; BC Cancer Agency, Surrey, B.C.; BC Cancer Agency, Vancouver, B.C.

2014-08-15

Many have speculated about the future of computational technology in clinical radiation oncology. It has been advocated that the next generation of computational infrastructure will improve on the current generation by incorporating richer aspects of automation, more heavily and seamlessly featuring distributed and parallel computation, and providing more flexibility toward aggregate data analysis. In this report we describe how a recently created — but currently existing — analysis framework (DICOMautomaton) incorporates these aspects. DICOMautomaton supports a variety of use cases but is especially suited for dosimetric outcomes correlation analysis, investigation and comparison of radiotherapy treatment efficacy, and dose-volume computation. Wemore » describe: how it overcomes computational bottlenecks by distributing workload across a network of machines; how modern, asynchronous computational techniques are used to reduce blocking and avoid unnecessary computation; and how issues of out-of-date data are addressed using reactive programming techniques and data dependency chains. We describe internal architecture of the software and give a detailed demonstration of how DICOMautomaton could be used to search for correlations between dosimetric and outcomes data.« less
3D Hybrid Simulations of Interactions of High-Velocity Plasmoids with Obstacles

NASA Astrophysics Data System (ADS)

Omelchenko, Y. A.; Weber, T. E.; Smith, R. J.

2015-11-01

Interactions of fast plasma streams and objects with magnetic obstacles (dipoles, mirrors, etc) lie at the core of many space and laboratory plasma phenomena ranging from magnetoshells and solar wind interactions with planetary magnetospheres to compact fusion plasmas (spheromaks and FRCs) to astrophysics-in-lab experiments. Properly modeling ion kinetic, finite-Larmor radius and Hall effects is essential for describing large-scale plasma dynamics, turbulence and heating in complex magnetic field geometries. Using an asynchronous parallel hybrid code, HYPERS, we conduct 3D hybrid (particle-in-cell ion, fluid electron) simulations of such interactions under realistic conditions that include magnetic flux coils, ion-ion collisions and the Chodura resistivity. HYPERS does not step simulation variables synchronously in time but instead performs time integration by executing asynchronous discrete events: updates of particles and fields carried out as frequently as dictated by local physical time scales. Simulations are compared with data from the MSX experiment which studies the physics of magnetized collisionless shocks through the acceleration and subsequent stagnation of FRC plasmoids against a strong magnetic mirror and flux-conserving boundary.
Evolution of a minimal parallel programming model

DOE PAGES

Lusk, Ewing; Butler, Ralph; Pieper, Steven C.

2017-04-30

Here, we take a historical approach to our presentation of self-scheduled task parallelism, a programming model with its origins in early irregular and nondeterministic computations encountered in automated theorem proving and logic programming. We show how an extremely simple task model has evolved into a system, asynchronous dynamic load balancing (ADLB), and a scalable implementation capable of supporting sophisticated applications on today’s (and tomorrow’s) largest supercomputers; and we illustrate the use of ADLB with a Green’s function Monte Carlo application, a modern, mature nuclear physics code in production use. Our lesson is that by surrendering a certain amount of generalitymore » and thus applicability, a minimal programming model (in terms of its basic concepts and the size of its application programmer interface) can achieve extreme scalability without introducing complexity.« less
An Analysis of Performance Enhancement Techniques for Overset Grid Applications

NASA Technical Reports Server (NTRS)

Djomehri, J. J.; Biswas, R.; Potsdam, M.; Strawn, R. C.; Biegel, Bryan (Technical Monitor)

2002-01-01

The overset grid methodology has significantly reduced time-to-solution of high-fidelity computational fluid dynamics (CFD) simulations about complex aerospace configurations. The solution process resolves the geometrical complexity of the problem domain by using separately generated but overlapping structured discretization grids that periodically exchange information through interpolation. However, high performance computations of such large-scale realistic applications must be handled efficiently on state-of-the-art parallel supercomputers. This paper analyzes the effects of various performance enhancement techniques on the parallel efficiency of an overset grid Navier-Stokes CFD application running on an SGI Origin2000 machine. Specifically, the role of asynchronous communication, grid splitting, and grid grouping strategies are presented and discussed. Results indicate that performance depends critically on the level of latency hiding and the quality of load balancing across the processors.
Advanced Technology and Mitigation (ATDM) SPARC Re-Entry Code Fiscal Year 2017 Progress and Accomplishments for ECP.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Crozier, Paul; Howard, Micah; Rider, William J.

The SPARC (Sandia Parallel Aerodynamics and Reentry Code) will provide nuclear weapon qualification evidence for the random vibration and thermal environments created by re-entry of a warhead into the earth’s atmosphere. SPARC incorporates the innovative approaches of ATDM projects on several fronts including: effective harnessing of heterogeneous compute nodes using Kokkos, exascale-ready parallel scalability through asynchronous multi-tasking, uncertainty quantification through Sacado integration, implementation of state-of-the-art reentry physics and multiscale models, use of advanced verification and validation methods, and enabling of improved workflows for users. SPARC is being developed primarily for the Department of Energy nuclear weapon program, with additional developmentmore » and use of the code is being supported by the Department of Defense for conventional weapons programs.« less
Human Exposure to Electromagnetic Fields from Parallel Wireless Power Transfer Systems.

PubMed

Wen, Feng; Huang, Xueliang

2017-02-08

The scenario of multiple wireless power transfer (WPT) systems working closely, synchronously or asynchronously with phase difference often occurs in power supply for household appliances and electric vehicles in parking lots. Magnetic field leakage from the WPT systems is also varied due to unpredictable asynchronous working conditions. In this study, the magnetic field leakage from parallel WPT systems working with phase difference is predicted, and the induced electric field and specific absorption rate (SAR) in a human body standing in the vicinity are also evaluated. Computational results are compared with the restrictions prescribed in the regulations established to limit human exposure to time-varying electromagnetic fields (EMFs). The results show that the middle region between the two WPT coils is safer for the two WPT systems working in-phase, and the peripheral regions are safer around the WPT systems working anti-phase. Thin metallic plates larger than the WPT coils can shield the magnetic field leakage well, while smaller ones may worsen the situation. The orientation of the human body will influence the maximum magnitude of induced electric field and its distribution within the human body. The induced electric field centralizes in the trunk, groin, and genitals with only one exception: when the human body is standing right at the middle of the two WPT coils working in-phase, the induced electric field focuses on lower limbs. The SAR value in the lungs always seems to be greater than in other organs, while the value in the liver is minimal. Human exposure to EMFs meets the guidelines of the International Committee on Non-Ionizing Radiation Protection (ICNIRP), specifically reference levels with respect to magnetic field and basic restrictions on induced electric fields and SAR, as the charging power is lower than 3.1 kW and 55.5 kW, respectively. These results are positive with respect to the safe applications of parallel WPT systems working simultaneously.
Human Exposure to Electromagnetic Fields from Parallel Wireless Power Transfer Systems

PubMed Central

Wen, Feng; Huang, Xueliang

2017-01-01

The scenario of multiple wireless power transfer (WPT) systems working closely, synchronously or asynchronously with phase difference often occurs in power supply for household appliances and electric vehicles in parking lots. Magnetic field leakage from the WPT systems is also varied due to unpredictable asynchronous working conditions. In this study, the magnetic field leakage from parallel WPT systems working with phase difference is predicted, and the induced electric field and specific absorption rate (SAR) in a human body standing in the vicinity are also evaluated. Computational results are compared with the restrictions prescribed in the regulations established to limit human exposure to time-varying electromagnetic fields (EMFs). The results show that the middle region between the two WPT coils is safer for the two WPT systems working in-phase, and the peripheral regions are safer around the WPT systems working anti-phase. Thin metallic plates larger than the WPT coils can shield the magnetic field leakage well, while smaller ones may worsen the situation. The orientation of the human body will influence the maximum magnitude of induced electric field and its distribution within the human body. The induced electric field centralizes in the trunk, groin, and genitals with only one exception: when the human body is standing right at the middle of the two WPT coils working in-phase, the induced electric field focuses on lower limbs. The SAR value in the lungs always seems to be greater than in other organs, while the value in the liver is minimal. Human exposure to EMFs meets the guidelines of the International Committee on Non-Ionizing Radiation Protection (ICNIRP), specifically reference levels with respect to magnetic field and basic restrictions on induced electric fields and SAR, as the charging power is lower than 3.1 kW and 55.5 kW, respectively. These results are positive with respect to the safe applications of parallel WPT systems working simultaneously. PMID:28208709
Aztec user`s guide. Version 1

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hutchinson, S.A.; Shadid, J.N.; Tuminaro, R.S.

1995-10-01

Aztec is an iterative library that greatly simplifies the parallelization process when solving the linear systems of equations Ax = b where A is a user supplied n x n sparse matrix, b is a user supplied vector of length n and x is a vector of length n to be computed. Aztec is intended as a software tool for users who want to avoid cumbersome parallel programming details but who have large sparse linear systems which require an efficiently utilized parallel processing system. A collection of data transformation tools are provided that allow for easy creation of distributed sparsemore » unstructured matrices for parallel solution. Once the distributed matrix is created, computation can be performed on any of the parallel machines running Aztec: nCUBE 2, IBM SP2 and Intel Paragon, MPI platforms as well as standard serial and vector platforms. Aztec includes a number of Krylov iterative methods such as conjugate gradient (CG), generalized minimum residual (GMRES) and stabilized biconjugate gradient (BICGSTAB) to solve systems of equations. These Krylov methods are used in conjunction with various preconditioners such as polynomial or domain decomposition methods using LU or incomplete LU factorizations within subdomains. Although the matrix A can be general, the package has been designed for matrices arising from the approximation of partial differential equations (PDEs). In particular, the Aztec package is oriented toward systems arising from PDE applications.« less
Iterative nonlinear joint transform correlation for the detection of objects in cluttered scenes

NASA Astrophysics Data System (ADS)

Haist, Tobias; Tiziani, Hans J.

1999-03-01

An iterative correlation technique with digital image processing in the feedback loop for the detection of small objects in cluttered scenes is proposed. A scanning aperture is combined with the method in order to improve the immunity against noise and clutter. Multiple reference objects or different views of one object are processed in parallel. We demonstrate the method by detecting a noisy and distorted face in a crowd with a nonlinear joint transform correlator.
Parallel design of JPEG-LS encoder on graphics processing units

NASA Astrophysics Data System (ADS)

Duan, Hao; Fang, Yong; Huang, Bormin

2012-01-01

With recent technical advances in graphic processing units (GPUs), GPUs have outperformed CPUs in terms of compute capability and memory bandwidth. Many successful GPU applications to high performance computing have been reported. JPEG-LS is an ISO/IEC standard for lossless image compression which utilizes adaptive context modeling and run-length coding to improve compression ratio. However, adaptive context modeling causes data dependency among adjacent pixels and the run-length coding has to be performed in a sequential way. Hence, using JPEG-LS to compress large-volume hyperspectral image data is quite time-consuming. We implement an efficient parallel JPEG-LS encoder for lossless hyperspectral compression on a NVIDIA GPU using the computer unified device architecture (CUDA) programming technology. We use the block parallel strategy, as well as such CUDA techniques as coalesced global memory access, parallel prefix sum, and asynchronous data transfer. We also show the relation between GPU speedup and AVIRIS block size, as well as the relation between compression ratio and AVIRIS block size. When AVIRIS images are divided into blocks, each with 64×64 pixels, we gain the best GPU performance with 26.3x speedup over its original CPU code.

LDRD final report on massively-parallel linear programming : the parPCx system.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Parekh, Ojas; Phillips, Cynthia Ann; Boman, Erik Gunnar

2005-02-01

This report summarizes the research and development performed from October 2002 to September 2004 at Sandia National Laboratories under the Laboratory-Directed Research and Development (LDRD) project ''Massively-Parallel Linear Programming''. We developed a linear programming (LP) solver designed to use a large number of processors. LP is the optimization of a linear objective function subject to linear constraints. Companies and universities have expended huge efforts over decades to produce fast, stable serial LP solvers. Previous parallel codes run on shared-memory systems and have little or no distribution of the constraint matrix. We have seen no reports of general LP solver runsmore » on large numbers of processors. Our parallel LP code is based on an efficient serial implementation of Mehrotra's interior-point predictor-corrector algorithm (PCx). The computational core of this algorithm is the assembly and solution of a sparse linear system. We have substantially rewritten the PCx code and based it on Trilinos, the parallel linear algebra library developed at Sandia. Our interior-point method can use either direct or iterative solvers for the linear system. To achieve a good parallel data distribution of the constraint matrix, we use a (pre-release) version of a hypergraph partitioner from the Zoltan partitioning library. We describe the design and implementation of our new LP solver called parPCx and give preliminary computational results. We summarize a number of issues related to efficient parallel solution of LPs with interior-point methods including data distribution, numerical stability, and solving the core linear system using both direct and iterative methods. We describe a number of applications of LP specific to US Department of Energy mission areas and we summarize our efforts to integrate parPCx (and parallel LP solvers in general) into Sandia's massively-parallel integer programming solver PICO (Parallel Interger and Combinatorial Optimizer). We conclude with directions for long-term future algorithmic research and for near-term development that could improve the performance of parPCx.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)

Sadowski, Greg

In one form, a logic circuit includes an asynchronous logic circuit, a synchronous logic circuit, and an interface circuit coupled between the asynchronous logic circuit and the synchronous logic circuit. The asynchronous logic circuit has a plurality of asynchronous outputs for providing a corresponding plurality of asynchronous signals. The synchronous logic circuit has a plurality of synchronous inputs corresponding to the plurality of asynchronous outputs, a stretch input for receiving a stretch signal, and a clock output for providing a clock signal. The synchronous logic circuit provides the clock signal as a periodic signal but prolongs a predetermined state ofmore » the clock signal while the stretch signal is active. The asynchronous interface detects whether metastability could occur when latching any of the plurality of the asynchronous outputs of the asynchronous logic circuit using said clock signal, and activates the stretch signal while the metastability could occur.« less
AZTEC. Parallel Iterative method Software for Solving Linear Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hutchinson, S.; Shadid, J.; Tuminaro, R.

1995-07-01

AZTEC is an interactive library that greatly simplifies the parrallelization process when solving the linear systems of equations Ax=b where A is a user supplied n X n sparse matrix, b is a user supplied vector of length n and x is a vector of length n to be computed. AZTEC is intended as a software tool for users who want to avoid cumbersome parallel programming details but who have large sparse linear systems which require an efficiently utilized parallel processing system. A collection of data transformation tools are provided that allow for easy creation of distributed sparse unstructured matricesmore » for parallel solutions.« less
A parallel Jacobson-Oksman optimization algorithm. [parallel processing (computers)

NASA Technical Reports Server (NTRS)

Straeter, T. A.; Markos, A. T.

1975-01-01

A gradient-dependent optimization technique which exploits the vector-streaming or parallel-computing capabilities of some modern computers is presented. The algorithm, derived by assuming that the function to be minimized is homogeneous, is a modification of the Jacobson-Oksman serial minimization method. In addition to describing the algorithm, conditions insuring the convergence of the iterates of the algorithm and the results of numerical experiments on a group of sample test functions are presented. The results of these experiments indicate that this algorithm will solve optimization problems in less computing time than conventional serial methods on machines having vector-streaming or parallel-computing capabilities.
Communications oriented programming of parallel iterative solutions of sparse linear systems

NASA Technical Reports Server (NTRS)

Patrick, M. L.; Pratt, T. W.

1986-01-01

Parallel algorithms are developed for a class of scientific computational problems by partitioning the problems into smaller problems which may be solved concurrently. The effectiveness of the resulting parallel solutions is determined by the amount and frequency of communication and synchronization and the extent to which communication can be overlapped with computation. Three different parallel algorithms for solving the same class of problems are presented, and their effectiveness is analyzed from this point of view. The algorithms are programmed using a new programming environment. Run-time statistics and experience obtained from the execution of these programs assist in measuring the effectiveness of these algorithms.
Parallel Conjugate Gradient: Effects of Ordering Strategies, Programming Paradigms, and Architectural Platforms

NASA Technical Reports Server (NTRS)

Oliker, Leonid; Heber, Gerd; Biswas, Rupak

2000-01-01

The Conjugate Gradient (CG) algorithm is perhaps the best-known iterative technique to solve sparse linear systems that are symmetric and positive definite. A sparse matrix-vector multiply (SPMV) usually accounts for most of the floating-point operations within a CG iteration. In this paper, we investigate the effects of various ordering and partitioning strategies on the performance of parallel CG and SPMV using different programming paradigms and architectures. Results show that for this class of applications, ordering significantly improves overall performance, that cache reuse may be more important than reducing communication, and that it is possible to achieve message passing performance using shared memory constructs through careful data ordering and distribution. However, a multi-threaded implementation of CG on the Tera MTA does not require special ordering or partitioning to obtain high efficiency and scalability.
A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data.

PubMed

Liang, Faming; Kim, Jinsu; Song, Qifan

2016-01-01

Markov chain Monte Carlo (MCMC) methods have proven to be a very powerful tool for analyzing data of complex structures. However, their computer-intensive nature, which typically require a large number of iterations and a complete scan of the full dataset for each iteration, precludes their use for big data analysis. In this paper, we propose the so-called bootstrap Metropolis-Hastings (BMH) algorithm, which provides a general framework for how to tame powerful MCMC methods to be used for big data analysis; that is to replace the full data log-likelihood by a Monte Carlo average of the log-likelihoods that are calculated in parallel from multiple bootstrap samples. The BMH algorithm possesses an embarrassingly parallel structure and avoids repeated scans of the full dataset in iterations, and is thus feasible for big data problems. Compared to the popular divide-and-combine method, BMH can be generally more efficient as it can asymptotically integrate the whole data information into a single simulation run. The BMH algorithm is very flexible. Like the Metropolis-Hastings algorithm, it can serve as a basic building block for developing advanced MCMC algorithms that are feasible for big data problems. This is illustrated in the paper by the tempering BMH algorithm, which can be viewed as a combination of parallel tempering and the BMH algorithm. BMH can also be used for model selection and optimization by combining with reversible jump MCMC and simulated annealing, respectively.
A Bootstrap Metropolis–Hastings Algorithm for Bayesian Analysis of Big Data

PubMed Central

Kim, Jinsu; Song, Qifan

2016-01-01

Markov chain Monte Carlo (MCMC) methods have proven to be a very powerful tool for analyzing data of complex structures. However, their computer-intensive nature, which typically require a large number of iterations and a complete scan of the full dataset for each iteration, precludes their use for big data analysis. In this paper, we propose the so-called bootstrap Metropolis-Hastings (BMH) algorithm, which provides a general framework for how to tame powerful MCMC methods to be used for big data analysis; that is to replace the full data log-likelihood by a Monte Carlo average of the log-likelihoods that are calculated in parallel from multiple bootstrap samples. The BMH algorithm possesses an embarrassingly parallel structure and avoids repeated scans of the full dataset in iterations, and is thus feasible for big data problems. Compared to the popular divide-and-combine method, BMH can be generally more efficient as it can asymptotically integrate the whole data information into a single simulation run. The BMH algorithm is very flexible. Like the Metropolis-Hastings algorithm, it can serve as a basic building block for developing advanced MCMC algorithms that are feasible for big data problems. This is illustrated in the paper by the tempering BMH algorithm, which can be viewed as a combination of parallel tempering and the BMH algorithm. BMH can also be used for model selection and optimization by combining with reversible jump MCMC and simulated annealing, respectively. PMID:29033469
Enhancing Scalability and Efficiency of the TOUGH2_MP for LinuxClusters

DOE Office of Scientific and Technical Information (OSTI.GOV)

Zhang, Keni; Wu, Yu-Shu

2006-04-17

TOUGH2{_}MP, the parallel version TOUGH2 code, has been enhanced by implementing more efficient communication schemes. This enhancement is achieved through reducing the amount of small-size messages and the volume of large messages. The message exchange speed is further improved by using non-blocking communications for both linear and nonlinear iterations. In addition, we have modified the AZTEC parallel linear-equation solver to nonblocking communication. Through the improvement of code structuring and bug fixing, the new version code is now more stable, while demonstrating similar or even better nonlinear iteration converging speed than the original TOUGH2 code. As a result, the new versionmore » of TOUGH2{_}MP is improved significantly in its efficiency. In this paper, the scalability and efficiency of the parallel code are demonstrated by solving two large-scale problems. The testing results indicate that speedup of the code may depend on both problem size and complexity. In general, the code has excellent scalability in memory requirement as well as computing time.« less
Marine Controlled-Source Electromagnetic 2D Inversion for synthetic models.

NASA Astrophysics Data System (ADS)

Liu, Y.; Li, Y.

2016-12-01

We present a 2D inverse algorithm for frequency domain marine controlled-source electromagnetic (CSEM) data, which is based on the regularized Gauss-Newton approach. As a forward solver, our parallel adaptive finite element forward modeling program is employed. It is a self-adaptive, goal-oriented grid refinement algorithm in which a finite element analysis is performed on a sequence of refined meshes. The mesh refinement process is guided by a dual error estimate weighting to bias refinement towards elements that affect the solution at the EM receiver locations. With the use of the direct solver (MUMPS), we can effectively compute the electromagnetic fields for multi-sources and parametric sensitivities. We also implement the parallel data domain decomposition approach of Key and Ovall (2011), with the goal of being able to compute accurate responses in parallel for complicated models and a full suite of data parameters typical of offshore CSEM surveys. All minimizations are carried out by using the Gauss-Newton algorithm and model perturbations at each iteration step are obtained by using the Inexact Conjugate Gradient iteration method. Synthetic test inversions are presented.
Summer Proceedings 2016: The Center for Computing Research at Sandia National Laboratories

DOE Office of Scientific and Technical Information (OSTI.GOV)

Carleton, James Brian; Parks, Michael L.

Solving sparse linear systems from the discretization of elliptic partial differential equations (PDEs) is an important building block in many engineering applications. Sparse direct solvers can solve general linear systems, but are usually slower and use much more memory than effective iterative solvers. To overcome these two disadvantages, a hierarchical solver (LoRaSp) based on H2-matrices was introduced in [22]. Here, we have developed a parallel version of the algorithm in LoRaSp to solve large sparse matrices on distributed memory machines. On a single processor, the factorization time of our parallel solver scales almost linearly with the problem size for three-dimensionalmore » problems, as opposed to the quadratic scalability of many existing sparse direct solvers. Moreover, our solver leads to almost constant numbers of iterations, when used as a preconditioner for Poisson problems. On more than one processor, our algorithm has significant speedups compared to sequential runs. With this parallel algorithm, we are able to solve large problems much faster than many existing packages as demonstrated by the numerical experiments.« less
Special purpose parallel computer architecture for real-time control and simulation in robotic applications

NASA Technical Reports Server (NTRS)

Fijany, Amir (Inventor); Bejczy, Antal K. (Inventor)

1993-01-01

This is a real-time robotic controller and simulator which is a MIMD-SIMD parallel architecture for interfacing with an external host computer and providing a high degree of parallelism in computations for robotic control and simulation. It includes a host processor for receiving instructions from the external host computer and for transmitting answers to the external host computer. There are a plurality of SIMD microprocessors, each SIMD processor being a SIMD parallel processor capable of exploiting fine grain parallelism and further being able to operate asynchronously to form a MIMD architecture. Each SIMD processor comprises a SIMD architecture capable of performing two matrix-vector operations in parallel while fully exploiting parallelism in each operation. There is a system bus connecting the host processor to the plurality of SIMD microprocessors and a common clock providing a continuous sequence of clock pulses. There is also a ring structure interconnecting the plurality of SIMD microprocessors and connected to the clock for providing the clock pulses to the SIMD microprocessors and for providing a path for the flow of data and instructions between the SIMD microprocessors. The host processor includes logic for controlling the RRCS by interpreting instructions sent by the external host computer, decomposing the instructions into a series of computations to be performed by the SIMD microprocessors, using the system bus to distribute associated data among the SIMD microprocessors, and initiating activity of the SIMD microprocessors to perform the computations on the data by procedure call.
Iterative image reconstruction for PROPELLER-MRI using the nonuniform fast fourier transform.

PubMed

Tamhane, Ashish A; Anastasio, Mark A; Gui, Minzhi; Arfanakis, Konstantinos

2010-07-01

To investigate an iterative image reconstruction algorithm using the nonuniform fast Fourier transform (NUFFT) for PROPELLER (Periodically Rotated Overlapping ParallEL Lines with Enhanced Reconstruction) MRI. Numerical simulations, as well as experiments on a phantom and a healthy human subject were used to evaluate the performance of the iterative image reconstruction algorithm for PROPELLER, and compare it with that of conventional gridding. The trade-off between spatial resolution, signal to noise ratio, and image artifacts, was investigated for different values of the regularization parameter. The performance of the iterative image reconstruction algorithm in the presence of motion was also evaluated. It was demonstrated that, for a certain range of values of the regularization parameter, iterative reconstruction produced images with significantly increased signal to noise ratio, reduced artifacts, for similar spatial resolution, compared with gridding. Furthermore, the ability to reduce the effects of motion in PROPELLER-MRI was maintained when using the iterative reconstruction approach. An iterative image reconstruction technique based on the NUFFT was investigated for PROPELLER MRI. For a certain range of values of the regularization parameter, the new reconstruction technique may provide PROPELLER images with improved image quality compared with conventional gridding. (c) 2010 Wiley-Liss, Inc.
Iterative Image Reconstruction for PROPELLER-MRI using the NonUniform Fast Fourier Transform

PubMed Central

Tamhane, Ashish A.; Anastasio, Mark A.; Gui, Minzhi; Arfanakis, Konstantinos

2013-01-01

Purpose To investigate an iterative image reconstruction algorithm using the non-uniform fast Fourier transform (NUFFT) for PROPELLER (Periodically Rotated Overlapping parallEL Lines with Enhanced Reconstruction) MRI. Materials and Methods Numerical simulations, as well as experiments on a phantom and a healthy human subject were used to evaluate the performance of the iterative image reconstruction algorithm for PROPELLER, and compare it to that of conventional gridding. The trade-off between spatial resolution, signal to noise ratio, and image artifacts, was investigated for different values of the regularization parameter. The performance of the iterative image reconstruction algorithm in the presence of motion was also evaluated. Results It was demonstrated that, for a certain range of values of the regularization parameter, iterative reconstruction produced images with significantly increased SNR, reduced artifacts, for similar spatial resolution, compared to gridding. Furthermore, the ability to reduce the effects of motion in PROPELLER-MRI was maintained when using the iterative reconstruction approach. Conclusion An iterative image reconstruction technique based on the NUFFT was investigated for PROPELLER MRI. For a certain range of values of the regularization parameter the new reconstruction technique may provide PROPELLER images with improved image quality compared to conventional gridding. PMID:20578028
Exploring the Ability of a Coarse-grained Potential to Describe the Stress-strain Response of Glassy Polystyrene

DTIC Science & Technology

2012-10-01

using the open-source code Large-scale Atomic/Molecular Massively Parallel Simulator ( LAMMPS ) (http://lammps.sandia.gov) (23). The commercial...parameters are proprietary and cannot be ported to the LAMMPS 4 simulation code. In our molecular dynamics simulations at the atomistic resolution, we...IBI iterative Boltzmann inversion LAMMPS Large-scale Atomic/Molecular Massively Parallel Simulator MAPS Materials Processes and Simulations MS
Real-Time, Multiple, Pan/Tilt/Zoom, Computer Vision Tracking, and 3D Position Estimating System for Unmanned Aerial System Metrology

DTIC Science & Technology

2013-10-18

of the enclosed tasks plus the last parallel task for a total of five parallel tasks for one iteration, i). for j = 1…N for i = 1… 8 Figure...drizzling juices culminating in a state of salivating desire to cut a piece and enjoy. On the other hand, the smell could be that of a pungent, unpleasant
Iterative-method performance evaluation for multiple vectors associated with a large-scale sparse matrix

NASA Astrophysics Data System (ADS)

Imamura, Seigo; Ono, Kenji; Yokokawa, Mitsuo

2016-07-01

Ensemble computing, which is an instance of capacity computing, is an effective computing scenario for exascale parallel supercomputers. In ensemble computing, there are multiple linear systems associated with a common coefficient matrix. We improve the performance of iterative solvers for multiple vectors by solving them at the same time, that is, by solving for the product of the matrices. We implemented several iterative methods and compared their performance. The maximum performance on Sparc VIIIfx was 7.6 times higher than that of a naïve implementation. Finally, to deal with the different convergence processes of linear systems, we introduced a control method to eliminate the calculation of already converged vectors.
Methodes iteratives paralleles: Applications en neutronique et en mecanique des fluides

NASA Astrophysics Data System (ADS)

Qaddouri, Abdessamad

Dans cette these, le calcul parallele est applique successivement a la neutronique et a la mecanique des fluides. Dans chacune de ces deux applications, des methodes iteratives sont utilisees pour resoudre le systeme d'equations algebriques resultant de la discretisation des equations du probleme physique. Dans le probleme de neutronique, le calcul des matrices des probabilites de collision (PC) ainsi qu'un schema iteratif multigroupe utilisant une methode inverse de puissance sont parallelises. Dans le probleme de mecanique des fluides, un code d'elements finis utilisant un algorithme iteratif du type GMRES preconditionne est parallelise. Cette these est presentee sous forme de six articles suivis d'une conclusion. Les cinq premiers articles traitent des applications en neutronique, articles qui representent l'evolution de notre travail dans ce domaine. Cette evolution passe par un calcul parallele des matrices des PC et un algorithme multigroupe parallele teste sur un probleme unidimensionnel (article 1), puis par deux algorithmes paralleles l'un mutiregion l'autre multigroupe, testes sur des problemes bidimensionnels (articles 2--3). Ces deux premieres etapes sont suivies par l'application de deux techniques d'acceleration, le rebalancement neutronique et la minimisation du residu aux deux algorithmes paralleles (article 4). Finalement, on a mis en oeuvre l'algorithme multigroupe et le calcul parallele des matrices des PC sur un code de production DRAGON ou les tests sont plus realistes et peuvent etre tridimensionnels (article 5). Le sixieme article (article 6), consacre a l'application a la mecanique des fluides, traite la parallelisation d'un code d'elements finis FES ou le partitionneur de graphe METIS et la librairie PSPARSLIB sont utilises.
Generalizations of the Alternating Direction Method of Multipliers for Large-Scale and Distributed Optimization

DTIC Science & Technology

2014-05-01

exact one is solved later — as- signed as step 5 of Algorithm 2 — because at each iteration , the ADMM updates the variables in the Gauss - Seidel ...Jacobi ADMM (see Algo- rithm 5 below). Unlike the Gauss - Seidel ADMM, the Jacobi ADMM updates all the 70 blocks in parallel at every iteration : xk+1i...that extending ADMM straightforwardly from the classic Gauss - Seidel setting to the Jacobi setting, from two blocks to multiple blocks, will preserve
The monophasic action potential upstroke: a means of characterizing local conduction.

PubMed

Levine, J H; Moore, E N; Kadish, A H; Guarnieri, T; Spear, J F

1986-11-01

The upstrokes of monophasic action potentials (MAPs) recorded with an extracellular pressure electrode were characterized in isolated canine tissue preparations in vitro. The characteristics of the MAP upstroke were compared with those of the local action potential foot as well as with the characteristics of approaching electrical activation during uniform and asynchronous conduction. The upstroke of the MAP was exponential during uniform conduction. The time constant of rise of the MAP upstroke (TMAP) correlated with that of the action potential foot (Tfoot): TMAP + 1.01 Tfoot + 0.50; r2 = .80. Furthermore, changes in Tfoot with alterations in cycle length were associated with similar changes in TMAP: Tfoot = 1.06 TMAP - 0.11; r2 = .78. In addition, TMAP and Tfoot both deviated from exponential during asynchronous activation; the inflections that developed in the MAP upstroke correlated in time with intracellular action potential upstrokes that were asynchronous in onset in these tissues. Finally, the field of view of the MAP was determined and was found to be dependent in part on tissue architecture and the space constant. Specifically, the field of view of the MAP was found to be greater parallel compared with transverse to fiber orientation (6.02 +/- 1.74 vs 3.03 +/- 1.10 mm; p less than .01). These data suggest that the MAP upstroke may be used to define and characterize local electrical activation. The relatively large field of view of the MAP suggests that this technique may be a sensitive means to record focal membrane phenomena in vivo.

Inversion of potential field data using the finite element method on parallel computers

NASA Astrophysics Data System (ADS)

Gross, L.; Altinay, C.; Shaw, S.

2015-11-01

In this paper we present a formulation of the joint inversion of potential field anomaly data as an optimization problem with partial differential equation (PDE) constraints. The problem is solved using the iterative Broyden-Fletcher-Goldfarb-Shanno (BFGS) method with the Hessian operator of the regularization and cross-gradient component of the cost function as preconditioner. We will show that each iterative step requires the solution of several PDEs namely for the potential fields, for the adjoint defects and for the application of the preconditioner. In extension to the traditional discrete formulation the BFGS method is applied to continuous descriptions of the unknown physical properties in combination with an appropriate integral form of the dot product. The PDEs can easily be solved using standard conforming finite element methods (FEMs) with potentially different resolutions. For two examples we demonstrate that the number of PDE solutions required to reach a given tolerance in the BFGS iteration is controlled by weighting regularization and cross-gradient but is independent of the resolution of PDE discretization and that as a consequence the method is weakly scalable with the number of cells on parallel computers. We also show a comparison with the UBC-GIF GRAV3D code.
Scalable splitting algorithms for big-data interferometric imaging in the SKA era

NASA Astrophysics Data System (ADS)

Onose, Alexandru; Carrillo, Rafael E.; Repetti, Audrey; McEwen, Jason D.; Thiran, Jean-Philippe; Pesquet, Jean-Christophe; Wiaux, Yves

2016-11-01

In the context of next-generation radio telescopes, like the Square Kilometre Array (SKA), the efficient processing of large-scale data sets is extremely important. Convex optimization tasks under the compressive sensing framework have recently emerged and provide both enhanced image reconstruction quality and scalability to increasingly larger data sets. We focus herein mainly on scalability and propose two new convex optimization algorithmic structures able to solve the convex optimization tasks arising in radio-interferometric imaging. They rely on proximal splitting and forward-backward iterations and can be seen, by analogy, with the CLEAN major-minor cycle, as running sophisticated CLEAN-like iterations in parallel in multiple data, prior, and image spaces. Both methods support any convex regularization function, in particular, the well-studied ℓ1 priors promoting image sparsity in an adequate domain. Tailored for big-data, they employ parallel and distributed computations to achieve scalability, in terms of memory and computational requirements. One of them also exploits randomization, over data blocks at each iteration, offering further flexibility. We present simulation results showing the feasibility of the proposed methods as well as their advantages compared to state-of-the-art algorithmic solvers. Our MATLAB code is available online on GitHub.
Performance evaluation of canny edge detection on a tiled multicore architecture

NASA Astrophysics Data System (ADS)

Brethorst, Andrew Z.; Desai, Nehal; Enright, Douglas P.; Scrofano, Ronald

2011-01-01

In the last few years, a variety of multicore architectures have been used to parallelize image processing applications. In this paper, we focus on assessing the parallel speed-ups of different Canny edge detection parallelization strategies on the Tile64, a tiled multicore architecture developed by the Tilera Corporation. Included in these strategies are different ways Canny edge detection can be parallelized, as well as differences in data management. The two parallelization strategies examined were loop-level parallelism and domain decomposition. Loop-level parallelism is achieved through the use of OpenMP,1 and it is capable of parallelization across the range of values over which a loop iterates. Domain decomposition is the process of breaking down an image into subimages, where each subimage is processed independently, in parallel. The results of the two strategies show that for the same number of threads, programmer implemented, domain decomposition exhibits higher speed-ups than the compiler managed, loop-level parallelism implemented with OpenMP.
UPC++ Programmer’s Guide, v1.0-2018.3.0

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bachan, J.; Baden, S.; Bonachea, Dan

UPC++ is a C++11 library that provides Partitioned Global Address Space (PGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The PGAS model is single program, multiple-data (SPMD), with each separate thread of execution (referred to as a rank, a term borrowed from MPI) having access to local memory as it would in C++. However, PGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the ranks. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operationsmore » that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.« less
More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server

PubMed Central

Ho, Qirong; Cipar, James; Cui, Henggang; Kim, Jin Kyu; Lee, Seunghak; Gibbons, Phillip B.; Gibson, Garth A.; Ganger, Gregory R.; Xing, Eric P.

2014-01-01

We propose a parameter server system for distributed ML, which follows a Stale Synchronous Parallel (SSP) model of computation that maximizes the time computational workers spend doing useful work on ML algorithms, while still providing correctness guarantees. The parameter server provides an easy-to-use shared interface for read/write access to an ML model’s values (parameters and variables), and the SSP model allows distributed workers to read older, stale versions of these values from a local cache, instead of waiting to get them from a central storage. This significantly increases the proportion of time workers spend computing, as opposed to waiting. Furthermore, the SSP model ensures ML algorithm correctness by limiting the maximum age of the stale values. We provide a proof of correctness under SSP, as well as empirical results demonstrating that the SSP model achieves faster algorithm convergence on several different ML problems, compared to fully-synchronous and asynchronous schemes. PMID:25400488
Real time software tools and methodologies

NASA Technical Reports Server (NTRS)

Christofferson, M. J.

1981-01-01

Real time systems are characterized by high speed processing and throughput as well as asynchronous event processing requirements. These requirements give rise to particular implementations of parallel or pipeline multitasking structures, of intertask or interprocess communications mechanisms, and finally of message (buffer) routing or switching mechanisms. These mechanisms or structures, along with the data structue, describe the essential character of the system. These common structural elements and mechanisms are identified, their implementation in the form of routines, tasks or macros - in other words, tools are formalized. The tools developed support or make available the following: reentrant task creation, generalized message routing techniques, generalized task structures/task families, standardized intertask communications mechanisms, and pipeline and parallel processing architectures in a multitasking environment. Tools development raise some interesting prospects in the areas of software instrumentation and software portability. These issues are discussed following the description of the tools themselves.
Performance Enhancement Strategies for Multi-Block Overset Grid CFD Applications

NASA Technical Reports Server (NTRS)

Djomehri, M. Jahed; Biswas, Rupak

2003-01-01

The overset grid methodology has significantly reduced time-to-solution of highfidelity computational fluid dynamics (CFD) simulations about complex aerospace configurations. The solution process resolves the geometrical complexity of the problem domain by using separately generated but overlapping structured discretization grids that periodically exchange information through interpolation. However, high performance computations of such large-scale realistic applications must be handled efficiently on state-of-the-art parallel supercomputers. This paper analyzes the effects of various performance enhancement strategies on the parallel efficiency of an overset grid Navier-Stokes CFD application running on an SGI Origin2000 machinc. Specifically, the role of asynchronous communication, grid splitting, and grid grouping strategies are presented and discussed. Details of a sophisticated graph partitioning technique for grid grouping are also provided. Results indicate that performance depends critically on the level of latency hiding and the quality of load balancing across the processors.
Tempest: Accelerated MS/MS Database Search Software for Heterogeneous Computing Platforms.

PubMed

Adamo, Mark E; Gerber, Scott A

2016-09-07

MS/MS database search algorithms derive a set of candidate peptide sequences from in silico digest of a protein sequence database, and compute theoretical fragmentation patterns to match these candidates against observed MS/MS spectra. The original Tempest publication described these operations mapped to a CPU-GPU model, in which the CPU (central processing unit) generates peptide candidates that are asynchronously sent to a discrete GPU (graphics processing unit) to be scored against experimental spectra in parallel. The current version of Tempest expands this model, incorporating OpenCL to offer seamless parallelization across multicore CPUs, GPUs, integrated graphics chips, and general-purpose coprocessors. Three protocols describe how to configure and run a Tempest search, including discussion of how to leverage Tempest's unique feature set to produce optimal results. © 2016 by John Wiley & Sons, Inc. Copyright © 2016 John Wiley & Sons, Inc.
Online measurement for geometrical parameters of wheel set based on structure light and CUDA parallel processing

NASA Astrophysics Data System (ADS)

Wu, Kaihua; Shao, Zhencheng; Chen, Nian; Wang, Wenjie

2018-01-01

The wearing degree of the wheel set tread is one of the main factors that influence the safety and stability of running train. Geometrical parameters mainly include flange thickness and flange height. Line structure laser light was projected on the wheel tread surface. The geometrical parameters can be deduced from the profile image. An online image acquisition system was designed based on asynchronous reset of CCD and CUDA parallel processing unit. The image acquisition was fulfilled by hardware interrupt mode. A high efficiency parallel segmentation algorithm based on CUDA was proposed. The algorithm firstly divides the image into smaller squares, and extracts the squares of the target by fusion of k_means and STING clustering image segmentation algorithm. Segmentation time is less than 0.97ms. A considerable acceleration ratio compared with the CPU serial calculation was obtained, which greatly improved the real-time image processing capacity. When wheel set was running in a limited speed, the system placed alone railway line can measure the geometrical parameters automatically. The maximum measuring speed is 120km/h.
A Stream Tilling Approach to Surface Area Estimation for Large Scale Spatial Data in a Shared Memory System

NASA Astrophysics Data System (ADS)

Liu, Jiping; Kang, Xiaochen; Dong, Chun; Xu, Shenghua

2017-12-01

Surface area estimation is a widely used tool for resource evaluation in the physical world. When processing large scale spatial data, the input/output (I/O) can easily become the bottleneck in parallelizing the algorithm due to the limited physical memory resources and the very slow disk transfer rate. In this paper, we proposed a stream tilling approach to surface area estimation that first decomposed a spatial data set into tiles with topological expansions. With these tiles, the one-to-one mapping relationship between the input and the computing process was broken. Then, we realized a streaming framework towards the scheduling of the I/O processes and computing units. Herein, each computing unit encapsulated a same copy of the estimation algorithm, and multiple asynchronous computing units could work individually in parallel. Finally, the performed experiment demonstrated that our stream tilling estimation can efficiently alleviate the heavy pressures from the I/O-bound work, and the measured speedup after being optimized have greatly outperformed the directly parallel versions in shared memory systems with multi-core processors.
Stability analysis of a deterministic dose calculation for MRI-guided radiotherapy.

PubMed

Zelyak, O; Fallone, B G; St-Aubin, J

2017-12-14

Modern effort in radiotherapy to address the challenges of tumor localization and motion has led to the development of MRI guided radiotherapy technologies. Accurate dose calculations must properly account for the effects of the MRI magnetic fields. Previous work has investigated the accuracy of a deterministic linear Boltzmann transport equation (LBTE) solver that includes magnetic field, but not the stability of the iterative solution method. In this work, we perform a stability analysis of this deterministic algorithm including an investigation of the convergence rate dependencies on the magnetic field, material density, energy, and anisotropy expansion. The iterative convergence rate of the continuous and discretized LBTE including magnetic fields is determined by analyzing the spectral radius using Fourier analysis for the stationary source iteration (SI) scheme. The spectral radius is calculated when the magnetic field is included (1) as a part of the iteration source, and (2) inside the streaming-collision operator. The non-stationary Krylov subspace solver GMRES is also investigated as a potential method to accelerate the iterative convergence, and an angular parallel computing methodology is investigated as a method to enhance the efficiency of the calculation. SI is found to be unstable when the magnetic field is part of the iteration source, but unconditionally stable when the magnetic field is included in the streaming-collision operator. The discretized LBTE with magnetic fields using a space-angle upwind stabilized discontinuous finite element method (DFEM) was also found to be unconditionally stable, but the spectral radius rapidly reaches unity for very low-density media and increasing magnetic field strengths indicating arbitrarily slow convergence rates. However, GMRES is shown to significantly accelerate the DFEM convergence rate showing only a weak dependence on the magnetic field. In addition, the use of an angular parallel computing strategy is shown to potentially increase the efficiency of the dose calculation.
Stability analysis of a deterministic dose calculation for MRI-guided radiotherapy

NASA Astrophysics Data System (ADS)

Zelyak, O.; Fallone, B. G.; St-Aubin, J.

2018-01-01

Modern effort in radiotherapy to address the challenges of tumor localization and motion has led to the development of MRI guided radiotherapy technologies. Accurate dose calculations must properly account for the effects of the MRI magnetic fields. Previous work has investigated the accuracy of a deterministic linear Boltzmann transport equation (LBTE) solver that includes magnetic field, but not the stability of the iterative solution method. In this work, we perform a stability analysis of this deterministic algorithm including an investigation of the convergence rate dependencies on the magnetic field, material density, energy, and anisotropy expansion. The iterative convergence rate of the continuous and discretized LBTE including magnetic fields is determined by analyzing the spectral radius using Fourier analysis for the stationary source iteration (SI) scheme. The spectral radius is calculated when the magnetic field is included (1) as a part of the iteration source, and (2) inside the streaming-collision operator. The non-stationary Krylov subspace solver GMRES is also investigated as a potential method to accelerate the iterative convergence, and an angular parallel computing methodology is investigated as a method to enhance the efficiency of the calculation. SI is found to be unstable when the magnetic field is part of the iteration source, but unconditionally stable when the magnetic field is included in the streaming-collision operator. The discretized LBTE with magnetic fields using a space-angle upwind stabilized discontinuous finite element method (DFEM) was also found to be unconditionally stable, but the spectral radius rapidly reaches unity for very low-density media and increasing magnetic field strengths indicating arbitrarily slow convergence rates. However, GMRES is shown to significantly accelerate the DFEM convergence rate showing only a weak dependence on the magnetic field. In addition, the use of an angular parallel computing strategy is shown to potentially increase the efficiency of the dose calculation.
Corrigendum to "Stability analysis of a deterministic dose calculation for MRI-guided radiotherapy".

PubMed

Zelyak, Oleksandr; Fallone, B Gino; St-Aubin, Joel

2018-03-12

Modern effort in radiotherapy to address the challenges of tumor localization and motion has led to the development of MRI guided radiotherapy technologies. Accurate dose calculations must properly account for the effects of the MRI magnetic fields. Previous work has investigated the accuracy of a deterministic linear Boltzmann transport equation (LBTE) solver that includes magnetic field, but not the stability of the iterative solution method. In this work, we perform a stability analysis of this deterministic algorithm including an investigation of the convergence rate dependencies on the magnetic field, material density, energy, and anisotropy expansion. The iterative convergence rate of the continuous and discretized LBTE including magnetic fields is determined by analyzing the spectral radius using Fourier analysis for the stationary source iteration (SI) scheme. The spectral radius is calculated when the magnetic field is included (1) as a part of the iteration source, and (2) inside the streaming-collision operator. The non-stationary Krylov subspace solver GMRES is also investigated as a potential method to accelerate the iterative convergence, and an angular parallel computing methodology is investigated as a method to enhance the efficiency of the calculation. SI is found to be unstable when the magnetic field is part of the iteration source, but unconditionally stable when the magnetic field is included in the streaming-collision operator. The discretized LBTE with magnetic fields using a space-angle upwind stabilized discontinuous finite element method (DFEM) was also found to be unconditionally stable, but the spectral radius rapidly reaches unity for very low density media and increasing magnetic field strengths indicating arbitrarily slow convergence rates. However, GMRES is shown to significantly accelerate the DFEM convergence rate showing only a weak dependence on the magnetic field. In addition, the use of an angular parallel computing strategy is shown to potentially increase the efficiency of the dose calculation. © 2018 Institute of Physics and Engineering in Medicine.
Multiprocessor smalltalk: Implementation, performance, and analysis

DOE Office of Scientific and Technical Information (OSTI.GOV)

Pallas, J.I.

1990-01-01

Multiprocessor Smalltalk demonstrates the value of object-oriented programming on a multiprocessor. Its implementation and analysis shed light on three areas: concurrent programming in an object oriented language without special extensions, implementation techniques for adapting to multiprocessors, and performance factors in the resulting system. Adding parallelism to Smalltalk code is easy, because programs already use control abstractions like iterators. Smalltalk's basic control and concurrency primitives (lambda expressions, processes and semaphores) can be used to build parallel control abstractions, including parallel iterators, parallel objects, atomic objects, and futures. Language extensions for concurrency are not required. This implementation demonstrates that it is possiblemore » to build an efficient parallel object-oriented programming system and illustrates techniques for doing so. Three modification tools-serialization, replication, and reorganization-adapted the Berkeley Smalltalk interpreter to the Firefly multiprocessor. Multiprocessor Smalltalk's performance shows that the combination of multiprocessing and object-oriented programming can be effective: speedups (relative to the original serial version) exceed 2.0 for five processors on all the benchmarks; the median efficiency is 48%. Analysis shows both where performance is lost and how to improve and generalize the experimental results. Changes in the interpreter to support concurrency add at most 12% overhead; better access to per-process variables could eliminate much of that. Changes in the user code to express concurrency add as much as 70% overhead; this overhead could be reduced to 54% if blocks (lambda expressions) were reentrant. Performance is also lost when the program cannot keep all five processors busy.« less
Self-Consistent Scheme for Spike-Train Power Spectra in Heterogeneous Sparse Networks

PubMed Central

Pena, Rodrigo F. O.; Vellmer, Sebastian; Bernardi, Davide; Roque, Antonio C.; Lindner, Benjamin

2018-01-01

Recurrent networks of spiking neurons can be in an asynchronous state characterized by low or absent cross-correlations and spike statistics which resemble those of cortical neurons. Although spatial correlations are negligible in this state, neurons can show pronounced temporal correlations in their spike trains that can be quantified by the autocorrelation function or the spike-train power spectrum. Depending on cellular and network parameters, correlations display diverse patterns (ranging from simple refractory-period effects and stochastic oscillations to slow fluctuations) and it is generally not well-understood how these dependencies come about. Previous work has explored how the single-cell correlations in a homogeneous network (excitatory and inhibitory integrate-and-fire neurons with nearly balanced mean recurrent input) can be determined numerically from an iterative single-neuron simulation. Such a scheme is based on the fact that every neuron is driven by the network noise (i.e., the input currents from all its presynaptic partners) but also contributes to the network noise, leading to a self-consistency condition for the input and output spectra. Here we first extend this scheme to homogeneous networks with strong recurrent inhibition and a synaptic filter, in which instabilities of the previous scheme are avoided by an averaging procedure. We then extend the scheme to heterogeneous networks in which (i) different neural subpopulations (e.g., excitatory and inhibitory neurons) have different cellular or connectivity parameters; (ii) the number and strength of the input connections are random (Erdős-Rényi topology) and thus different among neurons. In all heterogeneous cases, neurons are lumped in different classes each of which is represented by a single neuron in the iterative scheme; in addition, we make a Gaussian approximation of the input current to the neuron. These approximations seem to be justified over a broad range of parameters as indicated by comparison with simulation results of large recurrent networks. Our method can help to elucidate how network heterogeneity shapes the asynchronous state in recurrent neural networks. PMID:29551968
Exploiting loop level parallelism in nonprocedural dataflow programs

NASA Technical Reports Server (NTRS)

Gokhale, Maya B.

1987-01-01

Discussed are how loop level parallelism is detected in a nonprocedural dataflow program, and how a procedural program with concurrent loops is scheduled. Also discussed is a program restructuring technique which may be applied to recursive equations so that concurrent loops may be generated for a seemingly iterative computation. A compiler which generates C code for the language described below has been implemented. The scheduling component of the compiler and the restructuring transformation are described.
Fault Tolerant Parallel Implementations of Iterative Algorithms for Optimal Control Problems

DTIC Science & Technology

1988-01-21

p/.V)] steps, but did not discuss any specific parallel implementation. Gajski [51 improved upon this result by performing the SIMD computation in...N = p2. our approach reduces to that of [51, except that Gajski presents the coefficient computation and partial solution phases as a single...8217>. the SIMD algo- rithm presented by Gajski [5] can be most efficiently mapped to a unidirec- tional ring network with broadcasting capability. Based
Maintaining High Assurance in Asynchronous Messaging

DTIC Science & Technology

2015-10-24

Assurance in Asynchronous Messaging Kevin E. Foltz and William R. Simpson Abstract—Asynchronous messaging is the delivery of a message without... integrity , and confidentiality guarantees. End-to-end security for asynchronous messaging must be provided by the asynchronous messaging layer itself... continuing its processing. At the completion of message transmission, the sender does not know when or whether the receiver received it. The message
PRIM: An Efficient Preconditioning Iterative Reweighted Least Squares Method for Parallel Brain MRI Reconstruction.

PubMed

Xu, Zheng; Wang, Sheng; Li, Yeqing; Zhu, Feiyun; Huang, Junzhou

2018-02-08

The most recent history of parallel Magnetic Resonance Imaging (pMRI) has in large part been devoted to finding ways to reduce acquisition time. While joint total variation (JTV) regularized model has been demonstrated as a powerful tool in increasing sampling speed for pMRI, however, the major bottleneck is the inefficiency of the optimization method. While all present state-of-the-art optimizations for the JTV model could only reach a sublinear convergence rate, in this paper, we squeeze the performance by proposing a linear-convergent optimization method for the JTV model. The proposed method is based on the Iterative Reweighted Least Squares algorithm. Due to the complexity of the tangled JTV objective, we design a novel preconditioner to further accelerate the proposed method. Extensive experiments demonstrate the superior performance of the proposed algorithm for pMRI regarding both accuracy and efficiency compared with state-of-the-art methods.
SPIRiT: Iterative Self-consistent Parallel Imaging Reconstruction from Arbitrary k-Space

PubMed Central

Lustig, Michael; Pauly, John M.

2010-01-01

A new approach to autocalibrating, coil-by-coil parallel imaging reconstruction is presented. It is a generalized reconstruction framework based on self consistency. The reconstruction problem is formulated as an optimization that yields the most consistent solution with the calibration and acquisition data. The approach is general and can accurately reconstruct images from arbitrary k-space sampling patterns. The formulation can flexibly incorporate additional image priors such as off-resonance correction and regularization terms that appear in compressed sensing. Several iterative strategies to solve the posed reconstruction problem in both image and k-space domain are presented. These are based on a projection over convex sets (POCS) and a conjugate gradient (CG) algorithms. Phantom and in-vivo studies demonstrate efficient reconstructions from undersampled Cartesian and spiral trajectories. Reconstructions that include off-resonance correction and nonlinear ℓ1-wavelet regularization are also demonstrated. PMID:20665790

Efficient relaxed-Jacobi smoothers for multigrid on parallel computers

NASA Astrophysics Data System (ADS)

Yang, Xiang; Mittal, Rajat

2017-03-01

In this Technical Note, we present a family of Jacobi-based multigrid smoothers suitable for the solution of discretized elliptic equations. These smoothers are based on the idea of scheduled-relaxation Jacobi proposed recently by Yang & Mittal (2014) [18] and employ two or three successive relaxed Jacobi iterations with relaxation factors derived so as to maximize the smoothing property of these iterations. The performance of these new smoothers measured in terms of convergence acceleration and computational workload, is assessed for multi-domain implementations typical of parallelized solvers, and compared to the lexicographic point Gauss-Seidel smoother. The tests include the geometric multigrid method on structured grids as well as the algebraic grid method on unstructured grids. The tests demonstrate that unlike Gauss-Seidel, the convergence of these Jacobi-based smoothers is unaffected by domain decomposition, and furthermore, they outperform the lexicographic Gauss-Seidel by factors that increase with domain partition count.
Aerodynamic optimization by simultaneously updating flow variables and design parameters

NASA Technical Reports Server (NTRS)

Rizk, M. H.

1990-01-01

The application of conventional optimization schemes to aerodynamic design problems leads to inner-outer iterative procedures that are very costly. An alternative approach is presented based on the idea of updating the flow variable iterative solutions and the design parameter iterative solutions simultaneously. Two schemes based on this idea are applied to problems of correcting wind tunnel wall interference and optimizing advanced propeller designs. The first of these schemes is applicable to a limited class of two-design-parameter problems with an equality constraint. It requires the computation of a single flow solution. The second scheme is suitable for application to general aerodynamic problems. It requires the computation of several flow solutions in parallel. In both schemes, the design parameters are updated as the iterative flow solutions evolve. Computations are performed to test the schemes' efficiency, accuracy, and sensitivity to variations in the computational parameters.
Parallel medicinal chemistry approaches to selective HDAC1/HDAC2 inhibitor (SHI-1:2) optimization.

PubMed

Kattar, Solomon D; Surdi, Laura M; Zabierek, Anna; Methot, Joey L; Middleton, Richard E; Hughes, Bethany; Szewczak, Alexander A; Dahlberg, William K; Kral, Astrid M; Ozerova, Nicole; Fleming, Judith C; Wang, Hongmei; Secrist, Paul; Harsch, Andreas; Hamill, Julie E; Cruz, Jonathan C; Kenific, Candia M; Chenard, Melissa; Miller, Thomas A; Berk, Scott C; Tempest, Paul

2009-02-15

The successful application of both solid and solution phase library synthesis, combined with tight integration into the medicinal chemistry effort, resulted in the efficient optimization of a novel structural series of selective HDAC1/HDAC2 inhibitors by the MRL-Boston Parallel Medicinal Chemistry group. An initial lead from a small parallel library was found to be potent and selective in biochemical assays. Advanced compounds were the culmination of iterative library design and possess excellent biochemical and cellular potency, as well as acceptable PK and efficacy in animal models.
Relationship between cortical state and spiking activity in the lateral geniculate nucleus of marmosets

PubMed Central

Pietersen, Alexander N.J.; Cheong, Soon Keen; Munn, Brandon; Gong, Pulin; Solomon, Samuel G.

2017-01-01

Key points How parallel are the primate visual pathways? In the present study, we demonstrate that parallel visual pathways in the dorsal lateral geniculate nucleus (LGN) show distinct patterns of interaction with rhythmic activity in the primary visual cortex (V1).In the V1 of anaesthetized marmosets, the EEG frequency spectrum undergoes transient changes that are characterized by fluctuations in delta‐band EEG power.We show that, on multisecond timescales, spiking activity in an evolutionary primitive (koniocellular) LGN pathway is specifically linked to these slow EEG spectrum changes. By contrast, on subsecond (delta frequency) timescales, cortical oscillations can entrain spiking activity throughout the entire LGN.Our results are consistent with the hypothesis that, in waking animals, the koniocellular pathway selectively participates in brain circuits controlling vigilance and attention. Abstract The major afferent cortical pathway in the visual system passes through the dorsal lateral geniculate nucleus (LGN), where nerve signals originating in the eye can first interact with brain circuits regulating visual processing, vigilance and attention. In the present study, we investigated how ongoing and visually driven activity in magnocellular (M), parvocellular (P) and koniocellular (K) layers of the LGN are related to cortical state. We recorded extracellular spiking activity in the LGN simultaneously with local field potentials (LFP) in primary visual cortex, in sufentanil‐anaesthetized marmoset monkeys. We found that asynchronous cortical states (marked by low power in delta‐band LFPs) are linked to high spike rates in K cells (but not P cells or M cells), on multisecond timescales. Cortical asynchrony precedes the increases in K cell spike rates by 1–3 s, implying causality. At subsecond timescales, the spiking activity in many cells of all (M, P and K) classes is phase‐locked to delta waves in the cortical LFP, and more cells are phase‐locked during synchronous cortical states than during asynchronous cortical states. The switch from low‐to‐high spike rates in K cells does not degrade their visual signalling capacity. By contrast, during asynchronous cortical states, the fidelity of visual signals transmitted by K cells is improved, probably because K cell responses become less rectified. Overall, the data show that slow fluctuations in cortical state are selectively linked to K pathway spiking activity, whereas delta‐frequency cortical oscillations entrain spiking activity throughout the entire LGN, in anaesthetized marmosets. PMID:28116750
Exploring asynchronous brainstorming in large groups: a field comparison of serial and parallel subgroups.

PubMed

de Vreede, Gert-Jan; Briggs, Robert O; Reiter-Palmon, Roni

2010-04-01

The aim of this study was to compare the results of two different modes of using multiple groups (instead of one large group) to identify problems and develop solutions. Many of the complex problems facing organizations today require the use of very large groups or collaborations of groups from multiple organizations. There are many logistical problems associated with the use of such large groups, including the ability to bring everyone together at the same time and location. A field study involved two different organizations and compared productivity and satisfaction of group. The approaches included (a) multiple small groups, each completing the entire process from start to end and combining the results at the end (parallel mode); and (b) multiple subgroups, each building on the work provided by previous subgroups (serial mode). Groups using the serial mode produced more elaborations compared with parallel groups, whereas parallel groups produced more unique ideas compared with serial groups. No significant differences were found related to satisfaction with process and outcomes between the two modes. Preferred mode depends on the type of task facing the group. Parallel groups are more suited for tasks for which a variety of new ideas are needed, whereas serial groups are best suited when elaboration and in-depth thinking on the solution are required. Results of this research can guide the development of facilitated sessions of large groups or "teams of teams."
Fast iterative censoring CFAR algorithm for ship detection from SAR images

NASA Astrophysics Data System (ADS)

Gu, Dandan; Yue, Hui; Zhang, Yuan; Gao, Pengcheng

2017-11-01

Ship detection is one of the essential techniques for ship recognition from synthetic aperture radar (SAR) images. This paper presents a fast iterative detection procedure to eliminate the influence of target returns on the estimation of local sea clutter distributions for constant false alarm rate (CFAR) detectors. A fast block detector is first employed to extract potential target sub-images; and then, an iterative censoring CFAR algorithm is used to detect ship candidates from each target blocks adaptively and efficiently, where parallel detection is available, and statistical parameters of G0 distribution fitting local sea clutter well can be quickly estimated based on an integral image operator. Experimental results of TerraSAR-X images demonstrate the effectiveness of the proposed technique.
Heinrich events simulated across the glacial

NASA Astrophysics Data System (ADS)

Ziemen, F. A.; Mikolajewicz, U.

2015-12-01

Heinrich events are among the most prominent climate change events recorded in proxies across the northern hemisphere. They are the archetype of ice sheet — climate interactions on millennial time scales. Nevertheless, the exact mechanisms that cause Heinrich events are still under discussion, and their climatic consequences are far from being fully understood. We contribute to answering the open questions by studying Heinrich events in a coupled ice sheet model (ISM) atmosphere-ocean-vegetation general circulation model (AOVGCM) framework, where this variability occurs as part of the model generated internal variability. The setup consists of a northern hemisphere setup of the modified Parallel Ice Sheet Model (mPISM) coupled to the global AOVGCM ECHAM5/MPIOM/LPJ. The simulations were performed fully coupled and with transient orbital and greenhouse gas forcing. They span from several millennia before the last glacial maximum into the deglaciation. We analyze simulations where the ISM is coupled asynchronously to the AOVGCM and simulations where the ISM and the ocean model are coupled synchronously and the atmosphere model is coupled asynchronously to them. The modeled Heinrich events show a marked influence of the ice discharge on the Atlantic circulation and heat transport.
Unified Lambert Tool for Massively Parallel Applications in Space Situational Awareness

NASA Astrophysics Data System (ADS)

Woollands, Robyn M.; Read, Julie; Hernandez, Kevin; Probe, Austin; Junkins, John L.

2018-03-01

This paper introduces a parallel-compiled tool that combines several of our recently developed methods for solving the perturbed Lambert problem using modified Chebyshev-Picard iteration. This tool (unified Lambert tool) consists of four individual algorithms, each of which is unique and better suited for solving a particular type of orbit transfer. The first is a Keplerian Lambert solver, which is used to provide a good initial guess (warm start) for solving the perturbed problem. It is also used to determine the appropriate algorithm to call for solving the perturbed problem. The arc length or true anomaly angle spanned by the transfer trajectory is the parameter that governs the automated selection of the appropriate perturbed algorithm, and is based on the respective algorithm convergence characteristics. The second algorithm solves the perturbed Lambert problem using the modified Chebyshev-Picard iteration two-point boundary value solver. This algorithm does not require a Newton-like shooting method and is the most efficient of the perturbed solvers presented herein, however the domain of convergence is limited to about a third of an orbit and is dependent on eccentricity. The third algorithm extends the domain of convergence of the modified Chebyshev-Picard iteration two-point boundary value solver to about 90% of an orbit, through regularization with the Kustaanheimo-Stiefel transformation. This is the second most efficient of the perturbed set of algorithms. The fourth algorithm uses the method of particular solutions and the modified Chebyshev-Picard iteration initial value solver for solving multiple revolution perturbed transfers. This method does require "shooting" but differs from Newton-like shooting methods in that it does not require propagation of a state transition matrix. The unified Lambert tool makes use of the General Mission Analysis Tool and we use it to compute thousands of perturbed Lambert trajectories in parallel on the Space Situational Awareness computer cluster at the LASR Lab, Texas A&M University. We demonstrate the power of our tool by solving a highly parallel example problem, that is the generation of extremal field maps for optimal spacecraft rendezvous (and eventual orbit debris removal). In addition we demonstrate the need for including perturbative effects in simulations for satellite tracking or data association. The unified Lambert tool is ideal for but not limited to space situational awareness applications.
Fully Parallel MHD Stability Analysis Tool

NASA Astrophysics Data System (ADS)

Svidzinski, Vladimir; Galkin, Sergei; Kim, Jin-Soo; Liu, Yueqiang

2014-10-01

Progress on full parallelization of the plasma stability code MARS will be reported. MARS calculates eigenmodes in 2D axisymmetric toroidal equilibria in MHD-kinetic plasma models. It is a powerful tool for studying MHD and MHD-kinetic instabilities and it is widely used by fusion community. Parallel version of MARS is intended for simulations on local parallel clusters. It will be an efficient tool for simulation of MHD instabilities with low, intermediate and high toroidal mode numbers within both fluid and kinetic plasma models, already implemented in MARS. Parallelization of the code includes parallelization of the construction of the matrix for the eigenvalue problem and parallelization of the inverse iterations algorithm, implemented in MARS for the solution of the formulated eigenvalue problem. Construction of the matrix is parallelized by distributing the load among processors assigned to different magnetic surfaces. Parallelization of the solution of the eigenvalue problem is made by repeating steps of the present MARS algorithm using parallel libraries and procedures. Initial results of the code parallelization will be reported. Work is supported by the U.S. DOE SBIR program.
Exploiting Vector and Multicore Parallelsim for Recursive, Data- and Task-Parallel Programs

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ren, Bin; Krishnamoorthy, Sriram; Agrawal, Kunal

Modern hardware contains parallel execution resources that are well-suited for data-parallelism-vector units-and task parallelism-multicores. However, most work on parallel scheduling focuses on one type of hardware or the other. In this work, we present a scheduling framework that allows for a unified treatment of task- and data-parallelism. Our key insight is an abstraction, task blocks, that uniformly handles data-parallel iterations and task-parallel tasks, allowing them to be scheduled on vector units or executed independently as multicores. Our framework allows us to define schedulers that can dynamically select between executing task- blocks on vector units or multicores. We show that thesemore » schedulers are asymptotically optimal, and deliver the maximum amount of parallelism available in computation trees. To evaluate our schedulers, we develop program transformations that can convert mixed data- and task-parallel pro- grams into task block-based programs. Using a prototype instantiation of our scheduling framework, we show that, on an 8-core system, we can simultaneously exploit vector and multicore parallelism to achieve 14×-108× speedup over sequential baselines.« less
Parallel AFSA algorithm accelerating based on MIC architecture

NASA Astrophysics Data System (ADS)

Zhou, Junhao; Xiao, Hong; Huang, Yifan; Li, Yongzhao; Xu, Yuanrui

2017-05-01

Analysis AFSA past for solving the traveling salesman problem, the algorithm efficiency is often a big problem, and the algorithm processing method, it does not fully responsive to the characteristics of the traveling salesman problem to deal with, and therefore proposes a parallel join improved AFSA process. The simulation with the current TSP known optimal solutions were analyzed, the results showed that the AFSA iterations improved less, on the MIC cards doubled operating efficiency, efficiency significantly.
Air Force Maui Optical and Supercomputing Site (AMOS) Application Briefs 2004

DTIC Science & Technology

2004-01-01

Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection...17. LIMITATION OF ABSTRACT Same as Report (SAR) 18. NUMBER OF PAGES 60 19a. NAME OF RESPONSIBLE PERSON a. REPORT unclassified b. ABSTRACT...SOR methods, the work in parallel matrix iterative methods is very relevant to the parallelization of a CA. The Moore neighborhood used in this work
Nonlinear study of the parallel velocity/tearing instability using an implicit, nonlinear resistive MHD solver

NASA Astrophysics Data System (ADS)

Chacon, L.; Finn, J. M.; Knoll, D. A.

2000-10-01

Recently, a new parallel velocity instability has been found.(J. M. Finn, Phys. Plasmas), 2, 12 (1995) This mode is a tearing mode driven unstable by curvature effects and sound wave coupling in the presence of parallel velocity shear. Under such conditions, linear theory predicts that tearing instabilities will grow even in situations in which the classical tearing mode is stable. This could then be a viable seed mechanism for the neoclassical tearing mode, and hence a non-linear study is of interest. Here, the linear and non-linear stages of this instability are explored using a fully implicit, fully nonlinear 2D reduced resistive MHD code,(L. Chacon et al), ``Implicit, Jacobian-free Newton-Krylov 2D reduced resistive MHD nonlinear solver,'' submitted to J. Comput. Phys. (2000) including viscosity and particle transport effects. The nonlinear implicit time integration is performed using the Newton-Raphson iterative algorithm. Krylov iterative techniques are employed for the required algebraic matrix inversions, implemented Jacobian-free (i.e., without ever forming and storing the Jacobian matrix), and preconditioned with a ``physics-based'' preconditioner. Nonlinear results indicate that, for large total plasma beta and large parallel velocity shear, the instability results in the generation of large poloidal shear flows and large magnetic islands even in regimes when the classical tearing mode is absolutely stable. For small viscosity, the time asymptotic state can be turbulent.
A communication-avoiding, hybrid-parallel, rank-revealing orthogonalization method.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hoemmen, Mark

2010-11-01

Orthogonalization consumes much of the run time of many iterative methods for solving sparse linear systems and eigenvalue problems. Commonly used algorithms, such as variants of Gram-Schmidt or Householder QR, have performance dominated by communication. Here, 'communication' includes both data movement between the CPU and memory, and messages between processors in parallel. Our Tall Skinny QR (TSQR) family of algorithms requires asymptotically fewer messages between processors and data movement between CPU and memory than typical orthogonalization methods, yet achieves the same accuracy as Householder QR factorization. Furthermore, in block orthogonalizations, TSQR is faster and more accurate than existing approaches formore » orthogonalizing the vectors within each block ('normalization'). TSQR's rank-revealing capability also makes it useful for detecting deflation in block iterative methods, for which existing approaches sacrifice performance, accuracy, or both. We have implemented a version of TSQR that exploits both distributed-memory and shared-memory parallelism, and supports real and complex arithmetic. Our implementation is optimized for the case of orthogonalizing a small number (5-20) of very long vectors. The shared-memory parallel component uses Intel's Threading Building Blocks, though its modular design supports other shared-memory programming models as well, including computation on the GPU. Our implementation achieves speedups of 2 times or more over competing orthogonalizations. It is available now in the development branch of the Trilinos software package, and will be included in the 10.8 release.« less
A parallel variable metric optimization algorithm

NASA Technical Reports Server (NTRS)

Straeter, T. A.

1973-01-01

An algorithm, designed to exploit the parallel computing or vector streaming (pipeline) capabilities of computers is presented. When p is the degree of parallelism, then one cycle of the parallel variable metric algorithm is defined as follows: first, the function and its gradient are computed in parallel at p different values of the independent variable; then the metric is modified by p rank-one corrections; and finally, a single univariant minimization is carried out in the Newton-like direction. Several properties of this algorithm are established. The convergence of the iterates to the solution is proved for a quadratic functional on a real separable Hilbert space. For a finite-dimensional space the convergence is in one cycle when p equals the dimension of the space. Results of numerical experiments indicate that the new algorithm will exploit parallel or pipeline computing capabilities to effect faster convergence than serial techniques.
Parallel multigrid smoothing: polynomial versus Gauss-Seidel

NASA Astrophysics Data System (ADS)

Adams, Mark; Brezina, Marian; Hu, Jonathan; Tuminaro, Ray

2003-07-01

Gauss-Seidel is often the smoother of choice within multigrid applications. In the context of unstructured meshes, however, maintaining good parallel efficiency is difficult with multiplicative iterative methods such as Gauss-Seidel. This leads us to consider alternative smoothers. We discuss the computational advantages of polynomial smoothers within parallel multigrid algorithms for positive definite symmetric systems. Two particular polynomials are considered: Chebyshev and a multilevel specific polynomial. The advantages of polynomial smoothing over traditional smoothers such as Gauss-Seidel are illustrated on several applications: Poisson's equation, thin-body elasticity, and eddy current approximations to Maxwell's equations. While parallelizing the Gauss-Seidel method typically involves a compromise between a scalable convergence rate and maintaining high flop rates, polynomial smoothers achieve parallel scalable multigrid convergence rates without sacrificing flop rates. We show that, although parallel computers are the main motivation, polynomial smoothers are often surprisingly competitive with Gauss-Seidel smoothers on serial machines.
ASC-ATDM Performance Portability Requirements for 2015-2019

DOE Office of Scientific and Technical Information (OSTI.GOV)

Edwards, Harold C.; Trott, Christian Robert

This report outlines the research, development, and support requirements for the Advanced Simulation and Computing (ASC ) Advanced Technology, Development, and Mitigation (ATDM) Performance Portability (a.k.a., Kokkos) project for 2015 - 2019 . The research and development (R&D) goal for Kokkos (v2) has been to create and demonstrate a thread - parallel programming model a nd standard C++ library - based implementation that enables performance portability across diverse manycore architectures such as multicore CPU, Intel Xeon Phi, and NVIDIA Kepler GPU. This R&D goal has been achieved for algorithms that use data parallel pat terns including parallel - for, parallelmore » - reduce, and parallel - scan. Current R&D is focusing on hierarchical parallel patterns such as a directed acyclic graph (DAG) of asynchronous tasks where each task contain s nested data parallel algorithms. This five y ear plan includes R&D required to f ully and performance portably exploit thread parallelism across current and anticipated next generation platforms (NGP). The Kokkos library is being evaluated by many projects exploring algorithm s and code design for NGP. Some production libraries and applications such as Trilinos and LAMMPS have already committed to Kokkos as their foundation for manycore parallelism an d performance portability. These five year requirements includes support required for current and antic ipated ASC projects to be effective and productive in their use of Kokkos on NGP. The greatest risk to the success of Kokkos and ASC projects relying upon Kokkos is a lack of staffing resources to support Kokkos to the degree needed by these ASC projects. This support includes up - to - date tutorials, documentation, multi - platform (hardware and software stack) testing, minor feature enhancements, thread - scalable algorithm consulting, and managing collaborative R&D.« less
Defining the Role of the Online Therapeutic Facilitator: Principles and Guidelines Developed for Couplelinks, an Online Support Program for Couples Affected by Breast Cancer

PubMed Central

McLeod, Deborah; Stephen, Joanne

2015-01-01

Development of psychological interventions delivered via the Internet is a rapidly growing field with the potential to make vital services more accessible. However, there is a corresponding need for careful examination of factors that contribute to effectiveness of Internet-delivered interventions, especially given the observed high dropout rates relative to traditional in-person (IP) interventions. Research has found that the involvement of an online therapist in a Web-based intervention reduces treatment dropout. However, the role of such online therapists is seldom well articulated and varies considerably across programs making it difficult to discern processes that are important for online therapist involvement.In this paper, we introduce the concept of “therapeutic facilitation” to describe the role of the online therapist that was developed and further refined in the context of a Web-based, asynchronous psychosocial intervention for couples affected by breast cancer called Couplelinks. Couplelinks is structured into 6 dyadic learning modules designed to be completed on a weekly basis in consultation with a facilitator through regular, asynchronous, online text-based communication.Principles of therapeutic facilitation derived from a combination of theory underlying the intervention and pilot-testing of the first iteration of the program are described. Case examples to illustrate these principles as well as commonly encountered challenges to online facilitation are presented. Guidelines and principles for therapeutic facilitation hold relevance for professionally delivered online programs more broadly, beyond interventions for couples and cancer. PMID:28410159
Asynchronous ice lobe retreat and glacial Lake Bascom: Deglaciation of the Hoosic and Vermont valleys, southwestern Vermont

DOE Office of Scientific and Technical Information (OSTI.GOV)

Small, E.; Desimone, D.

Deglaciation of the Hoosic River drainage basin in southwestern Vermont was more complex than previously described. Detailed surficial mapping, stratigraphic relationships, and terrace levels/delta elevations reveal new details in the chronology of glacial Lake Bascom: (1) a pre-Wisconsinan proglacial lake was present in a similar position to Lake Bascom as ice advanced: (2) the northern margin of 275m (900 ft) glacial Lake Bascom extended 10 km up the Vermont Valley; (3) the 215m (705 ft) Bascom level was stable and long lived; (4) intermediate water planes existed between 215m and 190m (625 ft) levels; and (5) a separate ice tonguemore » existed in Shaftsbury Hollow damming a small glacial lake, here named glacial Lake Emmons. This information is used to correlate ice margins to different lake levels. Distance of ice margin retreat during a lake level can be measured. Lake levels are then used as control points on a Lake Bascom relative time line to compare rate of retreat of different ice tongues. Correlation of ice margins to Bascom levels indicates ice retreat was asynchronous between nearby tongues in southwestern Vermont. The Vermont Valley ice tongue retreated between two and four times faster than the Hoosic Valley tongue during the Bascom 275m level. Rate of retreat of the Vermont Valley tongue slowed to one-half of the Hoosic tongue during the 215m--190m lake levels. Factors responsible for varying rates of retreat are subglacial bedrock gradient, proximity to the Hudson-Champlain lobe, and the presence of absence of a calving margins. Asynchronous retreat produced splayed ice margins in southwestern Vermont. Findings from this study do not support the model of parallel, synchronous retreat proposed by many workers for this region.« less
Scalable asynchronous execution of cellular automata

NASA Astrophysics Data System (ADS)

Folino, Gianluigi; Giordano, Andrea; Mastroianni, Carlo

2016-10-01

The performance and scalability of cellular automata, when executed on parallel/distributed machines, are limited by the necessity of synchronizing all the nodes at each time step, i.e., a node can execute only after the execution of the previous step at all the other nodes. However, these synchronization requirements can be relaxed: a node can execute one step after synchronizing only with the adjacent nodes. In this fashion, different nodes can execute different time steps. This can be a notable advantageous in many novel and increasingly popular applications of cellular automata, such as smart city applications, simulation of natural phenomena, etc., in which the execution times can be different and variable, due to the heterogeneity of machines and/or data and/or executed functions. Indeed, a longer execution time at a node does not slow down the execution at all the other nodes but only at the neighboring nodes. This is particularly advantageous when the nodes that act as bottlenecks vary during the application execution. The goal of the paper is to analyze the benefits that can be achieved with the described asynchronous implementation of cellular automata, when compared to the classical all-to-all synchronization pattern. The performance and scalability have been evaluated through a Petri net model, as this model is very useful to represent the synchronization barrier among nodes. We examined the usual case in which the territory is partitioned into a number of regions, and the computation associated with a region is assigned to a computing node. We considered both the cases of mono-dimensional and two-dimensional partitioning. The results show that the advantage obtained through the asynchronous execution, when compared to the all-to-all synchronous approach is notable, and it can be as large as 90% in terms of speedup.

FPGA implementation of low complexity LDPC iterative decoder

NASA Astrophysics Data System (ADS)

Verma, Shivani; Sharma, Sanjay

2016-07-01

Low-density parity-check (LDPC) codes, proposed by Gallager, emerged as a class of codes which can yield very good performance on the additive white Gaussian noise channel as well as on the binary symmetric channel. LDPC codes have gained lots of importance due to their capacity achieving property and excellent performance in the noisy channel. Belief propagation (BP) algorithm and its approximations, most notably min-sum, are popular iterative decoding algorithms used for LDPC and turbo codes. The trade-off between the hardware complexity and the decoding throughput is a critical factor in the implementation of the practical decoder. This article presents introduction to LDPC codes and its various decoding algorithms followed by realisation of LDPC decoder by using simplified message passing algorithm and partially parallel decoder architecture. Simplified message passing algorithm has been proposed for trade-off between low decoding complexity and decoder performance. It greatly reduces the routing and check node complexity of the decoder. Partially parallel decoder architecture possesses high speed and reduced complexity. The improved design of the decoder possesses a maximum symbol throughput of 92.95 Mbps and a maximum of 18 decoding iterations. The article presents implementation of 9216 bits, rate-1/2, (3, 6) LDPC decoder on Xilinx XC3D3400A device from Spartan-3A DSP family.
Drifts, currents, and power scrape-off width in SOLPS-ITER modeling of DIII-D

DOE PAGES

Meier, E. T.; Goldston, R. J.; Kaveeva, E. G.; ...

2016-12-27

The effects of drifts and associated flows and currents on the width of the parallel heat flux channel (λ q) in the tokamak scrape-off layer (SOL) are analyzed using the SOLPS-ITER 2D fluid transport code. Motivation is supplied by Goldston’s heuristic drift (HD) model for λ q, which yields the same approximately inverse poloidal magnetic field dependence seen in multi-machine regression. The analysis, focusing on a DIII-D H-mode discharge, reveals HD-like features, including comparable density and temperature fall-off lengths in the SOL, and up-down ion pressure asymmetry that allows net cross-separatrix ion magnetic drift flux to exceed net anomalous ionmore » flux. In experimentally relevant high-recycling cases, scans of both toroidal and poloidal magnetic field (B tor and B pol) are conducted, showing minimal λ q dependence on either component of the field. Insensitivity to B tor is expected, and suggests that SOLPS-ITER is effectively capturing some aspects of HD physics. Absence of λ q dependence on B pol, however, is inconsistent with both the HD model and experimental results. As a result, the inconsistency is attributed to strong variation in the parallel Mach number, which violates one of the premises of the HD model.« less
Detection of mouse liver cancer via a parallel iterative shrinkage method in hybrid optical/microcomputed tomography imaging

NASA Astrophysics Data System (ADS)

Wu, Ping; Liu, Kai; Zhang, Qian; Xue, Zhenwen; Li, Yongbao; Ning, Nannan; Yang, Xin; Li, Xingde; Tian, Jie

2012-12-01

Liver cancer is one of the most common malignant tumors worldwide. In order to enable the noninvasive detection of small liver tumors in mice, we present a parallel iterative shrinkage (PIS) algorithm for dual-modality tomography. It takes advantage of microcomputed tomography and multiview bioluminescence imaging, providing anatomical structure and bioluminescence intensity information to reconstruct the size and location of tumors. By incorporating prior knowledge of signal sparsity, we associate some mathematical strategies including specific smooth convex approximation, an iterative shrinkage operator, and affine subspace with the PIS method, which guarantees the accuracy, efficiency, and reliability for three-dimensional reconstruction. Then an in vivo experiment on the bead-implanted mouse has been performed to validate the feasibility of this method. The findings indicate that a tiny lesion less than 3 mm in diameter can be localized with a position bias no more than 1 mm the computational efficiency is one to three orders of magnitude faster than the existing algorithms; this approach is robust to the different regularization parameters and the lp norms. Finally, we have applied this algorithm to another in vivo experiment on an HCCLM3 orthotopic xenograft mouse model, which suggests the PIS method holds the promise for practical applications of whole-body cancer detection.
Acceptability of an Asynchronous Learning Forum on Mobile Devices

ERIC Educational Resources Information Center

Chang, Chih-Kai

2010-01-01

Mobile learning has recently become noteworthy because mobile devices have become popular. To construct an asynchronous learning forum on mobile devices is important because an asynchronous learning forum is always an essential part of networked asynchronous distance learning. However, the input interface in handheld learning devices, which is…
The Influence of Asynchronous Video Communication on Learner Social Presence: A Narrative Analysis of Four Cases

ERIC Educational Resources Information Center

Borup, Jered; West, Richard E.; Graham, Charles R.

2013-01-01

Online courses are increasingly using asynchronous video communication. However, little is known about how asynchronous video communication influences students' communication patterns. This study presents four narratives of students with varying characteristics who engaged in asynchronous video communication. The extrovert valued the efficiency of…
Algorithm for solving the linear Cauchy problem for large systems of ordinary differential equations with the use of parallel computations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Moryakov, A. V., E-mail: sailor@orc.ru

2016-12-15

An algorithm for solving the linear Cauchy problem for large systems of ordinary differential equations is presented. The algorithm for systems of first-order differential equations is implemented in the EDELWEISS code with the possibility of parallel computations on supercomputers employing the MPI (Message Passing Interface) standard for the data exchange between parallel processes. The solution is represented by a series of orthogonal polynomials on the interval [0, 1]. The algorithm is characterized by simplicity and the possibility to solve nonlinear problems with a correction of the operator in accordance with the solution obtained in the previous iterative process.
A Framework for Load Balancing of Tensor Contraction Expressions via Dynamic Task Partitioning

DOE Office of Scientific and Technical Information (OSTI.GOV)

Lai, Pai-Wei; Stock, Kevin; Rajbhandari, Samyam

In this paper, we introduce the Dynamic Load-balanced Tensor Contractions (DLTC), a domain-specific library for efficient task parallel execution of tensor contraction expressions, a class of computation encountered in quantum chemistry and physics. Our framework decomposes each contraction into smaller unit of tasks, represented by an abstraction referred to as iterators. We exploit an extra level of parallelism by having tasks across independent contractions executed concurrently through a dynamic load balancing run- time. We demonstrate the improved performance, scalability, and flexibility for the computation of tensor contraction expressions on parallel computers using examples from coupled cluster methods.
Fast quantum Monte Carlo on a GPU

NASA Astrophysics Data System (ADS)

Lutsyshyn, Y.

2015-02-01

We present a scheme for the parallelization of quantum Monte Carlo method on graphical processing units, focusing on variational Monte Carlo simulation of bosonic systems. We use asynchronous execution schemes with shared memory persistence, and obtain an excellent utilization of the accelerator. The CUDA code is provided along with a package that simulates liquid helium-4. The program was benchmarked on several models of Nvidia GPU, including Fermi GTX560 and M2090, and the Kepler architecture K20 GPU. Special optimization was developed for the Kepler cards, including placement of data structures in the register space of the Kepler GPUs. Kepler-specific optimization is discussed.
Rectangular Array Of Digital Processors For Planning Paths

NASA Technical Reports Server (NTRS)

Kemeny, Sabrina E.; Fossum, Eric R.; Nixon, Robert H.

1993-01-01

Prototype 24 x 25 rectangular array of asynchronous parallel digital processors rapidly finds best path across two-dimensional field, which could be patch of terrain traversed by robotic or military vehicle. Implemented as single-chip very-large-scale integrated circuit. Excepting processors on edges, each processor communicates with four nearest neighbors along paths representing travel to north, south, east, and west. Each processor contains delay generator in form of 8-bit ripple counter, preset to 1 of 256 possible values. Operation begins with choice of processor representing starting point. Transmits signals to nearest neighbor processors, which retransmits to other neighboring processors, and process repeats until signals propagated across entire field.
Chebyshev polynomial filtered subspace iteration in the discontinuous Galerkin method for large-scale electronic structure calculations

DOE PAGES

Banerjee, Amartya S.; Lin, Lin; Hu, Wei; ...

2016-10-21

The Discontinuous Galerkin (DG) electronic structure method employs an adaptive local basis (ALB) set to solve the Kohn-Sham equations of density functional theory in a discontinuous Galerkin framework. The adaptive local basis is generated on-the-fly to capture the local material physics and can systematically attain chemical accuracy with only a few tens of degrees of freedom per atom. A central issue for large-scale calculations, however, is the computation of the electron density (and subsequently, ground state properties) from the discretized Hamiltonian in an efficient and scalable manner. We show in this work how Chebyshev polynomial filtered subspace iteration (CheFSI) canmore » be used to address this issue and push the envelope in large-scale materials simulations in a discontinuous Galerkin framework. We describe how the subspace filtering steps can be performed in an efficient and scalable manner using a two-dimensional parallelization scheme, thanks to the orthogonality of the DG basis set and block-sparse structure of the DG Hamiltonian matrix. The on-the-fly nature of the ALB functions requires additional care in carrying out the subspace iterations. We demonstrate the parallel scalability of the DG-CheFSI approach in calculations of large-scale twodimensional graphene sheets and bulk three-dimensional lithium-ion electrolyte systems. In conclusion, employing 55 296 computational cores, the time per self-consistent field iteration for a sample of the bulk 3D electrolyte containing 8586 atoms is 90 s, and the time for a graphene sheet containing 11 520 atoms is 75 s.« less
Acceleration of linear stationary iterative processes in multiprocessor computers. II

DOE Office of Scientific and Technical Information (OSTI.GOV)

Romm, Ya.E.

1982-05-01

For pt.I, see Kibernetika, vol.18, no.1, p.47 (1982). For pt.I, see Cybernetics, vol.18, no.1, p.54 (1982). Considers a reduced system of linear algebraic equations x=ax+b, where a=(a/sub ij/) is a real n*n matrix; b is a real vector with common euclidean norm >>>. It is supposed that the existence and uniqueness of solution det (0-a) not equal to e is given, where e is a unit matrix. The linear iterative process converging to x x/sup (k+1)/=fx/sup (k)/, k=0, 1, 2, ..., where the operator f translates r/sup n/ into r/sup n/. In considering implementation of the iterative process (ip) inmore » a multiprocessor system, it is assumed that the number of processors is constant, and are various values of the latter investigated; it is assumed in addition, that the processors perform elementary binary arithmetic operations of addition and multiestimates only include the time of execution of arithmetic operations. With any paralleling of individual iteration, the execution time of the ip is proportional to the number of sequential steps k+1. The author sets the task of reducing the number of sequential steps in the ip so as to execute it in a time proportional to a value smaller than k+1. He also sets the goal of formulating a method of accelerated bit serial-parallel execution of each successive step of the ip, with, in the modification sought, a reduced number of steps in a time comparable to the operation time of logical elements. 6 references.« less
Reconstruction of coded aperture images

NASA Technical Reports Server (NTRS)

Bielefeld, Michael J.; Yin, Lo I.

1987-01-01

Balanced correlation method and the Maximum Entropy Method (MEM) were implemented to reconstruct a laboratory X-ray source as imaged by a Uniformly Redundant Array (URA) system. Although the MEM method has advantages over the balanced correlation method, it is computationally time consuming because of the iterative nature of its solution. Massively Parallel Processing, with its parallel array structure is ideally suited for such computations. These preliminary results indicate that it is possible to use the MEM method in future coded-aperture experiments with the help of the MPP.
Research in Computational Aeroscience Applications Implemented on Advanced Parallel Computing Systems

NASA Technical Reports Server (NTRS)

Wigton, Larry

1996-01-01

Improving the numerical linear algebra routines for use in new Navier-Stokes codes, specifically Tim Barth's unstructured grid code, with spin-offs to TRANAIR is reported. A fast distance calculation routine for Navier-Stokes codes using the new one-equation turbulence models is written. The primary focus of this work was devoted to improving matrix-iterative methods. New algorithms have been developed which activate the full potential of classical Cray-class computers as well as distributed-memory parallel computers.
Pharmacists' perception of synchronous versus asynchronous distance learning for continuing education programs.

PubMed

Buxton, Eric C

2014-02-12

To evaluate and compare pharmacists' satisfaction with the content and learning environment of a continuing education program series offered as either synchronous or asynchronous webinars. An 8-lecture series of online presentations on the topic of new drug therapies was offered to pharmacists in synchronous and asynchronous webinar formats. Participants completed a 50-question online survey at the end of the program series to evaluate their perceptions of the distance learning experience. Eighty-two participants completed the survey instrument (41 participants from the live webinar series and 41 participants from the asynchronous webinar series.) Responses indicated that while both groups were satisfied with the program content, the asynchronous group showed greater satisfaction with many aspects of the learning environment. The synchronous and asynchronous webinar participants responded positively regarding the quality of the programming and the method of delivery, but asynchronous participants rated their experience more positively overall.
Pharmacists’ Perception of Synchronous Versus Asynchronous Distance Learning for Continuing Education Programs

PubMed Central

2014-01-01

Objective. To evaluate and compare pharmacists’ satisfaction with the content and learning environment of a continuing education program series offered as either synchronous or asynchronous webinars. Methods. An 8-lecture series of online presentations on the topic of new drug therapies was offered to pharmacists in synchronous and asynchronous webinar formats. Participants completed a 50-question online survey at the end of the program series to evaluate their perceptions of the distance learning experience. Results. Eighty-two participants completed the survey instrument (41 participants from the live webinar series and 41 participants from the asynchronous webinar series.) Responses indicated that while both groups were satisfied with the program content, the asynchronous group showed greater satisfaction with many aspects of the learning environment. Conclusion. The synchronous and asynchronous webinar participants responded positively regarding the quality of the programming and the method of delivery, but asynchronous participants rated their experience more positively overall. PMID:24558276
Comparing the force ripple during asynchronous and conventional stimulation.

PubMed

Downey, Ryan J; Tate, Mark; Kawai, Hiroyuki; Dixon, Warren E

2014-10-01

Asynchronous stimulation has been shown to reduce fatigue during electrical stimulation; however, it may also exhibit a force ripple. We quantified the ripple during asynchronous and conventional single-channel transcutaneous stimulation across a range of stimulation frequencies. The ripple was measured during 5 asynchronous stimulation protocols, 2 conventional stimulation protocols, and 3 volitional contractions in 12 healthy individuals. Conventional 40 Hz and asynchronous 16 Hz stimulation were found to induce contractions that were as smooth as volitional contractions. Asynchronous 8, 10, and 12 Hz stimulation induced contractions with significant ripple. Lower stimulation frequencies can reduce fatigue; however, they may also lead to increased ripple. Future efforts should study the relationship between force ripple and the smoothness of the evoked movements in addition to the relationship between stimulation frequency and NMES-induced fatigue to elucidate an optimal stimulation frequency for asynchronous stimulation. © 2014 Wiley Periodicals, Inc.
Synchronous Office Hours in an Asynchronous Course: Making the Connection

ERIC Educational Resources Information Center

Gibbons-Kunka, Beatrice

2017-01-01

The notion of synchronous office hours in an asynchronous course seems counterintuitive. After all, one of the tenets of asynchronous education is to not require students to be online and participating at any time during the course. Having taught higher education online asynchronous courses for twenty years, the researcher experimented with online…
Parallel-vector computation for linear structural analysis and non-linear unconstrained optimization problems

NASA Technical Reports Server (NTRS)

Nguyen, D. T.; Al-Nasra, M.; Zhang, Y.; Baddourah, M. A.; Agarwal, T. K.; Storaasli, O. O.; Carmona, E. A.

1991-01-01

Several parallel-vector computational improvements to the unconstrained optimization procedure are described which speed up the structural analysis-synthesis process. A fast parallel-vector Choleski-based equation solver, pvsolve, is incorporated into the well-known SAP-4 general-purpose finite-element code. The new code, denoted PV-SAP, is tested for static structural analysis. Initial results on a four processor CRAY 2 show that using pvsolve reduces the equation solution time by a factor of 14-16 over the original SAP-4 code. In addition, parallel-vector procedures for the Golden Block Search technique and the BFGS method are developed and tested for nonlinear unconstrained optimization. A parallel version of an iterative solver and the pvsolve direct solver are incorporated into the BFGS method. Preliminary results on nonlinear unconstrained optimization test problems, using pvsolve in the analysis, show excellent parallel-vector performance indicating that these parallel-vector algorithms can be used in a new generation of finite-element based structural design/analysis-synthesis codes.
A parallel algorithm for the two-dimensional time fractional diffusion equation with implicit difference method.

PubMed

Gong, Chunye; Bao, Weimin; Tang, Guojian; Jiang, Yuewen; Liu, Jie

2014-01-01

It is very time consuming to solve fractional differential equations. The computational complexity of two-dimensional fractional differential equation (2D-TFDE) with iterative implicit finite difference method is O(M(x)M(y)N(2)). In this paper, we present a parallel algorithm for 2D-TFDE and give an in-depth discussion about this algorithm. A task distribution model and data layout with virtual boundary are designed for this parallel algorithm. The experimental results show that the parallel algorithm compares well with the exact solution. The parallel algorithm on single Intel Xeon X5540 CPU runs 3.16-4.17 times faster than the serial algorithm on single CPU core. The parallel efficiency of 81 processes is up to 88.24% compared with 9 processes on a distributed memory cluster system. We do think that the parallel computing technology will become a very basic method for the computational intensive fractional applications in the near future.
Parallel Ellipsoidal Perfectly Matched Layers for Acoustic Helmholtz Problems on Exterior Domains

DOE PAGES

Bunting, Gregory; Prakash, Arun; Walsh, Timothy; ...

2018-01-26

Exterior acoustic problems occur in a wide range of applications, making the finite element analysis of such problems a common practice in the engineering community. Various methods for truncating infinite exterior domains have been developed, including absorbing boundary conditions, infinite elements, and more recently, perfectly matched layers (PML). PML are gaining popularity due to their generality, ease of implementation, and effectiveness as an absorbing boundary condition. PML formulations have been developed in Cartesian, cylindrical, and spherical geometries, but not ellipsoidal. In addition, the parallel solution of PML formulations with iterative solvers for the solution of the Helmholtz equation, and howmore » this compares with more traditional strategies such as infinite elements, has not been adequately investigated. In this study, we present a parallel, ellipsoidal PML formulation for acoustic Helmholtz problems. To faciliate the meshing process, the ellipsoidal PML layer is generated with an on-the-fly mesh extrusion. Though the complex stretching is defined along ellipsoidal contours, we modify the Jacobian to include an additional mapping back to Cartesian coordinates in the weak formulation of the finite element equations. This allows the equations to be solved in Cartesian coordinates, which is more compatible with existing finite element software, but without the necessity of dealing with corners in the PML formulation. Herein we also compare the conditioning and performance of the PML Helmholtz problem with infinite element approach that is based on high order basis functions. On a set of representative exterior acoustic examples, we show that high order infinite element basis functions lead to an increasing number of Helmholtz solver iterations, whereas for PML the number of iterations remains constant for the same level of accuracy. Finally, this provides an additional advantage of PML over the infinite element approach.« less

Parallel Ellipsoidal Perfectly Matched Layers for Acoustic Helmholtz Problems on Exterior Domains

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bunting, Gregory; Prakash, Arun; Walsh, Timothy

Exterior acoustic problems occur in a wide range of applications, making the finite element analysis of such problems a common practice in the engineering community. Various methods for truncating infinite exterior domains have been developed, including absorbing boundary conditions, infinite elements, and more recently, perfectly matched layers (PML). PML are gaining popularity due to their generality, ease of implementation, and effectiveness as an absorbing boundary condition. PML formulations have been developed in Cartesian, cylindrical, and spherical geometries, but not ellipsoidal. In addition, the parallel solution of PML formulations with iterative solvers for the solution of the Helmholtz equation, and howmore » this compares with more traditional strategies such as infinite elements, has not been adequately investigated. In this study, we present a parallel, ellipsoidal PML formulation for acoustic Helmholtz problems. To faciliate the meshing process, the ellipsoidal PML layer is generated with an on-the-fly mesh extrusion. Though the complex stretching is defined along ellipsoidal contours, we modify the Jacobian to include an additional mapping back to Cartesian coordinates in the weak formulation of the finite element equations. This allows the equations to be solved in Cartesian coordinates, which is more compatible with existing finite element software, but without the necessity of dealing with corners in the PML formulation. Herein we also compare the conditioning and performance of the PML Helmholtz problem with infinite element approach that is based on high order basis functions. On a set of representative exterior acoustic examples, we show that high order infinite element basis functions lead to an increasing number of Helmholtz solver iterations, whereas for PML the number of iterations remains constant for the same level of accuracy. Finally, this provides an additional advantage of PML over the infinite element approach.« less
Agile parallel bioinformatics workflow management using Pwrake.

PubMed

Mishima, Hiroyuki; Sasaki, Kensaku; Tanaka, Masahiro; Tatebe, Osamu; Yoshiura, Koh-Ichiro

2011-09-08

In bioinformatics projects, scientific workflow systems are widely used to manage computational procedures. Full-featured workflow systems have been proposed to fulfil the demand for workflow management. However, such systems tend to be over-weighted for actual bioinformatics practices. We realize that quick deployment of cutting-edge software implementing advanced algorithms and data formats, and continuous adaptation to changes in computational resources and the environment are often prioritized in scientific workflow management. These features have a greater affinity with the agile software development method through iterative development phases after trial and error.Here, we show the application of a scientific workflow system Pwrake to bioinformatics workflows. Pwrake is a parallel workflow extension of Ruby's standard build tool Rake, the flexibility of which has been demonstrated in the astronomy domain. Therefore, we hypothesize that Pwrake also has advantages in actual bioinformatics workflows. We implemented the Pwrake workflows to process next generation sequencing data using the Genomic Analysis Toolkit (GATK) and Dindel. GATK and Dindel workflows are typical examples of sequential and parallel workflows, respectively. We found that in practice, actual scientific workflow development iterates over two phases, the workflow definition phase and the parameter adjustment phase. We introduced separate workflow definitions to help focus on each of the two developmental phases, as well as helper methods to simplify the descriptions. This approach increased iterative development efficiency. Moreover, we implemented combined workflows to demonstrate modularity of the GATK and Dindel workflows. Pwrake enables agile management of scientific workflows in the bioinformatics domain. The internal domain specific language design built on Ruby gives the flexibility of rakefiles for writing scientific workflows. Furthermore, readability and maintainability of rakefiles may facilitate sharing workflows among the scientific community. Workflows for GATK and Dindel are available at http://github.com/misshie/Workflows.
Agile parallel bioinformatics workflow management using Pwrake

PubMed Central

2011-01-01

Background In bioinformatics projects, scientific workflow systems are widely used to manage computational procedures. Full-featured workflow systems have been proposed to fulfil the demand for workflow management. However, such systems tend to be over-weighted for actual bioinformatics practices. We realize that quick deployment of cutting-edge software implementing advanced algorithms and data formats, and continuous adaptation to changes in computational resources and the environment are often prioritized in scientific workflow management. These features have a greater affinity with the agile software development method through iterative development phases after trial and error. Here, we show the application of a scientific workflow system Pwrake to bioinformatics workflows. Pwrake is a parallel workflow extension of Ruby's standard build tool Rake, the flexibility of which has been demonstrated in the astronomy domain. Therefore, we hypothesize that Pwrake also has advantages in actual bioinformatics workflows. Findings We implemented the Pwrake workflows to process next generation sequencing data using the Genomic Analysis Toolkit (GATK) and Dindel. GATK and Dindel workflows are typical examples of sequential and parallel workflows, respectively. We found that in practice, actual scientific workflow development iterates over two phases, the workflow definition phase and the parameter adjustment phase. We introduced separate workflow definitions to help focus on each of the two developmental phases, as well as helper methods to simplify the descriptions. This approach increased iterative development efficiency. Moreover, we implemented combined workflows to demonstrate modularity of the GATK and Dindel workflows. Conclusions Pwrake enables agile management of scientific workflows in the bioinformatics domain. The internal domain specific language design built on Ruby gives the flexibility of rakefiles for writing scientific workflows. Furthermore, readability and maintainability of rakefiles may facilitate sharing workflows among the scientific community. Workflows for GATK and Dindel are available at http://github.com/misshie/Workflows. PMID:21899774
Multilevel acceleration of scattering-source iterations with application to electron transport

DOE PAGES

Drumm, Clif; Fan, Wesley

2017-08-18

Acceleration/preconditioning strategies available in the SCEPTRE radiation transport code are described. A flexible transport synthetic acceleration (TSA) algorithm that uses a low-order discrete-ordinates (S N) or spherical-harmonics (P N) solve to accelerate convergence of a high-order S N source-iteration (SI) solve is described. Convergence of the low-order solves can be further accelerated by applying off-the-shelf incomplete-factorization or algebraic-multigrid methods. Also available is an algorithm that uses a generalized minimum residual (GMRES) iterative method rather than SI for convergence, using a parallel sweep-based solver to build up a Krylov subspace. TSA has been applied as a preconditioner to accelerate the convergencemore » of the GMRES iterations. The methods are applied to several problems involving electron transport and problems with artificial cross sections with large scattering ratios. These methods were compared and evaluated by considering material discontinuities and scattering anisotropy. Observed accelerations obtained are highly problem dependent, but speedup factors around 10 have been observed in typical applications.« less
Iterative Importance Sampling Algorithms for Parameter Estimation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Grout, Ray W; Morzfeld, Matthias; Day, Marcus S.

In parameter estimation problems one computes a posterior distribution over uncertain parameters defined jointly by a prior distribution, a model, and noisy data. Markov chain Monte Carlo (MCMC) is often used for the numerical solution of such problems. An alternative to MCMC is importance sampling, which can exhibit near perfect scaling with the number of cores on high performance computing systems because samples are drawn independently. However, finding a suitable proposal distribution is a challenging task. Several sampling algorithms have been proposed over the past years that take an iterative approach to constructing a proposal distribution. We investigate the applicabilitymore » of such algorithms by applying them to two realistic and challenging test problems, one in subsurface flow, and one in combustion modeling. More specifically, we implement importance sampling algorithms that iterate over the mean and covariance matrix of Gaussian or multivariate t-proposal distributions. Our implementation leverages massively parallel computers, and we present strategies to initialize the iterations using 'coarse' MCMC runs or Gaussian mixture models.« less
An epigenetic state associated with areas of gene duplication

PubMed Central

Gimelbrant, Alexander A.; Chess, Andrew

2006-01-01

Asynchronous DNA replication is an epigenetically determined feature found in all cases of monoallelic expression, including genomic imprinting, X-inactivation, and random monoallelic expression of autosomal genes such as immunoglobulins and olfactory receptor genes. Most genes of the latter class were identified in experiments focused on genes functioning in the chemosensory and immune systems. We performed an unbiased survey of asynchronous replication in the mouse genome, excluding known asynchronously replicated genes. Fully 10% (eight of 80) of the genes tested exhibited asynchronous replication. A common feature of the newly identified asynchronously replicated areas is their proximity to areas of tandem gene duplication. Testing of other clustered areas supported the idea that such regions are enriched with asynchronously replicated genes. PMID:16687731
Leveraging human oversight and intervention in large-scale parallel processing of open-source data

NASA Astrophysics Data System (ADS)

Casini, Enrico; Suri, Niranjan; Bradshaw, Jeffrey M.

2015-05-01

The popularity of cloud computing along with the increased availability of cheap storage have led to the necessity of elaboration and transformation of large volumes of open-source data, all in parallel. One way to handle such extensive volumes of information properly is to take advantage of distributed computing frameworks like Map-Reduce. Unfortunately, an entirely automated approach that excludes human intervention is often unpredictable and error prone. Highly accurate data processing and decision-making can be achieved by supporting an automatic process through human collaboration, in a variety of environments such as warfare, cyber security and threat monitoring. Although this mutual participation seems easily exploitable, human-machine collaboration in the field of data analysis presents several challenges. First, due to the asynchronous nature of human intervention, it is necessary to verify that once a correction is made, all the necessary reprocessing is done in chain. Second, it is often needed to minimize the amount of reprocessing in order to optimize the usage of resources due to limited availability. In order to improve on these strict requirements, this paper introduces improvements to an innovative approach for human-machine collaboration in the processing of large amounts of open-source data in parallel.
Capabilities of Fully Parallelized MHD Stability Code MARS

NASA Astrophysics Data System (ADS)

Svidzinski, Vladimir; Galkin, Sergei; Kim, Jin-Soo; Liu, Yueqiang

2016-10-01

Results of full parallelization of the plasma stability code MARS will be reported. MARS calculates eigenmodes in 2D axisymmetric toroidal equilibria in MHD-kinetic plasma models. Parallel version of MARS, named PMARS, has been recently developed at FAR-TECH. Parallelized MARS is an efficient tool for simulation of MHD instabilities with low, intermediate and high toroidal mode numbers within both fluid and kinetic plasma models, implemented in MARS. Parallelization of the code included parallelization of the construction of the matrix for the eigenvalue problem and parallelization of the inverse vector iterations algorithm, implemented in MARS for the solution of the formulated eigenvalue problem. Construction of the matrix is parallelized by distributing the load among processors assigned to different magnetic surfaces. Parallelization of the solution of the eigenvalue problem is made by repeating steps of the MARS algorithm using parallel libraries and procedures. Parallelized MARS is capable of calculating eigenmodes with significantly increased spatial resolution: up to 5,000 adapted radial grid points with up to 500 poloidal harmonics. Such resolution is sufficient for simulation of kink, tearing and peeling-ballooning instabilities with physically relevant parameters. Work is supported by the U.S. DOE SBIR program.
Fully Parallel MHD Stability Analysis Tool

NASA Astrophysics Data System (ADS)

Svidzinski, Vladimir; Galkin, Sergei; Kim, Jin-Soo; Liu, Yueqiang

2015-11-01

Progress on full parallelization of the plasma stability code MARS will be reported. MARS calculates eigenmodes in 2D axisymmetric toroidal equilibria in MHD-kinetic plasma models. It is a powerful tool for studying MHD and MHD-kinetic instabilities and it is widely used by fusion community. Parallel version of MARS is intended for simulations on local parallel clusters. It will be an efficient tool for simulation of MHD instabilities with low, intermediate and high toroidal mode numbers within both fluid and kinetic plasma models, already implemented in MARS. Parallelization of the code includes parallelization of the construction of the matrix for the eigenvalue problem and parallelization of the inverse iterations algorithm, implemented in MARS for the solution of the formulated eigenvalue problem. Construction of the matrix is parallelized by distributing the load among processors assigned to different magnetic surfaces. Parallelization of the solution of the eigenvalue problem is made by repeating steps of the present MARS algorithm using parallel libraries and procedures. Results of MARS parallelization and of the development of a new fix boundary equilibrium code adapted for MARS input will be reported. Work is supported by the U.S. DOE SBIR program.
HPC-NMF: A High-Performance Parallel Algorithm for Nonnegative Matrix Factorization

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kannan, Ramakrishnan; Sukumar, Sreenivas R.; Ballard, Grey M.

NMF is a useful tool for many applications in different domains such as topic modeling in text mining, background separation in video analysis, and community detection in social networks. Despite its popularity in the data mining community, there is a lack of efficient distributed algorithms to solve the problem for big data sets. We propose a high-performance distributed-memory parallel algorithm that computes the factorization by iteratively solving alternating non-negative least squares (NLS) subproblems formore » $$\\WW$$ and $$\\HH$$. It maintains the data and factor matrices in memory (distributed across processors), uses MPI for interprocessor communication, and, in the dense case, provably minimizes communication costs (under mild assumptions). As opposed to previous implementation, our algorithm is also flexible: It performs well for both dense and sparse matrices, and allows the user to choose any one of the multiple algorithms for solving the updates to low rank factors $$\\WW$$ and $$\\HH$$ within the alternating iterations.« less
Real-Time Compressive Sensing MRI Reconstruction Using GPU Computing and Split Bregman Methods

PubMed Central

Smith, David S.; Gore, John C.; Yankeelov, Thomas E.; Welch, E. Brian

2012-01-01

Compressive sensing (CS) has been shown to enable dramatic acceleration of MRI acquisition in some applications. Being an iterative reconstruction technique, CS MRI reconstructions can be more time-consuming than traditional inverse Fourier reconstruction. We have accelerated our CS MRI reconstruction by factors of up to 27 by using a split Bregman solver combined with a graphics processing unit (GPU) computing platform. The increases in speed we find are similar to those we measure for matrix multiplication on this platform, suggesting that the split Bregman methods parallelize efficiently. We demonstrate that the combination of the rapid convergence of the split Bregman algorithm and the massively parallel strategy of GPU computing can enable real-time CS reconstruction of even acquisition data matrices of dimension 40962 or more, depending on available GPU VRAM. Reconstruction of two-dimensional data matrices of dimension 10242 and smaller took ~0.3 s or less, showing that this platform also provides very fast iterative reconstruction for small-to-moderate size images. PMID:22481908
Real-Time Compressive Sensing MRI Reconstruction Using GPU Computing and Split Bregman Methods.

PubMed

Smith, David S; Gore, John C; Yankeelov, Thomas E; Welch, E Brian

2012-01-01

Compressive sensing (CS) has been shown to enable dramatic acceleration of MRI acquisition in some applications. Being an iterative reconstruction technique, CS MRI reconstructions can be more time-consuming than traditional inverse Fourier reconstruction. We have accelerated our CS MRI reconstruction by factors of up to 27 by using a split Bregman solver combined with a graphics processing unit (GPU) computing platform. The increases in speed we find are similar to those we measure for matrix multiplication on this platform, suggesting that the split Bregman methods parallelize efficiently. We demonstrate that the combination of the rapid convergence of the split Bregman algorithm and the massively parallel strategy of GPU computing can enable real-time CS reconstruction of even acquisition data matrices of dimension 4096(2) or more, depending on available GPU VRAM. Reconstruction of two-dimensional data matrices of dimension 1024(2) and smaller took ~0.3 s or less, showing that this platform also provides very fast iterative reconstruction for small-to-moderate size images.
New Parallel Algorithms for Structural Analysis and Design of Aerospace Structures

NASA Technical Reports Server (NTRS)

Nguyen, Duc T.

1998-01-01

Subspace and Lanczos iterations have been developed, well documented, and widely accepted as efficient methods for obtaining p-lowest eigen-pair solutions of large-scale, practical engineering problems. The focus of this paper is to incorporate recent developments in vectorized sparse technologies in conjunction with Subspace and Lanczos iterative algorithms for computational enhancements. Numerical performance, in terms of accuracy and efficiency of the proposed sparse strategies for Subspace and Lanczos algorithm, is demonstrated by solving for the lowest frequencies and mode shapes of structural problems on the IBM-R6000/590 and SunSparc 20 workstations.
A Phenomenological Synapse Model for Asynchronous Neurotransmitter Release

PubMed Central

Wang, Tao; Yin, Luping; Zou, Xiaolong; Shu, Yousheng; Rasch, Malte J.; Wu, Si

2016-01-01

Neurons communicate with each other via synapses. Action potentials cause release of neurotransmitters at the axon terminal. Typically, this neurotransmitter release is tightly time-locked to the arrival of an action potential and is thus called synchronous release. However, neurotransmitter release is stochastic and the rate of release of small quanta of neurotransmitters can be considerably elevated even long after the ceasing of spiking activity, leading to asynchronous release of neurotransmitters. Such asynchronous release varies for tissue and neuron types and has been shown recently to be pronounced in fast-spiking neurons. Notably, it was found that asynchronous release is enhanced in human epileptic tissue implicating a possibly important role in generating abnormal neural activity. Current neural network models for simulating and studying neural activity virtually only consider synchronous release and ignore asynchronous transmitter release. Here, we develop a phenomenological model for asynchronous neurotransmitter release, which, on one hand, captures the fundamental features of the asynchronous release process, and, on the other hand, is simple enough to be incorporated in large-size network simulations. Our proposed model is based on the well-known equations for short-term dynamical synaptic interactions and includes an additional stochastic term for modeling asynchronous release. We use experimental data obtained from inhibitory fast-spiking synapses of human epileptic tissue to fit the model parameters, and demonstrate that our model reproduces the characteristics of realistic asynchronous transmitter release. PMID:26834617
High-throughput GPU-based LDPC decoding

NASA Astrophysics Data System (ADS)

Chang, Yang-Lang; Chang, Cheng-Chun; Huang, Min-Yu; Huang, Bormin

2010-08-01

Low-density parity-check (LDPC) code is a linear block code known to approach the Shannon limit via the iterative sum-product algorithm. LDPC codes have been adopted in most current communication systems such as DVB-S2, WiMAX, WI-FI and 10GBASE-T. LDPC for the needs of reliable and flexible communication links for a wide variety of communication standards and configurations have inspired the demand for high-performance and flexibility computing. Accordingly, finding a fast and reconfigurable developing platform for designing the high-throughput LDPC decoder has become important especially for rapidly changing communication standards and configurations. In this paper, a new graphic-processing-unit (GPU) LDPC decoding platform with the asynchronous data transfer is proposed to realize this practical implementation. Experimental results showed that the proposed GPU-based decoder achieved 271x speedup compared to its CPU-based counterpart. It can serve as a high-throughput LDPC decoder.
Segmentation of remotely sensed data using parallel region growing

NASA Technical Reports Server (NTRS)

Tilton, J. C.; Cox, S. C.

1983-01-01

The improved spatial resolution of the new earth resources satellites will increase the need for effective utilization of spatial information in machine processing of remotely sensed data. One promising technique is scene segmentation by region growing. Region growing can use spatial information in two ways: only spatially adjacent regions merge together, and merging criteria can be based on region-wide spatial features. A simple region growing approach is described in which the similarity criterion is based on region mean and variance (a simple spatial feature). An effective way to implement region growing for remote sensing is as an iterative parallel process on a large parallel processor. A straightforward parallel pixel-based implementation of the algorithm is explored and its efficiency is compared with sequential pixel-based, sequential region-based, and parallel region-based implementations. Experimental results from on aircraft scanner data set are presented, as is a discussioon of proposed improvements to the segmentation algorithm.
Parallel Domain Decomposition Formulation and Software for Large-Scale Sparse Symmetrical/Unsymmetrical Aeroacoustic Applications

NASA Technical Reports Server (NTRS)

Nguyen, D. T.; Watson, Willie R. (Technical Monitor)

2005-01-01

The overall objectives of this research work are to formulate and validate efficient parallel algorithms, and to efficiently design/implement computer software for solving large-scale acoustic problems, arised from the unified frameworks of the finite element procedures. The adopted parallel Finite Element (FE) Domain Decomposition (DD) procedures should fully take advantages of multiple processing capabilities offered by most modern high performance computing platforms for efficient parallel computation. To achieve this objective. the formulation needs to integrate efficient sparse (and dense) assembly techniques, hybrid (or mixed) direct and iterative equation solvers, proper pre-conditioned strategies, unrolling strategies, and effective processors' communicating schemes. Finally, the numerical performance of the developed parallel finite element procedures will be evaluated by solving series of structural, and acoustic (symmetrical and un-symmetrical) problems (in different computing platforms). Comparisons with existing "commercialized" and/or "public domain" software are also included, whenever possible.
User's Guide for ENSAERO_FE Parallel Finite Element Solver

NASA Technical Reports Server (NTRS)

Eldred, Lloyd B.; Guruswamy, Guru P.

1999-01-01

A high fidelity parallel static structural analysis capability is created and interfaced to the multidisciplinary analysis package ENSAERO-MPI of Ames Research Center. This new module replaces ENSAERO's lower fidelity simple finite element and modal modules. Full aircraft structures may be more accurately modeled using the new finite element capability. Parallel computation is performed by breaking the full structure into multiple substructures. This approach is conceptually similar to ENSAERO's multizonal fluid analysis capability. The new substructure code is used to solve the structural finite element equations for each substructure in parallel. NASTRANKOSMIC is utilized as a front end for this code. Its full library of elements can be used to create an accurate and realistic aircraft model. It is used to create the stiffness matrices for each substructure. The new parallel code then uses an iterative preconditioned conjugate gradient method to solve the global structural equations for the substructure boundary nodes.
Stochastic bifurcations in the nonlinear parallel Ising model.

PubMed

Bagnoli, Franco; Rechtman, Raúl

2016-11-01

We investigate the phase transitions of a nonlinear, parallel version of the Ising model, characterized by an antiferromagnetic linear coupling and ferromagnetic nonlinear one. This model arises in problems of opinion formation. The mean-field approximation shows chaotic oscillations, by changing the couplings or the connectivity. The spatial model shows bifurcations in the average magnetization, similar to that seen in the mean-field approximation, induced by the change of the topology, after rewiring short-range to long-range connection, as predicted by the small-world effect. These coherent periodic and chaotic oscillations of the magnetization reflect a certain degree of synchronization of the spins, induced by long-range couplings. Similar bifurcations may be induced in the randomly connected model by changing the couplings or the connectivity and also the dilution (degree of asynchronism) of the updating. We also examined the effects of inhomogeneity, mixing ferromagnetic and antiferromagnetic coupling, which induces an unexpected bifurcation diagram with a "bubbling" behavior, as also happens for dilution.
Concurrent simulation of a parallel jaw end effector

NASA Technical Reports Server (NTRS)

Bynum, Bill

1985-01-01

A system of programs developed to aid in the design and development of the command/response protocol between a parallel jaw end effector and the strategic planner program controlling it are presented. The system executes concurrently with the LISP controlling program to generate a graphical image of the end effector that moves in approximately real time in response to commands sent from the controlling program. Concurrent execution of the simulation program is useful for revealing flaws in the communication command structure arising from the asynchronous nature of the message traffic between the end effector and the strategic planner. Software simulation helps to minimize the number of hardware changes necessary to the microprocessor driving the end effector because of changes in the communication protocol. The simulation of other actuator devices can be easily incorporated into the system of programs by using the underlying support that was developed for the concurrent execution of the simulation process and the communication between it and the controlling program.

Non-Cartesian Parallel Imaging Reconstruction

PubMed Central

Wright, Katherine L.; Hamilton, Jesse I.; Griswold, Mark A.; Gulani, Vikas; Seiberlich, Nicole

2014-01-01

Non-Cartesian parallel imaging has played an important role in reducing data acquisition time in MRI. The use of non-Cartesian trajectories can enable more efficient coverage of k-space, which can be leveraged to reduce scan times. These trajectories can be undersampled to achieve even faster scan times, but the resulting images may contain aliasing artifacts. Just as Cartesian parallel imaging can be employed to reconstruct images from undersampled Cartesian data, non-Cartesian parallel imaging methods can mitigate aliasing artifacts by using additional spatial encoding information in the form of the non-homogeneous sensitivities of multi-coil phased arrays. This review will begin with an overview of non-Cartesian k-space trajectories and their sampling properties, followed by an in-depth discussion of several selected non-Cartesian parallel imaging algorithms. Three representative non-Cartesian parallel imaging methods will be described, including Conjugate Gradient SENSE (CG SENSE), non-Cartesian GRAPPA, and Iterative Self-Consistent Parallel Imaging Reconstruction (SPIRiT). After a discussion of these three techniques, several potential promising clinical applications of non-Cartesian parallel imaging will be covered. PMID:24408499
Trace: a high-throughput tomographic reconstruction engine for large-scale datasets

DOE PAGES

Bicer, Tekin; Gursoy, Doga; Andrade, Vincent De; ...

2017-01-28

Here, synchrotron light source and detector technologies enable scientists to perform advanced experiments. These scientific instruments and experiments produce data at such scale and complexity that large-scale computation is required to unleash their full power. One of the widely used data acquisition technique at light sources is Computed Tomography, which can generate tens of GB/s depending on x-ray range. A large-scale tomographic dataset, such as mouse brain, may require hours of computation time with a medium size workstation. In this paper, we present Trace, a data-intensive computing middleware we developed for implementation and parallelization of iterative tomographic reconstruction algorithms. Tracemore » provides fine-grained reconstruction of tomography datasets using both (thread level) shared memory and (process level) distributed memory parallelization. Trace utilizes a special data structure called replicated reconstruction object to maximize application performance. We also present the optimizations we have done on the replicated reconstruction objects and evaluate them using a shale and a mouse brain sinogram. Our experimental evaluations show that the applied optimizations and parallelization techniques can provide 158x speedup (using 32 compute nodes) over single core configuration, which decreases the reconstruction time of a sinogram (with 4501 projections and 22400 detector resolution) from 12.5 hours to less than 5 minutes per iteration.« less
Trace: a high-throughput tomographic reconstruction engine for large-scale datasets

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bicer, Tekin; Gursoy, Doga; Andrade, Vincent De

Here, synchrotron light source and detector technologies enable scientists to perform advanced experiments. These scientific instruments and experiments produce data at such scale and complexity that large-scale computation is required to unleash their full power. One of the widely used data acquisition technique at light sources is Computed Tomography, which can generate tens of GB/s depending on x-ray range. A large-scale tomographic dataset, such as mouse brain, may require hours of computation time with a medium size workstation. In this paper, we present Trace, a data-intensive computing middleware we developed for implementation and parallelization of iterative tomographic reconstruction algorithms. Tracemore » provides fine-grained reconstruction of tomography datasets using both (thread level) shared memory and (process level) distributed memory parallelization. Trace utilizes a special data structure called replicated reconstruction object to maximize application performance. We also present the optimizations we have done on the replicated reconstruction objects and evaluate them using a shale and a mouse brain sinogram. Our experimental evaluations show that the applied optimizations and parallelization techniques can provide 158x speedup (using 32 compute nodes) over single core configuration, which decreases the reconstruction time of a sinogram (with 4501 projections and 22400 detector resolution) from 12.5 hours to less than 5 minutes per iteration.« less
Virtual fringe projection system with nonparallel illumination based on iteration

NASA Astrophysics Data System (ADS)

Zhou, Duo; Wang, Zhangying; Gao, Nan; Zhang, Zonghua; Jiang, Xiangqian

2017-06-01

Fringe projection profilometry has been widely applied in many fields. To set up an ideal measuring system, a virtual fringe projection technique has been studied to assist in the design of hardware configurations. However, existing virtual fringe projection systems use parallel illumination and have a fixed optical framework. This paper presents a virtual fringe projection system with nonparallel illumination. Using an iterative method to calculate intersection points between rays and reference planes or object surfaces, the proposed system can simulate projected fringe patterns and captured images. A new explicit calibration method has been presented to validate the precision of the system. Simulated results indicate that the proposed iterative method outperforms previous systems. Our virtual system can be applied to error analysis, algorithm optimization, and help operators to find ideal system parameter settings for actual measurements.
Variable aperture-based ptychographical iterative engine method

NASA Astrophysics Data System (ADS)

Sun, Aihui; Kong, Yan; Meng, Xin; He, Xiaoliang; Du, Ruijun; Jiang, Zhilong; Liu, Fei; Xue, Liang; Wang, Shouyu; Liu, Cheng

2018-02-01

A variable aperture-based ptychographical iterative engine (vaPIE) is demonstrated both numerically and experimentally to reconstruct the sample phase and amplitude rapidly. By adjusting the size of a tiny aperture under the illumination of a parallel light beam to change the illumination on the sample step by step and recording the corresponding diffraction patterns sequentially, both the sample phase and amplitude can be faithfully reconstructed with a modified ptychographical iterative engine (PIE) algorithm. Since many fewer diffraction patterns are required than in common PIE and the shape, the size, and the position of the aperture need not to be known exactly, this proposed vaPIE method remarkably reduces the data acquisition time and makes PIE less dependent on the mechanical accuracy of the translation stage; therefore, the proposed technique can be potentially applied for various scientific researches.
Efficient ICCG on a shared memory multiprocessor

NASA Technical Reports Server (NTRS)

Hammond, Steven W.; Schreiber, Robert

1989-01-01

Different approaches are discussed for exploiting parallelism in the ICCG (Incomplete Cholesky Conjugate Gradient) method for solving large sparse symmetric positive definite systems of equations on a shared memory parallel computer. Techniques for efficiently solving triangular systems and computing sparse matrix-vector products are explored. Three methods for scheduling the tasks in solving triangular systems are implemented on the Sequent Balance 21000. Sample problems that are representative of a large class of problems solved using iterative methods are used. We show that a static analysis to determine data dependences in the triangular solve can greatly improve its parallel efficiency. We also show that ignoring symmetry and storing the whole matrix can reduce solution time substantially.
MPF: A portable message passing facility for shared memory multiprocessors

NASA Technical Reports Server (NTRS)

Malony, Allen D.; Reed, Daniel A.; Mcguire, Patrick J.

1987-01-01

The design, implementation, and performance evaluation of a message passing facility (MPF) for shared memory multiprocessors are presented. The MPF is based on a message passing model conceptually similar to conversations. Participants (parallel processors) can enter or leave a conversation at any time. The message passing primitives for this model are implemented as a portable library of C function calls. The MPF is currently operational on a Sequent Balance 21000, and several parallel applications were developed and tested. Several simple benchmark programs are presented to establish interprocess communication performance for common patterns of interprocess communication. Finally, performance figures are presented for two parallel applications, linear systems solution, and iterative solution of partial differential equations.
Distributed spatial information integration based on web service

NASA Astrophysics Data System (ADS)

Tong, Hengjian; Zhang, Yun; Shao, Zhenfeng

2008-10-01

Spatial information systems and spatial information in different geographic locations usually belong to different organizations. They are distributed and often heterogeneous and independent from each other. This leads to the fact that many isolated spatial information islands are formed, reducing the efficiency of information utilization. In order to address this issue, we present a method for effective spatial information integration based on web service. The method applies asynchronous invocation of web service and dynamic invocation of web service to implement distributed, parallel execution of web map services. All isolated information islands are connected by the dispatcher of web service and its registration database to form a uniform collaborative system. According to the web service registration database, the dispatcher of web services can dynamically invoke each web map service through an asynchronous delegating mechanism. All of the web map services can be executed at the same time. When each web map service is done, an image will be returned to the dispatcher. After all of the web services are done, all images are transparently overlaid together in the dispatcher. Thus, users can browse and analyze the integrated spatial information. Experiments demonstrate that the utilization rate of spatial information resources is significantly raised thought the proposed method of distributed spatial information integration.
Distributed spatial information integration based on web service

NASA Astrophysics Data System (ADS)

Tong, Hengjian; Zhang, Yun; Shao, Zhenfeng

2009-10-01

Spatial information systems and spatial information in different geographic locations usually belong to different organizations. They are distributed and often heterogeneous and independent from each other. This leads to the fact that many isolated spatial information islands are formed, reducing the efficiency of information utilization. In order to address this issue, we present a method for effective spatial information integration based on web service. The method applies asynchronous invocation of web service and dynamic invocation of web service to implement distributed, parallel execution of web map services. All isolated information islands are connected by the dispatcher of web service and its registration database to form a uniform collaborative system. According to the web service registration database, the dispatcher of web services can dynamically invoke each web map service through an asynchronous delegating mechanism. All of the web map services can be executed at the same time. When each web map service is done, an image will be returned to the dispatcher. After all of the web services are done, all images are transparently overlaid together in the dispatcher. Thus, users can browse and analyze the integrated spatial information. Experiments demonstrate that the utilization rate of spatial information resources is significantly raised thought the proposed method of distributed spatial information integration.
Efficient parallel resolution of the simplified transport equations in mixed-dual formulation

NASA Astrophysics Data System (ADS)

Barrault, M.; Lathuilière, B.; Ramet, P.; Roman, J.

2011-03-01

A reactivity computation consists of computing the highest eigenvalue of a generalized eigenvalue problem, for which an inverse power algorithm is commonly used. Very fine modelizations are difficult to treat for our sequential solver, based on the simplified transport equations, in terms of memory consumption and computational time. A first implementation of a Lagrangian based domain decomposition method brings to a poor parallel efficiency because of an increase in the power iterations [1]. In order to obtain a high parallel efficiency, we improve the parallelization scheme by changing the location of the loop over the subdomains in the overall algorithm and by benefiting from the characteristics of the Raviart-Thomas finite element. The new parallel algorithm still allows us to locally adapt the numerical scheme (mesh, finite element order). However, it can be significantly optimized for the matching grid case. The good behavior of the new parallelization scheme is demonstrated for the matching grid case on several hundreds of nodes for computations based on a pin-by-pin discretization.
State Transition Matrix for Perturbed Orbital Motion Using Modified Chebyshev Picard Iteration

NASA Astrophysics Data System (ADS)

Read, Julie L.; Younes, Ahmad Bani; Macomber, Brent; Turner, James; Junkins, John L.

2015-06-01

The Modified Chebyshev Picard Iteration (MCPI) method has recently proven to be highly efficient for a given accuracy compared to several commonly adopted numerical integration methods, as a means to solve for perturbed orbital motion. This method utilizes Picard iteration, which generates a sequence of path approximations, and Chebyshev Polynomials, which are orthogonal and also enable both efficient and accurate function approximation. The nodes consistent with discrete Chebyshev orthogonality are generated using cosine sampling; this strategy also reduces the Runge effect and as a consequence of orthogonality, there is no matrix inversion required to find the basis function coefficients. The MCPI algorithms considered herein are parallel-structured so that they are immediately well-suited for massively parallel implementation with additional speedup. MCPI has a wide range of applications beyond ephemeris propagation, including the propagation of the State Transition Matrix (STM) for perturbed two-body motion. A solution is achieved for a spherical harmonic series representation of earth gravity (EGM2008), although the methodology is suitable for application to any gravity model. Included in this representation the normalized, Associated Legendre Functions are given and verified numerically. Modifications of the classical algorithm techniques, such as rewriting the STM equations in a second-order cascade formulation, gives rise to additional speedup. Timing results for the baseline formulation and this second-order formulation are given.
Designing Asynchronous Communication Tools for Optimization of Patient-Clinician Coordination

PubMed Central

Eschler, Jordan; Liu, Leslie S.; Vizer, Lisa M.; McClure, Jennifer B.; Lozano, Paula; Pratt, Wanda; Ralston, James D.

2015-01-01

Asynchronous communication outside the clinical setting has both enriched and complicated patient-clinician interactions. Many patients can now interact with a patient portal 24 hours a day, asking questions of their clinicians via secure message, checking lab results, ordering medication refills, or making appointments. However, the mode of communication (asynchronous) and the nature of the interaction (lacking tone or body language) strip valuable information from each side of patient-clinician asynchronous communication. Using interviews with 34 individuals who actively manage a chronic illness of their own, or for a child or partner, we elicited narratives about patients’ experiences and expectations for using asynchronous communication to address medical issues with their clinicians. Based on these perspectives, we present opportunities for designing asynchronous communication tools to better facilitate understanding of and coordination around care activities between patients and clinicians. PMID:26958188
Asynchronous Video Streaming vs. Synchronous Videoconferencing for Teaching a Pharmacogenetic Pharmacotherapy Course

PubMed Central

2007-01-01

Objectives To compare students' performance and course evaluations for a pharmacogenetic pharmacotherapy course taught by synchronous videoconferencing method via the Internet and for the same course taught via asynchronous video streaming via the Internet. Methods In spring 2005, a pharmacogenetic therapy course was taught to 73 students located on Amarillo, Lubbock, and Dallas campuses using synchronous videoconferencing, and in spring 2006, to 78 students located on the same 3 campuses using asynchronous video streaming. A course evaluation was administered to each group at the end of the courses. Results Students in the asynchronous setting had final course grades of 89% ± 7% compared to the mean final course grade of 87% ± 7% in the synchronous group (p = 0.05). Regardless of which technology was used, average course grades did not differ significantly among the 3 campus sites. Significantly more of the students in the asynchronous setting agreed (57%) with the statement that they could read the lecture notes and absorb the content on their own without attending the class than students in the synchronous class (23%; chi-square test; p < 0.001). Conclusions Students in both asynchronous and synchronous settings performed well. However, students taught using asynchronous videotaped lectures had lower satisfaction with the method of content delivery, and preferred live interactive sessions or a mix of interactive sessions and asynchronous videos over delivery of content using the synchronous or asynchronous method alone. PMID:17429516
Optimization and validation of accelerated golden-angle radial sparse MRI reconstruction with self-calibrating GRAPPA operator gridding.

PubMed

Benkert, Thomas; Tian, Ye; Huang, Chenchan; DiBella, Edward V R; Chandarana, Hersh; Feng, Li

2018-07-01

Golden-angle radial sparse parallel (GRASP) MRI reconstruction requires gridding and regridding to transform data between radial and Cartesian k-space. These operations are repeatedly performed in each iteration, which makes the reconstruction computationally demanding. This work aimed to accelerate GRASP reconstruction using self-calibrating GRAPPA operator gridding (GROG) and to validate its performance in clinical imaging. GROG is an alternative gridding approach based on parallel imaging, in which k-space data acquired on a non-Cartesian grid are shifted onto a Cartesian k-space grid using information from multicoil arrays. For iterative non-Cartesian image reconstruction, GROG is performed only once as a preprocessing step. Therefore, the subsequent iterative reconstruction can be performed directly in Cartesian space, which significantly reduces computational burden. Here, a framework combining GROG with GRASP (GROG-GRASP) is first optimized and then compared with standard GRASP reconstruction in 22 prostate patients. GROG-GRASP achieved approximately 4.2-fold reduction in reconstruction time compared with GRASP (∼333 min versus ∼78 min) while maintaining image quality (structural similarity index ≈ 0.97 and root mean square error ≈ 0.007). Visual image quality assessment by two experienced radiologists did not show significant differences between the two reconstruction schemes. With a graphics processing unit implementation, image reconstruction time can be further reduced to approximately 14 min. The GRASP reconstruction can be substantially accelerated using GROG. This framework is promising toward broader clinical application of GRASP and other iterative non-Cartesian reconstruction methods. Magn Reson Med 80:286-293, 2018. © 2017 International Society for Magnetic Resonance in Medicine. © 2017 International Society for Magnetic Resonance in Medicine.
Analysis of surface wave propagation in a grounded dielectric slab covered by a resistive sheet

NASA Technical Reports Server (NTRS)

Shively, David G.

1992-01-01

Both parallel and perpendicular polarized surface waves are known to propagate on lossless and lossy grounded dielectric slabs. Surface wave propagation on a grounded dielectric slab covered with a resistive sheet is considered. Both parallel and perpendicular polarizations are examined. Transcendental equations are derived for each polarization and are solved using iterative techniques. Attenuation and phase velocity are shown for representative geometries. The results are applicable to both a grounded slab with a resistive sheet and an ungrounded slab covered on each side with a resistive sheet.
Robust parallel iterative solvers for linear and least-squares problems, Final Technical Report

DOE Office of Scientific and Technical Information (OSTI.GOV)

Saad, Yousef

2014-01-16

The primary goal of this project is to study and develop robust iterative methods for solving linear systems of equations and least squares systems. The focus of the Minnesota team is on algorithms development, robustness issues, and on tests and validation of the methods on realistic problems. 1. The project begun with an investigation on how to practically update a preconditioner obtained from an ILU-type factorization, when the coefficient matrix changes. 2. We investigated strategies to improve robustness in parallel preconditioners in a specific case of a PDE with discontinuous coefficients. 3. We explored ways to adapt standard preconditioners formore » solving linear systems arising from the Helmholtz equation. These are often difficult linear systems to solve by iterative methods. 4. We have also worked on purely theoretical issues related to the analysis of Krylov subspace methods for linear systems. 5. We developed an effective strategy for performing ILU factorizations for the case when the matrix is highly indefinite. The strategy uses shifting in some optimal way. The method was extended to the solution of Helmholtz equations by using complex shifts, yielding very good results in many cases. 6. We addressed the difficult problem of preconditioning sparse systems of equations on GPUs. 7. A by-product of the above work is a software package consisting of an iterative solver library for GPUs based on CUDA. This was made publicly available. It was the first such library that offers complete iterative solvers for GPUs. 8. We considered another form of ILU which blends coarsening techniques from Multigrid with algebraic multilevel methods. 9. We have released a new version on our parallel solver - called pARMS [new version is version 3]. As part of this we have tested the code in complex settings - including the solution of Maxwell and Helmholtz equations and for a problem of crystal growth.10. As an application of polynomial preconditioning we considered the problem of evaluating f(A)v which arises in statistical sampling. 11. As an application to the methods we developed, we tackled the problem of computing the diagonal of the inverse of a matrix. This arises in statistical applications as well as in many applications in physics. We explored probing methods as well as domain-decomposition type methods. 12. A collaboration with researchers from Toulouse, France, considered the important problem of computing the Schur complement in a domain-decomposition approach. 13. We explored new ways of preconditioning linear systems, based on low-rank approximations.« less
Building asynchronous geospatial processing workflows with web services

NASA Astrophysics Data System (ADS)

Zhao, Peisheng; Di, Liping; Yu, Genong

2012-02-01

Geoscience research and applications often involve a geospatial processing workflow. This workflow includes a sequence of operations that use a variety of tools to collect, translate, and analyze distributed heterogeneous geospatial data. Asynchronous mechanisms, by which clients initiate a request and then resume their processing without waiting for a response, are very useful for complicated workflows that take a long time to run. Geospatial contents and capabilities are increasingly becoming available online as interoperable Web services. This online availability significantly enhances the ability to use Web service chains to build distributed geospatial processing workflows. This paper focuses on how to orchestrate Web services for implementing asynchronous geospatial processing workflows. The theoretical bases for asynchronous Web services and workflows, including asynchrony patterns and message transmission, are examined to explore different asynchronous approaches to and architecture of workflow code for the support of asynchronous behavior. A sample geospatial processing workflow, issued by the Open Geospatial Consortium (OGC) Web Service, Phase 6 (OWS-6), is provided to illustrate the implementation of asynchronous geospatial processing workflows and the challenges in using Web Services Business Process Execution Language (WS-BPEL) to develop them.
SciSpark: Highly Interactive and Scalable Model Evaluation and Climate Metrics

NASA Astrophysics Data System (ADS)

Wilson, B. D.; Mattmann, C. A.; Waliser, D. E.; Kim, J.; Loikith, P.; Lee, H.; McGibbney, L. J.; Whitehall, K. D.

2014-12-01

Remote sensing data and climate model output are multi-dimensional arrays of massive sizes locked away in heterogeneous file formats (HDF5/4, NetCDF 3/4) and metadata models (HDF-EOS, CF) making it difficult to perform multi-stage, iterative science processing since each stage requires writing and reading data to and from disk. We are developing a lightning fast Big Data technology called SciSpark based on ApacheTM Spark. Spark implements the map-reduce paradigm for parallel computing on a cluster, but emphasizes in-memory computation, "spilling" to disk only as needed, and so outperforms the disk-based ApacheTM Hadoop by 100x in memory and by 10x on disk, and makes iterative algorithms feasible. SciSpark will enable scalable model evaluation by executing large-scale comparisons of A-Train satellite observations to model grids on a cluster of 100 to 1000 compute nodes. This 2nd generation capability for NASA's Regional Climate Model Evaluation System (RCMES) will compute simple climate metrics at interactive speeds, and extend to quite sophisticated iterative algorithms such as machine-learning (ML) based clustering of temperature PDFs, and even graph-based algorithms for searching for Mesocale Convective Complexes. The goals of SciSpark are to: (1) Decrease the time to compute comparison statistics and plots from minutes to seconds; (2) Allow for interactive exploration of time-series properties over seasons and years; (3) Decrease the time for satellite data ingestion into RCMES to hours; (4) Allow for Level-2 comparisons with higher-order statistics or PDF's in minutes to hours; and (5) Move RCMES into a near real time decision-making platform. We will report on: the architecture and design of SciSpark, our efforts to integrate climate science algorithms in Python and Scala, parallel ingest and partitioning (sharding) of A-Train satellite observations from HDF files and model grids from netCDF files, first parallel runs to compute comparison statistics and PDF's, and first metrics quantifying parallel speedups and memory & disk usage.
Spike train generation and current-to-frequency conversion in silicon diodes

NASA Technical Reports Server (NTRS)

Coon, D. D.; Perera, A. G. U.

1989-01-01

A device physics model is developed to analyze spontaneous neuron-like spike train generation in current driven silicon p(+)-n-n(+) devices in cryogenic environments. The model is shown to explain the very high dynamic range (0 to the 7th) current-to-frequency conversion and experimental features of the spike train frequency as a function of input current. The devices are interesting components for implementation of parallel asynchronous processing adjacent to cryogenically cooled focal planes because of their extremely low current and power requirements, their electronic simplicity, and their pulse coding capability, and could be used to form the hardware basis for neural networks which employ biologically plausible means of information coding.
Real-Time Cognitive Computing Architecture for Data Fusion in a Dynamic Environment

NASA Technical Reports Server (NTRS)

Duong, Tuan A.; Duong, Vu A.

2012-01-01

A novel cognitive computing architecture is conceptualized for processing multiple channels of multi-modal sensory data streams simultaneously, and fusing the information in real time to generate intelligent reaction sequences. This unique architecture is capable of assimilating parallel data streams that could be analog, digital, synchronous/asynchronous, and could be programmed to act as a knowledge synthesizer and/or an "intelligent perception" processor. In this architecture, the bio-inspired models of visual pathway and olfactory receptor processing are combined as processing components, to achieve the composite function of "searching for a source of food while avoiding the predator." The architecture is particularly suited for scene analysis from visual data and odorant.

Pteros: fast and easy to use open-source C++ library for molecular analysis.

PubMed

Yesylevskyy, Semen O

2012-07-15

An open-source Pteros library for molecular modeling and analysis of molecular dynamics trajectories for C++ programming language is introduced. Pteros provides a number of routine analysis operations ranging from reading and writing trajectory files and geometry transformations to structural alignment and computation of nonbonded interaction energies. The library features asynchronous trajectory reading and parallel execution of several analysis routines, which greatly simplifies development of computationally intensive trajectory analysis algorithms. Pteros programming interface is very simple and intuitive while the source code is well documented and easily extendible. Pteros is available for free under open-source Artistic License from http://sourceforge.net/projects/pteros/. Copyright © 2012 Wiley Periodicals, Inc.
Parallel Newton-Krylov-Schwarz algorithms for the transonic full potential equation

NASA Technical Reports Server (NTRS)

Cai, Xiao-Chuan; Gropp, William D.; Keyes, David E.; Melvin, Robin G.; Young, David P.

1996-01-01

We study parallel two-level overlapping Schwarz algorithms for solving nonlinear finite element problems, in particular, for the full potential equation of aerodynamics discretized in two dimensions with bilinear elements. The overall algorithm, Newton-Krylov-Schwarz (NKS), employs an inexact finite-difference Newton method and a Krylov space iterative method, with a two-level overlapping Schwarz method as a preconditioner. We demonstrate that NKS, combined with a density upwinding continuation strategy for problems with weak shocks, is robust and, economical for this class of mixed elliptic-hyperbolic nonlinear partial differential equations, with proper specification of several parameters. We study upwinding parameters, inner convergence tolerance, coarse grid density, subdomain overlap, and the level of fill-in in the incomplete factorization, and report their effect on numerical convergence rate, overall execution time, and parallel efficiency on a distributed-memory parallel computer.
Variable aperture-based ptychographical iterative engine method.

PubMed

Sun, Aihui; Kong, Yan; Meng, Xin; He, Xiaoliang; Du, Ruijun; Jiang, Zhilong; Liu, Fei; Xue, Liang; Wang, Shouyu; Liu, Cheng

2018-02-01

A variable aperture-based ptychographical iterative engine (vaPIE) is demonstrated both numerically and experimentally to reconstruct the sample phase and amplitude rapidly. By adjusting the size of a tiny aperture under the illumination of a parallel light beam to change the illumination on the sample step by step and recording the corresponding diffraction patterns sequentially, both the sample phase and amplitude can be faithfully reconstructed with a modified ptychographical iterative engine (PIE) algorithm. Since many fewer diffraction patterns are required than in common PIE and the shape, the size, and the position of the aperture need not to be known exactly, this proposed vaPIE method remarkably reduces the data acquisition time and makes PIE less dependent on the mechanical accuracy of the translation stage; therefore, the proposed technique can be potentially applied for various scientific researches. (2018) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE).
On iterative processes in the Krylov-Sonneveld subspaces

NASA Astrophysics Data System (ADS)

Ilin, Valery P.

2016-10-01

The iterative Induced Dimension Reduction (IDR) methods are considered for solving large systems of linear algebraic equations (SLAEs) with nonsingular nonsymmetric matrices. These approaches are investigated by many authors and are charachterized sometimes as the alternative to the classical processes of Krylov type. The key moments of the IDR algorithms consist in the construction of the embedded Sonneveld subspaces, which have the decreasing dimensions and use the orthogonalization to some fixed subspace. Other independent approaches for research and optimization of the iterations are based on the augmented and modified Krylov subspaces by using the aggregation and deflation procedures with present various low rank approximations of the original matrices. The goal of this paper is to show, that IDR method in Sonneveld subspaces present an original interpretation of the modified algorithms in the Krylov subspaces. In particular, such description is given for the multi-preconditioned semi-conjugate direction methods which are actual for the parallel algebraic domain decomposition approaches.
Anderson acceleration of the Jacobi iterative method: An efficient alternative to Krylov methods for large, sparse linear systems

DOE PAGES

Pratapa, Phanisri P.; Suryanarayana, Phanish; Pask, John E.

2015-12-01

We employ Anderson extrapolation to accelerate the classical Jacobi iterative method for large, sparse linear systems. Specifically, we utilize extrapolation at periodic intervals within the Jacobi iteration to develop the Alternating Anderson–Jacobi (AAJ) method. We verify the accuracy and efficacy of AAJ in a range of test cases, including nonsymmetric systems of equations. We demonstrate that AAJ possesses a favorable scaling with system size that is accompanied by a small prefactor, even in the absence of a preconditioner. In particular, we show that AAJ is able to accelerate the classical Jacobi iteration by over four orders of magnitude, with speed-upsmore » that increase as the system gets larger. Moreover, we find that AAJ significantly outperforms the Generalized Minimal Residual (GMRES) method in the range of problems considered here, with the relative performance again improving with size of the system. As a result, the proposed method represents a simple yet efficient technique that is particularly attractive for large-scale parallel solutions of linear systems of equations.« less
Anderson acceleration of the Jacobi iterative method: An efficient alternative to Krylov methods for large, sparse linear systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Pratapa, Phanisri P.; Suryanarayana, Phanish; Pask, John E.

We employ Anderson extrapolation to accelerate the classical Jacobi iterative method for large, sparse linear systems. Specifically, we utilize extrapolation at periodic intervals within the Jacobi iteration to develop the Alternating Anderson–Jacobi (AAJ) method. We verify the accuracy and efficacy of AAJ in a range of test cases, including nonsymmetric systems of equations. We demonstrate that AAJ possesses a favorable scaling with system size that is accompanied by a small prefactor, even in the absence of a preconditioner. In particular, we show that AAJ is able to accelerate the classical Jacobi iteration by over four orders of magnitude, with speed-upsmore » that increase as the system gets larger. Moreover, we find that AAJ significantly outperforms the Generalized Minimal Residual (GMRES) method in the range of problems considered here, with the relative performance again improving with size of the system. As a result, the proposed method represents a simple yet efficient technique that is particularly attractive for large-scale parallel solutions of linear systems of equations.« less
Gauss-Seidel Iterative Method as a Real-Time Pile-Up Solver of Scintillation Pulses

NASA Astrophysics Data System (ADS)

Novak, Roman; Vencelj, Matja¿

2009-12-01

The pile-up rejection in nuclear spectroscopy has been confronted recently by several pile-up correction schemes that compensate for distortions of the signal and subsequent energy spectra artifacts as the counting rate increases. We study here a real-time capability of the event-by-event correction method, which at the core translates to solving many sets of linear equations. Tight time limits and constrained front-end electronics resources make well-known direct solvers inappropriate. We propose a novel approach based on the Gauss-Seidel iterative method, which turns out to be a stable and cost-efficient solution to improve spectroscopic resolution in the front-end electronics. We show the method convergence properties for a class of matrices that emerge in calorimetric processing of scintillation detector signals and demonstrate the ability of the method to support the relevant resolutions. The sole iteration-based error component can be brought below the sliding window induced errors in a reasonable number of iteration steps, thus allowing real-time operation. An area-efficient hardware implementation is proposed that fully utilizes the method's inherent parallelism.
Iterated reaction graphs: simulating complex Maillard reaction pathways.

PubMed

Patel, S; Rabone, J; Russell, S; Tissen, J; Klaffke, W

2001-01-01

This study investigates a new method of simulating a complex chemical system including feedback loops and parallel reactions. The practical purpose of this approach is to model the actual reactions that take place in the Maillard process, a set of food browning reactions, in sufficient detail to be able to predict the volatile composition of the Maillard products. The developed framework, called iterated reaction graphs, consists of two main elements: a soup of molecules and a reaction base of Maillard reactions. An iterative process loops through the reaction base, taking reactants from and feeding products back to the soup. This produces a reaction graph, with molecules as nodes and reactions as arcs. The iterated reaction graph is updated and validated by comparing output with the main products found by classical gas-chromatographic/mass spectrometric analysis. To ensure a realistic output and convergence to desired volatiles only, the approach contains a number of novel elements: rate kinetics are treated as reaction probabilities; only a subset of the true chemistry is modeled; and the reactions are blocked into groups.
Efficient parallel implicit methods for rotary-wing aerodynamics calculations

NASA Astrophysics Data System (ADS)

Wissink, Andrew M.

Euler/Navier-Stokes Computational Fluid Dynamics (CFD) methods are commonly used for prediction of the aerodynamics and aeroacoustics of modern rotary-wing aircraft. However, their widespread application to large complex problems is limited lack of adequate computing power. Parallel processing offers the potential for dramatic increases in computing power, but most conventional implicit solution methods are inefficient in parallel and new techniques must be adopted to realize its potential. This work proposes alternative implicit schemes for Euler/Navier-Stokes rotary-wing calculations which are robust and efficient in parallel. The first part of this work proposes an efficient parallelizable modification of the Lower Upper-Symmetric Gauss Seidel (LU-SGS) implicit operator used in the well-known Transonic Unsteady Rotor Navier Stokes (TURNS) code. The new hybrid LU-SGS scheme couples a point-relaxation approach of the Data Parallel-Lower Upper Relaxation (DP-LUR) algorithm for inter-processor communication with the Symmetric Gauss Seidel algorithm of LU-SGS for on-processor computations. With the modified operator, TURNS is implemented in parallel using Message Passing Interface (MPI) for communication. Numerical performance and parallel efficiency are evaluated on the IBM SP2 and Thinking Machines CM-5 multi-processors for a variety of steady-state and unsteady test cases. The hybrid LU-SGS scheme maintains the numerical performance of the original LU-SGS algorithm in all cases and shows a good degree of parallel efficiency. It experiences a higher degree of robustness than DP-LUR for third-order upwind solutions. The second part of this work examines use of Krylov subspace iterative solvers for the nonlinear CFD solutions. The hybrid LU-SGS scheme is used as a parallelizable preconditioner. Two iterative methods are tested, Generalized Minimum Residual (GMRES) and Orthogonal s-Step Generalized Conjugate Residual (OSGCR). The Newton method demonstrates good parallel performance on the IBM SP2, with OS-GCR giving slightly better performance than GMRES on large numbers of processors. For steady and quasi-steady calculations, the convergence rate is accelerated but the overall solution time remains about the same as the standard hybrid LU-SGS scheme. For unsteady calculations, however, the Newton method maintains a higher degree of time-accuracy which allows tbe use of larger timesteps and results in CPU savings of 20-35%.
Use of general purpose graphics processing units with MODFLOW

USGS Publications Warehouse

Hughes, Joseph D.; White, Jeremy T.

2013-01-01

To evaluate the use of general-purpose graphics processing units (GPGPUs) to improve the performance of MODFLOW, an unstructured preconditioned conjugate gradient (UPCG) solver has been developed. The UPCG solver uses a compressed sparse row storage scheme and includes Jacobi, zero fill-in incomplete, and modified-incomplete lower-upper (LU) factorization, and generalized least-squares polynomial preconditioners. The UPCG solver also includes options for sequential and parallel solution on the central processing unit (CPU) using OpenMP. For simulations utilizing the GPGPU, all basic linear algebra operations are performed on the GPGPU; memory copies between the central processing unit CPU and GPCPU occur prior to the first iteration of the UPCG solver and after satisfying head and flow criteria or exceeding a maximum number of iterations. The efficiency of the UPCG solver for GPGPU and CPU solutions is benchmarked using simulations of a synthetic, heterogeneous unconfined aquifer with tens of thousands to millions of active grid cells. Testing indicates GPGPU speedups on the order of 2 to 8, relative to the standard MODFLOW preconditioned conjugate gradient (PCG) solver, can be achieved when (1) memory copies between the CPU and GPGPU are optimized, (2) the percentage of time performing memory copies between the CPU and GPGPU is small relative to the calculation time, (3) high-performance GPGPU cards are utilized, and (4) CPU-GPGPU combinations are used to execute sequential operations that are difficult to parallelize. Furthermore, UPCG solver testing indicates GPGPU speedups exceed parallel CPU speedups achieved using OpenMP on multicore CPUs for preconditioners that can be easily parallelized.
An efficient implementation of 3D high-resolution imaging for large-scale seismic data with GPU/CPU heterogeneous parallel computing

NASA Astrophysics Data System (ADS)

Xu, Jincheng; Liu, Wei; Wang, Jin; Liu, Linong; Zhang, Jianfeng

2018-02-01

De-absorption pre-stack time migration (QPSTM) compensates for the absorption and dispersion of seismic waves by introducing an effective Q parameter, thereby making it an effective tool for 3D, high-resolution imaging of seismic data. Although the optimal aperture obtained via stationary-phase migration reduces the computational cost of 3D QPSTM and yields 3D stationary-phase QPSTM, the associated computational efficiency is still the main problem in the processing of 3D, high-resolution images for real large-scale seismic data. In the current paper, we proposed a division method for large-scale, 3D seismic data to optimize the performance of stationary-phase QPSTM on clusters of graphics processing units (GPU). Then, we designed an imaging point parallel strategy to achieve an optimal parallel computing performance. Afterward, we adopted an asynchronous double buffering scheme for multi-stream to perform the GPU/CPU parallel computing. Moreover, several key optimization strategies of computation and storage based on the compute unified device architecture (CUDA) were adopted to accelerate the 3D stationary-phase QPSTM algorithm. Compared with the initial GPU code, the implementation of the key optimization steps, including thread optimization, shared memory optimization, register optimization and special function units (SFU), greatly improved the efficiency. A numerical example employing real large-scale, 3D seismic data showed that our scheme is nearly 80 times faster than the CPU-QPSTM algorithm. Our GPU/CPU heterogeneous parallel computing framework significant reduces the computational cost and facilitates 3D high-resolution imaging for large-scale seismic data.
Using the Statecharts paradigm for simulation of patient flow in surgical care.

PubMed

Sobolev, Boris; Harel, David; Vasilakis, Christos; Levy, Adrian

2008-03-01

Computer simulation of patient flow has been used extensively to assess the impacts of changes in the management of surgical care. However, little research is available on the utility of existing modeling techniques. The purpose of this paper is to examine the capacity of Statecharts, a system of graphical specification, for constructing a discrete-event simulation model of the perioperative process. The Statecharts specification paradigm was originally developed for representing reactive systems by extending the formalism of finite-state machines through notions of hierarchy, parallelism, and event broadcasting. Hierarchy permits subordination between states so that one state may contain other states. Parallelism permits more than one state to be active at any given time. Broadcasting of events allows one state to detect changes in another state. In the context of the peri-operative process, hierarchy provides the means to describe steps within activities and to cluster related activities, parallelism provides the means to specify concurrent activities, and event broadcasting provides the means to trigger a series of actions in one activity according to transitions that occur in another activity. Combined with hierarchy and parallelism, event broadcasting offers a convenient way to describe the interaction of concurrent activities. We applied the Statecharts formalism to describe the progress of individual patients through surgical care as a series of asynchronous updates in patient records generated in reaction to events produced by parallel finite-state machines representing concurrent clinical and managerial activities. We conclude that Statecharts capture successfully the behavioral aspects of surgical care delivery by specifying permissible chronology of events, conditions, and actions.
Parallelization of a blind deconvolution algorithm

NASA Astrophysics Data System (ADS)

Matson, Charles L.; Borelli, Kathy J.

2006-09-01

Often it is of interest to deblur imagery in order to obtain higher-resolution images. Deblurring requires knowledge of the blurring function - information that is often not available separately from the blurred imagery. Blind deconvolution algorithms overcome this problem by jointly estimating both the high-resolution image and the blurring function from the blurred imagery. Because blind deconvolution algorithms are iterative in nature, they can take minutes to days to deblur an image depending how many frames of data are used for the deblurring and the platforms on which the algorithms are executed. Here we present our progress in parallelizing a blind deconvolution algorithm to increase its execution speed. This progress includes sub-frame parallelization and a code structure that is not specialized to a specific computer hardware architecture.
Synchronization of Hierarchical Time-Varying Neural Networks Based on Asynchronous and Intermittent Sampled-Data Control.

PubMed

Xiong, Wenjun; Patel, Ragini; Cao, Jinde; Zheng, Wei Xing

In this brief, our purpose is to apply asynchronous and intermittent sampled-data control methods to achieve the synchronization of hierarchical time-varying neural networks. The asynchronous and intermittent sampled-data controllers are proposed for two reasons: 1) the controllers may not transmit the control information simultaneously and 2) the controllers cannot always exist at any time . The synchronization is then discussed for a kind of hierarchical time-varying neural networks based on the asynchronous and intermittent sampled-data controllers. Finally, the simulation results are given to illustrate the usefulness of the developed criteria.In this brief, our purpose is to apply asynchronous and intermittent sampled-data control methods to achieve the synchronization of hierarchical time-varying neural networks. The asynchronous and intermittent sampled-data controllers are proposed for two reasons: 1) the controllers may not transmit the control information simultaneously and 2) the controllers cannot always exist at any time . The synchronization is then discussed for a kind of hierarchical time-varying neural networks based on the asynchronous and intermittent sampled-data controllers. Finally, the simulation results are given to illustrate the usefulness of the developed criteria.
Digital Synchronizer without Metastability

NASA Technical Reports Server (NTRS)

Simle, Robert M.; Cavazos, Jose A.

2009-01-01

A proposed design for a digital synchronizing circuit would eliminate metastability that plagues flip-flop circuits in digital input/output interfaces. This metastability is associated with sampling, by use of flip-flops, of an external signal that is asynchronous with a clock signal that drives the flip-flops: it is a temporary flip-flop failure that can occur when a rising or falling edge of an asynchronous signal occurs during the setup and/or hold time of a flip-flop. The proposed design calls for (1) use of a clock frequency greater than the frequency of the asynchronous signal, (2) use of flip-flop asynchronous preset or clear signals for the asynchronous input, (3) use of a clock asynchronous recovery delay with pulse width discriminator, and (4) tying the data inputs to constant logic levels to obtain (5) two half-rate synchronous partial signals - one for the falling and one for the rising edge. Inasmuch as the flip-flop data inputs would be permanently tied to constant logic levels, setup and hold times would not be violated. The half-rate partial signals would be recombined to construct a signal that would replicate the original asynchronous signal at its original rate but would be synchronous with the clock signal.
The Effects of Asynchronous Visual Delays on Simulator Flight Performance and the Development of Simulator Sickness Symptomatology

DTIC Science & Technology

1986-12-26

NAVAL TRAINING SYSTEMS CENTER ORLANDO. FLORIDA IT FILE COPY THE EFFECTS OF ASYNCHRONOUS VISUAL DELAYS ON SIMULATOR FLIGHT PERFORMANCE AND THE...ASYNCHRONOUS VISUAL. DELAYS ON SI.WLATOR FLIGHT PERF OMANCE AND THE DEVELOPMENT OF SIMLATOR SICKNESS SYMPTOMATOLOGY K. C. Uliano, E. Y. Lambert, R. S. Kennedy...ACCESSION NO. N63733N SP-01 0785-7P6 I. 4780 11. TITLE (Include Security Classification) The Effects of Asynchronous Visual Delays on Simulator Flight
Optimal Iterative Task Scheduling for Parallel Simulations.

DTIC Science & Technology

1991-03-01

State University, Pullman, Washington. November 1976. 19. Grimaldi , Ralph P . Discrete and Combinatorial Mathematics. Addison-Wesley. June 1989. 20...2 4.8.1 Problem Description .. .. .. .. ... .. ... .... 4-25 4.8.2 Reasons for Level-Strate- p Failure. .. .. .. .. ... 4-26...f- I CA A* overview................................ C-1 C .2 Sample A* r......................... .... C-I C-3 Evaluation P
Non-iterative Voltage Stability

DOE Office of Scientific and Technical Information (OSTI.GOV)

Makarov, Yuri V.; Vyakaranam, Bharat; Hou, Zhangshuan

2014-09-30

This report demonstrates promising capabilities and performance characteristics of the proposed method using several power systems models. The new method will help to develop a new generation of highly efficient tools suitable for real-time parallel implementation. The ultimate benefit obtained will be early detection of system instability and prevention of system blackouts in real time.
Modeling and Analysis of Mixed Synchronous/Asynchronous Systems

NASA Technical Reports Server (NTRS)

Driscoll, Kevin R.; Madl. Gabor; Hall, Brendan

2012-01-01

Practical safety-critical distributed systems must integrate safety critical and non-critical data in a common platform. Safety critical systems almost always consist of isochronous components that have synchronous or asynchronous interface with other components. Many of these systems also support a mix of synchronous and asynchronous interfaces. This report presents a study on the modeling and analysis of asynchronous, synchronous, and mixed synchronous/asynchronous systems. We build on the SAE Architecture Analysis and Design Language (AADL) to capture architectures for analysis. We present preliminary work targeted to capture mixed low- and high-criticality data, as well as real-time properties in a common Model of Computation (MoC). An abstract, but representative, test specimen system was created as the system to be modeled.
A novel comparator featured with input data characteristic

NASA Astrophysics Data System (ADS)

Jiang, Xiaobo; Ye, Desheng; Xu, Xiangmin; Zheng, Shuai

2016-03-01

Two types of low-power asynchronous comparators featured with input data statistical characteristic are proposed in this article. The asynchronous ripple comparator stops comparing at the first unequal bit but delivers the result to the least significant bit. The pre-stop asynchronous comparator can completely stop comparing and obtain results immediately. The proposed and contrastive comparators were implemented in SMIC 0.18 μm process with different bit widths. Simulation shows that the proposed pre-stop asynchronous comparator features the lowest power consumption, shortest average propagation delay and highest area efficiency among the comparators. Data path of low-density parity check decoder using the proposed pre-stop asynchronous comparators are most power efficient compared with other data paths with synthesised, clock gating and bitwise competition logic comparators.

Scalable Nonlinear Solvers for Fully Implicit Coupled Nuclear Fuel Modeling. Final Report

DOE Office of Scientific and Technical Information (OSTI.GOV)

Cai, Xiao-Chuan; Keyes, David; Yang, Chao

2014-09-29

The focus of the project is on the development and customization of some highly scalable domain decomposition based preconditioning techniques for the numerical solution of nonlinear, coupled systems of partial differential equations (PDEs) arising from nuclear fuel simulations. These high-order PDEs represent multiple interacting physical fields (for example, heat conduction, oxygen transport, solid deformation), each is modeled by a certain type of Cahn-Hilliard and/or Allen-Cahn equations. Most existing approaches involve a careful splitting of the fields and the use of field-by-field iterations to obtain a solution of the coupled problem. Such approaches have many advantages such as ease of implementationmore » since only single field solvers are needed, but also exhibit disadvantages. For example, certain nonlinear interactions between the fields may not be fully captured, and for unsteady problems, stable time integration schemes are difficult to design. In addition, when implemented on large scale parallel computers, the sequential nature of the field-by-field iterations substantially reduces the parallel efficiency. To overcome the disadvantages, fully coupled approaches have been investigated in order to obtain full physics simulations.« less
Turbo Trellis Coded Modulation With Iterative Decoding for Mobile Satellite Communications

NASA Technical Reports Server (NTRS)

Divsalar, D.; Pollara, F.

1997-01-01

In this paper, analytical bounds on the performance of parallel concatenation of two codes, known as turbo codes, and serial concatenation of two codes over fading channels are obtained. Based on this analysis, design criteria for the selection of component trellis codes for MPSK modulation, and a suitable bit-by-bit iterative decoding structure are proposed. Examples are given for throughput of 2 bits/sec/Hz with 8PSK modulation. The parallel concatenation example uses two rate 4/5 8-state convolutional codes with two interleavers. The convolutional codes' outputs are then mapped to two 8PSK modulations. The serial concatenated code example uses an 8-state outer code with rate 4/5 and a 4-state inner trellis code with 5 inputs and 2 x 8PSK outputs per trellis branch. Based on the above mentioned design criteria for fading channels, a method to obtain he structure of the trellis code with maximum diversity is proposed. Simulation results are given for AWGN and an independent Rayleigh fading channel with perfect Channel State Information (CSI).
HeinzelCluster: accelerated reconstruction for FORE and OSEM3D.

PubMed

Vollmar, S; Michel, C; Treffert, J T; Newport, D F; Casey, M; Knöss, C; Wienhard, K; Liu, X; Defrise, M; Heiss, W D

2002-08-07

Using iterative three-dimensional (3D) reconstruction techniques for reconstruction of positron emission tomography (PET) is not feasible on most single-processor machines due to the excessive computing time needed, especially so for the large sinogram sizes of our high-resolution research tomograph (HRRT). In our first approach to speed up reconstruction time we transform the 3D scan into the format of a two-dimensional (2D) scan with sinograms that can be reconstructed independently using Fourier rebinning (FORE) and a fast 2D reconstruction method. On our dedicated reconstruction cluster (seven four-processor systems, Intel PIII@700 MHz, switched fast ethernet and Myrinet, Windows NT Server), we process these 2D sinograms in parallel. We have achieved a speedup > 23 using 26 processors and also compared results for different communication methods (RPC, Syngo, Myrinet GM). The other approach is to parallelize OSEM3D (implementation of C Michel), which has produced the best results for HRRT data so far and is more suitable for an adequate treatment of the sinogram gaps that result from the detector geometry of the HRRT. We have implemented two levels of parallelization for four dedicated cluster (a shared memory fine-grain level on each node utilizing all four processors and a coarse-grain level allowing for 15 nodes) reducing the time for one core iteration from over 7 h to about 35 min.
The novel implicit LU-SGS parallel iterative method based on the diffusion equation of a nuclear reactor on a GPU cluster

NASA Astrophysics Data System (ADS)

Zhang, Jilin; Sha, Chaoqun; Wu, Yusen; Wan, Jian; Zhou, Li; Ren, Yongjian; Si, Huayou; Yin, Yuyu; Jing, Ya

2017-02-01

GPU not only is used in the field of graphic technology but also has been widely used in areas needing a large number of numerical calculations. In the energy industry, because of low carbon, high energy density, high duration and other characteristics, the development of nuclear energy cannot easily be replaced by other energy sources. Management of core fuel is one of the major areas of concern in a nuclear power plant, and it is directly related to the economic benefits and cost of nuclear power. The large-scale reactor core expansion equation is large and complicated, so the calculation of the diffusion equation is crucial in the core fuel management process. In this paper, we use CUDA programming technology on a GPU cluster to run the LU-SGS parallel iterative calculation against the background of the diffusion equation of the reactor. We divide one-dimensional and two-dimensional mesh into a plurality of domains, with each domain evenly distributed on the GPU blocks. A parallel collision scheme is put forward that defines the virtual boundary of the grid exchange information and data transmission by non-stop collision. Compared with the serial program, the experiment shows that GPU greatly improves the efficiency of program execution and verifies that GPU is playing a much more important role in the field of numerical calculations.
Interactional Coherence in Asynchronous Learning Networks: A Rhetorical Approach

ERIC Educational Resources Information Center

Potter, Andrew

2008-01-01

Numerous studies have affirmed the value of asynchronous online communication as a learning resource. Several investigations, however, have indicated that discussions in asynchronous environments are often neither interactive nor coherent. The research reported sought to develop an enhanced understanding of interactional coherence, argumentation,…
A system to build distributed multivariate models and manage disparate data sharing policies: implementation in the scalable national network for effectiveness research.

PubMed

Meeker, Daniella; Jiang, Xiaoqian; Matheny, Michael E; Farcas, Claudiu; D'Arcy, Michel; Pearlman, Laura; Nookala, Lavanya; Day, Michele E; Kim, Katherine K; Kim, Hyeoneui; Boxwala, Aziz; El-Kareh, Robert; Kuo, Grace M; Resnic, Frederic S; Kesselman, Carl; Ohno-Machado, Lucila

2015-11-01

Centralized and federated models for sharing data in research networks currently exist. To build multivariate data analysis for centralized networks, transfer of patient-level data to a central computation resource is necessary. The authors implemented distributed multivariate models for federated networks in which patient-level data is kept at each site and data exchange policies are managed in a study-centric manner. The objective was to implement infrastructure that supports the functionality of some existing research networks (e.g., cohort discovery, workflow management, and estimation of multivariate analytic models on centralized data) while adding additional important new features, such as algorithms for distributed iterative multivariate models, a graphical interface for multivariate model specification, synchronous and asynchronous response to network queries, investigator-initiated studies, and study-based control of staff, protocols, and data sharing policies. Based on the requirements gathered from statisticians, administrators, and investigators from multiple institutions, the authors developed infrastructure and tools to support multisite comparative effectiveness studies using web services for multivariate statistical estimation in the SCANNER federated network. The authors implemented massively parallel (map-reduce) computation methods and a new policy management system to enable each study initiated by network participants to define the ways in which data may be processed, managed, queried, and shared. The authors illustrated the use of these systems among institutions with highly different policies and operating under different state laws. Federated research networks need not limit distributed query functionality to count queries, cohort discovery, or independently estimated analytic models. Multivariate analyses can be efficiently and securely conducted without patient-level data transport, allowing institutions with strict local data storage requirements to participate in sophisticated analyses based on federated research networks. © The Author 2015. Published by Oxford University Press on behalf of the American Medical Informatics Association.
Development Of A Parallel Performance Model For The THOR Neutral Particle Transport Code

DOE Office of Scientific and Technical Information (OSTI.GOV)

Yessayan, Raffi; Azmy, Yousry; Schunert, Sebastian

The THOR neutral particle transport code enables simulation of complex geometries for various problems from reactor simulations to nuclear non-proliferation. It is undergoing a thorough V&V requiring computational efficiency. This has motivated various improvements including angular parallelization, outer iteration acceleration, and development of peripheral tools. For guiding future improvements to the code’s efficiency, better characterization of its parallel performance is useful. A parallel performance model (PPM) can be used to evaluate the benefits of modifications and to identify performance bottlenecks. Using INL’s Falcon HPC, the PPM development incorporates an evaluation of network communication behavior over heterogeneous links and a functionalmore » characterization of the per-cell/angle/group runtime of each major code component. After evaluating several possible sources of variability, this resulted in a communication model and a parallel portion model. The former’s accuracy is bounded by the variability of communication on Falcon while the latter has an error on the order of 1%.« less
Parallelization of the preconditioned IDR solver for modern multicore computer systems

NASA Astrophysics Data System (ADS)

Bessonov, O. A.; Fedoseyev, A. I.

2012-10-01

This paper present the analysis, parallelization and optimization approach for the large sparse matrix solver CNSPACK for modern multicore microprocessors. CNSPACK is an advanced solver successfully used for coupled solution of stiff problems arising in multiphysics applications such as CFD, semiconductor transport, kinetic and quantum problems. It employs iterative IDR algorithm with ILU preconditioning (user chosen ILU preconditioning order). CNSPACK has been successfully used during last decade for solving problems in several application areas, including fluid dynamics and semiconductor device simulation. However, there was a dramatic change in processor architectures and computer system organization in recent years. Due to this, performance criteria and methods have been revisited, together with involving the parallelization of the solver and preconditioner using Open MP environment. Results of the successful implementation for efficient parallelization are presented for the most advances computer system (Intel Core i7-9xx or two-processor Xeon 55xx/56xx).
Labeled Postings for Asynchronous Interaction

ERIC Educational Resources Information Center

ChanLin, Lih-Juan; Chen, Yong-Ting; Chan, Kung-Chi

2009-01-01

The Internet promotes computer-mediated communications, and so asynchronous learning network systems permit more flexibility in time, space, and interaction than synchronous mode of learning. The key point of asynchronous learning is the materials for web-aided teaching and the flow of knowledge. This research focuses on improving online…
An Asynchronous Augmentation to Traditional Course Delivery.

ERIC Educational Resources Information Center

Wolverton, Marvin L.; Wolverton, Mimi

Asynchronous augmentation facilitates distributed learning, which relies heavily on technology and self-learning. This paper reports the results of delivering a real estate principles course using an asynchronous course delivery format. It highlights one of many ways to enhance learning using technology, and it provides information concerning how…
A Taxonomy of Learning through Asynchronous Discussion

ERIC Educational Resources Information Center

Knowlton, Dave S.

2005-01-01

This article presents a five-tiered taxonomy that describes the nature of participation in, and learning through, asynchronous discussion. The taxonomy is framed by a constructivist view of asynchronous discussion. The five tiers of the taxonomy include the following: (a) passive participation, (b) developmental participation, (c) generative…
Designing Asynchronous Online Discussion Environments: Recent Progress and Possible Future Directions

ERIC Educational Resources Information Center

Gao, Fei; Zhang, Tianyi; Franklin, Teresa

2013-01-01

Asynchronous online discussion environments are important platforms to support learning. Research suggests, however, threaded forums, one of the most popular asynchronous discussion environments, do not often foster productive online discussions naturally. This paper explores how certain properties of threaded forums have affected or constrained…
Integrating Asynchronous Digital Design Into the Computer Engineering Curriculum

ERIC Educational Resources Information Center

Smith, S. C.; Al-Assadi, W. K.; Di, J.

2010-01-01

As demand increases for circuits with higher performance, higher complexity, and decreased feature size, asynchronous (clockless) paradigms will become more widely used in the semiconductor industry, as evidenced by the International Technology Roadmap for Semiconductors' (ITRS) prediction of a likely shift from synchronous to asynchronous design…
Evaluation of the effects of patient arm attenuation in SPECT cardiac perfusion imaging

NASA Astrophysics Data System (ADS)

Luo, Dershan; King, M. A.; Pan, Tin-Su; Xia, Weishi

1996-12-01

It was hypothesized that the use of attenuation correction could compensate for degradation in the uniformity of apparent localization of imaging agents seen in cardiac walls when patients are imaged with arms at their sides. Noise-free simulations of the digital MCAT phantom were employed to investigate this hypothesis. Four variations in camera size and collimation scheme were investigated. We observed that: 1) without attenuation correction, the arms had little additional influences on the uniformity of the heart for 180/spl deg/ reconstructions and caused a small increase in nonuniformity for 360/spl deg/ reconstructions, where the impact of both arms was included; 2) change in patient size had more of an impact on count uniformity than the presence of the arms, either with or without attenuation correction; 3) for a low number of iterations and large patient size, slightly better uniformity was obtained from parallel emission data than from fan-beam emission data, independent of whether parallel or fan-beam transmission data was used to reconstruct the attenuation maps; and 4) for all camera configurations, uniformity was improved with attenuation correction and, given sufficient number of iterations, it was compatible among different imaging geometry combinations. Thus, iterative algorithms can compensate for the additional attenuation imposed by larger patients or having the arms on the sides. When the arms are at the sides of the patient, however, a larger radius of rotation may be required, resulting in decreased spatial resolution.
Evaluation of the effects of patient arm attenuation in SPECT cardiac perfusion imaging

DOE Office of Scientific and Technical Information (OSTI.GOV)

Luo, D.; King, M.A.; Pan, T.S.

1996-12-01

It was hypothesized that the use of attenuation correction could compensate for degradation in the uniformity of apparent localization of imaging agents seen in cardiac walls when patients are imaged with arms at their sides. Noise-free simulations of the digital MCAT phantom were employed to investigate this hypothesis. Four variations in camera size and collimation scheme were investigated. The authors observed that: (1) without attenuation correction, the arms had little additional influences on the uniformity of the heart for 180{degree} reconstructions and caused a small increase in nonuniformity for 360{degree} reconstructions, where the impact of both arms was included; (2)more » change in patient size had more of an impact on count uniformity than the presence of the arms, either with or without attenuation correction; (3) for a low number of iterations and large patient size, slightly better uniformity was obtained from parallel emission data than from fan-beam emission data, independent of whether parallel or fan-beam transmission data was used to reconstruct the attenuation maps; and (4) for all camera configurations, uniformity was improved with attenuation correction and, given sufficient number of iterations, it was compatible among different imaging geometry combinations. Thus, iterative algorithms can compensate for the additional attenuation imposed by larger patients or having the arms on the sides. When the arms are at the sides of the patient, however, a larger radius of rotation may be required, resulting in decreased spatial resolution.« less
Improving the iterative Linear Interaction Energy approach using automated recognition of configurational transitions.

PubMed

Vosmeer, C Ruben; Kooi, Derk P; Capoferri, Luigi; Terpstra, Margreet M; Vermeulen, Nico P E; Geerke, Daan P

2016-01-01

Recently an iterative method was proposed to enhance the accuracy and efficiency of ligand-protein binding affinity prediction through linear interaction energy (LIE) theory. For ligand binding to flexible Cytochrome P450s (CYPs), this method was shown to decrease the root-mean-square error and standard deviation of error prediction by combining interaction energies of simulations starting from different conformations. Thereby, different parts of protein-ligand conformational space are sampled in parallel simulations. The iterative LIE framework relies on the assumption that separate simulations explore different local parts of phase space, and do not show transitions to other parts of configurational space that are already covered in parallel simulations. In this work, a method is proposed to (automatically) detect such transitions during the simulations that are performed to construct LIE models and to predict binding affinities. Using noise-canceling techniques and splines to fit time series of the raw data for the interaction energies, transitions during simulation between different parts of phase space are identified. Boolean selection criteria are then applied to determine which parts of the interaction energy trajectories are to be used as input for the LIE calculations. Here we show that this filtering approach benefits the predictive quality of our previous CYP 2D6-aryloxypropanolamine LIE model. In addition, an analysis is performed of the gain in computational efficiency that can be obtained from monitoring simulations using the proposed filtering method and by prematurely terminating simulations accordingly.
A new Green's function Monte Carlo algorithm for the solution of the two-dimensional nonlinear Poisson–Boltzmann equation: Application to the modeling of the communication breakdown problem in space vehicles during re-entry

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chatterjee, Kausik, E-mail: kausik.chatterjee@aggiemail.usu.edu; Center for Atmospheric and Space Sciences, Utah State University, Logan, UT 84322; Roadcap, John R., E-mail: john.roadcap@us.af.mil

The objective of this paper is the exposition of a recently-developed, novel Green's function Monte Carlo (GFMC) algorithm for the solution of nonlinear partial differential equations and its application to the modeling of the plasma sheath region around a cylindrical conducting object, carrying a potential and moving at low speeds through an otherwise neutral medium. The plasma sheath is modeled in equilibrium through the GFMC solution of the nonlinear Poisson–Boltzmann (NPB) equation. The traditional Monte Carlo based approaches for the solution of nonlinear equations are iterative in nature, involving branching stochastic processes which are used to calculate linear functionals ofmore » the solution of nonlinear integral equations. Over the last several years, one of the authors of this paper, K. Chatterjee has been developing a philosophically-different approach, where the linearization of the equation of interest is not required and hence there is no need for iteration and the simulation of branching processes. Instead, an approximate expression for the Green's function is obtained using perturbation theory, which is used to formulate the random walk equations within the problem sub-domains where the random walker makes its walks. However, as a trade-off, the dimensions of these sub-domains have to be restricted by the limitations imposed by perturbation theory. The greatest advantage of this approach is the ease and simplicity of parallelization stemming from the lack of the need for iteration, as a result of which the parallelization procedure is identical to the parallelization procedure for the GFMC solution of a linear problem. The application area of interest is in the modeling of the communication breakdown problem during a space vehicle's re-entry into the atmosphere. However, additional application areas are being explored in the modeling of electromagnetic propagation through the atmosphere/ionosphere in UHF/GPS applications.« less
A new Green's function Monte Carlo algorithm for the solution of the two-dimensional nonlinear Poisson-Boltzmann equation: Application to the modeling of the communication breakdown problem in space vehicles during re-entry

NASA Astrophysics Data System (ADS)

Chatterjee, Kausik; Roadcap, John R.; Singh, Surendra

2014-11-01

The objective of this paper is the exposition of a recently-developed, novel Green's function Monte Carlo (GFMC) algorithm for the solution of nonlinear partial differential equations and its application to the modeling of the plasma sheath region around a cylindrical conducting object, carrying a potential and moving at low speeds through an otherwise neutral medium. The plasma sheath is modeled in equilibrium through the GFMC solution of the nonlinear Poisson-Boltzmann (NPB) equation. The traditional Monte Carlo based approaches for the solution of nonlinear equations are iterative in nature, involving branching stochastic processes which are used to calculate linear functionals of the solution of nonlinear integral equations. Over the last several years, one of the authors of this paper, K. Chatterjee has been developing a philosophically-different approach, where the linearization of the equation of interest is not required and hence there is no need for iteration and the simulation of branching processes. Instead, an approximate expression for the Green's function is obtained using perturbation theory, which is used to formulate the random walk equations within the problem sub-domains where the random walker makes its walks. However, as a trade-off, the dimensions of these sub-domains have to be restricted by the limitations imposed by perturbation theory. The greatest advantage of this approach is the ease and simplicity of parallelization stemming from the lack of the need for iteration, as a result of which the parallelization procedure is identical to the parallelization procedure for the GFMC solution of a linear problem. The application area of interest is in the modeling of the communication breakdown problem during a space vehicle's re-entry into the atmosphere. However, additional application areas are being explored in the modeling of electromagnetic propagation through the atmosphere/ionosphere in UHF/GPS applications.
Parallel architectures for iterative methods on adaptive, block structured grids

NASA Technical Reports Server (NTRS)

Gannon, D.; Vanrosendale, J.

1983-01-01

A parallel computer architecture well suited to the solution of partial differential equations in complicated geometries is proposed. Algorithms for partial differential equations contain a great deal of parallelism. But this parallelism can be difficult to exploit, particularly on complex problems. One approach to extraction of this parallelism is the use of special purpose architectures tuned to a given problem class. The architecture proposed here is tuned to boundary value problems on complex domains. An adaptive elliptic algorithm which maps effectively onto the proposed architecture is considered in detail. Two levels of parallelism are exploited by the proposed architecture. First, by making use of the freedom one has in grid generation, one can construct grids which are locally regular, permitting a one to one mapping of grids to systolic style processor arrays, at least over small regions. All local parallelism can be extracted by this approach. Second, though there may be a regular global structure to the grids constructed, there will be parallelism at this level. One approach to finding and exploiting this parallelism is to use an architecture having a number of processor clusters connected by a switching network. The use of such a network creates a highly flexible architecture which automatically configures to the problem being solved.
Actively Engaging Students in Asynchronous Online Classes. IDEA Paper #64

ERIC Educational Resources Information Center

Riggs, Shannon A.; Linder, Kathryn E.

2016-01-01

Active learning activities and pedagogical strategies can look different in online learning environments, particularly in asynchronous courses when students are not interacting with the instructor, or with each other, in real time. This paper suggests a three-pronged approach for conceptualizing active learning in the online asynchronous class:…

Exploring Asynchronous and Synchronous Tool Use in Online Courses

ERIC Educational Resources Information Center

Oztok, Murat; Zingaro, Daniel; Brett, Clare; Hewitt, Jim

2013-01-01

While the independent contributions of synchronous and asynchronous interaction in online learning are clear, comparatively less is known about the pedagogical consequences of using both modes in the same environment. In this study, we examine relationships between students' use of asynchronous discussion forums and synchronous private messages…
Teaching Presence and Communication Timeliness in Asynchronous Online Courses

ERIC Educational Resources Information Center

Skramstad, Erik; Schlosser, Charles; Orellana, Anymir

2012-01-01

This study examined student perceptions of teaching presence and communication timeliness in asynchronous online courses. Garrison, Anderson, and Archer's (2000) community of inquiry model provided the framework for the survey research methodology used. Participants were 59 student volunteers taking 1 or more asynchronous online graduate courses.…
Two Studies Examining Argumentation in Asynchronous Computer Mediated Communication

ERIC Educational Resources Information Center

Joiner, Richard; Jones, Sarah; Doherty, John

2008-01-01

Asynchronous computer mediated communication (CMC) would seem to be an ideal medium for supporting development in student argumentation. This paper investigates this assumption through two studies. The first study compared asynchronous CMC with face-to-face discussions. The transactional and strategic level of the argumentation (i.e. measures of…
Using Television Sitcoms to Facilitate Asynchronous Discussions in the Online Communication Course

ERIC Educational Resources Information Center

Tolman, Elizabeth; Asbury, Bryan

2012-01-01

Asynchronous discussions are a useful instructional resource in the online communication course. In discussion groups students have the opportunity to actively participate and interact with students and the instructor. Asynchronous communication allows for flexibility because "participants can interact with significant amounts of time between…
Asynchronous Discussion Board Facilitation and Rubric Use in a Blended Learning Environment

ERIC Educational Resources Information Center

Giacumo, Lisa

2012-01-01

The purpose of this study was to investigate the effects of instructor response prompts and rubrics on students' performance in an asynchronous discussion-board assignment, their learning achievement on an objective-type posttest, and their reported satisfaction levels. Researchers who have studied asynchronous computer-mediated student…
Designing Asynchronous, Text-Based Computer Conferences: Ten Research-Based Suggestions

ERIC Educational Resources Information Center

Choitz, Paul; Lee, Doris

2006-01-01

Asynchronous computer conferencing refers to the use of computer software and a network enabling participants to post messages that allow discourse to continue even though interactions may be extended over days and weeks. Asynchronous conferences are time-independent, adapting to multiple time zones and learner schedules. Such activities as…
Asynchronous Learning Forums for Business Acculturation

ERIC Educational Resources Information Center

Pence, Christine Cope; Wulf, Catharina

2009-01-01

The use of IT as a facilitator for student collaboration in higher business education has grown rapidly since 2000. Asynchronous discussion forums are used abundantly for collaborative training purposes and for teaching students business-relevant tools for their future careers. This article presents an analysis of the asynchronous discussion forum…
Comparing face-to-face, synchronous, and asynchronous learning: postgraduate dental resident preferences.

PubMed

Kunin, Marc; Julliard, Kell N; Rodriguez, Tobias E

2014-06-01

The Department of Dental Medicine of Lutheran Medical Center has developed an asynchronous online curriculum consisting of prerecorded PowerPoint presentations with audio explanations. The focus of this study was to evaluate if the new asynchronous format satisfied the educational needs of the residents compared to traditional lecture (face-to-face) and synchronous (distance learning) formats. Lectures were delivered to 219 dental residents employing face-to-face and synchronous formats, as well as the new asynchronous format; 169 (77 percent) participated in the study. Outcomes were assessed with pretests, posttests, and individual lecture surveys. Results found the residents preferred face-to-face and asynchronous formats to the synchronous format in terms of effectiveness and clarity of presentations. This preference was directly related to the residents' perception of how well the technology worked in each format. The residents also rated the quality of student-instructor and student-student interactions in the synchronous and asynchronous formats significantly higher after taking the lecture series than they did before taking it. However, they rated the face-to-face format as significantly more conducive to student-instructor and student-student interaction. While the study found technology had a major impact on the efficacy of this curricular model, the results suggest that the asynchronous format can be an effective way to teach a postgraduate course.
Asynchronous glimpsing of speech: Spread of masking and task set-size

PubMed Central

Ozmeral, Erol J.; Buss, Emily; Hall, Joseph W.

2012-01-01

Howard-Jones and Rosen [(1993). J. Acoust. Soc. Am. 93, 2915–2922] investigated the ability to integrate glimpses of speech that are separated in time and frequency using a “checkerboard” masker, with asynchronous amplitude modulation (AM) across frequency. Asynchronous glimpsing was demonstrated only for spectrally wide frequency bands. It is possible that the reduced evidence of spectro-temporal integration with narrower bands was due to spread of masking at the periphery. The present study tested this hypothesis with a dichotic condition, in which the even- and odd-numbered bands of the target speech and asynchronous AM masker were presented to opposite ears, minimizing the deleterious effects of masking spread. For closed-set consonant recognition, thresholds were 5.1–8.5 dB better for dichotic than for monotic asynchronous AM conditions. Results were similar for closed-set word recognition, but for open-set word recognition the benefit of dichotic presentation was more modest and level dependent, consistent with the effects of spread of masking being level dependent. There was greater evidence of asynchronous glimpsing in the open-set than closed-set tasks. Presenting stimuli dichotically supported asynchronous glimpsing with narrower frequency bands than previously shown, though the magnitude of glimpsing was reduced for narrower bandwidths even in some dichotic conditions. PMID:22894234
Asynchronous vs didactic education: it's too early to throw in the towel on tradition.

PubMed

Jordan, Jaime; Jalali, Azadeh; Clarke, Samuel; Dyne, Pamela; Spector, Tahlia; Coates, Wendy

2013-08-08

Asynchronous, computer based instruction is cost effective, allows self-directed pacing and review, and addresses preferences of millennial learners. Current research suggests there is no significant difference in learning compared to traditional classroom instruction. Data are limited for novice learners in emergency medicine. The objective of this study was to compare asynchronous, computer-based instruction with traditional didactics for senior medical students during a week-long intensive course in acute care. We hypothesized both modalities would be equivalent. This was a prospective observational quasi-experimental study of 4th year medical students who were novice learners with minimal prior exposure to curricular elements. We assessed baseline knowledge with an objective pre-test. The curriculum was delivered in either traditional lecture format (shock, acute abdomen, dyspnea, field trauma) or via asynchronous, computer-based modules (chest pain, EKG interpretation, pain management, trauma). An interactive review covering all topics was followed by a post-test. Knowledge retention was measured after 10 weeks. Pre and post-test items were written by a panel of medical educators and validated with a reference group of learners. Mean scores were analyzed using dependent t-test and attitudes were assessed by a 5-point Likert scale. 44 of 48 students completed the protocol. Students initially acquired more knowledge from didactic education as demonstrated by mean gain scores (didactic: 28.39% ± 18.06; asynchronous 9.93% ± 23.22). Mean difference between didactic and asynchronous = 18.45% with 95% CI [10.40 to 26.50]; p = 0.0001. Retention testing demonstrated similar knowledge attrition: mean gain scores -14.94% (didactic); -17.61% (asynchronous), which was not significantly different: 2.68% ± 20.85, 95% CI [-3.66 to 9.02], p = 0.399. The attitudinal survey revealed that 60.4% of students believed the asynchronous modules were educational and 95.8% enjoyed the flexibility of the method. 39.6% of students preferred asynchronous education for required didactics; 37.5% were neutral; 23% preferred traditional lectures. Asynchronous, computer-based instruction was not equivalent to traditional didactics for novice learners of acute care topics. Interactive, standard didactic education was valuable. Retention rates were similar between instructional methods. Students had mixed attitudes toward asynchronous learning but enjoyed the flexibility. We urge caution in trading in traditional didactic lectures in favor of asynchronous education for novice learners in acute care.
Observations of the larval stages of Diceroprocta apache Davis (Homoptera: Tibicinidae)

USGS Publications Warehouse

Ellingson, A.R.; Andersen, D.C.; Kondratieff, B.C.

2002-01-01

Diceroprocta apache Davis is a locally abundant cicada in the riparian woodlands of the southwestern United States. While its ecological importance has often been hypothesized, very little is known of its specific life history. This paper presents preliminary information on life history of D. apache from larvae collected in the field at seasonal intervals as well as a smaller number of reared specimens. Morphological development of the fore-femoral comb closely parallels growth through distinct size classes. The data indicate the presence of five larval instars in D. apache. Development times from greenhouse-reared specimens suggest a 3-4 year life span and overlapping broods were present in the field. Sex ratios among pre-emergent larvae suggest the asynchronous emergence of sexes.
Method and apparatus for offloading compute resources to a flash co-processing appliance

DOEpatents

Tzelnic, Percy; Faibish, Sorin; Gupta, Uday K.; Bent, John; Grider, Gary Alan; Chen, Hsing -bung

2015-10-13

Solid-State Drive (SSD) burst buffer nodes are interposed into a parallel supercomputing cluster to enable fast burst checkpoint of cluster memory to or from nearby interconnected solid-state storage with asynchronous migration between the burst buffer nodes and slower more distant disk storage. The SSD nodes also perform tasks offloaded from the compute nodes or associated with the checkpoint data. For example, the data for the next job is preloaded in the SSD node and very fast uploaded to the respective compute node just before the next job starts. During a job, the SSD nodes perform fast visualization and statistical analysis upon the checkpoint data. The SSD nodes can also perform data reduction and encryption of the checkpoint data.
Bonsai: an event-based framework for processing and controlling data streams

PubMed Central

Lopes, Gonçalo; Bonacchi, Niccolò; Frazão, João; Neto, Joana P.; Atallah, Bassam V.; Soares, Sofia; Moreira, Luís; Matias, Sara; Itskov, Pavel M.; Correia, Patrícia A.; Medina, Roberto E.; Calcaterra, Lorenza; Dreosti, Elena; Paton, Joseph J.; Kampff, Adam R.

2015-01-01

The design of modern scientific experiments requires the control and monitoring of many different data streams. However, the serial execution of programming instructions in a computer makes it a challenge to develop software that can deal with the asynchronous, parallel nature of scientific data. Here we present Bonsai, a modular, high-performance, open-source visual programming framework for the acquisition and online processing of data streams. We describe Bonsai's core principles and architecture and demonstrate how it allows for the rapid and flexible prototyping of integrated experimental designs in neuroscience. We specifically highlight some applications that require the combination of many different hardware and software components, including video tracking of behavior, electrophysiology and closed-loop control of stimulation. PMID:25904861
An iterative method for the localization of a neutron source in a large box (container)

NASA Astrophysics Data System (ADS)

Dubinski, S.; Presler, O.; Alfassi, Z. B.

2007-12-01

The localization of an unknown neutron source in a bulky box was studied. This can be used for the inspection of cargo, to prevent the smuggling of neutron and α emitters. It is important to localize the source from the outside for safety reasons. Source localization is necessary in order to determine its activity. A previous study showed that, by using six detectors, three on each parallel face of the box (460×420×200 mm 3), the location of the source can be found with an average distance of 4.73 cm between the real source position and the calculated one and a maximal distance of about 9 cm. Accuracy was improved in this work by applying an iteration method based on four fixed detectors and the successive iteration of positioning of an external calibrating source. The initial positioning of the calibrating source is the plane of detectors 1 and 2. This method finds the unknown source location with an average distance of 0.78 cm between the real source position and the calculated one and a maximum distance of 3.66 cm for the same box. For larger boxes, localization without iterations requires an increase in the number of detectors, while localization with iterations requires only an increase in the number of iteration steps. In addition to source localization, two methods for determining the activity of the unknown source were also studied.
Preconditioned conjugate gradient methods for the compressible Navier-Stokes equations

NASA Technical Reports Server (NTRS)

Venkatakrishnan, V.

1990-01-01

The compressible Navier-Stokes equations are solved for a variety of two-dimensional inviscid and viscous problems by preconditioned conjugate gradient-like algorithms. Roe's flux difference splitting technique is used to discretize the inviscid fluxes. The viscous terms are discretized by using central differences. An algebraic turbulence model is also incorporated. The system of linear equations which arises out of the linearization of a fully implicit scheme is solved iteratively by the well known methods of GMRES (Generalized Minimum Residual technique) and Chebyschev iteration. Incomplete LU factorization and block diagonal factorization are used as preconditioners. The resulting algorithm is competitive with the best current schemes, but has wide applications in parallel computing and unstructured mesh computations.
Domain decomposition methods in computational fluid dynamics

NASA Technical Reports Server (NTRS)

Gropp, William D.; Keyes, David E.

1991-01-01

The divide-and-conquer paradigm of iterative domain decomposition, or substructuring, has become a practical tool in computational fluid dynamic applications because of its flexibility in accommodating adaptive refinement through locally uniform (or quasi-uniform) grids, its ability to exploit multiple discretizations of the operator equations, and the modular pathway it provides towards parallelism. These features are illustrated on the classic model problem of flow over a backstep using Newton's method as the nonlinear iteration. Multiple discretizations (second-order in the operator and first-order in the preconditioner) and locally uniform mesh refinement pay dividends separately, and they can be combined synergistically. Sample performance results are included from an Intel iPSC/860 hypercube implementation.
Preliminary design for a standard 10 sup 7 bit Solid State Memory (SSM)

NASA Technical Reports Server (NTRS)

Hayes, P. J.; Howle, W. M., Jr.; Stermer, R. L., Jr.

1978-01-01

A modular concept with three separate modules roughly separating bubble domain technology, control logic technology, and power supply technology was employed. These modules were respectively the standard memory module (SMM), the data control unit (DCU), and power supply module (PSM). The storage medium was provided by bubble domain chips organized into memory cells. These cells and the circuitry for parallel data access to the cells make up the SMM. The DCU provides a flexible serial data interface to the SMM. The PSM provides adequate power to enable one DCU and one SMM to operate simultaneously at the maximum data rate. The SSM was designed to handle asynchronous data rates from dc to 1.024 Mbs with a bit error rate less than 1 error in 10 to the eight power bits. Two versions of the SSM, a serial data memory and a dual parallel data memory were specified using the standard modules. The SSM specification includes requirements for radiation hardness, temperature and mechanical environments, dc magnetic field emission and susceptibility, electromagnetic compatibility, and reliability.
Spatial and temporal accuracy of asynchrony-tolerant finite difference schemes for partial differential equations at extreme scales

NASA Astrophysics Data System (ADS)

Kumari, Komal; Donzis, Diego

2017-11-01

Highly resolved computational simulations on massively parallel machines are critical in understanding the physics of a vast number of complex phenomena in nature governed by partial differential equations. Simulations at extreme levels of parallelism present many challenges with communication between processing elements (PEs) being a major bottleneck. In order to fully exploit the computational power of exascale machines one needs to devise numerical schemes that relax global synchronizations across PEs. This asynchronous computations, however, have a degrading effect on the accuracy of standard numerical schemes.We have developed asynchrony-tolerant (AT) schemes that maintain order of accuracy despite relaxed communications. We show, analytically and numerically, that these schemes retain their numerical properties with multi-step higher order temporal Runge-Kutta schemes. We also show that for a range of optimized parameters,the computation time and error for AT schemes is less than their synchronous counterpart. Stability of the AT schemes which depends upon history and random nature of delays, are also discussed. Support from NSF is gratefully acknowledged.
Material Implementation of Hyperincursive Field on Slime Mold Computer

NASA Astrophysics Data System (ADS)

Aono, Masashi; Gunji, Yukio-Pegio

2004-08-01

"Elementary Conflictable Cellular Automaton (ECCA)" was introduced by Aono and Gunji as a problematic computational syntax embracing the non-deterministic/non-algorithmic property due to its hyperincursivity and nonlocality. Although ECCA's hyperincursive evolution equation indicates the occurrence of the deadlock/infinite-loop, we do not consider that this problem declares the fundamental impossibility of implementing ECCA materially. Dubois proposed to call a computing system where uncertainty/contradiction occurs "the hyperincursive field". In this paper we introduce a material implementation of the hyperincursive field by using plasmodia of the true slime mold Physarum polycephalum. The amoeboid organism is adopted as a computing media of ECCA slime mold computer (ECCA-SMC) mainly because; it is a parallel non-distributed system whose locally branched tips (components) can act in parallel with asynchronism and nonlocal correlation. A notable characteristic of ECCA-SMC is that a cell representing a spatio-temporal segment of computation is occupied (overlapped) redundantly by multiple spatially adjacent computing operations and by temporally successive computing events. The overlapped time representation may contribute to the progression of discussions on unconventional notions of the time.
Parallel approach for bioinspired algorithms

NASA Astrophysics Data System (ADS)

Zaporozhets, Dmitry; Zaruba, Daria; Kulieva, Nina

2018-05-01

In the paper, a probabilistic parallel approach based on the population heuristic, such as a genetic algorithm, is suggested. The authors proposed using a multithreading approach at the micro level at which new alternative solutions are generated. On each iteration, several threads that independently used the same population to generate new solutions can be started. After the work of all threads, a selection operator combines obtained results in the new population. To confirm the effectiveness of the suggested approach, the authors have developed software on the basis of which experimental computations can be carried out. The authors have considered a classic optimization problem – finding a Hamiltonian cycle in a graph. Experiments show that due to the parallel approach at the micro level, increment of running speed can be obtained on graphs with 250 and more vertices.

ITER Central Solenoid Module Fabrication

DOE Office of Scientific and Technical Information (OSTI.GOV)

Smith, John

The fabrication of the modules for the ITER Central Solenoid (CS) has started in a dedicated production facility located in Poway, California, USA. The necessary tools have been designed, built, installed, and tested in the facility to enable the start of production. The current schedule has first module fabrication completed in 2017, followed by testing and subsequent shipment to ITER. The Central Solenoid is a key component of the ITER tokamak providing the inductive voltage to initiate and sustain the plasma current and to position and shape the plasma. The design of the CS has been a collaborative effort betweenmore » the US ITER Project Office (US ITER), the international ITER Organization (IO) and General Atomics (GA). GA’s responsibility includes: completing the fabrication design, developing and qualifying the fabrication processes and tools, and then completing the fabrication of the seven 110 tonne CS modules. The modules will be shipped separately to the ITER site, and then stacked and aligned in the Assembly Hall prior to insertion in the core of the ITER tokamak. A dedicated facility in Poway, California, USA has been established by GA to complete the fabrication of the seven modules. Infrastructure improvements included thick reinforced concrete floors, a diesel generator for backup power, along with, cranes for moving the tooling within the facility. The fabrication process for a single module requires approximately 22 months followed by five months of testing, which includes preliminary electrical testing followed by high current (48.5 kA) tests at 4.7K. The production of the seven modules is completed in a parallel fashion through ten process stations. The process stations have been designed and built with most stations having completed testing and qualification for carrying out the required fabrication processes. The final qualification step for each process station is achieved by the successful production of a prototype coil. Fabrication of the first ITER module is in progress. The seven modules will be individually shipped to Cadarache, France upon their completion. This paper describes the processes and status of the fabrication of the CS Modules for ITER.« less
Fostering Critical Reflection in a Computer-Based, Asynchronously Delivered Diversity Training Course

ERIC Educational Resources Information Center

Givhan, Shawn T.

2013-01-01

This dissertation study chronicles the creation of a computer-based, asynchronously delivered diversity training course for a state agency. The course format enabled efficient delivery of a mandatory curriculum to the Massachusetts Department of State Police workforce. However, the asynchronous format posed a challenge to achieving the learning…
Anonymity and Motivation in Asynchronous Discussions and L2 Vocabulary Learning

ERIC Educational Resources Information Center

Polat, Nihat; Mancilla, Rae; Mahalingappa, Laura

2013-01-01

This study investigates L2 attainment in asynchronous online environments, specifically possible relationships among anonymity, L2 motivation, participation in discussions, quality of L2 production, and success in L2 vocabulary learning. It examines, in asynchronous discussions, (a) if participation and (b) motivation contribute to L2 vocabulary…
Exploring the Effect of Scripted Roles on Cognitive Presence in Asynchronous Online Discussions

ERIC Educational Resources Information Center

Olesova, Larisa; Slavin, Margaret; Lim, Jieun

2016-01-01

The purpose of this study was to identify the effect of scripted roles on students' level of cognitive presence in asynchronous online threaded discussions. A quantitative content analysis was used to investigate: (1) what level of cognitive presence is achieved by students' assigned roles in asynchronous online discussions; (2) differences…
A Multi-Perspective Investigation into Learners' Interaction in Asynchronous Computer-Mediated Communication (CMC)

ERIC Educational Resources Information Center

Çardak, Çigdem Suzan

2016-01-01

This article focusses on graduate level students' interactions during asynchronous CMC activities of an online course about the teaching profession in Turkey. The instructor of the course designed and facilitated a semester-long asynchronous CMC on forum discussions, and investigated the interaction of learners in multiple perspectives: learners'…
The Negotiation Model in Asynchronous Computer-Mediated Communication (CMC): Negotiation in Task-Based Email Exchanges

ERIC Educational Resources Information Center

Kitade, Keiko

2006-01-01

Based on recent studies, computer-mediated communication (CMC) has been considered a tool to aid in language learning on account of its distinctive interactional features. However, most studies have referred to "synchronous" CMC and neglected to investigate how "asynchronous" CMC contributes to language learning. Asynchronous CMC possesses…
Integrating the Intangibles into Asynchronous Online Instruction: Strategies for Improving Interaction and Social Presence

ERIC Educational Resources Information Center

McGuire, Beverley Foulks

2016-01-01

This paper considers how instructors of asynchronous online courses in the Humanities might integrate intangibles associated with face-to-face instruction into their online environments. It presents a case study of asynchronous online instruction in a philosophy and religion department at a midsize public university in the southeastern United…
Postgraduate Students' Knowledge Construction during Asynchronous Computer Conferences in a Blended Learning Environment: A Malaysian Experience

ERIC Educational Resources Information Center

Kian-Sam, Hong; Lee, Julia Ai Cheng

2008-01-01

Blended learning, using e-learning tools to supplement existing on campus learning, often incorporates asynchronous computer conferencing as a means of augmenting knowledge construction among students. This case study reports findings about levels of knowledge construction amongst adult postgraduate students in six asynchronous computer…
Asynchronous Learning Sources in a High-Tech Organization

ERIC Educational Resources Information Center

Bouhnik, Dan; Giat, Yahel; Sanderovitch, Yafit

2009-01-01

Purpose: The purpose of this study is to characterize learning from asynchronous sources among research and development (R&D) personnel. It aims to examine four aspects of asynchronous source learning: employee preferences regarding self-learning; extent of source usage; employee satisfaction with these sources and the effect of the sources on the…
Asynchronous reference frame agreement in a quantum network

NASA Astrophysics Data System (ADS)

Islam, Tanvirul; Wehner, Stephanie

2016-03-01

An efficient implementation of many multiparty protocols for quantum networks requires that all the nodes in the network share a common reference frame. Establishing such a reference frame from scratch is especially challenging in an asynchronous network where network links might have arbitrary delays and the nodes do not share synchronised clocks. In this work, we study the problem of establishing a common reference frame in an asynchronous network of n nodes of which at most t are affected by arbitrary unknown error, and the identities of the faulty nodes are not known. We present a protocol that allows all the correctly functioning nodes to agree on a common reference frame as long as the network graph is complete and not more than t\\lt n/4 nodes are faulty. As the protocol is asynchronous, it can be used with some assumptions to synchronise clocks over a network. Also, the protocol has the appealing property that it allows any existing two-node asynchronous protocol for reference frame agreement to be lifted to a robust protocol for an asynchronous quantum network.
Parallel Symmetric Eigenvalue Problem Solvers

DTIC Science & Technology

2015-05-01

get research, tutoring, and mentoring experience as an undergraduate. Last but not least, I thank my family for their love and support. v TABLE OF...32 4.6.2 Choice of the Ritz shifts . . . . . . . . . . . . . . . . . . . . 37 4.7 Relationship between...pencil. I will conclude with a discussion of the relationship between Trace- Min and simultaneous iteration. If both methods solve the linear systems
Tokamak und Stellarator - zwei Wege zur Fusionsenergie: Fusionsforschung

NASA Astrophysics Data System (ADS)

Milch, Isabella

2006-07-01

Im Laufe der Fusionsforschung haben sich zwei Bautypen für ein zukünftiges Kraftwerk als besonders aussichtsreich erwiesen: Tokamak und Stellarator. Mit dem geplanten Tokamak-Experimentalreaktor ITER steht die internationale Fusionsforschung vor der Demonstration eines Energie liefernden Plasmas. Parallel soll die in Greifswald entstehende Forschungsanlage Wendelstein 7-X die Kraftwerkstauglichkeit des alternativen Bauprinzips der Stellaratoren zeigen.
A fast iterative scheme for the linearized Boltzmann equation

NASA Astrophysics Data System (ADS)

Wu, Lei; Zhang, Jun; Liu, Haihu; Zhang, Yonghao; Reese, Jason M.

2017-06-01

Iterative schemes to find steady-state solutions to the Boltzmann equation are efficient for highly rarefied gas flows, but can be very slow to converge in the near-continuum flow regime. In this paper, a synthetic iterative scheme is developed to speed up the solution of the linearized Boltzmann equation by penalizing the collision operator L into the form L = (L + Nδh) - Nδh, where δ is the gas rarefaction parameter, h is the velocity distribution function, and N is a tuning parameter controlling the convergence rate. The velocity distribution function is first solved by the conventional iterative scheme, then it is corrected such that the macroscopic flow velocity is governed by a diffusion-type equation that is asymptotic-preserving into the Navier-Stokes limit. The efficiency of this new scheme is assessed by calculating the eigenvalue of the iteration, as well as solving for Poiseuille and thermal transpiration flows. We find that the fastest convergence of our synthetic scheme for the linearized Boltzmann equation is achieved when Nδ is close to the average collision frequency. The synthetic iterative scheme is significantly faster than the conventional iterative scheme in both the transition and the near-continuum gas flow regimes. Moreover, due to its asymptotic-preserving properties, the synthetic iterative scheme does not need high spatial resolution in the near-continuum flow regime, which makes it even faster than the conventional iterative scheme. Using this synthetic scheme, with the fast spectral approximation of the linearized Boltzmann collision operator, Poiseuille and thermal transpiration flows between two parallel plates, through channels of circular/rectangular cross sections and various porous media are calculated over the whole range of gas rarefaction. Finally, the flow of a Ne-Ar gas mixture is solved based on the linearized Boltzmann equation with the Lennard-Jones intermolecular potential for the first time, and the difference between these results and those using the hard-sphere potential is discussed.
Multi-GPU Acceleration of Branchless Distance Driven Projection and Backprojection for Clinical Helical CT.

PubMed

Mitra, Ayan; Politte, David G; Whiting, Bruce R; Williamson, Jeffrey F; O'Sullivan, Joseph A

2017-01-01

Model-based image reconstruction (MBIR) techniques have the potential to generate high quality images from noisy measurements and a small number of projections which can reduce the x-ray dose in patients. These MBIR techniques rely on projection and backprojection to refine an image estimate. One of the widely used projectors for these modern MBIR based technique is called branchless distance driven (DD) projection and backprojection. While this method produces superior quality images, the computational cost of iterative updates keeps it from being ubiquitous in clinical applications. In this paper, we provide several new parallelization ideas for concurrent execution of the DD projectors in multi-GPU systems using CUDA programming tools. We have introduced some novel schemes for dividing the projection data and image voxels over multiple GPUs to avoid runtime overhead and inter-device synchronization issues. We have also reduced the complexity of overlap calculation of the algorithm by eliminating the common projection plane and directly projecting the detector boundaries onto image voxel boundaries. To reduce the time required for calculating the overlap between the detector edges and image voxel boundaries, we have proposed a pre-accumulation technique to accumulate image intensities in perpendicular 2D image slabs (from a 3D image) before projection and after backprojection to ensure our DD kernels run faster in parallel GPU threads. For the implementation of our iterative MBIR technique we use a parallel multi-GPU version of the alternating minimization (AM) algorithm with penalized likelihood update. The time performance using our proposed reconstruction method with Siemens Sensation 16 patient scan data shows an average of 24 times speedup using a single TITAN X GPU and 74 times speedup using 3 TITAN X GPUs in parallel for combined projection and backprojection.
Linear scaling computation of the Fock matrix. VI. Data parallel computation of the exchange-correlation matrix

NASA Astrophysics Data System (ADS)

Gan, Chee Kwan; Challacombe, Matt

2003-05-01

Recently, early onset linear scaling computation of the exchange-correlation matrix has been achieved using hierarchical cubature [J. Chem. Phys. 113, 10037 (2000)]. Hierarchical cubature differs from other methods in that the integration grid is adaptive and purely Cartesian, which allows for a straightforward domain decomposition in parallel computations; the volume enclosing the entire grid may be simply divided into a number of nonoverlapping boxes. In our data parallel approach, each box requires only a fraction of the total density to perform the necessary numerical integrations due to the finite extent of Gaussian-orbital basis sets. This inherent data locality may be exploited to reduce communications between processors as well as to avoid memory and copy overheads associated with data replication. Although the hierarchical cubature grid is Cartesian, naive boxing leads to irregular work loads due to strong spatial variations of the grid and the electron density. In this paper we describe equal time partitioning, which employs time measurement of the smallest sub-volumes (corresponding to the primitive cubature rule) to load balance grid-work for the next self-consistent-field iteration. After start-up from a heuristic center of mass partitioning, equal time partitioning exploits smooth variation of the density and grid between iterations to achieve load balance. With the 3-21G basis set and a medium quality grid, equal time partitioning applied to taxol (62 heavy atoms) attained a speedup of 61 out of 64 processors, while for a 110 molecule water cluster at standard density it achieved a speedup of 113 out of 128. The efficiency of equal time partitioning applied to hierarchical cubature improves as the grid work per processor increases. With a fine grid and the 6-311G(df,p) basis set, calculations on the 26 atom molecule α-pinene achieved a parallel efficiency better than 99% with 64 processors. For more coarse grained calculations, superlinear speedups are found to result from reduced computational complexity associated with data parallelism.
Pediatric emergency medicine asynchronous e-learning: a multicenter randomized controlled Solomon four-group study.

PubMed

Chang, Todd P; Pham, Phung K; Sobolewski, Brad; Doughty, Cara B; Jamal, Nazreen; Kwan, Karen Y; Little, Kim; Brenkert, Timothy E; Mathison, David J

2014-08-01

Asynchronous e-learning allows for targeted teaching, particularly advantageous when bedside and didactic education is insufficient. An asynchronous e-learning curriculum has not been studied across multiple centers in the context of a clinical rotation. We hypothesize that an asynchronous e-learning curriculum during the pediatric emergency medicine (EM) rotation improves medical knowledge among residents and students across multiple participating centers. Trainees on pediatric EM rotations at four large pediatric centers from 2012 to 2013 were randomized in a Solomon four-group design. The experimental arms received an asynchronous e-learning curriculum consisting of nine Web-based, interactive, peer-reviewed Flash/HTML5 modules. Postrotation testing and in-training examination (ITE) scores quantified improvements in knowledge. A 2 × 2 analysis of covariance (ANCOVA) tested interaction and main effects, and Pearson's correlation tested associations between module usage, scores, and ITE scores. A total of 256 of 458 participants completed all study elements; 104 had access to asynchronous e-learning modules, and 152 were controls who used the current education standards. No pretest sensitization was found (p = 0.75). Use of asynchronous e-learning modules was associated with an improvement in posttest scores (p < 0.001), from a mean score of 18.45 (95% confidence interval [CI] = 17.92 to 18.98) to 21.30 (95% CI = 20.69 to 21.91), a large effect (partial η(2) = 0.19). Posttest scores correlated with ITE scores (r(2) = 0.14, p < 0.001) among pediatric residents. Asynchronous e-learning is an effective educational tool to improve knowledge in a clinical rotation. Web-based asynchronous e-learning is a promising modality to standardize education among multiple institutions with common curricula, particularly in clinical rotations where scheduling difficulties, seasonality, and variable experiences limit in-hospital learning. © 2014 by the Society for Academic Emergency Medicine.
Modeling and Analysis of Asynchronous Systems Using SAL and Hybrid SAL

NASA Technical Reports Server (NTRS)

Tiwari, Ashish; Dutertre, Bruno

2013-01-01

We present formal models and results of formal analysis of two different asynchronous systems. We first examine a mid-value select module that merges the signals coming from three different sensors that are each asynchronously sampling the same input signal. We then consider the phase locking protocol proposed by Daly, Hopkins, and McKenna. This protocol is designed to keep a set of non-faulty (asynchronous) clocks phase locked even in the presence of Byzantine-faulty clocks on the network. All models and verifications have been developed using the SAL model checking tools and the Hybrid SAL abstractor.
Detection of Failure in Asynchronous Motor Using Soft Computing Method

NASA Astrophysics Data System (ADS)

Vinoth Kumar, K.; Sony, Kevin; Achenkunju John, Alan; Kuriakose, Anto; John, Ano P.

2018-04-01

This paper investigates the stator short winding failure of asynchronous motor also their effects on motor current spectrums. A fuzzy logic approach i.e., model based technique possibly will help to detect the asynchronous motor failure. Actually, fuzzy logic similar to humanoid intelligent methods besides expected linguistic empowering inferences through vague statistics. The dynamic model is technologically advanced for asynchronous motor by means of fuzzy logic classifier towards investigate the stator inter turn failure in addition open phase failure. A hardware implementation was carried out with LabVIEW for the online-monitoring of faults.
Asynchronous discrete control of continuous processes

NASA Astrophysics Data System (ADS)

Kaliski, M. E.; Johnson, T. L.

1984-07-01

The research during this second contract year continued to deal with the development of sound theoretical models for asynchronous systems. Two criteria served to shape the research pursued: the first, that the developed models extend and generalize previously developed research for synchronous discrete control; the second, that the models explicitly address the question of how to incorporate system transition times into themselves. The following sections of this report concisely delineate this year's work. Our original proposal for this research identified four general tasks of investigation: (1.1) Analysis of Qualitative Properties of Asynchronous Hybrid Systems; (1.2) Acceptance and Control for Asynchronous Hybrid Systems.
Multi-GPU Accelerated Admittance Method for High-Resolution Human Exposure Evaluation.

PubMed

Xiong, Zubiao; Feng, Shi; Kautz, Richard; Chandra, Sandeep; Altunyurt, Nevin; Chen, Ji

2015-12-01

A multi-graphics processing unit (GPU) accelerated admittance method solver is presented for solving the induced electric field in high-resolution anatomical models of human body when exposed to external low-frequency magnetic fields. In the solver, the anatomical model is discretized as a three-dimensional network of admittances. The conjugate orthogonal conjugate gradient (COCG) iterative algorithm is employed to take advantage of the symmetric property of the complex-valued linear system of equations. Compared against the widely used biconjugate gradient stabilized method, the COCG algorithm can reduce the solving time by 3.5 times and reduce the storage requirement by about 40%. The iterative algorithm is then accelerated further by using multiple NVIDIA GPUs. The computations and data transfers between GPUs are overlapped in time by using asynchronous concurrent execution design. The communication overhead is well hidden so that the acceleration is nearly linear with the number of GPU cards. Numerical examples show that our GPU implementation running on four NVIDIA Tesla K20c cards can reach 90 times faster than the CPU implementation running on eight CPU cores (two Intel Xeon E5-2603 processors). The implemented solver is able to solve large dimensional problems efficiently. A whole adult body discretized in 1-mm resolution can be solved in just several minutes. The high efficiency achieved makes it practical to investigate human exposure involving a large number of cases with a high resolution that meets the requirements of international dosimetry guidelines.

Iterative methods for 3D implicit finite-difference migration using the complex Padé approximation

NASA Astrophysics Data System (ADS)

Costa, Carlos A. N.; Campos, Itamara S.; Costa, Jessé C.; Neto, Francisco A.; Schleicher, Jörg; Novais, Amélia

2013-08-01

Conventional implementations of 3D finite-difference (FD) migration use splitting techniques to accelerate performance and save computational cost. However, such techniques are plagued with numerical anisotropy that jeopardises the correct positioning of dipping reflectors in the directions not used for the operator splitting. We implement 3D downward continuation FD migration without splitting using a complex Padé approximation. In this way, the numerical anisotropy is eliminated at the expense of a computationally more intensive solution of a large-band linear system. We compare the performance of the iterative stabilized biconjugate gradient (BICGSTAB) and that of the multifrontal massively parallel direct solver (MUMPS). It turns out that the use of the complex Padé approximation not only stabilizes the solution, but also acts as an effective preconditioner for the BICGSTAB algorithm, reducing the number of iterations as compared to the implementation using the real Padé expansion. As a consequence, the iterative BICGSTAB method is more efficient than the direct MUMPS method when solving a single term in the Padé expansion. The results of both algorithms, here evaluated by computing the migration impulse response in the SEG/EAGE salt model, are of comparable quality.
Hierarchical fractional-step approximations and parallel kinetic Monte Carlo algorithms

DOE Office of Scientific and Technical Information (OSTI.GOV)

Arampatzis, Giorgos, E-mail: garab@math.uoc.gr; Katsoulakis, Markos A., E-mail: markos@math.umass.edu; Plechac, Petr, E-mail: plechac@math.udel.edu

2012-10-01

We present a mathematical framework for constructing and analyzing parallel algorithms for lattice kinetic Monte Carlo (KMC) simulations. The resulting algorithms have the capacity to simulate a wide range of spatio-temporal scales in spatially distributed, non-equilibrium physiochemical processes with complex chemistry and transport micro-mechanisms. Rather than focusing on constructing exactly the stochastic trajectories, our approach relies on approximating the evolution of observables, such as density, coverage, correlations and so on. More specifically, we develop a spatial domain decomposition of the Markov operator (generator) that describes the evolution of all observables according to the kinetic Monte Carlo algorithm. This domain decompositionmore » corresponds to a decomposition of the Markov generator into a hierarchy of operators and can be tailored to specific hierarchical parallel architectures such as multi-core processors or clusters of Graphical Processing Units (GPUs). Based on this operator decomposition, we formulate parallel Fractional step kinetic Monte Carlo algorithms by employing the Trotter Theorem and its randomized variants; these schemes, (a) are partially asynchronous on each fractional step time-window, and (b) are characterized by their communication schedule between processors. The proposed mathematical framework allows us to rigorously justify the numerical and statistical consistency of the proposed algorithms, showing the convergence of our approximating schemes to the original serial KMC. The approach also provides a systematic evaluation of different processor communicating schedules. We carry out a detailed benchmarking of the parallel KMC schemes using available exact solutions, for example, in Ising-type systems and we demonstrate the capabilities of the method to simulate complex spatially distributed reactions at very large scales on GPUs. Finally, we discuss work load balancing between processors and propose a re-balancing scheme based on probabilistic mass transport methods.« less
Combining Live Video and Audio Broadcasting, Synchronous Chat, and Asynchronous Open Forum Discussions in Distance Education

ERIC Educational Resources Information Center

Teng, Tian-Lih; Taveras, Marypat

2004-01-01

This article outlines the evolution of a unique distance education program that began as a hybrid--combining face-to-face instruction with asynchronous online teaching--and evolved to become an innovative combination of synchronous education using live streaming video, audio, and chat over the Internet, blended with asynchronous online discussions…
Utilizing Spectrum Efficiently (USE)

DTIC Science & Technology

2011-02-28

18 4.8 Space-Time Coded Asynchronous DS - CDMA with Decentralized MAI Suppression: Performance and...numerical results. 4.8 Space-Time Coded Asynchronous DS - CDMA with Decentralized MAI Suppression: Performance and Spectral Efficiency In [60] multiple...supported at a given signal-to-interference ratio in asynchronous direct-sequence code-division multiple-access ( DS - CDMA ) sys- tems was examined. It was
The Development, Validity, and Reliability of Communication Satisfaction in an Online Asynchronous Discussion Scale

ERIC Educational Resources Information Center

Hung, Min-Ling; Chou, Chien

2014-01-01

The purpose of this study was to identify dimensions of students' communication satisfaction in an asynchronous discussion forum. An asynchronous discussion may be defined as text-based human-to-human communication via computer networks that provides a platform for the participants to interact with one another to exchange ideas, insights, and…
Web-based Cases in Teaching and Learning - the Quality of Discussions and a Stage of Perspective Taking in Asynchronous Communication.

ERIC Educational Resources Information Center

Jarvela, Sanna; Hakkinen, Paivi

2002-01-01

Examines the quality of asynchronous interaction in Web-based conferencing among preservice teachers. The study combines asynchronous conferencing with peer and mentor collaboration to electronically apprentice student learning. Results point out different levels of Web-based discussion: higher-level, progressive, and lower-level discussion. A…
Asynchronous vs didactic education: it’s too early to throw in the towel on tradition

PubMed Central

2013-01-01

Background Asynchronous, computer based instruction is cost effective, allows self-directed pacing and review, and addresses preferences of millennial learners. Current research suggests there is no significant difference in learning compared to traditional classroom instruction. Data are limited for novice learners in emergency medicine. The objective of this study was to compare asynchronous, computer-based instruction with traditional didactics for senior medical students during a week-long intensive course in acute care. We hypothesized both modalities would be equivalent. Methods This was a prospective observational quasi-experimental study of 4th year medical students who were novice learners with minimal prior exposure to curricular elements. We assessed baseline knowledge with an objective pre-test. The curriculum was delivered in either traditional lecture format (shock, acute abdomen, dyspnea, field trauma) or via asynchronous, computer-based modules (chest pain, EKG interpretation, pain management, trauma). An interactive review covering all topics was followed by a post-test. Knowledge retention was measured after 10 weeks. Pre and post-test items were written by a panel of medical educators and validated with a reference group of learners. Mean scores were analyzed using dependent t-test and attitudes were assessed by a 5-point Likert scale. Results 44 of 48 students completed the protocol. Students initially acquired more knowledge from didactic education as demonstrated by mean gain scores (didactic: 28.39% ± 18.06; asynchronous 9.93% ± 23.22). Mean difference between didactic and asynchronous = 18.45% with 95% CI [10.40 to 26.50]; p = 0.0001. Retention testing demonstrated similar knowledge attrition: mean gain scores −14.94% (didactic); -17.61% (asynchronous), which was not significantly different: 2.68% ± 20.85, 95% CI [−3.66 to 9.02], p = 0.399. The attitudinal survey revealed that 60.4% of students believed the asynchronous modules were educational and 95.8% enjoyed the flexibility of the method. 39.6% of students preferred asynchronous education for required didactics; 37.5% were neutral; 23% preferred traditional lectures. Conclusions Asynchronous, computer-based instruction was not equivalent to traditional didactics for novice learners of acute care topics. Interactive, standard didactic education was valuable. Retention rates were similar between instructional methods. Students had mixed attitudes toward asynchronous learning but enjoyed the flexibility. We urge caution in trading in traditional didactic lectures in favor of asynchronous education for novice learners in acute care. PMID:23927420
[A novel proposal explaining sleep disturbance of children in Japan--asynchronization].

PubMed

Kohyama, Jun

2008-07-01

It has been reported that more than half of the children in Japan suffer from daytime sleepiness. In contrast, about one quarter of junior high-school students in Japan complain of insomnia. According to the International Classification of Sleep Disorders (Second edition), these children could be diagnosed as having behaviorally-induced insufficient sleep syndrome due to inadequate sleeping habits. Getting on adequate amount of sleep should solve such problems;however, such a therapeutic approach often fails. Although social factors are involved in these sleep disturbances, I feel that a novel notion - asynchronization - leads to an understanding of the pathophysiology of disturbances in these children. Further, it could contribute to resolve their problems. The essence of asynchronization is a disturbance of various aspects (e.g., cycle, amplitude, phase, and interrelationship) of the biological rhythms that normally exhibits circadian oscillation. The main cause of asynchronization is hypothesized to be the combination of light exposure during night and the lack of light exposure in the morning. Asynchronization results in the disturbance of variable systems. Thus, symptoms of asynchronization include disturbances of the autonomic nervous system (sleepiness, insomnia, disturbance of hormonal excretion, gastrointestinal problems, etc.) and higher brain function (disorientation, loss of sociality, loss of will or motivation, impaired alertness and performance, etc.). Neurological (attention deficit, aggression, impulsiveness, hyperactivity, etc.), psychiatric (depressive disorders, personality disorders, anxiety disorders, etc.) and somatic (tiredness, fatigue, etc.) disturbances could also be symptoms of asynchronization. At the initial phase of asynchronization, disturbances are functional and can be resolved relatively easily, such as by the establishment of a regular sleep-wakefulness cycle;however, without adequate intervention the disturbances could gradually worsen and become hard to resolve.
Region of interest processing for iterative reconstruction in x-ray computed tomography

NASA Astrophysics Data System (ADS)

Kopp, Felix K.; Nasirudin, Radin A.; Mei, Kai; Fehringer, Andreas; Pfeiffer, Franz; Rummeny, Ernst J.; Noël, Peter B.

2015-03-01

The recent advancements in the graphics card technology raised the performance of parallel computing and contributed to the introduction of iterative reconstruction methods for x-ray computed tomography in clinical CT scanners. Iterative maximum likelihood (ML) based reconstruction methods are known to reduce image noise and to improve the diagnostic quality of low-dose CT. However, iterative reconstruction of a region of interest (ROI), especially ML based, is challenging. But for some clinical procedures, like cardiac CT, only a ROI is needed for diagnostics. A high-resolution reconstruction of the full field of view (FOV) consumes unnecessary computation effort that results in a slower reconstruction than clinically acceptable. In this work, we present an extension and evaluation of an existing ROI processing algorithm. Especially improvements for the equalization between regions inside and outside of a ROI are proposed. The evaluation was done on data collected from a clinical CT scanner. The performance of the different algorithms is qualitatively and quantitatively assessed. Our solution to the ROI problem provides an increase in signal-to-noise ratio and leads to visually less noise in the final reconstruction. The reconstruction speed of our technique was observed to be comparable with other previous proposed techniques. The development of ROI processing algorithms in combination with iterative reconstruction will provide higher diagnostic quality in the near future.
Stand-Alone and Hybrid Positioning Using Asynchronous Pseudolites

PubMed Central

Gioia, Ciro; Borio, Daniele

2015-01-01

global navigation satellite system (GNSS) receivers are usually unable to achieve satisfactory performance in difficult environments, such as open-pit mines, urban canyons and indoors. Pseudolites have the potential to extend GNSS usage and significantly improve receiver performance in such environments by providing additional navigation signals. This also applies to asynchronous pseudolite systems, where different pseudolites operate in an independent way. Asynchronous pseudolite systems require, however, dedicated strategies in order to properly integrate GNSS and pseudolite measurements. In this paper, several asynchronous pseudolite/GNSS integration strategies are considered: loosely- and tightly-coupled approaches are developed and combined with pseudolite proximity and receiver signal strength (RSS)-based positioning. The performance of the approaches proposed has been tested in different scenarios, including static and kinematic conditions. The tests performed demonstrate that the methods developed are effective techniques for integrating heterogeneous measurements from different sources, such as asynchronous pseudolites and GNSS. PMID:25609041
Distributed Consensus of Stochastic Delayed Multi-agent Systems Under Asynchronous Switching.

PubMed

Wu, Xiaotai; Tang, Yang; Cao, Jinde; Zhang, Wenbing

2016-08-01

In this paper, the distributed exponential consensus of stochastic delayed multi-agent systems with nonlinear dynamics is investigated under asynchronous switching. The asynchronous switching considered here is to account for the time of identifying the active modes of multi-agent systems. After receipt of confirmation of mode's switching, the matched controller can be applied, which means that the switching time of the matched controller in each node usually lags behind that of system switching. In order to handle the coexistence of switched signals and stochastic disturbances, a comparison principle of stochastic switched delayed systems is first proved. By means of this extended comparison principle, several easy to verified conditions for the existence of an asynchronously switched distributed controller are derived such that stochastic delayed multi-agent systems with asynchronous switching and nonlinear dynamics can achieve global exponential consensus. Two examples are given to illustrate the effectiveness of the proposed method.
Hydrodynamic advantages of swimming by salp chains.

PubMed

Sutherland, Kelly R; Weihs, Daniel

2017-08-01

Salps are marine invertebrates comprising multiple jet-propelled swimming units during a colonial life-cycle stage. Using theory, we show that asynchronous swimming with multiple pulsed jets yields substantial hydrodynamic benefit due to the production of steady swimming velocities, which limit drag. Laboratory comparisons of swimming kinematics of aggregate salps ( Salpa fusiformis and Weelia cylindrica ) using high-speed video supported that asynchronous swimming by aggregates results in a smoother velocity profile and showed that this smoother velocity profile is the result of uncoordinated, asynchronous swimming by individual zooids. In situ flow visualizations of W. cylindrica swimming wakes revealed that another consequence of asynchronous swimming is that fluid interactions between jet wakes are minimized. Although the advantages of multi-jet propulsion have been mentioned elsewhere, this is the first time that the theory has been quantified and the role of asynchronous swimming verified using experimental data from the laboratory and the field. © 2017 The Author(s).
Reducing Design Cycle Time and Cost Through Process Resequencing

NASA Technical Reports Server (NTRS)

Rogers, James L.

2004-01-01

In today's competitive environment, companies are under enormous pressure to reduce the time and cost of their design cycle. One method for reducing both time and cost is to develop an understanding of the flow of the design processes and the effects of the iterative subcycles that are found in complex design projects. Once these aspects are understood, the design manager can make decisions that take advantage of decomposition, concurrent engineering, and parallel processing techniques to reduce the total time and the total cost of the design cycle. One software tool that can aid in this decision-making process is the Design Manager's Aid for Intelligent Decomposition (DeMAID). The DeMAID software minimizes the feedback couplings that create iterative subcycles, groups processes into iterative subcycles, and decomposes the subcycles into a hierarchical structure. The real benefits of producing the best design in the least time and at a minimum cost are obtained from sequencing the processes in the subcycles.
On the inversion of geodetic integrals defined over the sphere using 1-D FFT

NASA Astrophysics Data System (ADS)

García, R. V.; Alejo, C. A.

2005-08-01

An iterative method is presented which performs inversion of integrals defined over the sphere. The method is based on one-dimensional fast Fourier transform (1-D FFT) inversion and is implemented with the projected Landweber technique, which is used to solve constrained least-squares problems reducing the associated 1-D cyclic-convolution error. The results obtained are as precise as the direct matrix inversion approach, but with better computational efficiency. A case study uses the inversion of Hotine’s integral to obtain gravity disturbances from geoid undulations. Numerical convergence is also analyzed and comparisons with respect to the direct matrix inversion method using conjugate gradient (CG) iteration are presented. Like the CG method, the number of iterations needed to get the optimum (i.e., small) error decreases as the measurement noise increases. Nevertheless, for discrete data given over a whole parallel band, the method can be applied directly without implementing the projected Landweber method, since no cyclic convolution error exists.
Trace: a high-throughput tomographic reconstruction engine for large-scale datasets.

PubMed

Bicer, Tekin; Gürsoy, Doğa; Andrade, Vincent De; Kettimuthu, Rajkumar; Scullin, William; Carlo, Francesco De; Foster, Ian T

2017-01-01

Modern synchrotron light sources and detectors produce data at such scale and complexity that large-scale computation is required to unleash their full power. One of the widely used imaging techniques that generates data at tens of gigabytes per second is computed tomography (CT). Although CT experiments result in rapid data generation, the analysis and reconstruction of the collected data may require hours or even days of computation time with a medium-sized workstation, which hinders the scientific progress that relies on the results of analysis. We present Trace, a data-intensive computing engine that we have developed to enable high-performance implementation of iterative tomographic reconstruction algorithms for parallel computers. Trace provides fine-grained reconstruction of tomography datasets using both (thread-level) shared memory and (process-level) distributed memory parallelization. Trace utilizes a special data structure called replicated reconstruction object to maximize application performance. We also present the optimizations that we apply to the replicated reconstruction objects and evaluate them using tomography datasets collected at the Advanced Photon Source. Our experimental evaluations show that our optimizations and parallelization techniques can provide 158× speedup using 32 compute nodes (384 cores) over a single-core configuration and decrease the end-to-end processing time of a large sinogram (with 4501 × 1 × 22,400 dimensions) from 12.5 h to <5 min per iteration. The proposed tomographic reconstruction engine can efficiently process large-scale tomographic data using many compute nodes and minimize reconstruction times.
Solutions of large-scale electromagnetics problems involving dielectric objects with the parallel multilevel fast multipole algorithm.

PubMed

Ergül, Özgür

2011-11-01

Fast and accurate solutions of large-scale electromagnetics problems involving homogeneous dielectric objects are considered. Problems are formulated with the electric and magnetic current combined-field integral equation and discretized with the Rao-Wilton-Glisson functions. Solutions are performed iteratively by using the multilevel fast multipole algorithm (MLFMA). For the solution of large-scale problems discretized with millions of unknowns, MLFMA is parallelized on distributed-memory architectures using a rigorous technique, namely, the hierarchical partitioning strategy. Efficiency and accuracy of the developed implementation are demonstrated on very large problems involving as many as 100 million unknowns.
Solution of partial differential equations on vector and parallel computers

NASA Technical Reports Server (NTRS)

Ortega, J. M.; Voigt, R. G.

1985-01-01

The present status of numerical methods for partial differential equations on vector and parallel computers was reviewed. The relevant aspects of these computers are discussed and a brief review of their development is included, with particular attention paid to those characteristics that influence algorithm selection. Both direct and iterative methods are given for elliptic equations as well as explicit and implicit methods for initial boundary value problems. The intent is to point out attractive methods as well as areas where this class of computer architecture cannot be fully utilized because of either hardware restrictions or the lack of adequate algorithms. Application areas utilizing these computers are briefly discussed.
Self-consistent determination of the spike-train power spectrum in a neural network with sparse connectivity.

PubMed

Dummer, Benjamin; Wieland, Stefan; Lindner, Benjamin

2014-01-01

A major source of random variability in cortical networks is the quasi-random arrival of presynaptic action potentials from many other cells. In network studies as well as in the study of the response properties of single cells embedded in a network, synaptic background input is often approximated by Poissonian spike trains. However, the output statistics of the cells is in most cases far from being Poisson. This is inconsistent with the assumption of similar spike-train statistics for pre- and postsynaptic cells in a recurrent network. Here we tackle this problem for the popular class of integrate-and-fire neurons and study a self-consistent statistics of input and output spectra of neural spike trains. Instead of actually using a large network, we use an iterative scheme, in which we simulate a single neuron over several generations. In each of these generations, the neuron is stimulated with surrogate stochastic input that has a similar statistics as the output of the previous generation. For the surrogate input, we employ two distinct approximations: (i) a superposition of renewal spike trains with the same interspike interval density as observed in the previous generation and (ii) a Gaussian current with a power spectrum proportional to that observed in the previous generation. For input parameters that correspond to balanced input in the network, both the renewal and the Gaussian iteration procedure converge quickly and yield comparable results for the self-consistent spike-train power spectrum. We compare our results to large-scale simulations of a random sparsely connected network of leaky integrate-and-fire neurons (Brunel, 2000) and show that in the asynchronous regime close to a state of balanced synaptic input from the network, our iterative schemes provide an excellent approximations to the autocorrelation of spike trains in the recurrent network.
Merging the Forces of Asynchronous Tutoring and Synchronous Conferencing: A Qualitative Study of Arab ESL Academic Writers Using E-Tutoring

ERIC Educational Resources Information Center

Alqadoumi, Omar Mohamed

2012-01-01

Previous studies in the field of e-tutoring dealt either with asynchronous tutoring or synchronous conferencing as modes for providing e-tutoring services to English learners. This qualitative research study reports the experiences of Arab ESL tutees with both asynchronous tutoring and synchronous conferencing. It also reports the experiences of…
Differences in Electronic Exchanges in Synchronous and Asynchronous Computer-Mediated Communication: The Effect of Culture as a Mediating Variable

ERIC Educational Resources Information Center

Angeli, Charoula; Schwartz, Neil H.

2016-01-01

Two hundred and eighty undergraduates from universities in two countries were asked to read didactic material, and then think and write about potential solutions to an ill-defined problem. The writing was conducted within a synchronous or asynchronous computer-mediated communication (CMC) environment. Asynchronous CMC took the form of email…

Localized radio frequency communication using asynchronous transfer mode protocol

DOEpatents

Witzke, Edward L [Edgewood, NM; Robertson, Perry J [Albuquerque, NM; Pierson, Lyndon G [Albuquerque, NM

2007-08-14

A localized wireless communication system for communication between a plurality of circuit boards, and between electronic components on the circuit boards. Transceivers are located on each circuit board and electronic component. The transceivers communicate with one another over spread spectrum radio frequencies. An asynchronous transfer mode protocol controls communication flow with asynchronous transfer mode switches located on the circuit boards.
Features of the Asynchronous Correlation between the China Coal Price Index and Coal Mining Accidental Deaths.

PubMed

Huang, Yuecheng; Cheng, Wuyi; Luo, Sida; Luo, Yun; Ma, Chengchen; He, Tailin

2016-01-01

The features of the asynchronous correlation between accident indices and the factors that influence accidents can provide an effective reference for warnings of coal mining accidents. However, what are the features of this correlation? To answer this question, data from the China coal price index and the number of deaths from coal mining accidents were selected as the sample data. The fluctuation modes of the asynchronous correlation between the two data sets were defined according to the asynchronous correlation coefficients, symbolization, and sliding windows. We then built several directed and weighted network models, within which the fluctuation modes and the transformations between modes were represented by nodes and edges. Then, the features of the asynchronous correlation between these two variables could be studied from a perspective of network topology. We found that the correlation between the price index and the accidental deaths was asynchronous and fluctuating. Certain aspects, such as the key fluctuation modes, the subgroups characteristics, the transmission medium, the periodicity and transmission path length in the network, were analyzed by using complex network theory, analytical methods and spectral analysis method. These results provide a scientific reference for generating warnings for coal mining accidents based on economic indices.
Scientific and technical challenges on the road towards fusion electricity

NASA Astrophysics Data System (ADS)

Donné, A. J. H.; Federici, G.; Litaudon, X.; McDonald, D. C.

2017-10-01

The goal of the European Fusion Roadmap is to deliver fusion electricity to the grid early in the second half of this century. It breaks the quest for fusion energy into eight missions, and for each of them it describes a research and development programme to address all the open technical gaps in physics and technology and estimates the required resources. It points out the needs to intensify industrial involvement and to seek all opportunities for collaboration outside Europe. The roadmap covers three periods: the short term, which runs parallel to the European Research Framework Programme Horizon 2020, the medium term and the long term. ITER is the key facility of the roadmap as it is expected to achieve most of the important milestones on the path to fusion power. Thus, the vast majority of present resources are dedicated to ITER and its accompanying experiments. The medium term is focussed on taking ITER into operation and bringing it to full power, as well as on preparing the construction of a demonstration power plant DEMO, which will for the first time demonstrate fusion electricity to the grid around the middle of this century. Building and operating DEMO is the subject of the last roadmap phase: the long term. Clearly, the Fusion Roadmap is tightly connected to the ITER schedule. Three key milestones are the first operation of ITER, the start of the DT operation in ITER and reaching the full performance at which the thermal fusion power is 10 times the power put in to the plasma. The Engineering Design Activity of DEMO needs to start a few years after the first ITER plasma, while the start of the construction phase will be a few years after ITER reaches full performance. In this way ITER can give viable input to the design and development of DEMO. Because the neutron fluence in DEMO will be much higher than in ITER, it is important to develop and validate materials that can handle these very high neutron loads. For the testing of the materials, a dedicated 14 MeV neutron source is needed. This DEMO Oriented Neutron Source (DONES) is therefore an important facility to support the fusion roadmap.
Nonperturbative measurement of the local magnetic field using pulsed polarimetry for fusion reactor conditions (invited).

PubMed

Smith, Roger J

2008-10-01

A novel diagnostic technique for the remote and nonperturbative sensing of the local magnetic field in reactor relevant plasmas is presented. Pulsed polarimetry [Patent No. 12/150,169 (pending)] combines optical scattering with the Faraday effect. The polarimetric light detection and ranging (LIDAR)-like diagnostic has the potential to be a local B(pol) diagnostic on ITER and can achieve spatial resolutions of millimeters on high energy density (HED) plasmas using existing lasers. The pulsed polarimetry method is based on nonlocal measurements and subtle effects are introduced that are not present in either cw polarimetry or Thomson scattering LIDAR. Important features include the capability of simultaneously measuring local T(e), n(e), and B(parallel) along the line of sight, a resiliency to refractive effects, a short measurement duration providing near instantaneous data in time, and location for real-time feedback and control of magnetohydrodynamic (MHD) instabilities and the realization of a widely applicable internal magnetic field diagnostic for the magnetic fusion energy program. The technique improves for higher n(e)B(parallel) product and higher n(e) and is well suited for diagnosing the transient plasmas in the HED program. Larger devices such as ITER and DEMO are also better suited to the technique, allowing longer pulse lengths and thereby relaxing key technology constraints making pulsed polarimetry a valuable asset for next step devices. The pulsed polarimetry technique is clarified by way of illustration on the ITER tokamak and plasmas within the magnetized target fusion program within present technological means.
Optimal parallel solution of sparse triangular systems

NASA Technical Reports Server (NTRS)

Alvarado, Fernando L.; Schreiber, Robert

1990-01-01

A method for the parallel solution of triangular sets of equations is described that is appropriate when there are many right-handed sides. By preprocessing, the method can reduce the number of parallel steps required to solve Lx = b compared to parallel forward or backsolve. Applications are to iterative solvers with triangular preconditioners, to structural analysis, or to power systems applications, where there may be many right-handed sides (not all available a priori). The inverse of L is represented as a product of sparse triangular factors. The problem is to find a factored representation of this inverse of L with the smallest number of factors (or partitions), subject to the requirement that no new nonzero elements be created in the formation of these inverse factors. A method from an earlier reference is shown to solve this problem. This method is improved upon by constructing a permutation of the rows and columns of L that preserves triangularity and allow for the best possible such partition. A number of practical examples and algorithmic details are presented. The parallelism attainable is illustrated by means of elimination trees and clique trees.
Computer program MCAP-TOSS calculates steady-state fluid dynamics of coolant in parallel channels and temperature distribution in surrounding heat-generating solid

NASA Technical Reports Server (NTRS)

Lee, A. Y.

1967-01-01

Computer program calculates the steady state fluid distribution, temperature rise, and pressure drop of a coolant, the material temperature distribution of a heat generating solid, and the heat flux distributions at the fluid-solid interfaces. It performs the necessary iterations automatically within the computer, in one machine run.
On some Aitken-like acceleration of the Schwarz method

NASA Astrophysics Data System (ADS)

Garbey, M.; Tromeur-Dervout, D.

2002-12-01

In this paper we present a family of domain decomposition based on Aitken-like acceleration of the Schwarz method seen as an iterative procedure with a linear rate of convergence. We first present the so-called Aitken-Schwarz procedure for linear differential operators. The solver can be a direct solver when applied to the Helmholtz problem with five-point finite difference scheme on regular grids. We then introduce the Steffensen-Schwarz variant which is an iterative domain decomposition solver that can be applied to linear and nonlinear problems. We show that these solvers have reasonable numerical efficiency compared to classical fast solvers for the Poisson problem or multigrids for more general linear and nonlinear elliptic problems. However, the salient feature of our method is that our algorithm has high tolerance to slow network in the context of distributed parallel computing and is attractive, generally speaking, to use with computer architecture for which performance is limited by the memory bandwidth rather than the flop performance of the CPU. This is nowadays the case for most parallel. computer using the RISC processor architecture. We will illustrate this highly desirable property of our algorithm with large-scale computing experiments.
Development of iterative techniques for the solution of unsteady compressible viscous flows

NASA Technical Reports Server (NTRS)

Hixon, Duane; Sankar, L. N.

1993-01-01

During the past two decades, there has been significant progress in the field of numerical simulation of unsteady compressible viscous flows. At present, a variety of solution techniques exist such as the transonic small disturbance analyses (TSD), transonic full potential equation-based methods, unsteady Euler solvers, and unsteady Navier-Stokes solvers. These advances have been made possible by developments in three areas: (1) improved numerical algorithms; (2) automation of body-fitted grid generation schemes; and (3) advanced computer architectures with vector processing and massively parallel processing features. In this work, the GMRES scheme has been considered as a candidate for acceleration of a Newton iteration time marching scheme for unsteady 2-D and 3-D compressible viscous flow calculation; from preliminary calculations, this will provide up to a 65 percent reduction in the computer time requirements over the existing class of explicit and implicit time marching schemes. The proposed method has ben tested on structured grids, but is flexible enough for extension to unstructured grids. The described scheme has been tested only on the current generation of vector processor architecture of the Cray Y/MP class, but should be suitable for adaptation to massively parallel machines.
Optimizing transformations of stencil operations for parallel cache-based architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bassetti, F.; Davis, K.

This paper describes a new technique for optimizing serial and parallel stencil- and stencil-like operations for cache-based architectures. This technique takes advantage of the semantic knowledge implicity in stencil-like computations. The technique is implemented as a source-to-source program transformation; because of its specificity it could not be expected of a conventional compiler. Empirical results demonstrate a uniform factor of two speedup. The experiments clearly show the benefits of this technique to be a consequence, as intended, of the reduction in cache misses. The test codes are based on a 5-point stencil obtained by the discretization of the Poisson equation andmore » applied to a two-dimensional uniform grid using the Jacobi method as an iterative solver. Results are presented for a 1-D tiling for a single processor, and in parallel using 1-D data partition. For the parallel case both blocking and non-blocking communication are tested. The same scheme of experiments has bee n performed for the 2-D tiling case. However, for the parallel case the 2-D partitioning is not discussed here, so the parallel case handled for 2-D is 2-D tiling with 1-D data partitioning.« less
Streaming parallel GPU acceleration of large-scale filter-based spiking neural networks.

PubMed

Slażyński, Leszek; Bohte, Sander

2012-01-01

The arrival of graphics processing (GPU) cards suitable for massively parallel computing promises affordable large-scale neural network simulation previously only available at supercomputing facilities. While the raw numbers suggest that GPUs may outperform CPUs by at least an order of magnitude, the challenge is to develop fine-grained parallel algorithms to fully exploit the particulars of GPUs. Computation in a neural network is inherently parallel and thus a natural match for GPU architectures: given inputs, the internal state for each neuron can be updated in parallel. We show that for filter-based spiking neurons, like the Spike Response Model, the additive nature of membrane potential dynamics enables additional update parallelism. This also reduces the accumulation of numerical errors when using single precision computation, the native precision of GPUs. We further show that optimizing simulation algorithms and data structures to the GPU's architecture has a large pay-off: for example, matching iterative neural updating to the memory architecture of the GPU speeds up this simulation step by a factor of three to five. With such optimizations, we can simulate in better-than-realtime plausible spiking neural networks of up to 50 000 neurons, processing over 35 million spiking events per second.
Towards the Experimental Assessment of the DQE in SPECT Scanners

NASA Astrophysics Data System (ADS)

Fountos, G. P.; Michail, C. M.

2017-11-01

The purpose of this work was to introduce the Detective Quantum Efficiency (DQE) in single photon emission computed tomography (SPECT) systems using a flood source. A Tc-99m-based flood source (Eγ = 140 keV) consisting of a radiopharmaceutical solution of dithiothreitol (DTT, 10-3 M)/Tc-99m(III)-DMSA, 40 mCi/40 ml bound to the grains of an Agfa MammoRay HDR Medical X-ray film) was prepared in laboratory. The source was placed between two PMMA blocks and images were obtained by using the brain tomographic acquisition protocol (DatScan-brain). The Modulation Transfer Function (MTF) was evaluated using the Iterative 2D algorithm. All imaging experiments were performed in a Siemens e-Cam gamma camera. The Normalized Noise Power spectra (NNPS) were obtained from the sagittal views of the source. The higher MTF values were obtained for the Flash Iterative 2D with 24 iterations and 20 subsets. The noise levels of the SPECT reconstructed images, in terms of the NNPS, were found to increase as the number of iterations increase. The behavior of the DQE was influenced by both MTF and NNPS. As the number of iterations was increased, higher MTF values were obtained, however with a parallel, increase of magnitude in image noise, as depicted from the NNPS results. DQE values, which were influenced by both MTF and NNPS, were found higher when the number of iterations results in resolution saturation. The method presented here is novel and easy to implement, requiring materials commonly found in clinical practice and can be useful in the quality control of SPECT scanners.
Development of a stereo analysis algorithm for generating topographic maps using interactive techniques of the MPP

NASA Technical Reports Server (NTRS)

Strong, James P.

1987-01-01

A local area matching algorithm was developed on the Massively Parallel Processor (MPP). It is an iterative technique that first matches coarse or low resolution areas and at each iteration performs matches of higher resolution. Results so far show that when good matches are possible in the two images, the MPP algorithm matches corresponding areas as well as a human observer. To aid in developing this algorithm, a control or shell program was developed for the MPP that allows interactive experimentation with various parameters and procedures to be used in the matching process. (This would not be possible without the high speed of the MPP). With the system, optimal techniques can be developed for different types of matching problems.
The effect of pre-existing islands on disruption mitigation in MHD simulations of DIII-D

DOE Office of Scientific and Technical Information (OSTI.GOV)

Izzo, V. A.

Locked-modes are the most likely cause of disruptions in ITER, so large islands are expected to be common when the ITER disruption mitigation system is deployed. MHD modeling of disruption mitigation by massive gas injection is carried out for DIII-D plasmas with stationary, pre-existing islands. Results show that the magnetic topology at the q=2 surface can affect the parallel spreading of injected impurities, and that, in particular, the break-up of large 2/1 islands into smaller 4/2 islands chains can favorably affect mitigation metrics. The direct imposition of a 4/2 mode is found to have similar results to the case inmore » which the 4/2 harmonic grows spontaneously.« less
The effect of pre-existing islands on disruption mitigation in MHD simulations of DIII-D

DOE PAGES

Izzo, V. A.

2017-02-27

Locked-modes are the most likely cause of disruptions in ITER, so large islands are expected to be common when the ITER disruption mitigation system is deployed. MHD modeling of disruption mitigation by massive gas injection is carried out for DIII-D plasmas with stationary, pre-existing islands. Results show that the magnetic topology at the q=2 surface can affect the parallel spreading of injected impurities, and that, in particular, the break-up of large 2/1 islands into smaller 4/2 islands chains can favorably affect mitigation metrics. The direct imposition of a 4/2 mode is found to have similar results to the case inmore » which the 4/2 harmonic grows spontaneously.« less
Projection matrix acquisition for cone-beam computed tomography iterative reconstruction

NASA Astrophysics Data System (ADS)

Yang, Fuqiang; Zhang, Dinghua; Huang, Kuidong; Shi, Wenlong; Zhang, Caixin; Gao, Zongzhao

2017-02-01

Projection matrix is an essential and time-consuming part in computed tomography (CT) iterative reconstruction. In this article a novel calculation algorithm of three-dimensional (3D) projection matrix is proposed to quickly acquire the matrix for cone-beam CT (CBCT). The CT data needed to be reconstructed is considered as consisting of the three orthogonal sets of equally spaced and parallel planes, rather than the individual voxels. After getting the intersections the rays with the surfaces of the voxels, the coordinate points and vertex is compared to obtain the index value that the ray traversed. Without considering ray-slope to voxel, it just need comparing the position of two points. Finally, the computer simulation is used to verify the effectiveness of the algorithm.
A conjugate gradient method for solving the non-LTE line radiation transfer problem

NASA Astrophysics Data System (ADS)

Paletou, F.; Anterrieu, E.

2009-12-01

This study concerns the fast and accurate solution of the line radiation transfer problem, under non-LTE conditions. We propose and evaluate an alternative iterative scheme to the classical ALI-Jacobi method, and to the more recently proposed Gauss-Seidel and successive over-relaxation (GS/SOR) schemes. Our study is indeed based on applying a preconditioned bi-conjugate gradient method (BiCG-P). Standard tests, in 1D plane parallel geometry and in the frame of the two-level atom model with monochromatic scattering are discussed. Rates of convergence between the previously mentioned iterative schemes are compared, as are their respective timing properties. The smoothing capability of the BiCG-P method is also demonstrated.
Block-Parallel Data Analysis with DIY2

DOE Office of Scientific and Technical Information (OSTI.GOV)

Morozov, Dmitriy; Peterka, Tom

DIY2 is a programming model and runtime for block-parallel analytics on distributed-memory machines. Its main abstraction is block-structured data parallelism: data are decomposed into blocks; blocks are assigned to processing elements (processes or threads); computation is described as iterations over these blocks, and communication between blocks is defined by reusable patterns. By expressing computation in this general form, the DIY2 runtime is free to optimize the movement of blocks between slow and fast memories (disk and flash vs. DRAM) and to concurrently execute blocks residing in memory with multiple threads. This enables the same program to execute in-core, out-of-core, serial,more » parallel, single-threaded, multithreaded, or combinations thereof. This paper describes the implementation of the main features of the DIY2 programming model and optimizations to improve performance. DIY2 is evaluated on benchmark test cases to establish baseline performance for several common patterns and on larger complete analysis codes running on large-scale HPC machines.« less
Pushing configuration-interaction to the limit: Towards massively parallel MCSCF calculations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Vogiatzis, Konstantinos D.; Ma, Dongxia; Olsen, Jeppe

A new large-scale parallel multiconfigurational self-consistent field (MCSCF) implementation in the open-source NWChem computational chemistry code is presented. The generalized active space approach is used to partition large configuration interaction (CI) vectors and generate a sufficient number of batches that can be distributed to the available cores. Massively parallel CI calculations with large active spaces can be performed. The new parallel MCSCF implementation is tested for the chromium trimer and for an active space of 20 electrons in 20 orbitals, which can now routinely be performed. Unprecedented CI calculations with an active space of 22 electrons in 22 orbitals formore » the pentacene systems were performed and a single CI iteration calculation with an active space of 24 electrons in 24 orbitals for the chromium tetramer was possible. In conclusion, the chromium tetramer corresponds to a CI expansion of one trillion Slater determinants (914 058 513 424) and is the largest conventional CI calculation attempted up to date.« less
Pushing configuration-interaction to the limit: Towards massively parallel MCSCF calculations

DOE PAGES

Vogiatzis, Konstantinos D.; Ma, Dongxia; Olsen, Jeppe; ...

2017-11-14

A new large-scale parallel multiconfigurational self-consistent field (MCSCF) implementation in the open-source NWChem computational chemistry code is presented. The generalized active space approach is used to partition large configuration interaction (CI) vectors and generate a sufficient number of batches that can be distributed to the available cores. Massively parallel CI calculations with large active spaces can be performed. The new parallel MCSCF implementation is tested for the chromium trimer and for an active space of 20 electrons in 20 orbitals, which can now routinely be performed. Unprecedented CI calculations with an active space of 22 electrons in 22 orbitals formore » the pentacene systems were performed and a single CI iteration calculation with an active space of 24 electrons in 24 orbitals for the chromium tetramer was possible. In conclusion, the chromium tetramer corresponds to a CI expansion of one trillion Slater determinants (914 058 513 424) and is the largest conventional CI calculation attempted up to date.« less
Pushing configuration-interaction to the limit: Towards massively parallel MCSCF calculations

NASA Astrophysics Data System (ADS)

Vogiatzis, Konstantinos D.; Ma, Dongxia; Olsen, Jeppe; Gagliardi, Laura; de Jong, Wibe A.

2017-11-01

A new large-scale parallel multiconfigurational self-consistent field (MCSCF) implementation in the open-source NWChem computational chemistry code is presented. The generalized active space approach is used to partition large configuration interaction (CI) vectors and generate a sufficient number of batches that can be distributed to the available cores. Massively parallel CI calculations with large active spaces can be performed. The new parallel MCSCF implementation is tested for the chromium trimer and for an active space of 20 electrons in 20 orbitals, which can now routinely be performed. Unprecedented CI calculations with an active space of 22 electrons in 22 orbitals for the pentacene systems were performed and a single CI iteration calculation with an active space of 24 electrons in 24 orbitals for the chromium tetramer was possible. The chromium tetramer corresponds to a CI expansion of one trillion Slater determinants (914 058 513 424) and is the largest conventional CI calculation attempted up to date.

The Effects on Health Behavior and Health Outcomes of Internet-Based Asynchronous Communication Between Health Providers and Patients With a Chronic Condition: A Systematic Review

PubMed Central

Ros, Wynand JG; Schrijvers, Guus

2014-01-01

Background In support of professional practice, asynchronous communication between the patient and the provider is implemented separately or in combination with Internet-based self-management interventions. This interaction occurs primarily through electronic messaging or discussion boards. There is little evidence as to whether it is a useful tool for chronically ill patients to support their self-management and increase the effectiveness of interventions. Objective The aim of our study was to review the use and usability of patient-provider asynchronous communication for chronically ill patients and the effects of such communication on health behavior, health outcomes, and patient satisfaction. Methods A literature search was performed using PubMed and Embase. The quality of the articles was appraised according to the National Institute for Health and Clinical Excellence (NICE) criteria. The use and usability of the asynchronous communication was analyzed by examining the frequency of use and the number of users of the interventions with asynchronous communication, as well as of separate electronic messaging. The effectiveness of asynchronous communication was analyzed by examining effects on health behavior, health outcomes, and patient satisfaction. Results Patients’ knowledge concerning their chronic condition increased and they seemed to appreciate being able to communicate asynchronously with their providers. They not only had specific questions but also wanted to communicate about feeling ill. A decrease in visits to the physician was shown in two studies (P=.07, P=.07). Increases in self-management/self-efficacy for patients with back pain, dyspnea, and heart failure were found. Positive health outcomes were shown in 12 studies, where the clinical outcomes for diabetic patients (HbA1c level) and for asthmatic patients (forced expiratory volume [FEV]) improved. Physical symptoms improved in five studies. Five studies generated a variety of positive psychosocial outcomes. Conclusions The effect of asynchronous communication is not shown unequivocally in these studies. Patients seem to be interested in using email. Patients are willing to participate and are taking the initiative to discuss health issues with their providers. Additional testing of the effects of asynchronous communication on self-management in chronically ill patients is needed. PMID:24434570
The effects on health behavior and health outcomes of Internet-based asynchronous communication between health providers and patients with a chronic condition: a systematic review.

PubMed

de Jong, Catharina Carolina; Ros, Wynand Jg; Schrijvers, Guus

2014-01-16

In support of professional practice, asynchronous communication between the patient and the provider is implemented separately or in combination with Internet-based self-management interventions. This interaction occurs primarily through electronic messaging or discussion boards. There is little evidence as to whether it is a useful tool for chronically ill patients to support their self-management and increase the effectiveness of interventions. The aim of our study was to review the use and usability of patient-provider asynchronous communication for chronically ill patients and the effects of such communication on health behavior, health outcomes, and patient satisfaction. A literature search was performed using PubMed and Embase. The quality of the articles was appraised according to the National Institute for Health and Clinical Excellence (NICE) criteria. The use and usability of the asynchronous communication was analyzed by examining the frequency of use and the number of users of the interventions with asynchronous communication, as well as of separate electronic messaging. The effectiveness of asynchronous communication was analyzed by examining effects on health behavior, health outcomes, and patient satisfaction. Patients' knowledge concerning their chronic condition increased and they seemed to appreciate being able to communicate asynchronously with their providers. They not only had specific questions but also wanted to communicate about feeling ill. A decrease in visits to the physician was shown in two studies (P=.07, P=.07). Increases in self-management/self-efficacy for patients with back pain, dyspnea, and heart failure were found. Positive health outcomes were shown in 12 studies, where the clinical outcomes for diabetic patients (HbA1c level) and for asthmatic patients (forced expiratory volume [FEV]) improved. Physical symptoms improved in five studies. Five studies generated a variety of positive psychosocial outcomes. The effect of asynchronous communication is not shown unequivocally in these studies. Patients seem to be interested in using email. Patients are willing to participate and are taking the initiative to discuss health issues with their providers. Additional testing of the effects of asynchronous communication on self-management in chronically ill patients is needed.
S-HARP: A parallel dynamic spectral partitioner

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sohn, A.; Simon, H.

1998-01-01

Computational science problems with adaptive meshes involve dynamic load balancing when implemented on parallel machines. This dynamic load balancing requires fast partitioning of computational meshes at run time. The authors present in this report a fast parallel dynamic partitioner, called S-HARP. The underlying principles of S-HARP are the fast feature of inertial partitioning and the quality feature of spectral partitioning. S-HARP partitions a graph from scratch, requiring no partition information from previous iterations. Two types of parallelism have been exploited in S-HARP, fine grain loop level parallelism and coarse grain recursive parallelism. The parallel partitioner has been implemented in Messagemore » Passing Interface on Cray T3E and IBM SP2 for portability. Experimental results indicate that S-HARP can partition a mesh of over 100,000 vertices into 256 partitions in 0.2 seconds on a 64 processor Cray T3E. S-HARP is much more scalable than other dynamic partitioners, giving over 15 fold speedup on 64 processors while ParaMeTiS1.0 gives a few fold speedup. Experimental results demonstrate that S-HARP is three to 10 times faster than the dynamic partitioners ParaMeTiS and Jostle on six computational meshes of size over 100,000 vertices.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)

Lusk, Ewing; Butler, Ralph; Pieper, Steven C.

Here, we take a historical approach to our presentation of self-scheduled task parallelism, a programming model with its origins in early irregular and nondeterministic computations encountered in automated theorem proving and logic programming. We show how an extremely simple task model has evolved into a system, asynchronous dynamic load balancing (ADLB), and a scalable implementation capable of supporting sophisticated applications on today’s (and tomorrow’s) largest supercomputers; and we illustrate the use of ADLB with a Green’s function Monte Carlo application, a modern, mature nuclear physics code in production use. Our lesson is that by surrendering a certain amount of generalitymore » and thus applicability, a minimal programming model (in terms of its basic concepts and the size of its application programmer interface) can achieve extreme scalability without introducing complexity.« less
A low delay transmission method of multi-channel video based on FPGA

NASA Astrophysics Data System (ADS)

Fu, Weijian; Wei, Baozhi; Li, Xiaobin; Wang, Quan; Hu, Xiaofei

2018-03-01

In order to guarantee the fluency of multi-channel video transmission in video monitoring scenarios, we designed a kind of video format conversion method based on FPGA and its DMA scheduling for video data, reduces the overall video transmission delay.In order to sace the time in the conversion process, the parallel ability of FPGA is used to video format conversion. In order to improve the direct memory access (DMA) writing transmission rate of PCIe bus, a DMA scheduling method based on asynchronous command buffer is proposed. The experimental results show that this paper designs a low delay transmission method based on FPGA, which increases the DMA writing transmission rate by 34% compared with the existing method, and then the video overall delay is reduced to 23.6ms.
Optical data communication: fundamentals and future directions

NASA Astrophysics Data System (ADS)

DeCusatis, Casimer M.

1998-12-01

An overview of optical data communications is provided, beginning with a brief history and discussion of the unique requirements that distinguish this subfield from related areas such as telecommunications. Each of the major datacom standards is then discussed, including the physical layer specification, distances and data rates, fiber and connector types, data frame structures, and network considerations. These standards can be categorized by their prevailing applications, either storage [Enterprise System Connection, Fiber Channel Connection, and Fiber Channel], coupling (Fiber Channel), or networking [Fiber Distributed Data Interface, Gigabit Ethernet, and asynchronous transfer mode/synchronous optical network]. We also present some emerging technologies and their applications, including parallel optical interconnects, plastic optical fiber, wavelength multiplexing, and free- space optical links. We conclude with some cost/performance trade-offs and predictions of future bandwidth trends.
A robust low-rate coding scheme for packet video

NASA Technical Reports Server (NTRS)

Chen, Y. C.; Sayood, Khalid; Nelson, D. J.; Arikan, E. (Editor)

1991-01-01

Due to the rapidly evolving field of image processing and networking, video information promises to be an important part of telecommunication systems. Although up to now video transmission has been transported mainly over circuit-switched networks, it is likely that packet-switched networks will dominate the communication world in the near future. Asynchronous transfer mode (ATM) techniques in broadband-ISDN can provide a flexible, independent and high performance environment for video communication. For this paper, the network simulator was used only as a channel in this simulation. Mixture blocking coding with progressive transmission (MBCPT) has been investigated for use over packet networks and has been found to provide high compression rate with good visual performance, robustness to packet loss, tractable integration with network mechanics and simplicity in parallel implementation.
Asynchronous vibration problem of centrifugal compressor

NASA Technical Reports Server (NTRS)

Fujikawa, T.; Ishiguro, N.; Ito, M.

1980-01-01

An unstable asynchronous vibration problem in a high pressure centrifugal compressor and the remedial actions against it are described. Asynchronous vibration of the compressor took place when the discharge pressure (Pd) was increased, after the rotor was already at full speed. The typical spectral data of the shaft vibration indicate that as the pressure Pd increases, pre-unstable vibration appears and becomes larger, and large unstable asynchronous vibration occurs suddenly (Pd = 5.49MPa). A computer program was used which calculated the logarithmic decrement and the damped natural frequency of the rotor bearing systems. The analysis of the log-decrement is concluded to be effective in preventing unstable vibration in both the design stage and remedial actions.
Optimization of parameters of special asynchronous electric drives

NASA Astrophysics Data System (ADS)

Karandey, V. Yu; Popov, B. K.; Popova, O. B.; Afanasyev, V. L.

2018-03-01

The article considers the solution of the problem of parameters optimization of special asynchronous electric drives. The solution of the problem will allow one to project and create special asynchronous electric drives for various industries. The created types of electric drives will have optimum mass-dimensional and power parameters. It will allow one to realize and fulfill the set characteristics of management of technological processes with optimum level of expenses of electric energy, time of completing the process or other set parameters. The received decision allows one not only to solve a certain optimizing problem, but also to construct dependences between the optimized parameters of special asynchronous electric drives, for example, with the change of power, current in a winding of the stator or rotor, induction in a gap or steel of magnetic conductors and other parameters. On the constructed dependences, it is possible to choose necessary optimum values of parameters of special asynchronous electric drives and their components without carrying out repeated calculations.
Asynchronous Communication of TLNS3DMB Boundary Exchange

NASA Technical Reports Server (NTRS)

Hammond, Dana P.

1997-01-01

This paper describes the recognition of implicit serialization due to coarse-grain, synchronous communication and demonstrates the conversion to asynchronous communication for the exchange of boundary condition information in the Thin-Layer Navier Stokes 3-Dimensional Multi Block (TLNS3DMB) code. The implementation details of using asynchronous communication is provided including buffer allocation, message identification, and barrier control. The IBM SP2 was used for the tests presented.
[Cost-effectiveness of Synchronous vs. Asynchronous Telepsychiatry in Prison Inmates With Depression].

PubMed

Barrera-Valencia, Camilo; Benito-Devia, Alexis Vladimir; Vélez-Álvarez, Consuelo; Figueroa-Barrera, Mario; Franco-Idárraga, Sandra Milena

Telepsychiatry is defined as the use of information and communication technology (ICT) in providing remote psychiatric services. Telepsychiatry is applied using two types of communication: synchronous (real time) and asynchronous (store and forward). To determine the cost-effectiveness of a synchronous and an asynchronous telepsychiatric model in prison inmate patients with symptoms of depression. A cost-effectiveness study was performed on a population consisting of 157 patients from the Establecimiento Penitenciario y Carcelario de Mediana Seguridad de Manizales, Colombia. The sample was determined by applying Zung self-administered surveys for depression (1965) and the Hamilton Depression Rating Scale (HDRS), the latter being the tool used for the comparison. Initial Hamilton score, arrival time, duration of system downtime, and clinical effectiveness variables had normal distributions (P>.05). There were significant differences (P<.001) between care costs for the different models, showing that the mean cost of the asynchronous model is less than synchronous model, and making the asynchronous model more cost-effective. The asynchronous model is the most cost-effective model of telepsychiatry care for patients with depression admitted to a detention centre, according to the results of clinical effectiveness, cost measurement, and patient satisfaction. Copyright © 2016 Asociación Colombiana de Psiquiatría. Publicado por Elsevier España. All rights reserved.
Features of the Asynchronous Correlation between the China Coal Price Index and Coal Mining Accidental Deaths

PubMed Central

Huang, Yuecheng; Cheng, Wuyi; Luo, Sida; Luo, Yun; Ma, Chengchen; He, Tailin

2016-01-01

The features of the asynchronous correlation between accident indices and the factors that influence accidents can provide an effective reference for warnings of coal mining accidents. However, what are the features of this correlation? To answer this question, data from the China coal price index and the number of deaths from coal mining accidents were selected as the sample data. The fluctuation modes of the asynchronous correlation between the two data sets were defined according to the asynchronous correlation coefficients, symbolization, and sliding windows. We then built several directed and weighted network models, within which the fluctuation modes and the transformations between modes were represented by nodes and edges. Then, the features of the asynchronous correlation between these two variables could be studied from a perspective of network topology. We found that the correlation between the price index and the accidental deaths was asynchronous and fluctuating. Certain aspects, such as the key fluctuation modes, the subgroups characteristics, the transmission medium, the periodicity and transmission path length in the network, were analyzed by using complex network theory, analytical methods and spectral analysis method. These results provide a scientific reference for generating warnings for coal mining accidents based on economic indices. PMID:27902748
Efficacy of an asynchronous electronic curriculum in emergency medicine education in the United States.

PubMed

Wray, Alisa; Bennett, Kathryn; Boysen-Osborn, Megan; Wiechmann, Warren; Toohey, Shannon

2017-01-01

The aim of this study was to measure the effect of an iPad-based asynchronous curriculum on emergency medicine resident performance on the in-training exam (ITE). We hypothesized that the implementation of an asynchronous curriculum (replacing 1 hour of weekly didactic time) would result in non-inferior ITE scores compared to the historical scores of residents who had participated in the traditional 5-hour weekly didactic curriculum. The study was a retrospective, non-inferiority study. conducted at the University of California, Irvine Emergency Medicine Residency Program. We compared ITE scores from 2012 and 2013, when there were 5 weekly hours of didactic content, with scores from 2014 and 2015, when 1 hour of conference was replaced with asynchro-nous content. Examination results were compared using a non-inferiority data analysis with a 10% margin of difference. Using a non-inferiority test with a 95% confidence interval, there was no difference between the 2 groups (before and after implementation of asynchronous learning), as the confidence interval for the change of the ITE was -3.5 to 2.3 points, whereas the 10% non-inferiority margin was 7.8 points. Replacing 1 hour of didactic conference with asynchronous learning showed no negative impact on resident ITE scores.
Development of iterative techniques for the solution of unsteady compressible viscous flows

NASA Technical Reports Server (NTRS)

Sankar, Lakshmi; Hixon, Duane

1993-01-01

The work done under this project was documented in detail as the Ph. D. dissertation of Dr. Duane Hixon. The objectives of the research project were evaluation of the generalized minimum residual method (GMRES) as a tool for accelerating 2-D and 3-D unsteady flows and evaluation of the suitability of the GMRES algorithm for unsteady flows, computed on parallel computer architectures.
Methods for design and evaluation of integrated hardware-software systems for concurrent computation

NASA Technical Reports Server (NTRS)

Pratt, T. W.

1985-01-01

Research activities and publications are briefly summarized. The major tasks reviewed are: (1) VAX implementation of the PISCES parallel programming environment; (2) Apollo workstation network implementation of the PISCES environment; (3) FLEX implementation of the PISCES environment; (4) sparse matrix iterative solver in PSICES Fortran; (5) image processing application of PISCES; and (6) a formal model of concurrent computation being developed.
Burning plasma regime for Fussion-Fission Research Facility

NASA Astrophysics Data System (ADS)

Zakharov, Leonid E.

2010-11-01

The basic aspects of burning plasma regimes of Fusion-Fission Research Facility (FFRF, R/a=4/1 m/m, Ipl=5 MA, Btor=4-6 T, P^DT=50-100 MW, P^fission=80-4000 MW, 1 m thick blanket), which is suggested as the next step device for Chinese fusion program, are presented. The mission of FFRF is to advance magnetic fusion to the level of a stationary neutron source and to create a technical, scientific, and technology basis for the utilization of high-energy fusion neutrons for the needs of nuclear energy and technology. FFRF will rely as much as possible on ITER design. Thus, the magnetic system, especially TFC, will take advantage of ITER experience. TFC will use the same superconductor as ITER. The plasma regimes will represent an extension of the stationary plasma regimes on HT-7 and EAST tokamaks at ASIPP. Both inductive discharges and stationary non-inductive Lower Hybrid Current Drive (LHCD) will be possible. FFRF strongly relies on new, Lithium Wall Fusion (LiWF) plasma regimes, the development of which will be done on NSTX, HT-7, EAST in parallel with the design work. This regime will eliminate a number of uncertainties, still remaining unresolved in the ITER project. Well controlled, hours long inductive current drive operation at P^DT=50-100 MW is predicted.
Functional materials for breeding blankets—status and developments

NASA Astrophysics Data System (ADS)

Konishi, S.; Enoeda, M.; Nakamichi, M.; Hoshino, T.; Ying, A.; Sharafat, S.; Smolentsev, S.

2017-09-01

The development of tritium breeder, neutron multiplier and flow channel insert materials for the breeding blanket of the DEMO reactor is reviewed. Present emphasis is on the ITER test blanket module (TBM); lithium metatitanate (Li2TiO3) and lithium orthosilicate (Li4SiO4) pebbles have been developed by leading TBM parties. Beryllium pebbles have been selected as the neutron multiplier. Good progress has been made in their fabrication; however, verification of the design by experiments is in the planning stage. Irradiation data are also limited, but the decrease in thermal conductivity of beryllium due to irradiation followed by swelling is a concern. Tests at ITER are regarded as a major milestone. For the DEMO reactor, improvement of the breeder has been attempted to obtain a higher lithium content, and Be12Ti and other beryllide intermetallic compounds that have superior chemical stability have been studied. LiPb eutectic has been considered as a DEMO blanket in the liquid breeder option and is used as a coolant to achieve a higher outlet temperature; a SiC flow channel insert is used to prevent magnetohydrodynamic pressure drop and corrosion. A significant technical gap between ITER TBM and DEMO is recognized, and the world fusion community is working on ITER TBM and DEMO blanket development in parallel.
External heating and current drive source requirements towards steady-state operation in ITER

NASA Astrophysics Data System (ADS)

Poli, F. M.; Kessel, C. E.; Bonoli, P. T.; Batchelor, D. B.; Harvey, R. W.; Snyder, P. B.

2014-07-01

Steady state scenarios envisaged for ITER aim at optimizing the bootstrap current, while maintaining sufficient confinement and stability to provide the necessary fusion yield. Non-inductive scenarios will need to operate with internal transport barriers (ITBs) in order to reach adequate fusion gain at typical currents of 9 MA. However, the large pressure gradients associated with ITBs in regions of weak or negative magnetic shear can be conducive to ideal MHD instabilities, reducing the no-wall limit. The E × B flow shear from toroidal plasma rotation is expected to be low in ITER, with a major role in the ITB dynamics being played by magnetic geometry. Combinations of heating and current drive (H/CD) sources that sustain reversed magnetic shear profiles throughout the discharge are the focus of this work. Time-dependent transport simulations indicate that a combination of electron cyclotron (EC) and lower hybrid (LH) waves is a promising route towards steady state operation in ITER. The LH forms and sustains expanded barriers and the EC deposition at mid-radius freezes the bootstrap current profile stabilizing the barrier and leading to confinement levels 50% higher than typical H-mode energy confinement times. Using LH spectra with spectrum centred on parallel refractive index of 1.75-1.85, the performance of these plasma scenarios is close to the ITER target of 9 MA non-inductive current, global confinement gain H98 = 1.6 and fusion gain Q = 5.
Asynchronous networks: modularization of dynamics theorem

NASA Astrophysics Data System (ADS)

Bick, Christian; Field, Michael

2017-02-01

Building on the first part of this paper, we develop the theory of functional asynchronous networks. We show that a large class of functional asynchronous networks can be (uniquely) represented as feedforward networks connecting events or dynamical modules. For these networks we can give a complete description of the network function in terms of the function of the events comprising the network: the modularization of dynamics theorem. We give examples to illustrate the main results.
Distributed Data-aggregation Consensus for Sensor Networks: Relaxation of Consensus Concept and Convergence Property

DTIC Science & Technology

2014-08-01

consensus algorithm called randomized gossip is more suitable [7, 8]. In asynchronous randomized gossip algorithms, pairs of neighboring nodes exchange...messages and perform updates in an asynchronous and unattended manner, and they also 1 The class of broadcast gossip algorithms [9, 10, 11, 12] are...dynamics [2] and asynchronous pairwise randomized gossip [7, 8], broadcast gossip algorithms do not require that nodes know the identities of their

Simulating fail-stop in asynchronous distributed systems

NASA Technical Reports Server (NTRS)

Sabel, Laura; Marzullo, Keith

1994-01-01

The fail-stop failure model appears frequently in the distributed systems literature. However, in an asynchronous distributed system, the fail-stop model cannot be implemented. In particular, it is impossible to reliably detect crash failures in an asynchronous system. In this paper, we show that it is possible to specify and implement a failure model that is indistinguishable from the fail-stop model from the point of view of any process within an asynchronous system. We give necessary conditions for a failure model to be indistinguishable from the fail-stop model, and derive lower bounds on the amount of process replication needed to implement such a failure model. We present a simple one-round protocol for implementing one such failure model, which we call simulated fail-stop.
Asynchronous Video Interviewing as a New Technology in Personnel Selection: The Applicant’s Point of View

PubMed Central

Brenner, Falko S.; Ortner, Tuulia M.; Fay, Doris

2016-01-01

The present study aimed to integrate findings from technology acceptance research with research on applicant reactions to new technology for the emerging selection procedure of asynchronous video interviewing. One hundred six volunteers experienced asynchronous video interviewing and filled out several questionnaires including one on the applicants’ personalities. In line with previous technology acceptance research, the data revealed that perceived usefulness and perceived ease of use predicted attitudes toward asynchronous video interviewing. Furthermore, openness revealed to moderate the relation between perceived usefulness and attitudes toward this particular selection technology. No significant effects emerged for computer self-efficacy, job interview self-efficacy, extraversion, neuroticism, and conscientiousness. Theoretical and practical implications are discussed. PMID:27378969
Scalable load balancing for massively parallel distributed Monte Carlo particle transport

DOE Office of Scientific and Technical Information (OSTI.GOV)

O'Brien, M. J.; Brantley, P. S.; Joy, K. I.

2013-07-01

In order to run computer simulations efficiently on massively parallel computers with hundreds of thousands or millions of processors, care must be taken that the calculation is load balanced across the processors. Examining the workload of every processor leads to an unscalable algorithm, with run time at least as large as O(N), where N is the number of processors. We present a scalable load balancing algorithm, with run time 0(log(N)), that involves iterated processor-pair-wise balancing steps, ultimately leading to a globally balanced workload. We demonstrate scalability of the algorithm up to 2 million processors on the Sequoia supercomputer at Lawrencemore » Livermore National Laboratory. (authors)« less
Krylov subspace methods on supercomputers

NASA Technical Reports Server (NTRS)

Saad, Youcef

1988-01-01

A short survey of recent research on Krylov subspace methods with emphasis on implementation on vector and parallel computers is presented. Conjugate gradient methods have proven very useful on traditional scalar computers, and their popularity is likely to increase as three-dimensional models gain importance. A conservative approach to derive effective iterative techniques for supercomputers has been to find efficient parallel/vector implementations of the standard algorithms. The main source of difficulty in the incomplete factorization preconditionings is in the solution of the triangular systems at each step. A few approaches consisting of implementing efficient forward and backward triangular solutions are described in detail. Polynomial preconditioning as an alternative to standard incomplete factorization techniques is also discussed. Another efficient approach is to reorder the equations so as to improve the structure of the matrix to achieve better parallelism or vectorization. An overview of these and other ideas and their effectiveness or potential for different types of architectures is given.
Sequential and parallel image restoration: neural network implementations.

PubMed

Figueiredo, M T; Leitao, J N

1994-01-01

Sequential and parallel image restoration algorithms and their implementations on neural networks are proposed. For images degraded by linear blur and contaminated by additive white Gaussian noise, maximum a posteriori (MAP) estimation and regularization theory lead to the same high dimension convex optimization problem. The commonly adopted strategy (in using neural networks for image restoration) is to map the objective function of the optimization problem into the energy of a predefined network, taking advantage of its energy minimization properties. Departing from this approach, we propose neural implementations of iterative minimization algorithms which are first proved to converge. The developed schemes are based on modified Hopfield (1985) networks of graded elements, with both sequential and parallel updating schedules. An algorithm supported on a fully standard Hopfield network (binary elements and zero autoconnections) is also considered. Robustness with respect to finite numerical precision is studied, and examples with real images are presented.
Reducing neural network training time with parallel processing

NASA Technical Reports Server (NTRS)

Rogers, James L., Jr.; Lamarsh, William J., II

1995-01-01

Obtaining optimal solutions for engineering design problems is often expensive because the process typically requires numerous iterations involving analysis and optimization programs. Previous research has shown that a near optimum solution can be obtained in less time by simulating a slow, expensive analysis with a fast, inexpensive neural network. A new approach has been developed to further reduce this time. This approach decomposes a large neural network into many smaller neural networks that can be trained in parallel. Guidelines are developed to avoid some of the pitfalls when training smaller neural networks in parallel. These guidelines allow the engineer: to determine the number of nodes on the hidden layer of the smaller neural networks; to choose the initial training weights; and to select a network configuration that will capture the interactions among the smaller neural networks. This paper presents results describing how these guidelines are developed.
[Development and evaluation of the medical imaging distribution system with dynamic web application and clustering technology].

PubMed

Yokohama, Noriya; Tsuchimoto, Tadashi; Oishi, Masamichi; Itou, Katsuya

2007-01-20

It has been noted that the downtime of medical informatics systems is often long. Many systems encounter downtimes of hours or even days, which can have a critical effect on daily operations. Such systems remain especially weak in the areas of database and medical imaging data. The scheme design shows the three-layer architecture of the system: application, database, and storage layers. The application layer uses the DICOM protocol (Digital Imaging and Communication in Medicine) and HTTP (Hyper Text Transport Protocol) with AJAX (Asynchronous JavaScript+XML). The database is designed to decentralize in parallel using cluster technology. Consequently, restoration of the database can be done not only with ease but also with improved retrieval speed. In the storage layer, a network RAID (Redundant Array of Independent Disks) system, it is possible to construct exabyte-scale parallel file systems that exploit storage spread. Development and evaluation of the test-bed has been successful in medical information data backup and recovery in a network environment. This paper presents a schematic design of the new medical informatics system that can be accommodated from a recovery and the dynamic Web application for medical imaging distribution using AJAX.
Modified Petri net model sensitivity to workload manipulations

NASA Technical Reports Server (NTRS)

White, S. A.; Mackinnon, D. P.; Lyman, J.

1986-01-01

Modified Petri Nets (MPNs) are investigated as a workload modeling tool. The results of an exploratory study of the sensitivity of MPNs to work load manipulations in a dual task are described. Petri nets have been used to represent systems with asynchronous, concurrent and parallel activities (Peterson, 1981). These characteristics led some researchers to suggest the use of Petri nets in workload modeling where concurrent and parallel activities are common. Petri nets are represented by places and transitions. In the workload application, places represent operator activities and transitions represent events. MPNs have been used to formally represent task events and activities of a human operator in a man-machine system. Some descriptive applications demonstrate the usefulness of MPNs in the formal representation of systems. It is the general hypothesis herein that in addition to descriptive applications, MPNs may be useful for workload estimation and prediction. The results are reported of the first of a series of experiments designed to develop and test a MPN system of workload estimation and prediction. This first experiment is a screening test of MPN model general sensitivity to changes in workload. Positive results from this experiment will justify the more complicated analyses and techniques necessary for developing a workload prediction system.
String-averaging incremental subgradients for constrained convex optimization with applications to reconstruction of tomographic images

NASA Astrophysics Data System (ADS)

Massambone de Oliveira, Rafael; Salomão Helou, Elias; Fontoura Costa, Eduardo

2016-11-01

We present a method for non-smooth convex minimization which is based on subgradient directions and string-averaging techniques. In this approach, the set of available data is split into sequences (strings) and a given iterate is processed independently along each string, possibly in parallel, by an incremental subgradient method (ISM). The end-points of all strings are averaged to form the next iterate. The method is useful to solve sparse and large-scale non-smooth convex optimization problems, such as those arising in tomographic imaging. A convergence analysis is provided under realistic, standard conditions. Numerical tests are performed in a tomographic image reconstruction application, showing good performance for the convergence speed when measured as the decrease ratio of the objective function, in comparison to classical ISM.
Robust determination of the chemical potential in the pole expansion and selected inversion method for solving Kohn-Sham density functional theory

NASA Astrophysics Data System (ADS)

Jia, Weile; Lin, Lin

2017-10-01

Fermi operator expansion (FOE) methods are powerful alternatives to diagonalization type methods for solving Kohn-Sham density functional theory (KSDFT). One example is the pole expansion and selected inversion (PEXSI) method, which approximates the Fermi operator by rational matrix functions and reduces the computational complexity to at most quadratic scaling for solving KSDFT. Unlike diagonalization type methods, the chemical potential often cannot be directly read off from the result of a single step of evaluation of the Fermi operator. Hence multiple evaluations are needed to be sequentially performed to compute the chemical potential to ensure the correct number of electrons within a given tolerance. This hinders the performance of FOE methods in practice. In this paper, we develop an efficient and robust strategy to determine the chemical potential in the context of the PEXSI method. The main idea of the new method is not to find the exact chemical potential at each self-consistent-field (SCF) iteration but to dynamically and rigorously update the upper and lower bounds for the true chemical potential, so that the chemical potential reaches its convergence along the SCF iteration. Instead of evaluating the Fermi operator for multiple times sequentially, our method uses a two-level strategy that evaluates the Fermi operators in parallel. In the regime of full parallelization, the wall clock time of each SCF iteration is always close to the time for one single evaluation of the Fermi operator, even when the initial guess is far away from the converged solution. We demonstrate the effectiveness of the new method using examples with metallic and insulating characters, as well as results from ab initio molecular dynamics.
3D unstructured-mesh radiation transport codes

DOE Office of Scientific and Technical Information (OSTI.GOV)

Morel, J.

1997-12-31

Three unstructured-mesh radiation transport codes are currently being developed at Los Alamos National Laboratory. The first code is ATTILA, which uses an unstructured tetrahedral mesh in conjunction with standard Sn (discrete-ordinates) angular discretization, standard multigroup energy discretization, and linear-discontinuous spatial differencing. ATTILA solves the standard first-order form of the transport equation using source iteration in conjunction with diffusion-synthetic acceleration of the within-group source iterations. DANTE is designed to run primarily on workstations. The second code is DANTE, which uses a hybrid finite-element mesh consisting of arbitrary combinations of hexahedra, wedges, pyramids, and tetrahedra. DANTE solves several second-order self-adjoint forms of the transport equation including the even-parity equation, the odd-parity equation, and a new equation called the self-adjoint angular flux equation. DANTE also offers three angular discretization options:more » $$S{_}n$$ (discrete-ordinates), $$P{_}n$$ (spherical harmonics), and $$SP{_}n$$ (simplified spherical harmonics). DANTE is designed to run primarily on massively parallel message-passing machines, such as the ASCI-Blue machines at LANL and LLNL. The third code is PERICLES, which uses the same hybrid finite-element mesh as DANTE, but solves the standard first-order form of the transport equation rather than a second-order self-adjoint form. DANTE uses a standard $$S{_}n$$ discretization in angle in conjunction with trilinear-discontinuous spatial differencing, and diffusion-synthetic acceleration of the within-group source iterations. PERICLES was initially designed to run on workstations, but a version for massively parallel message-passing machines will be built. The three codes will be described in detail and computational results will be presented.« less
Robust determination of the chemical potential in the pole expansion and selected inversion method for solving Kohn-Sham density functional theory.

PubMed

Jia, Weile; Lin, Lin

2017-10-14

Fermi operator expansion (FOE) methods are powerful alternatives to diagonalization type methods for solving Kohn-Sham density functional theory (KSDFT). One example is the pole expansion and selected inversion (PEXSI) method, which approximates the Fermi operator by rational matrix functions and reduces the computational complexity to at most quadratic scaling for solving KSDFT. Unlike diagonalization type methods, the chemical potential often cannot be directly read off from the result of a single step of evaluation of the Fermi operator. Hence multiple evaluations are needed to be sequentially performed to compute the chemical potential to ensure the correct number of electrons within a given tolerance. This hinders the performance of FOE methods in practice. In this paper, we develop an efficient and robust strategy to determine the chemical potential in the context of the PEXSI method. The main idea of the new method is not to find the exact chemical potential at each self-consistent-field (SCF) iteration but to dynamically and rigorously update the upper and lower bounds for the true chemical potential, so that the chemical potential reaches its convergence along the SCF iteration. Instead of evaluating the Fermi operator for multiple times sequentially, our method uses a two-level strategy that evaluates the Fermi operators in parallel. In the regime of full parallelization, the wall clock time of each SCF iteration is always close to the time for one single evaluation of the Fermi operator, even when the initial guess is far away from the converged solution. We demonstrate the effectiveness of the new method using examples with metallic and insulating characters, as well as results from ab initio molecular dynamics.
A fresh look at electron cyclotron current drive power requirements for stabilization of tearing modes in ITER

DOE Office of Scientific and Technical Information (OSTI.GOV)

La Haye, R. J., E-mail: lahaye@fusion.gat.com

2015-12-10

ITER is an international project to design and build an experimental fusion reactor based on the “tokamak” concept. ITER relies upon localized electron cyclotron current drive (ECCD) at the rational safety factor q=2 to suppress or stabilize the expected poloidal mode m=2, toroidal mode n=1 neoclassical tearing mode (NTM) islands. Such islands if unmitigated degrade energy confinement, lock to the resistive wall (stop rotating), cause loss of “H-mode” and induce disruption. The International Tokamak Physics Activity (ITPA) on MHD, Disruptions and Magnetic Control joint experiment group MDC-8 on Current Drive Prevention/Stabilization of Neoclassical Tearing Modes started in 2005, after whichmore » assessments were made for the requirements for ECCD needed in ITER, particularly that of rf power and alignment on q=2 [1]. Narrow well-aligned rf current parallel to and of order of one percent of the total plasma current is needed to replace the “missing” current in the island O-points and heal or preempt (avoid destabilization by applying ECCD on q=2 in absence of the mode) the island [2-4]. This paper updates the advances in ECCD stabilization on NTMs learned in DIII-D experiments and modeling during the last 5 to 10 years as applies to stabilization by localized ECCD of tearing modes in ITER. This includes the ECCD (inside the q=1 radius) stabilization of the NTM “seeding” instability known as sawteeth (m/n=1/1) [5]. Recent measurements in DIII-D show that the ITER-similar current profile is classically unstable, curvature stabilization must not be neglected, and the small island width stabilization effect from helical ion polarization currents is stronger than was previously thought [6]. The consequences of updated assumptions in ITER modeling of the minimum well-aligned ECCD power needed are all-in-all favorable (and well-within the ITER 24 gyrotron capability) when all effects are included. However, a “wild card” may be broadening of the localized ECCD by the presence of the island; various theories predict broadening could occur and there is experimental evidence for broadening in DIII-D. Wider than now expected ECCD in ITER would make alignment easier to do but weaken the stabilization and thus require more rf power. In addition to updated modeling for ITER, advances in the ITER-relevant DIII-D ECCD gyrotron launch mirror control system hardware and real-time plasma control system have been made [7] and there are plans for application in DIII-D ITER demonstration discharges.« less
A fresh look at electron cyclotron current drive power requirements for stabilization of tearing modes in ITER

NASA Astrophysics Data System (ADS)

La Haye, R. J.

2015-12-01

ITER is an international project to design and build an experimental fusion reactor based on the "tokamak" concept. ITER relies upon localized electron cyclotron current drive (ECCD) at the rational safety factor q=2 to suppress or stabilize the expected poloidal mode m=2, toroidal mode n=1 neoclassical tearing mode (NTM) islands. Such islands if unmitigated degrade energy confinement, lock to the resistive wall (stop rotating), cause loss of "H-mode" and induce disruption. The International Tokamak Physics Activity (ITPA) on MHD, Disruptions and Magnetic Control joint experiment group MDC-8 on Current Drive Prevention/Stabilization of Neoclassical Tearing Modes started in 2005, after which assessments were made for the requirements for ECCD needed in ITER, particularly that of rf power and alignment on q=2 [1]. Narrow well-aligned rf current parallel to and of order of one percent of the total plasma current is needed to replace the "missing" current in the island O-points and heal or preempt (avoid destabilization by applying ECCD on q=2 in absence of the mode) the island [2-4]. This paper updates the advances in ECCD stabilization on NTMs learned in DIII-D experiments and modeling during the last 5 to 10 years as applies to stabilization by localized ECCD of tearing modes in ITER. This includes the ECCD (inside the q=1 radius) stabilization of the NTM "seeding" instability known as sawteeth (m/n=1/1) [5]. Recent measurements in DIII-D show that the ITER-similar current profile is classically unstable, curvature stabilization must not be neglected, and the small island width stabilization effect from helical ion polarization currents is stronger than was previously thought [6]. The consequences of updated assumptions in ITER modeling of the minimum well-aligned ECCD power needed are all-in-all favorable (and well-within the ITER 24 gyrotron capability) when all effects are included. However, a "wild card" may be broadening of the localized ECCD by the presence of the island; various theories predict broadening could occur and there is experimental evidence for broadening in DIII-D. Wider than now expected ECCD in ITER would make alignment easier to do but weaken the stabilization and thus require more rf power. In addition to updated modeling for ITER, advances in the ITER-relevant DIII-D ECCD gyrotron launch mirror control system hardware and real-time plasma control system have been made [7] and there are plans for application in DIII-D ITER demonstration discharges.
Network evolution induced by asynchronous stimuli through spike-timing-dependent plasticity.

PubMed

Yuan, Wu-Jie; Zhou, Jian-Fang; Zhou, Changsong

2013-01-01

In sensory neural system, external asynchronous stimuli play an important role in perceptual learning, associative memory and map development. However, the organization of structure and dynamics of neural networks induced by external asynchronous stimuli are not well understood. Spike-timing-dependent plasticity (STDP) is a typical synaptic plasticity that has been extensively found in the sensory systems and that has received much theoretical attention. This synaptic plasticity is highly sensitive to correlations between pre- and postsynaptic firings. Thus, STDP is expected to play an important role in response to external asynchronous stimuli, which can induce segregative pre- and postsynaptic firings. In this paper, we study the impact of external asynchronous stimuli on the organization of structure and dynamics of neural networks through STDP. We construct a two-dimensional spatial neural network model with local connectivity and sparseness, and use external currents to stimulate alternately on different spatial layers. The adopted external currents imposed alternately on spatial layers can be here regarded as external asynchronous stimuli. Through extensive numerical simulations, we focus on the effects of stimulus number and inter-stimulus timing on synaptic connecting weights and the property of propagation dynamics in the resulting network structure. Interestingly, the resulting feedforward structure induced by stimulus-dependent asynchronous firings and its propagation dynamics reflect both the underlying property of STDP. The results imply a possible important role of STDP in generating feedforward structure and collective propagation activity required for experience-dependent map plasticity in developing in vivo sensory pathways and cortices. The relevance of the results to cue-triggered recall of learned temporal sequences, an important cognitive function, is briefly discussed as well. Furthermore, this finding suggests a potential application for examining STDP by measuring neural population activity in a cultured neural network.
FAST: A fully asynchronous and status-tracking pattern for geoprocessing services orchestration

NASA Astrophysics Data System (ADS)

Wu, Huayi; You, Lan; Gui, Zhipeng; Gao, Shuang; Li, Zhenqiang; Yu, Jingmin

2014-09-01

Geoprocessing service orchestration (GSO) provides a unified and flexible way to implement cross-application, long-lived, and multi-step geoprocessing service workflows by coordinating geoprocessing services collaboratively. Usually, geoprocessing services and geoprocessing service workflows are data and/or computing intensive. The intensity feature may make the execution process of a workflow time-consuming. Since it initials an execution request without blocking other interactions on the client side, an asynchronous mechanism is especially appropriate for GSO workflows. Many critical problems remain to be solved in existing asynchronous patterns for GSO including difficulties in improving performance, status tracking, and clarifying the workflow structure. These problems are a challenge when orchestrating performance efficiency, making statuses instantly available, and constructing clearly structured GSO workflows. A Fully Asynchronous and Status-Tracking (FAST) pattern that adopts asynchronous interactions throughout the whole communication tier of a workflow is proposed for GSO. The proposed FAST pattern includes a mechanism that actively pushes the latest status to clients instantly and economically. An independent proxy was designed to isolate the status tracking logic from the geoprocessing business logic, which assists the formation of a clear GSO workflow structure. A workflow was implemented in the FAST pattern to simulate the flooding process in the Poyang Lake region. Experimental results show that the proposed FAST pattern can efficiently tackle data/computing intensive geoprocessing tasks. The performance of all collaborative partners was improved due to the asynchronous mechanism throughout communication tier. A status-tracking mechanism helps users retrieve the latest running status of a GSO workflow in an efficient and instant way. The clear structure of the GSO workflow lowers the barriers for geospatial domain experts and model designers to compose asynchronous GSO workflows. Most importantly, it provides better support for locating and diagnosing potential exceptions.
Asynchronous collision integrators: Explicit treatment of unilateral contact with friction and nodal restraints

PubMed Central

Wolff, Sebastian; Bucher, Christian

2013-01-01

This article presents asynchronous collision integrators and a simple asynchronous method treating nodal restraints. Asynchronous discretizations allow individual time step sizes for each spatial region, improving the efficiency of explicit time stepping for finite element meshes with heterogeneous element sizes. The article first introduces asynchronous variational integration being expressed by drift and kick operators. Linear nodal restraint conditions are solved by a simple projection of the forces that is shown to be equivalent to RATTLE. Unilateral contact is solved by an asynchronous variant of decomposition contact response. Therein, velocities are modified avoiding penetrations. Although decomposition contact response is solving a large system of linear equations (being critical for the numerical efficiency of explicit time stepping schemes) and is needing special treatment regarding overconstraint and linear dependency of the contact constraints (for example from double-sided node-to-surface contact or self-contact), the asynchronous strategy handles these situations efficiently and robust. Only a single constraint involving a very small number of degrees of freedom is considered at once leading to a very efficient solution. The treatment of friction is exemplified for the Coulomb model. Special care needs the contact of nodes that are subject to restraints. Together with the aforementioned projection for restraints, a novel efficient solution scheme can be presented. The collision integrator does not influence the critical time step. Hence, the time step can be chosen independently from the underlying time-stepping scheme. The time step may be fixed or time-adaptive. New demands on global collision detection are discussed exemplified by position codes and node-to-segment integration. Numerical examples illustrate convergence and efficiency of the new contact algorithm. Copyright © 2013 The Authors. International Journal for Numerical Methods in Engineering published by John Wiley & Sons, Ltd. PMID:23970806
Within-female plasticity in sex allocation is associated with a behavioural polyphenism in house wrens.

PubMed

Bowers, E K; Thompson, C F; Sakaluk, S K

2016-03-01

Sex allocation theory assumes individual plasticity in maternal strategies, but few studies have investigated within-individual changes across environments. In house wrens, differences between nests in the degree of hatching synchrony of eggs represent a behavioural polyphenism in females, and its expression varies with seasonal changes in the environment. Between-nest differences in hatching asynchrony also create different environments for offspring, and sons are more strongly affected than daughters by sibling competition when hatching occurs asynchronously over several days. Here, we examined variation in hatching asynchrony and sex allocation, and its consequences for offspring fitness. The number and condition of fledglings declined seasonally, and the frequency of asynchronous hatching increased. In broods hatched asynchronously, sons, which are over-represented in the earlier-laid eggs, were in better condition than daughters, which are over-represented in the later-laid eggs. Nonetheless, asynchronous broods were more productive later within seasons. The proportion of sons in asynchronous broods increased seasonally, whereas there was a seasonal increase in the production of daughters by mothers hatching their eggs synchronously, which was characterized by within-female changes in offspring sex and not by sex-biased mortality. As adults, sons from asynchronous broods were in better condition and produced more broods of their own than males from synchronous broods, and both males and females from asynchronous broods had higher lifetime reproductive success than those from synchronous broods. In conclusion, hatching patterns are under maternal control, representing distinct strategies for allocating offspring within broods, and are associated with offspring sex ratios and differences in offspring reproductive success. © 2015 European Society For Evolutionary Biology. Journal of Evolutionary Biology © 2015 European Society For Evolutionary Biology.
Asynchronous P300-based brain-computer interface to control a virtual environment: initial tests on end users.

PubMed

Aloise, Fabio; Schettini, Francesca; Aricò, Pietro; Salinari, Serenella; Guger, Christoph; Rinsma, Johanna; Aiello, Marco; Mattia, Donatella; Cincotti, Febo

2011-10-01

Motor disability and/or ageing can prevent individuals from fully enjoying home facilities, thus worsening their quality of life. Advances in the field of accessible user interfaces for domotic appliances can represent a valuable way to improve the independence of these persons. An asynchronous P300-based Brain-Computer Interface (BCI) system was recently validated with the participation of healthy young volunteers for environmental control. In this study, the asynchronous P300-based BCI for the interaction with a virtual home environment was tested with the participation of potential end-users (clients of a Frisian home care organization) with limited autonomy due to ageing and/or motor disabilities. System testing revealed that the minimum number of stimulation sequences needed to achieve correct classification had a higher intra-subject variability in potential end-users with respect to what was previously observed in young controls. Here we show that the asynchronous modality performed significantly better as compared to the synchronous mode in continuously adapting its speed to the users' state. Furthermore, the asynchronous system modality confirmed its reliability in avoiding misclassifications and false positives, as previously shown in young healthy subjects. The asynchronous modality may contribute to filling the usability gap between BCI systems and traditional input devices, representing an important step towards their use in the activities of daily living.
Universal filtered multi-carrier system for asynchronous uplink transmission in optical access network

NASA Astrophysics Data System (ADS)

Kang, Soo-Min; Kim, Chang-Hun; Han, Sang-Kook

2016-02-01

In passive optical network (PON), orthogonal frequency division multiplexing (OFDM) has been studied actively due to its advantages such as high spectra efficiency (SE), dynamic resource allocation in time or frequency domain, and dispersion robustness. However, orthogonal frequency division multiple access (OFDMA)-PON requires tight synchronization among multiple access signals. If not, frequency orthogonality could not be maintained. Also its sidelobe causes inter-channel interference (ICI) to adjacent channel. To prevent ICI caused by high sidelobes, guard band (GB) is usually used which degrades SE. Thus, OFDMA-PON is not suitable for asynchronous uplink transmission in optical access network. In this paper, we propose intensity modulation/direct detection (IM/DD) based universal filtered multi-carrier (UFMC) PON for asynchronous multiple access. The UFMC uses subband filtering to subsets of subcarriers. Since it reduces sidelobe of each subband by applying subband filtering, it could achieve better performance compared to OFDM. For the experimental demonstration, different sample delay was applied to subbands to implement asynchronous transmission condition. As a result, time synchronization robustness of UFMC was verified in asynchronous multiple access system.

Some links on this page may take you to non-federal websites. Their policies may differ from this site.