parallel block multi-level: Topics by Science.gov

Sample records for parallel block multi-level

Adaptive multi-GPU Exchange Monte Carlo for the 3D Random Field Ising Model

NASA Astrophysics Data System (ADS)

Navarro, Cristóbal A.; Huang, Wei; Deng, Youjin

2016-08-01

This work presents an adaptive multi-GPU Exchange Monte Carlo approach for the simulation of the 3D Random Field Ising Model (RFIM). The design is based on a two-level parallelization. The first level, spin-level parallelism, maps the parallel computation as optimal 3D thread-blocks that simulate blocks of spins in shared memory with minimal halo surface, assuming a constant block volume. The second level, replica-level parallelism, uses multi-GPU computation to handle the simulation of an ensemble of replicas. CUDA's concurrent kernel execution feature is used in order to fill the occupancy of each GPU with many replicas, providing a performance boost that is more notorious at the smallest values of L. In addition to the two-level parallel design, the work proposes an adaptive multi-GPU approach that dynamically builds a proper temperature set free of exchange bottlenecks. The strategy is based on mid-point insertions at the temperature gaps where the exchange rate is most compromised. The extra work generated by the insertions is balanced across the GPUs independently of where the mid-point insertions were performed. Performance results show that spin-level performance is approximately two orders of magnitude faster than a single-core CPU version and one order of magnitude faster than a parallel multi-core CPU version running on 16-cores. Multi-GPU performance is highly convenient under a weak scaling setting, reaching up to 99 % efficiency as long as the number of GPUs and L increase together. The combination of the adaptive approach with the parallel multi-GPU design has extended our possibilities of simulation to sizes of L = 32 , 64 for a workstation with two GPUs. Sizes beyond L = 64 can eventually be studied using larger multi-GPU systems.
A Multi-Level Parallelization Concept for High-Fidelity Multi-Block Solvers

NASA Technical Reports Server (NTRS)

Hatay, Ferhat F.; Jespersen, Dennis C.; Guruswamy, Guru P.; Rizk, Yehia M.; Byun, Chansup; Gee, Ken; VanDalsem, William R. (Technical Monitor)

1997-01-01

The integration of high-fidelity Computational Fluid Dynamics (CFD) analysis tools with the industrial design process benefits greatly from the robust implementations that are transportable across a wide range of computer architectures. In the present work, a hybrid domain-decomposition and parallelization concept was developed and implemented into the widely-used NASA multi-block Computational Fluid Dynamics (CFD) packages implemented in ENSAERO and OVERFLOW. The new parallel solver concept, PENS (Parallel Euler Navier-Stokes Solver), employs both fine and coarse granularity in data partitioning as well as data coalescing to obtain the desired load-balance characteristics on the available computer platforms. This multi-level parallelism implementation itself introduces no changes to the numerical results, hence the original fidelity of the packages are identically preserved. The present implementation uses the Message Passing Interface (MPI) library for interprocessor message passing and memory accessing. By choosing an appropriate combination of the available partitioning and coalescing capabilities only during the execution stage, the PENS solver becomes adaptable to different computer architectures from shared-memory to distributed-memory platforms with varying degrees of parallelism. The PENS implementation on the IBM SP2 distributed memory environment at the NASA Ames Research Center obtains 85 percent scalable parallel performance using fine-grain partitioning of single-block CFD domains using up to 128 wide computational nodes. Multi-block CFD simulations of complete aircraft simulations achieve 75 percent perfect load-balanced executions using data coalescing and the two levels of parallelism. SGI PowerChallenge, SGI Origin 2000, and a cluster of workstations are the other platforms where the robustness of the implementation is tested. The performance behavior on the other computer platforms with a variety of realistic problems will be included as this on-going study progresses.
Using OpenMP vs. Threading Building Blocks for Medical Imaging on Multi-cores

NASA Astrophysics Data System (ADS)

Kegel, Philipp; Schellmann, Maraike; Gorlatch, Sergei

We compare two parallel programming approaches for multi-core systems: the well-known OpenMP and the recently introduced Threading Building Blocks (TBB) library by Intel®. The comparison is made using the parallelization of a real-world numerical algorithm for medical imaging. We develop several parallel implementations, and compare them w.r.t. programming effort, programming style and abstraction, and runtime performance. We show that TBB requires a considerable program re-design, whereas with OpenMP simple compiler directives are sufficient. While TBB appears to be less appropriate for parallelizing existing implementations, it fosters a good programming style and higher abstraction level for newly developed parallel programs. Our experimental measurements on a dual quad-core system demonstrate that OpenMP slightly outperforms TBB in our implementation.
Expressing Parallelism with ROOT

NASA Astrophysics Data System (ADS)

Piparo, D.; Tejedor, E.; Guiraud, E.; Ganis, G.; Mato, P.; Moneta, L.; Valls Pla, X.; Canal, P.

2017-10-01

The need for processing the ever-increasing amount of data generated by the LHC experiments in a more efficient way has motivated ROOT to further develop its support for parallelism. Such support is being tackled both for shared-memory and distributed-memory environments. The incarnations of the aforementioned parallelism are multi-threading, multi-processing and cluster-wide executions. In the area of multi-threading, we discuss the new implicit parallelism and related interfaces, as well as the new building blocks to safely operate with ROOT objects in a multi-threaded environment. Regarding multi-processing, we review the new MultiProc framework, comparing it with similar tools (e.g. multiprocessing module in Python). Finally, as an alternative to PROOF for cluster-wide executions, we introduce the efforts on integrating ROOT with state-of-the-art distributed data processing technologies like Spark, both in terms of programming model and runtime design (with EOS as one of the main components). For all the levels of parallelism, we discuss, based on real-life examples and measurements, how our proposals can increase the productivity of scientists.
Expressing Parallelism with ROOT

DOE Office of Scientific and Technical Information (OSTI.GOV)

Piparo, D.; Tejedor, E.; Guiraud, E.

The need for processing the ever-increasing amount of data generated by the LHC experiments in a more efficient way has motivated ROOT to further develop its support for parallelism. Such support is being tackled both for shared-memory and distributed-memory environments. The incarnations of the aforementioned parallelism are multi-threading, multi-processing and cluster-wide executions. In the area of multi-threading, we discuss the new implicit parallelism and related interfaces, as well as the new building blocks to safely operate with ROOT objects in a multi-threaded environment. Regarding multi-processing, we review the new MultiProc framework, comparing it with similar tools (e.g. multiprocessing module inmore » Python). Finally, as an alternative to PROOF for cluster-wide executions, we introduce the efforts on integrating ROOT with state-of-the-art distributed data processing technologies like Spark, both in terms of programming model and runtime design (with EOS as one of the main components). For all the levels of parallelism, we discuss, based on real-life examples and measurements, how our proposals can increase the productivity of scientists.« less
Improving Block-level Efficiency with scsi-mq

DOE Office of Scientific and Technical Information (OSTI.GOV)

Caldwell, Blake A

2015-01-01

Current generation solid-state storage devices are exposing a new bottlenecks in the SCSI and block layers of the Linux kernel, where IO throughput is limited by lock contention, inefficient interrupt handling, and poor memory locality. To address these limitations, the Linux kernel block layer underwent a major rewrite with the blk-mq project to move from a single request queue to a multi-queue model. The Linux SCSI subsystem rework to make use of this new model, known as scsi-mq, has been merged into the Linux kernel and work is underway for dm-multipath support in the upcoming Linux 4.0 kernel. These piecesmore » were necessary to make use of the multi-queue block layer in a Lustre parallel filesystem with high availability requirements. We undertook adding support of the 3.18 kernel to Lustre with scsi-mq and dm-multipath patches to evaluate the potential of these efficiency improvements. In this paper we evaluate the block-level performance of scsi-mq with backing storage hardware representative of a HPC-targerted Lustre filesystem. Our findings show that SCSI write request latency is reduced by as much as 13.6%. Additionally, when profiling the CPU usage of our prototype Lustre filesystem, we found that CPU idle time increased by a factor of 7 with Linux 3.18 and blk-mq as compared to a standard 2.6.32 Linux kernel. Our findings demonstrate increased efficiency of the multi-queue block layer even with disk-based caching storage arrays used in existing parallel filesystems.« less
Productive High Performance Parallel Programming with Auto-tuned Domain-Specific Embedded Languages

DTIC Science & Technology

2013-01-02

Compilation JVM Java Virtual Machine KB Kilobyte KDT Knowledge Discovery Toolbox LAPACK Linear Algebra Package LLVM Low-Level Virtual Machine LOC Lines...different starting points. Leo Meyerovich also helped solidify some of the ideas here in discussions during Par Lab retreats. I would also like to thank...multi-timestep computations by blocking in both time and space. 88 Implementation Output Approx DSL Type Language Language Parallelism LoC Graphite
Adaptive mesh refinement and load balancing based on multi-level block-structured Cartesian mesh

NASA Astrophysics Data System (ADS)

Misaka, Takashi; Sasaki, Daisuke; Obayashi, Shigeru

2017-11-01

We developed a framework for a distributed-memory parallel computer that enables dynamic data management for adaptive mesh refinement and load balancing. We employed simple data structure of the building cube method (BCM) where a computational domain is divided into multi-level cubic domains and each cube has the same number of grid points inside, realising a multi-level block-structured Cartesian mesh. Solution adaptive mesh refinement, which works efficiently with the help of the dynamic load balancing, was implemented by dividing cubes based on mesh refinement criteria. The framework was investigated with the Laplace equation in terms of adaptive mesh refinement, load balancing and the parallel efficiency. It was then applied to the incompressible Navier-Stokes equations to simulate a turbulent flow around a sphere. We considered wall-adaptive cube refinement where a non-dimensional wall distance y+ near the sphere is used for a criterion of mesh refinement. The result showed the load imbalance due to y+ adaptive mesh refinement was corrected by the present approach. To utilise the BCM framework more effectively, we also tested a cube-wise algorithm switching where an explicit and implicit time integration schemes are switched depending on the local Courant-Friedrichs-Lewy (CFL) condition in each cube.
Adaptation of a Multi-Block Structured Solver for Effective Use in a Hybrid CPU/GPU Massively Parallel Environment

NASA Astrophysics Data System (ADS)

Gutzwiller, David; Gontier, Mathieu; Demeulenaere, Alain

2014-11-01

Multi-Block structured solvers hold many advantages over their unstructured counterparts, such as a smaller memory footprint and efficient serial performance. Historically, multi-block structured solvers have not been easily adapted for use in a High Performance Computing (HPC) environment, and the recent trend towards hybrid GPU/CPU architectures has further complicated the situation. This paper will elaborate on developments and innovations applied to the NUMECA FINE/Turbo solver that have allowed near-linear scalability with real-world problems on over 250 hybrid GPU/GPU cluster nodes. Discussion will focus on the implementation of virtual partitioning and load balancing algorithms using a novel meta-block concept. This implementation is transparent to the user, allowing all pre- and post-processing steps to be performed using a simple, unpartitioned grid topology. Additional discussion will elaborate on developments that have improved parallel performance, including fully parallel I/O with the ADIOS API and the GPU porting of the computationally heavy CPUBooster convergence acceleration module. Head of HPC and Release Management, Numeca International.
A Performance Comparison of the Parallel Preconditioners for Iterative Methods for Large Sparse Linear Systems Arising from Partial Differential Equations on Structured Grids

NASA Astrophysics Data System (ADS)

Ma, Sangback

In this paper we compare various parallel preconditioners such as Point-SSOR (Symmetric Successive OverRelaxation), ILU(0) (Incomplete LU) in the Wavefront ordering, ILU(0) in the Multi-color ordering, Multi-Color Block SOR (Successive OverRelaxation), SPAI (SParse Approximate Inverse) and pARMS (Parallel Algebraic Recursive Multilevel Solver) for solving large sparse linear systems arising from two-dimensional PDE (Partial Differential Equation)s on structured grids. Point-SSOR is well-known, and ILU(0) is one of the most popular preconditioner, but it is inherently serial. ILU(0) in the Wavefront ordering maximizes the parallelism in the natural order, but the lengths of the wave-fronts are often nonuniform. ILU(0) in the Multi-color ordering is a simple way of achieving a parallelism of the order N, where N is the order of the matrix, but its convergence rate often deteriorates as compared to that of natural ordering. We have chosen the Multi-Color Block SOR preconditioner combined with direct sparse matrix solver, since for the Laplacian matrix the SOR method is known to have a nondeteriorating rate of convergence when used with the Multi-Color ordering. By using block version we expect to minimize the interprocessor communications. SPAI computes the sparse approximate inverse directly by least squares method. Finally, ARMS is a preconditioner recursively exploiting the concept of independent sets and pARMS is the parallel version of ARMS. Experiments were conducted for the Finite Difference and Finite Element discretizations of five two-dimensional PDEs with large meshsizes up to a million on an IBM p595 machine with distributed memory. Our matrices are real positive, i. e., their real parts of the eigenvalues are positive. We have used GMRES(m) as our outer iterative method, so that the convergence of GMRES(m) for our test matrices are mathematically guaranteed. Interprocessor communications were done using MPI (Message Passing Interface) primitives. The results show that in general ILU(0) in the Multi-Color ordering ahd ILU(0) in the Wavefront ordering outperform the other methods but for symmetric and nearly symmetric 5-point matrices Multi-Color Block SOR gives the best performance, except for a few cases with a small number of processors.
GPU accelerated dynamic functional connectivity analysis for functional MRI data.

PubMed

Akgün, Devrim; Sakoğlu, Ünal; Esquivel, Johnny; Adinoff, Bryon; Mete, Mutlu

2015-07-01

Recent advances in multi-core processors and graphics card based computational technologies have paved the way for an improved and dynamic utilization of parallel computing techniques. Numerous applications have been implemented for the acceleration of computationally-intensive problems in various computational science fields including bioinformatics, in which big data problems are prevalent. In neuroimaging, dynamic functional connectivity (DFC) analysis is a computationally demanding method used to investigate dynamic functional interactions among different brain regions or networks identified with functional magnetic resonance imaging (fMRI) data. In this study, we implemented and analyzed a parallel DFC algorithm based on thread-based and block-based approaches. The thread-based approach was designed to parallelize DFC computations and was implemented in both Open Multi-Processing (OpenMP) and Compute Unified Device Architecture (CUDA) programming platforms. Another approach developed in this study to better utilize CUDA architecture is the block-based approach, where parallelization involves smaller parts of fMRI time-courses obtained by sliding-windows. Experimental results showed that the proposed parallel design solutions enabled by the GPUs significantly reduce the computation time for DFC analysis. Multicore implementation using OpenMP on 8-core processor provides up to 7.7× speed-up. GPU implementation using CUDA yielded substantial accelerations ranging from 18.5× to 157× speed-up once thread-based and block-based approaches were combined in the analysis. Proposed parallel programming solutions showed that multi-core processor and CUDA-supported GPU implementations accelerated the DFC analyses significantly. Developed algorithms make the DFC analyses more practical for multi-subject studies with more dynamic analyses. Copyright © 2015 Elsevier Ltd. All rights reserved.
A privacy-preserving parallel and homomorphic encryption scheme

NASA Astrophysics Data System (ADS)

Min, Zhaoe; Yang, Geng; Shi, Jingqi

2017-04-01

In order to protect data privacy whilst allowing efficient access to data in multi-nodes cloud environments, a parallel homomorphic encryption (PHE) scheme is proposed based on the additive homomorphism of the Paillier encryption algorithm. In this paper we propose a PHE algorithm, in which plaintext is divided into several blocks and blocks are encrypted with a parallel mode. Experiment results demonstrate that the encryption algorithm can reach a speed-up ratio at about 7.1 in the MapReduce environment with 16 cores and 4 nodes.
Shared Memory Parallelism for 3D Cartesian Discrete Ordinates Solver

NASA Astrophysics Data System (ADS)

Moustafa, Salli; Dutka-Malen, Ivan; Plagne, Laurent; Ponçot, Angélique; Ramet, Pierre

2014-06-01

This paper describes the design and the performance of DOMINO, a 3D Cartesian SN solver that implements two nested levels of parallelism (multicore+SIMD) on shared memory computation nodes. DOMINO is written in C++, a multi-paradigm programming language that enables the use of powerful and generic parallel programming tools such as Intel TBB and Eigen. These two libraries allow us to combine multi-thread parallelism with vector operations in an efficient and yet portable way. As a result, DOMINO can exploit the full power of modern multi-core processors and is able to tackle very large simulations, that usually require large HPC clusters, using a single computing node. For example, DOMINO solves a 3D full core PWR eigenvalue problem involving 26 energy groups, 288 angular directions (S16), 46 × 106 spatial cells and 1 × 1012 DoFs within 11 hours on a single 32-core SMP node. This represents a sustained performance of 235 GFlops and 40:74% of the SMP node peak performance for the DOMINO sweep implementation. The very high Flops/Watt ratio of DOMINO makes it a very interesting building block for a future many-nodes nuclear simulation tool.
A taxonomy and comparison of parallel block multi-level preconditioners for the incompressible Navier-Stokes equations.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shadid, John Nicolas; Elman, Howard; Shuttleworth, Robert R.

2007-04-01

In recent years, considerable effort has been placed on developing efficient and robust solution algorithms for the incompressible Navier-Stokes equations based on preconditioned Krylov methods. These include physics-based methods, such as SIMPLE, and purely algebraic preconditioners based on the approximation of the Schur complement. All these techniques can be represented as approximate block factorization (ABF) type preconditioners. The goal is to decompose the application of the preconditioner into simplified sub-systems in which scalable multi-level type solvers can be applied. In this paper we develop a taxonomy of these ideas based on an adaptation of a generalized approximate factorization of themore » Navier-Stokes system first presented in [25]. This taxonomy illuminates the similarities and differences among these preconditioners and the central role played by efficient approximation of certain Schur complement operators. We then present a parallel computational study that examines the performance of these methods and compares them to an additive Schwarz domain decomposition (DD) algorithm. Results are presented for two and three-dimensional steady state problems for enclosed domains and inflow/outflow systems on both structured and unstructured meshes. The numerical experiments are performed using MPSalsa, a stabilized finite element code.« less
Load Balancing Strategies for Multi-Block Overset Grid Applications

NASA Technical Reports Server (NTRS)

Djomehri, M. Jahed; Biswas, Rupak; Lopez-Benitez, Noe; Biegel, Bryan (Technical Monitor)

2002-01-01

The multi-block overset grid method is a powerful technique for high-fidelity computational fluid dynamics (CFD) simulations about complex aerospace configurations. The solution process uses a grid system that discretizes the problem domain by using separately generated but overlapping structured grids that periodically update and exchange boundary information through interpolation. For efficient high performance computations of large-scale realistic applications using this methodology, the individual grids must be properly partitioned among the parallel processors. Overall performance, therefore, largely depends on the quality of load balancing. In this paper, we present three different load balancing strategies far overset grids and analyze their effects on the parallel efficiency of a Navier-Stokes CFD application running on an SGI Origin2000 machine.
Employing Nested OpenMP for the Parallelization of Multi-Zone Computational Fluid Dynamics Applications

NASA Technical Reports Server (NTRS)

Ayguade, Eduard; Gonzalez, Marc; Martorell, Xavier; Jost, Gabriele

2004-01-01

In this paper we describe the parallelization of the multi-zone code versions of the NAS Parallel Benchmarks employing multi-level OpenMP parallelism. For our study we use the NanosCompiler, which supports nesting of OpenMP directives and provides clauses to control the grouping of threads, load balancing, and synchronization. We report the benchmark results, compare the timings with those of different hybrid parallelization paradigms and discuss OpenMP implementation issues which effect the performance of multi-level parallel applications.
Software Defined Radio with Parallelized Software Architecture

NASA Technical Reports Server (NTRS)

Heckler, Greg

2013-01-01

This software implements software-defined radio procession over multi-core, multi-CPU systems in a way that maximizes the use of CPU resources in the system. The software treats each processing step in either a communications or navigation modulator or demodulator system as an independent, threaded block. Each threaded block is defined with a programmable number of input or output buffers; these buffers are implemented using POSIX pipes. In addition, each threaded block is assigned a unique thread upon block installation. A modulator or demodulator system is built by assembly of the threaded blocks into a flow graph, which assembles the processing blocks to accomplish the desired signal processing. This software architecture allows the software to scale effortlessly between single CPU/single-core computers or multi-CPU/multi-core computers without recompilation. NASA spaceflight and ground communications systems currently rely exclusively on ASICs or FPGAs. This software allows low- and medium-bandwidth (100 bps to .50 Mbps) software defined radios to be designed and implemented solely in C/C++ software, while lowering development costs and facilitating reuse and extensibility.
Self-balanced modulation and magnetic rebalancing method for parallel multilevel inverters

DOE Office of Scientific and Technical Information (OSTI.GOV)

Li, Hui; Shi, Yanjun

A self-balanced modulation method and a closed-loop magnetic flux rebalancing control method for parallel multilevel inverters. The combination of the two methods provides for balancing of the magnetic flux of the inter-cell transformers (ICTs) of the parallel multilevel inverters without deteriorating the quality of the output voltage. In various embodiments a parallel multi-level inverter modulator is provide including a multi-channel comparator to generate a multiplexed digitized ideal waveform for a parallel multi-level inverter and a finite state machine (FSM) module coupled to the parallel multi-channel comparator, the FSM module to receive the multiplexed digitized ideal waveform and to generate amore » pulse width modulated gate-drive signal for each switching device of the parallel multi-level inverter. The system and method provides for optimization of the output voltage spectrum without influence the magnetic balancing.« less
Time-dependent density-functional theory in massively parallel computer architectures: the octopus project

NASA Astrophysics Data System (ADS)

Andrade, Xavier; Alberdi-Rodriguez, Joseba; Strubbe, David A.; Oliveira, Micael J. T.; Nogueira, Fernando; Castro, Alberto; Muguerza, Javier; Arruabarrena, Agustin; Louie, Steven G.; Aspuru-Guzik, Alán; Rubio, Angel; Marques, Miguel A. L.

2012-06-01

Octopus is a general-purpose density-functional theory (DFT) code, with a particular emphasis on the time-dependent version of DFT (TDDFT). In this paper we present the ongoing efforts to achieve the parallelization of octopus. We focus on the real-time variant of TDDFT, where the time-dependent Kohn-Sham equations are directly propagated in time. This approach has great potential for execution in massively parallel systems such as modern supercomputers with thousands of processors and graphics processing units (GPUs). For harvesting the potential of conventional supercomputers, the main strategy is a multi-level parallelization scheme that combines the inherent scalability of real-time TDDFT with a real-space grid domain-partitioning approach. A scalable Poisson solver is critical for the efficiency of this scheme. For GPUs, we show how using blocks of Kohn-Sham states provides the required level of data parallelism and that this strategy is also applicable for code optimization on standard processors. Our results show that real-time TDDFT, as implemented in octopus, can be the method of choice for studying the excited states of large molecular systems in modern parallel architectures.
Time-dependent density-functional theory in massively parallel computer architectures: the OCTOPUS project.

PubMed

Andrade, Xavier; Alberdi-Rodriguez, Joseba; Strubbe, David A; Oliveira, Micael J T; Nogueira, Fernando; Castro, Alberto; Muguerza, Javier; Arruabarrena, Agustin; Louie, Steven G; Aspuru-Guzik, Alán; Rubio, Angel; Marques, Miguel A L

2012-06-13

Octopus is a general-purpose density-functional theory (DFT) code, with a particular emphasis on the time-dependent version of DFT (TDDFT). In this paper we present the ongoing efforts to achieve the parallelization of octopus. We focus on the real-time variant of TDDFT, where the time-dependent Kohn-Sham equations are directly propagated in time. This approach has great potential for execution in massively parallel systems such as modern supercomputers with thousands of processors and graphics processing units (GPUs). For harvesting the potential of conventional supercomputers, the main strategy is a multi-level parallelization scheme that combines the inherent scalability of real-time TDDFT with a real-space grid domain-partitioning approach. A scalable Poisson solver is critical for the efficiency of this scheme. For GPUs, we show how using blocks of Kohn-Sham states provides the required level of data parallelism and that this strategy is also applicable for code optimization on standard processors. Our results show that real-time TDDFT, as implemented in octopus, can be the method of choice for studying the excited states of large molecular systems in modern parallel architectures.

An architecture of entropy decoder, inverse quantiser and predictor for multi-standard video decoding

NASA Astrophysics Data System (ADS)

Liu, Leibo; Chen, Yingjie; Yin, Shouyi; Lei, Hao; He, Guanghui; Wei, Shaojun

2014-07-01

A VLSI architecture for entropy decoder, inverse quantiser and predictor is proposed in this article. This architecture is used for decoding video streams of three standards on a single chip, i.e. H.264/AVC, AVS (China National Audio Video coding Standard) and MPEG2. The proposed scheme is called MPMP (Macro-block-Parallel based Multilevel Pipeline), which is intended to improve the decoding performance to satisfy the real-time requirements while maintaining a reasonable area and power consumption. Several techniques, such as slice level pipeline, MB (Macro-Block) level pipeline, MB level parallel, etc., are adopted. Input and output buffers for the inverse quantiser and predictor are shared by the decoding engines for H.264, AVS and MPEG2, therefore effectively reducing the implementation overhead. Simulation shows that decoding process consumes 512, 435 and 438 clock cycles per MB in H.264, AVS and MPEG2, respectively. Owing to the proposed techniques, the video decoder can support H.264 HP (High Profile) 1920 × 1088@30fps (frame per second) streams, AVS JP (Jizhun Profile) 1920 × 1088@41fps streams and MPEG2 MP (Main Profile) 1920 × 1088@39fps streams when exploiting a 200 MHz working frequency.
Carpet: Adaptive Mesh Refinement for the Cactus Framework

NASA Astrophysics Data System (ADS)

Schnetter, Erik; Hawley, Scott; Hawke, Ian

2016-11-01

Carpet is an adaptive mesh refinement and multi-patch driver for the Cactus Framework (ascl:1102.013). Cactus is a software framework for solving time-dependent partial differential equations on block-structured grids, and Carpet acts as driver layer providing adaptive mesh refinement, multi-patch capability, as well as parallelization and efficient I/O.
Scalable software architecture for on-line multi-camera video processing

NASA Astrophysics Data System (ADS)

Camplani, Massimo; Salgado, Luis

2011-03-01

In this paper we present a scalable software architecture for on-line multi-camera video processing, that guarantees a good trade off between computational power, scalability and flexibility. The software system is modular and its main blocks are the Processing Units (PUs), and the Central Unit. The Central Unit works as a supervisor of the running PUs and each PU manages the acquisition phase and the processing phase. Furthermore, an approach to easily parallelize the desired processing application has been presented. In this paper, as case study, we apply the proposed software architecture to a multi-camera system in order to efficiently manage multiple 2D object detection modules in a real-time scenario. System performance has been evaluated under different load conditions such as number of cameras and image sizes. The results show that the software architecture scales well with the number of camera and can easily works with different image formats respecting the real time constraints. Moreover, the parallelization approach can be used in order to speed up the processing tasks with a low level of overhead.
Enhancing instruction scheduling with a block-structured ISA

DOE Office of Scientific and Technical Information (OSTI.GOV)

Melvin, S.; Patt, Y.

It is now generally recognized that not enough parallelism exists within the small basic blocks of most general purpose programs to satisfy high performance processors. Thus, a wide variety of techniques have been developed to exploit instruction level parallelism across basic block boundaries. In this paper we discuss some previous techniques along with their hardware and software requirements. Then we propose a new paradigm for an instruction set architecture (ISA): block-structuring. This new paradigm is presented, its hardware and software requirements are discussed and the results from a simulation study are presented. We show that a block-structured ISA utilizes bothmore » dynamic and compile-time mechanisms for exploiting instruction level parallelism and has significant performance advantages over a conventional ISA.« less
Software Defined Radio with Parallelized Software Architecture

NASA Technical Reports Server (NTRS)

Heckler, Greg

2013-01-01

This software implements software-defined radio procession over multicore, multi-CPU systems in a way that maximizes the use of CPU resources in the system. The software treats each processing step in either a communications or navigation modulator or demodulator system as an independent, threaded block. Each threaded block is defined with a programmable number of input or output buffers; these buffers are implemented using POSIX pipes. In addition, each threaded block is assigned a unique thread upon block installation. A modulator or demodulator system is built by assembly of the threaded blocks into a flow graph, which assembles the processing blocks to accomplish the desired signal processing. This software architecture allows the software to scale effortlessly between single CPU/single-core computers or multi-CPU/multi-core computers without recompilation. NASA spaceflight and ground communications systems currently rely exclusively on ASICs or FPGAs. This software allows low- and medium-bandwidth (100 bps to approx.50 Mbps) software defined radios to be designed and implemented solely in C/C++ software, while lowering development costs and facilitating reuse and extensibility.
From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation

DOE PAGES

Blazewicz, Marek; Hinder, Ian; Koppelman, David M.; ...

2013-01-01

Starting from a high-level problem description in terms of partial differential equations using abstract tensor notation, the Chemora framework discretizes, optimizes, and generates complete high performance codes for a wide range of compute architectures. Chemora extends the capabilities of Cactus, facilitating the usage of large-scale CPU/GPU systems in an efficient manner for complex applications, without low-level code tuning. Chemora achieves parallelism through MPI and multi-threading, combining OpenMP and CUDA. Optimizations include high-level code transformations, efficient loop traversal strategies, dynamically selected data and instruction cache usage strategies, and JIT compilation of GPU code tailored to the problem characteristics. The discretization ismore » based on higher-order finite differences on multi-block domains. Chemora's capabilities are demonstrated by simulations of black hole collisions. This problem provides an acid test of the framework, as the Einstein equations contain hundreds of variables and thousands of terms.« less
Electro-Optic Computing Architectures. Volume I

DTIC Science & Technology

1998-02-01

The objective of the Electro - Optic Computing Architecture (EOCA) program was to develop multi-function electro - optic interfaces and optical...interconnect units to enhance the performance of parallel processor systems and form the building blocks for future electro - optic computing architectures...Specifically, three multi-function interface modules were targeted for development - an Electro - Optic Interface (EOI), an Optical Interconnection Unit (OW
Parallel Geospatial Data Management for Multi-Scale Environmental Data Analysis on GPUs

NASA Astrophysics Data System (ADS)

Wang, D.; Zhang, J.; Wei, Y.

2013-12-01

As the spatial and temporal resolutions of Earth observatory data and Earth system simulation outputs are getting higher, in-situ and/or post- processing such large amount of geospatial data increasingly becomes a bottleneck in scientific inquires of Earth systems and their human impacts. Existing geospatial techniques that are based on outdated computing models (e.g., serial algorithms and disk-resident systems), as have been implemented in many commercial and open source packages, are incapable of processing large-scale geospatial data and achieve desired level of performance. In this study, we have developed a set of parallel data structures and algorithms that are capable of utilizing massively data parallel computing power available on commodity Graphics Processing Units (GPUs) for a popular geospatial technique called Zonal Statistics. Given two input datasets with one representing measurements (e.g., temperature or precipitation) and the other one represent polygonal zones (e.g., ecological or administrative zones), Zonal Statistics computes major statistics (or complete distribution histograms) of the measurements in all regions. Our technique has four steps and each step can be mapped to GPU hardware by identifying its inherent data parallelisms. First, a raster is divided into blocks and per-block histograms are derived. Second, the Minimum Bounding Boxes (MBRs) of polygons are computed and are spatially matched with raster blocks; matched polygon-block pairs are tested and blocks that are either inside or intersect with polygons are identified. Third, per-block histograms are aggregated to polygons for blocks that are completely within polygons. Finally, for blocks that intersect with polygon boundaries, all the raster cells within the blocks are examined using point-in-polygon-test and cells that are within polygons are used to update corresponding histograms. As the task becomes I/O bound after applying spatial indexing and GPU hardware acceleration, we have developed a GPU-based data compression technique by reusing our previous work on Bitplane Quadtree (or BPQ-Tree) based indexing of binary bitmaps. Results have shown that our GPU-based parallel Zonal Statistic technique on 3000+ US counties over 20+ billion NASA SRTM 30 meter resolution Digital Elevation (DEM) raster cells has achieved impressive end-to-end runtimes: 101 seconds and 46 seconds a low-end workstation equipped with a Nvidia GTX Titan GPU using cold and hot cache, respectively; and, 60-70 seconds using a single OLCF TITAN computing node and 10-15 seconds using 8 nodes. Our experiment results clearly show the potentials of using high-end computing facilities for large-scale geospatial processing.
Application of the Geophysical Scale Multi-Block Transport Modeling System to Hydrodynamic Forcing of Dredged Material Placement Sediment Transport within the James River Estuary

NASA Astrophysics Data System (ADS)

Kim, S. C.; Hayter, E. J.; Pruhs, R.; Luong, P.; Lackey, T. C.

2016-12-01

The geophysical scale circulation of the Mid Atlantic Bight and hydrologic inputs from adjacent Chesapeake Bay watersheds and tributaries influences the hydrodynamics and transport of the James River estuary. Both barotropic and baroclinic transport govern the hydrodynamics of this partially stratified estuary. Modeling the placement of dredged sediment requires accommodating this wide spectrum of atmospheric and hydrodynamic scales. The Geophysical Scale Multi-Block (GSMB) Transport Modeling System is a collection of multiple well established and USACE approved process models. Taking advantage of the parallel computing capability of multi-block modeling, we performed one year three-dimensional modeling of hydrodynamics in supporting simulation of dredged sediment placements transport and morphology changes. Model forcing includes spatially and temporally varying meteorological conditions and hydrological inputs from the watershed. Surface heat flux estimates were derived from the National Solar Radiation Database (NSRDB). The open water boundary condition for water level was obtained from an ADCIRC model application of the U. S. East Coast. Temperature-salinity boundary conditions were obtained from the Environmental Protection Agency (EPA) Chesapeake Bay Program (CBP) long-term monitoring stations database. Simulated water levels were calibrated and verified by comparison with National Oceanic and Atmospheric Administration (NOAA) tide gage locations. A harmonic analysis of the modeled tides was performed and compared with NOAA tide prediction data. In addition, project specific circulation was verified using US Army Corps of Engineers (USACE) drogue data. Salinity and temperature transport was verified at seven CBP long term monitoring stations along the navigation channel. Simulation and analysis of model results suggest that GSMB is capable of resolving the long duration, multi-scale processes inherent to practical engineering problems such as dredged material placement stability.
Parallel SOR methods with a parabolic-diffusion acceleration technique for solving an unstructured-grid Poisson equation on 3D arbitrary geometries

NASA Astrophysics Data System (ADS)

Zapata, M. A. Uh; Van Bang, D. Pham; Nguyen, K. D.

2016-05-01

This paper presents a parallel algorithm for the finite-volume discretisation of the Poisson equation on three-dimensional arbitrary geometries. The proposed method is formulated by using a 2D horizontal block domain decomposition and interprocessor data communication techniques with message passing interface. The horizontal unstructured-grid cells are reordered according to the neighbouring relations and decomposed into blocks using a load-balanced distribution to give all processors an equal amount of elements. In this algorithm, two parallel successive over-relaxation methods are presented: a multi-colour ordering technique for unstructured grids based on distributed memory and a block method using reordering index following similar ideas of the partitioning for structured grids. In all cases, the parallel algorithms are implemented with a combination of an acceleration iterative solver. This solver is based on a parabolic-diffusion equation introduced to obtain faster solutions of the linear systems arising from the discretisation. Numerical results are given to evaluate the performances of the methods showing speedups better than linear.
SCORPIO: A Scalable Two-Phase Parallel I/O Library With Application To A Large Scale Subsurface Simulator

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sreepathi, Sarat; Sripathi, Vamsi; Mills, Richard T

2013-01-01

Inefficient parallel I/O is known to be a major bottleneck among scientific applications employed on supercomputers as the number of processor cores grows into the thousands. Our prior experience indicated that parallel I/O libraries such as HDF5 that rely on MPI-IO do not scale well beyond 10K processor cores, especially on parallel file systems (like Lustre) with single point of resource contention. Our previous optimization efforts for a massively parallel multi-phase and multi-component subsurface simulator (PFLOTRAN) led to a two-phase I/O approach at the application level where a set of designated processes participate in the I/O process by splitting themore » I/O operation into a communication phase and a disk I/O phase. The designated I/O processes are created by splitting the MPI global communicator into multiple sub-communicators. The root process in each sub-communicator is responsible for performing the I/O operations for the entire group and then distributing the data to rest of the group. This approach resulted in over 25X speedup in HDF I/O read performance and 3X speedup in write performance for PFLOTRAN at over 100K processor cores on the ORNL Jaguar supercomputer. This research describes the design and development of a general purpose parallel I/O library, SCORPIO (SCalable block-ORiented Parallel I/O) that incorporates our optimized two-phase I/O approach. The library provides a simplified higher level abstraction to the user, sitting atop existing parallel I/O libraries (such as HDF5) and implements optimized I/O access patterns that can scale on larger number of processors. Performance results with standard benchmark problems and PFLOTRAN indicate that our library is able to maintain the same speedups as before with the added flexibility of being applicable to a wider range of I/O intensive applications.« less
Cpu/gpu Computing for AN Implicit Multi-Block Compressible Navier-Stokes Solver on Heterogeneous Platform

NASA Astrophysics Data System (ADS)

Deng, Liang; Bai, Hanli; Wang, Fang; Xu, Qingxin

2016-06-01

CPU/GPU computing allows scientists to tremendously accelerate their numerical codes. In this paper, we port and optimize a double precision alternating direction implicit (ADI) solver for three-dimensional compressible Navier-Stokes equations from our in-house Computational Fluid Dynamics (CFD) software on heterogeneous platform. First, we implement a full GPU version of the ADI solver to remove a lot of redundant data transfers between CPU and GPU, and then design two fine-grain schemes, namely “one-thread-one-point” and “one-thread-one-line”, to maximize the performance. Second, we present a dual-level parallelization scheme using the CPU/GPU collaborative model to exploit the computational resources of both multi-core CPUs and many-core GPUs within the heterogeneous platform. Finally, considering the fact that memory on a single node becomes inadequate when the simulation size grows, we present a tri-level hybrid programming pattern MPI-OpenMP-CUDA that merges fine-grain parallelism using OpenMP and CUDA threads with coarse-grain parallelism using MPI for inter-node communication. We also propose a strategy to overlap the computation with communication using the advanced features of CUDA and MPI programming. We obtain speedups of 6.0 for the ADI solver on one Tesla M2050 GPU in contrast to two Xeon X5670 CPUs. Scalability tests show that our implementation can offer significant performance improvement on heterogeneous platform.
Electro-Optic Computing Architectures: Volume II. Components and System Design and Analysis

DTIC Science & Technology

1998-02-01

The objective of the Electro - Optic Computing Architecture (EOCA) program was to develop multi-function electro - optic interfaces and optical...interconnect units to enhance the performance of parallel processor systems and form the building blocks for future electro - optic computing architectures...Specifically, three multi-function interface modules were targeted for development - an Electro - Optic Interface (EOI), an Optical Interconnection Unit
Algorithms for parallel flow solvers on message passing architectures

NASA Technical Reports Server (NTRS)

Vanderwijngaart, Rob F.

1995-01-01

The purpose of this project has been to identify and test suitable technologies for implementation of fluid flow solvers -- possibly coupled with structures and heat equation solvers -- on MIMD parallel computers. In the course of this investigation much attention has been paid to efficient domain decomposition strategies for ADI-type algorithms. Multi-partitioning derives its efficiency from the assignment of several blocks of grid points to each processor in the parallel computer. A coarse-grain parallelism is obtained, and a near-perfect load balance results. In uni-partitioning every processor receives responsibility for exactly one block of grid points instead of several. This necessitates fine-grain pipelined program execution in order to obtain a reasonable load balance. Although fine-grain parallelism is less desirable on many systems, especially high-latency networks of workstations, uni-partition methods are still in wide use in production codes for flow problems. Consequently, it remains important to achieve good efficiency with this technique that has essentially been superseded by multi-partitioning for parallel ADI-type algorithms. Another reason for the concentration on improving the performance of pipeline methods is their applicability in other types of flow solver kernels with stronger implied data dependence. Analytical expressions can be derived for the size of the dynamic load imbalance incurred in traditional pipelines. From these it can be determined what is the optimal first-processor retardation that leads to the shortest total completion time for the pipeline process. Theoretical predictions of pipeline performance with and without optimization match experimental observations on the iPSC/860 very well. Analysis of pipeline performance also highlights the effect of uncareful grid partitioning in flow solvers that employ pipeline algorithms. If grid blocks at boundaries are not at least as large in the wall-normal direction as those immediately adjacent to them, then the first processor in the pipeline will receive a computational load that is less than that of subsequent processors, magnifying the pipeline slowdown effect. Extra compensation is needed for grid boundary effects, even if all grid blocks are equally sized.
Method and structure for skewed block-cyclic distribution of lower-dimensional data arrays in higher-dimensional processor grids

DOEpatents

Chatterjee, Siddhartha [Yorktown Heights, NY; Gunnels, John A [Brewster, NY

2011-11-08

A method and structure of distributing elements of an array of data in a computer memory to a specific processor of a multi-dimensional mesh of parallel processors includes designating a distribution of elements of at least a portion of the array to be executed by specific processors in the multi-dimensional mesh of parallel processors. The pattern of the designating includes a cyclical repetitive pattern of the parallel processor mesh, as modified to have a skew in at least one dimension so that both a row of data in the array and a column of data in the array map to respective contiguous groupings of the processors such that a dimension of the contiguous groupings is greater than one.
A multi-block adaptive solving technique based on lattice Boltzmann method

NASA Astrophysics Data System (ADS)

Zhang, Yang; Xie, Jiahua; Li, Xiaoyue; Ma, Zhenghai; Zou, Jianfeng; Zheng, Yao

2018-05-01

In this paper, a CFD parallel adaptive algorithm is self-developed by combining the multi-block Lattice Boltzmann Method (LBM) with Adaptive Mesh Refinement (AMR). The mesh refinement criterion of this algorithm is based on the density, velocity and vortices of the flow field. The refined grid boundary is obtained by extending outward half a ghost cell from the coarse grid boundary, which makes the adaptive mesh more compact and the boundary treatment more convenient. Two numerical examples of the backward step flow separation and the unsteady flow around circular cylinder demonstrate the vortex structure of the cold flow field accurately and specifically.
A Multi-Modality CMOS Sensor Array for Cell-Based Assay and Drug Screening.

PubMed

Chi, Taiyun; Park, Jong Seok; Butts, Jessica C; Hookway, Tracy A; Su, Amy; Zhu, Chengjie; Styczynski, Mark P; McDevitt, Todd C; Wang, Hua

2015-12-01

In this paper, we present a fully integrated multi-modality CMOS cellular sensor array with four sensing modalities to characterize different cell physiological responses, including extracellular voltage recording, cellular impedance mapping, optical detection with shadow imaging and bioluminescence sensing, and thermal monitoring. The sensor array consists of nine parallel pixel groups and nine corresponding signal conditioning blocks. Each pixel group comprises one temperature sensor and 16 tri-modality sensor pixels, while each tri-modality sensor pixel can be independently configured for extracellular voltage recording, cellular impedance measurement (voltage excitation/current sensing), and optical detection. This sensor array supports multi-modality cellular sensing at the pixel level, which enables holistic cell characterization and joint-modality physiological monitoring on the same cellular sample with a pixel resolution of 80 μm × 100 μm. Comprehensive biological experiments with different living cell samples demonstrate the functionality and benefit of the proposed multi-modality sensing in cell-based assay and drug screening.
LAURA Users Manual: 5.3-48528

NASA Technical Reports Server (NTRS)

Mazaheri, Alireza; Gnoffo, Peter A.; Johnston, Chirstopher O.; Kleb, Bil

2010-01-01

This users manual provides in-depth information concerning installation and execution of LAURA, version 5. LAURA is a structured, multi-block, computational aerothermodynamic simulation code. Version 5 represents a major refactoring of the original Fortran 77 LAURA code toward a modular structure afforded by Fortran 95. The refactoring improved usability and maintainability by eliminating the requirement for problem-dependent re-compilations, providing more intuitive distribution of functionality, and simplifying interfaces required for multi-physics coupling. As a result, LAURA now shares gas-physics modules, MPI modules, and other low-level modules with the FUN3D unstructured-grid code. In addition to internal refactoring, several new features and capabilities have been added, e.g., a GNU-standard installation process, parallel load balancing, automatic trajectory point sequencing, free-energy minimization, and coupled ablation and flowfield radiation.
LAURA Users Manual: 5.5-64987

NASA Technical Reports Server (NTRS)

Mazaheri, Alireza; Gnoffo, Peter A.; Johnston, Christopher O.; Kleb, William L.

2013-01-01

This users manual provides in-depth information concerning installation and execution of LAURA, version 5. LAURA is a structured, multi-block, computational aerothermodynamic simulation code. Version 5 represents a major refactoring of the original Fortran 77 LAURA code toward a modular structure afforded by Fortran 95. The refactoring improved usability and maintain ability by eliminating the requirement for problem dependent recompilations, providing more intuitive distribution of functionality, and simplifying interfaces required for multi-physics coupling. As a result, LAURA now shares gas-physics modules, MPI modules, and other low-level modules with the Fun3D unstructured-grid code. In addition to internal refactoring, several new features and capabilities have been added, e.g., a GNU standard installation process, parallel load balancing, automatic trajectory point sequencing, free-energy minimization, and coupled ablation and flowfield radiation.
LAURA Users Manual: 5.4-54166

NASA Technical Reports Server (NTRS)

Mazaheri, Alireza; Gnoffo, Peter A.; Johnston, Christopher O.; Kleb, Bil

2011-01-01

This users manual provides in-depth information concerning installation and execution of Laura, version 5. Laura is a structured, multi-block, computational aerothermodynamic simulation code. Version 5 represents a major refactoring of the original Fortran 77 Laura code toward a modular structure afforded by Fortran 95. The refactoring improved usability and maintainability by eliminating the requirement for problem dependent re-compilations, providing more intuitive distribution of functionality, and simplifying interfaces required for multi-physics coupling. As a result, Laura now shares gas-physics modules, MPI modules, and other low-level modules with the Fun3D unstructured-grid code. In addition to internal refactoring, several new features and capabilities have been added, e.g., a GNU-standard installation process, parallel load balancing, automatic trajectory point sequencing, free-energy minimization, and coupled ablation and flowfield radiation.

Performance Characteristics of the Multi-Zone NAS Parallel Benchmarks

NASA Technical Reports Server (NTRS)

Jin, Haoqiang; VanderWijngaart, Rob F.

2003-01-01

We describe a new suite of computational benchmarks that models applications featuring multiple levels of parallelism. Such parallelism is often available in realistic flow computations on systems of grids, but had not previously been captured in bench-marks. The new suite, named NPB Multi-Zone, is extended from the NAS Parallel Benchmarks suite, and involves solving the application benchmarks LU, BT and SP on collections of loosely coupled discretization meshes. The solutions on the meshes are updated independently, but after each time step they exchange boundary value information. This strategy provides relatively easily exploitable coarse-grain parallelism between meshes. Three reference implementations are available: one serial, one hybrid using the Message Passing Interface (MPI) and OpenMP, and another hybrid using a shared memory multi-level programming model (SMP+OpenMP). We examine the effectiveness of hybrid parallelization paradigms in these implementations on three different parallel computers. We also use an empirical formula to investigate the performance characteristics of the multi-zone benchmarks.
Performance Analysis of a Hybrid Overset Multi-Block Application on Multiple Architectures

NASA Technical Reports Server (NTRS)

Djomehri, M. Jahed; Biswas, Rupak

2003-01-01

This paper presents a detailed performance analysis of a multi-block overset grid compu- tational fluid dynamics app!ication on multiple state-of-the-art computer architectures. The application is implemented using a hybrid MPI+OpenMP programming paradigm that exploits both coarse and fine-grain parallelism; the former via MPI message passing and the latter via OpenMP directives. The hybrid model also extends the applicability of multi-block programs to large clusters of SNIP nodes by overcoming the restriction that the number of processors be less than the number of grid blocks. A key kernel of the application, namely the LU-SGS linear solver, had to be modified to enhance the performance of the hybrid approach on the target machines. Investigations were conducted on cacheless Cray SX6 vector processors, cache-based IBM Power3 and Power4 architectures, and single system image SGI Origin3000 platforms. Overall results for complex vortex dynamics simulations demonstrate that the SX6 achieves the highest performance and outperforms the RISC-based architectures; however, the best scaling performance was achieved on the Power3.
Parallel efficient rate control methods for JPEG 2000

NASA Astrophysics Data System (ADS)

Martínez-del-Amor, Miguel Á.; Bruns, Volker; Sparenberg, Heiko

2017-09-01

Since the introduction of JPEG 2000, several rate control methods have been proposed. Among them, post-compression rate-distortion optimization (PCRD-Opt) is the most widely used, and the one recommended by the standard. The approach followed by this method is to first compress the entire image split in code blocks, and subsequently, optimally truncate the set of generated bit streams according to the maximum target bit rate constraint. The literature proposes various strategies on how to estimate ahead of time where a block will get truncated in order to stop the execution prematurely and save time. However, none of them have been defined bearing in mind a parallel implementation. Today, multi-core and many-core architectures are becoming popular for JPEG 2000 codecs implementations. Therefore, in this paper, we analyze how some techniques for efficient rate control can be deployed in GPUs. In order to do that, the design of our GPU-based codec is extended, allowing stopping the process at a given point. This extension also harnesses a higher level of parallelism on the GPU, leading to up to 40% of speedup with 4K test material on a Titan X. In a second step, three selected rate control methods are adapted and implemented in our parallel encoder. A comparison is then carried out, and used to select the best candidate to be deployed in a GPU encoder, which gave an extra 40% of speedup in those situations where it was really employed.
Rapid Prediction of Unsteady Three-Dimensional Viscous Flows in Turbopump Geometries

NASA Technical Reports Server (NTRS)

Dorney, Daniel J.

1998-01-01

A program is underway to improve the efficiency of a three-dimensional Navier-Stokes code and generalize it for nozzle and turbopump geometries. Code modifications will include the implementation of parallel processing software, incorporating new physical models and generalizing the multi-block capability to allow the simultaneous simulation of nozzle and turbopump configurations. The current report contains details of code modifications, numerical results of several flow simulations and the status of the parallelization effort.
A study of the parallel algorithm for large-scale DC simulation of nonlinear systems

NASA Astrophysics Data System (ADS)

Cortés Udave, Diego Ernesto; Ogrodzki, Jan; Gutiérrez de Anda, Miguel Angel

Newton-Raphson DC analysis of large-scale nonlinear circuits may be an extremely time consuming process even if sparse matrix techniques and bypassing of nonlinear models calculation are used. A slight decrease in the time required for this task may be enabled on multi-core, multithread computers if the calculation of the mathematical models for the nonlinear elements as well as the stamp management of the sparse matrix entries are managed through concurrent processes. This numerical complexity can be further reduced via the circuit decomposition and parallel solution of blocks taking as a departure point the BBD matrix structure. This block-parallel approach may give a considerable profit though it is strongly dependent on the system topology and, of course, on the processor type. This contribution presents the easy-parallelizable decomposition-based algorithm for DC simulation and provides a detailed study of its effectiveness.
Accelerating nuclear configuration interaction calculations through a preconditioned block iterative eigensolver

NASA Astrophysics Data System (ADS)

Shao, Meiyue; Aktulga, H. Metin; Yang, Chao; Ng, Esmond G.; Maris, Pieter; Vary, James P.

2018-01-01

We describe a number of recently developed techniques for improving the performance of large-scale nuclear configuration interaction calculations on high performance parallel computers. We show the benefit of using a preconditioned block iterative method to replace the Lanczos algorithm that has traditionally been used to perform this type of computation. The rapid convergence of the block iterative method is achieved by a proper choice of starting guesses of the eigenvectors and the construction of an effective preconditioner. These acceleration techniques take advantage of special structure of the nuclear configuration interaction problem which we discuss in detail. The use of a block method also allows us to improve the concurrency of the computation, and take advantage of the memory hierarchy of modern microprocessors to increase the arithmetic intensity of the computation relative to data movement. We also discuss the implementation details that are critical to achieving high performance on massively parallel multi-core supercomputers, and demonstrate that the new block iterative solver is two to three times faster than the Lanczos based algorithm for problems of moderate sizes on a Cray XC30 system.
Reliable and Efficient Parallel Processing Algorithms and Architectures for Modern Signal Processing. Ph.D. Thesis

NASA Technical Reports Server (NTRS)

Liu, Kuojuey Ray

1990-01-01

Least-squares (LS) estimations and spectral decomposition algorithms constitute the heart of modern signal processing and communication problems. Implementations of recursive LS and spectral decomposition algorithms onto parallel processing architectures such as systolic arrays with efficient fault-tolerant schemes are the major concerns of this dissertation. There are four major results in this dissertation. First, we propose the systolic block Householder transformation with application to the recursive least-squares minimization. It is successfully implemented on a systolic array with a two-level pipelined implementation at the vector level as well as at the word level. Second, a real-time algorithm-based concurrent error detection scheme based on the residual method is proposed for the QRD RLS systolic array. The fault diagnosis, order degraded reconfiguration, and performance analysis are also considered. Third, the dynamic range, stability, error detection capability under finite-precision implementation, order degraded performance, and residual estimation under faulty situations for the QRD RLS systolic array are studied in details. Finally, we propose the use of multi-phase systolic algorithms for spectral decomposition based on the QR algorithm. Two systolic architectures, one based on triangular array and another based on rectangular array, are presented for the multiphase operations with fault-tolerant considerations. Eigenvectors and singular vectors can be easily obtained by using the multi-pase operations. Performance issues are also considered.
SIAM Conference on Parallel Processing for Scientific Computing, 4th, Chicago, IL, Dec. 11-13, 1989, Proceedings

NASA Technical Reports Server (NTRS)

Dongarra, Jack (Editor); Messina, Paul (Editor); Sorensen, Danny C. (Editor); Voigt, Robert G. (Editor)

1990-01-01

Attention is given to such topics as an evaluation of block algorithm variants in LAPACK and presents a large-grain parallel sparse system solver, a multiprocessor method for the solution of the generalized Eigenvalue problem on an interval, and a parallel QR algorithm for iterative subspace methods on the CM2. A discussion of numerical methods includes the topics of asynchronous numerical solutions of PDEs on parallel computers, parallel homotopy curve tracking on a hypercube, and solving Navier-Stokes equations on the Cedar Multi-Cluster system. A section on differential equations includes a discussion of a six-color procedure for the parallel solution of elliptic systems using the finite quadtree structure, data parallel algorithms for the finite element method, and domain decomposition methods in aerodynamics. Topics dealing with massively parallel computing include hypercube vs. 2-dimensional meshes and massively parallel computation of conservation laws. Performance and tools are also discussed.
A multi-satellite orbit determination problem in a parallel processing environment

NASA Technical Reports Server (NTRS)

Deakyne, M. S.; Anderle, R. J.

1988-01-01

The Engineering Orbit Analysis Unit at GE Valley Forge used an Intel Hypercube Parallel Processor to investigate the performance and gain experience of parallel processors with a multi-satellite orbit determination problem. A general study was selected in which major blocks of computation for the multi-satellite orbit computations were used as units to be assigned to the various processors on the Hypercube. Problems encountered or successes achieved in addressing the orbit determination problem would be more likely to be transferable to other parallel processors. The prime objective was to study the algorithm to allow processing of observations later in time than those employed in the state update. Expertise in ephemeris determination was exploited in addressing these problems and the facility used to bring a realism to the study which would highlight the problems which may not otherwise be anticipated. Secondary objectives were to gain experience of a non-trivial problem in a parallel processor environment, to explore the necessary interplay of serial and parallel sections of the algorithm in terms of timing studies, to explore the granularity (coarse vs. fine grain) to discover the granularity limit above which there would be a risk of starvation where the majority of nodes would be idle or under the limit where the overhead associated with splitting the problem may require more work and communication time than is useful.
A scalable parallel black oil simulator on distributed memory parallel computers

NASA Astrophysics Data System (ADS)

Wang, Kun; Liu, Hui; Chen, Zhangxin

2015-11-01

This paper presents our work on developing a parallel black oil simulator for distributed memory computers based on our in-house parallel platform. The parallel simulator is designed to overcome the performance issues of common simulators that are implemented for personal computers and workstations. The finite difference method is applied to discretize the black oil model. In addition, some advanced techniques are employed to strengthen the robustness and parallel scalability of the simulator, including an inexact Newton method, matrix decoupling methods, and algebraic multigrid methods. A new multi-stage preconditioner is proposed to accelerate the solution of linear systems from the Newton methods. Numerical experiments show that our simulator is scalable and efficient, and is capable of simulating extremely large-scale black oil problems with tens of millions of grid blocks using thousands of MPI processes on parallel computers.
Performance Enhancement Strategies for Multi-Block Overset Grid CFD Applications

NASA Technical Reports Server (NTRS)

Djomehri, M. Jahed; Biswas, Rupak

2003-01-01

The overset grid methodology has significantly reduced time-to-solution of highfidelity computational fluid dynamics (CFD) simulations about complex aerospace configurations. The solution process resolves the geometrical complexity of the problem domain by using separately generated but overlapping structured discretization grids that periodically exchange information through interpolation. However, high performance computations of such large-scale realistic applications must be handled efficiently on state-of-the-art parallel supercomputers. This paper analyzes the effects of various performance enhancement strategies on the parallel efficiency of an overset grid Navier-Stokes CFD application running on an SGI Origin2000 machinc. Specifically, the role of asynchronous communication, grid splitting, and grid grouping strategies are presented and discussed. Details of a sophisticated graph partitioning technique for grid grouping are also provided. Results indicate that performance depends critically on the level of latency hiding and the quality of load balancing across the processors.
Multi-stage decoding for multi-level block modulation codes

NASA Technical Reports Server (NTRS)

Lin, Shu

1991-01-01

In this paper, we investigate various types of multi-stage decoding for multi-level block modulation codes, in which the decoding of a component code at each stage can be either soft-decision or hard-decision, maximum likelihood or bounded-distance. Error performance of codes is analyzed for a memoryless additive channel based on various types of multi-stage decoding, and upper bounds on the probability of an incorrect decoding are derived. Based on our study and computation results, we find that, if component codes of a multi-level modulation code and types of decoding at various stages are chosen properly, high spectral efficiency and large coding gain can be achieved with reduced decoding complexity. In particular, we find that the difference in performance between the suboptimum multi-stage soft-decision maximum likelihood decoding of a modulation code and the single-stage optimum decoding of the overall code is very small: only a fraction of dB loss in SNR at the probability of an incorrect decoding for a block of 10(exp -6). Multi-stage decoding of multi-level modulation codes really offers a way to achieve the best of three worlds, bandwidth efficiency, coding gain, and decoding complexity.
Numerical study of supersonic combustors by multi-block grids with mismatched interfaces

NASA Technical Reports Server (NTRS)

Moon, Young J.

1990-01-01

A three dimensional, finite rate chemistry, Navier-Stokes code was extended to a multi-block code with mismatched interface for practical calculations of supersonic combustors. To ensure global conservation, a conservative algorithm was used for the treatment of mismatched interfaces. The extended code was checked against one test case, i.e., a generic supersonic combustor with transverse fuel injection, examining solution accuracy, convergence, and local mass flux error. After testing, the code was used to simulate the chemically reacting flow fields in a scramjet combustor with parallel fuel injectors (unswept and swept ramps). Computational results were compared with experimental shadowgraph and pressure measurements. Fuel-air mixing characteristics of the unswept and swept ramps were compared and investigated.
Research on Multi - Person Parallel Modeling Method Based on Integrated Model Persistent Storage

NASA Astrophysics Data System (ADS)

Qu, MingCheng; Wu, XiangHu; Tao, YongChao; Liu, Ying

2018-03-01

This paper mainly studies the multi-person parallel modeling method based on the integrated model persistence storage. The integrated model refers to a set of MDDT modeling graphics system, which can carry out multi-angle, multi-level and multi-stage description of aerospace general embedded software. Persistent storage refers to converting the data model in memory into a storage model and converting the storage model into a data model in memory, where the data model refers to the object model and the storage model is a binary stream. And multi-person parallel modeling refers to the need for multi-person collaboration, the role of separation, and even real-time remote synchronization modeling.
MLP: A Parallel Programming Alternative to MPI for New Shared Memory Parallel Systems

NASA Technical Reports Server (NTRS)

Taft, James R.

1999-01-01

Recent developments at the NASA AMES Research Center's NAS Division have demonstrated that the new generation of NUMA based Symmetric Multi-Processing systems (SMPs), such as the Silicon Graphics Origin 2000, can successfully execute legacy vector oriented CFD production codes at sustained rates far exceeding processing rates possible on dedicated 16 CPU Cray C90 systems. This high level of performance is achieved via shared memory based Multi-Level Parallelism (MLP). This programming approach, developed at NAS and outlined below, is distinct from the message passing paradigm of MPI. It offers parallelism at both the fine and coarse grained level, with communication latencies that are approximately 50-100 times lower than typical MPI implementations on the same platform. Such latency reductions offer the promise of performance scaling to very large CPU counts. The method draws on, but is also distinct from, the newly defined OpenMP specification, which uses compiler directives to support a limited subset of multi-level parallel operations. The NAS MLP method is general, and applicable to a large class of NASA CFD codes.
The design of multi-core DSP parallel model based on message passing and multi-level pipeline

NASA Astrophysics Data System (ADS)

Niu, Jingyu; Hu, Jian; He, Wenjing; Meng, Fanrong; Li, Chuanrong

2017-10-01

Currently, the design of embedded signal processing system is often based on a specific application, but this idea is not conducive to the rapid development of signal processing technology. In this paper, a parallel processing model architecture based on multi-core DSP platform is designed, and it is mainly suitable for the complex algorithms which are composed of different modules. This model combines the ideas of multi-level pipeline parallelism and message passing, and summarizes the advantages of the mainstream model of multi-core DSP (the Master-Slave model and the Data Flow model), so that it has better performance. This paper uses three-dimensional image generation algorithm to validate the efficiency of the proposed model by comparing with the effectiveness of the Master-Slave and the Data Flow model.
Accelerating nuclear configuration interaction calculations through a preconditioned block iterative eigensolver

DOE PAGES

Shao, Meiyue; Aktulga, H. Metin; Yang, Chao; ...

2017-09-14

In this paper, we describe a number of recently developed techniques for improving the performance of large-scale nuclear configuration interaction calculations on high performance parallel computers. We show the benefit of using a preconditioned block iterative method to replace the Lanczos algorithm that has traditionally been used to perform this type of computation. The rapid convergence of the block iterative method is achieved by a proper choice of starting guesses of the eigenvectors and the construction of an effective preconditioner. These acceleration techniques take advantage of special structure of the nuclear configuration interaction problem which we discuss in detail. Themore » use of a block method also allows us to improve the concurrency of the computation, and take advantage of the memory hierarchy of modern microprocessors to increase the arithmetic intensity of the computation relative to data movement. Finally, we also discuss the implementation details that are critical to achieving high performance on massively parallel multi-core supercomputers, and demonstrate that the new block iterative solver is two to three times faster than the Lanczos based algorithm for problems of moderate sizes on a Cray XC30 system.« less
Accelerating nuclear configuration interaction calculations through a preconditioned block iterative eigensolver

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shao, Meiyue; Aktulga, H. Metin; Yang, Chao

In this paper, we describe a number of recently developed techniques for improving the performance of large-scale nuclear configuration interaction calculations on high performance parallel computers. We show the benefit of using a preconditioned block iterative method to replace the Lanczos algorithm that has traditionally been used to perform this type of computation. The rapid convergence of the block iterative method is achieved by a proper choice of starting guesses of the eigenvectors and the construction of an effective preconditioner. These acceleration techniques take advantage of special structure of the nuclear configuration interaction problem which we discuss in detail. Themore » use of a block method also allows us to improve the concurrency of the computation, and take advantage of the memory hierarchy of modern microprocessors to increase the arithmetic intensity of the computation relative to data movement. Finally, we also discuss the implementation details that are critical to achieving high performance on massively parallel multi-core supercomputers, and demonstrate that the new block iterative solver is two to three times faster than the Lanczos based algorithm for problems of moderate sizes on a Cray XC30 system.« less
Collaborating CPU and GPU for large-scale high-order CFD simulations with complex grids on the TianHe-1A supercomputer

NASA Astrophysics Data System (ADS)

Xu, Chuanfu; Deng, Xiaogang; Zhang, Lilun; Fang, Jianbin; Wang, Guangxue; Jiang, Yi; Cao, Wei; Che, Yonggang; Wang, Yongxian; Wang, Zhenghua; Liu, Wei; Cheng, Xinghua

2014-12-01

Programming and optimizing complex, real-world CFD codes on current many-core accelerated HPC systems is very challenging, especially when collaborating CPUs and accelerators to fully tap the potential of heterogeneous systems. In this paper, with a tri-level hybrid and heterogeneous programming model using MPI + OpenMP + CUDA, we port and optimize our high-order multi-block structured CFD software HOSTA on the GPU-accelerated TianHe-1A supercomputer. HOSTA adopts two self-developed high-order compact definite difference schemes WCNS and HDCS that can simulate flows with complex geometries. We present a dual-level parallelization scheme for efficient multi-block computation on GPUs and perform particular kernel optimizations for high-order CFD schemes. The GPU-only approach achieves a speedup of about 1.3 when comparing one Tesla M2050 GPU with two Xeon X5670 CPUs. To achieve a greater speedup, we collaborate CPU and GPU for HOSTA instead of using a naive GPU-only approach. We present a novel scheme to balance the loads between the store-poor GPU and the store-rich CPU. Taking CPU and GPU load balance into account, we improve the maximum simulation problem size per TianHe-1A node for HOSTA by 2.3×, meanwhile the collaborative approach can improve the performance by around 45% compared to the GPU-only approach. Further, to scale HOSTA on TianHe-1A, we propose a gather/scatter optimization to minimize PCI-e data transfer times for ghost and singularity data of 3D grid blocks, and overlap the collaborative computation and communication as far as possible using some advanced CUDA and MPI features. Scalability tests show that HOSTA can achieve a parallel efficiency of above 60% on 1024 TianHe-1A nodes. With our method, we have successfully simulated an EET high-lift airfoil configuration containing 800M cells and China's large civil airplane configuration containing 150M cells. To our best knowledge, those are the largest-scale CPU-GPU collaborative simulations that solve realistic CFD problems with both complex configurations and high-order schemes.
Collaborating CPU and GPU for large-scale high-order CFD simulations with complex grids on the TianHe-1A supercomputer

DOE Office of Scientific and Technical Information (OSTI.GOV)

Xu, Chuanfu, E-mail: xuchuanfu@nudt.edu.cn; Deng, Xiaogang; Zhang, Lilun

Programming and optimizing complex, real-world CFD codes on current many-core accelerated HPC systems is very challenging, especially when collaborating CPUs and accelerators to fully tap the potential of heterogeneous systems. In this paper, with a tri-level hybrid and heterogeneous programming model using MPI + OpenMP + CUDA, we port and optimize our high-order multi-block structured CFD software HOSTA on the GPU-accelerated TianHe-1A supercomputer. HOSTA adopts two self-developed high-order compact definite difference schemes WCNS and HDCS that can simulate flows with complex geometries. We present a dual-level parallelization scheme for efficient multi-block computation on GPUs and perform particular kernel optimizations formore » high-order CFD schemes. The GPU-only approach achieves a speedup of about 1.3 when comparing one Tesla M2050 GPU with two Xeon X5670 CPUs. To achieve a greater speedup, we collaborate CPU and GPU for HOSTA instead of using a naive GPU-only approach. We present a novel scheme to balance the loads between the store-poor GPU and the store-rich CPU. Taking CPU and GPU load balance into account, we improve the maximum simulation problem size per TianHe-1A node for HOSTA by 2.3×, meanwhile the collaborative approach can improve the performance by around 45% compared to the GPU-only approach. Further, to scale HOSTA on TianHe-1A, we propose a gather/scatter optimization to minimize PCI-e data transfer times for ghost and singularity data of 3D grid blocks, and overlap the collaborative computation and communication as far as possible using some advanced CUDA and MPI features. Scalability tests show that HOSTA can achieve a parallel efficiency of above 60% on 1024 TianHe-1A nodes. With our method, we have successfully simulated an EET high-lift airfoil configuration containing 800M cells and China's large civil airplane configuration containing 150M cells. To our best knowledge, those are the largest-scale CPU–GPU collaborative simulations that solve realistic CFD problems with both complex configurations and high-order schemes.« less

Coarse mesh and one-cell block inversion based diffusion synthetic acceleration

NASA Astrophysics Data System (ADS)

Kim, Kang-Seog

DSA (Diffusion Synthetic Acceleration) has been developed to accelerate the SN transport iteration. We have developed solution techniques for the diffusion equations of FLBLD (Fully Lumped Bilinear Discontinuous), SCB (Simple Comer Balance) and UCB (Upstream Corner Balance) modified 4-step DSA in x-y geometry. Our first multi-level method includes a block Gauss-Seidel iteration for the discontinuous diffusion equation, uses the continuous diffusion equation derived from the asymptotic analysis, and avoids void cell calculation. We implemented this multi-level procedure and performed model problem calculations. The results showed that the FLBLD, SCB and UCB modified 4-step DSA schemes with this multi-level technique are unconditionally stable and rapidly convergent. We suggested a simplified multi-level technique for FLBLD, SCB and UCB modified 4-step DSA. This new procedure does not include iterations on the diffusion calculation or the residual calculation. Fourier analysis results showed that this new procedure was as rapidly convergent as conventional modified 4-step DSA. We developed new DSA procedures coupled with 1-CI (Cell Block Inversion) transport which can be easily parallelized. We showed that 1-CI based DSA schemes preceded by SI (Source Iteration) are efficient and rapidly convergent for LD (Linear Discontinuous) and LLD (Lumped Linear Discontinuous) in slab geometry and for BLD (Bilinear Discontinuous) and FLBLD in x-y geometry. For 1-CI based DSA without SI in slab geometry, the results showed that this procedure is very efficient and effective for all cases. We also showed that 1-CI based DSA in x-y geometry was not effective for thin mesh spacings, but is effective and rapidly convergent for intermediate and thick mesh spacings. We demonstrated that the diffusion equation discretized on a coarse mesh could be employed to accelerate the transport equation. Our results showed that coarse mesh DSA is unconditionally stable and is as rapidly convergent as fine mesh DSA in slab geometry. For x-y geometry our coarse mesh DSA is very effective for thin and intermediate mesh spacings independent of the scattering ratio, but is not effective for purely scattering problems and high aspect ratio zoning. However, if the scattering ratio is less than about 0.95, this procedure is very effective for all mesh spacing.
Multi-stage decoding for multi-level block modulation codes

NASA Technical Reports Server (NTRS)

Lin, Shu; Kasami, Tadao

1991-01-01

Various types of multistage decoding for multilevel block modulation codes, in which the decoding of a component code at each stage can be either soft decision or hard decision, maximum likelihood or bounded distance are discussed. Error performance for codes is analyzed for a memoryless additive channel based on various types of multi-stage decoding, and upper bounds on the probability of an incorrect decoding are derived. It was found that, if component codes of a multi-level modulation code and types of decoding at various stages are chosen properly, high spectral efficiency and large coding gain can be achieved with reduced decoding complexity. It was found that the difference in performance between the suboptimum multi-stage soft decision maximum likelihood decoding of a modulation code and the single stage optimum decoding of the overall code is very small, only a fraction of dB loss in SNR at the probability of an incorrect decoding for a block of 10(exp -6). Multi-stage decoding of multi-level modulation codes really offers a way to achieve the best of three worlds, bandwidth efficiency, coding gain, and decoding complexity.
Numerical grid generation in computational field simulations. Volume 1

DOE Office of Scientific and Technical Information (OSTI.GOV)

Soni, B.K.; Thompson, J.F.; Haeuser, J.

1996-12-31

To enhance the CFS technology to its next level of applicability (i.e., to create acceptance of CFS in an integrated product and process development involving multidisciplinary optimization) the basic requirements are: rapid turn-around time, reliable and accurate simulation, affordability and appropriate linkage to other engineering disciplines. In response to this demand, there has been a considerable growth in the grid generation related research activities involving automization, parallel processing, linkage with the CAD-CAM systems, CFS with dynamic motion and moving boundaries, strategies and algorithms associated with multi-block structured, unstructured, hybrid, hexahedral, and Cartesian grids, along with its applicability to various disciplinesmore » including biomedical, semiconductor, geophysical, ocean modeling, and multidisciplinary optimization.« less
Multi-cylinder hot gas engine

DOEpatents

Corey, John A.

1985-01-01

A multi-cylinder hot gas engine having an equal angle, V-shaped engine block in which two banks of parallel, equal length, equally sized cylinders are formed together with annular regenerator/cooler units surrounding each cylinder, and wherein the pistons are connected to a single crankshaft. The hot gas engine further includes an annular heater head disposed around a central circular combustor volume having a new balanced-flow hot-working-fluid manifold assembly that provides optimum balanced flow of the working fluid through the heater head working fluid passageways which are connected between each of the cylinders and their respective associated annular regenerator units. This balanced flow provides even heater head temperatures and, therefore, maximum average working fluid temperature for best operating efficiency with the use of a single crankshaft V-shaped engine block.
Investigation of obstacle effect to improve conjugate heat transfer in backward facing step channel using fast simulation of incompressible flow

NASA Astrophysics Data System (ADS)

Nouri-Borujerdi, Ali; Moazezi, Arash

2018-01-01

The current study investigates the conjugate heat transfer characteristics for laminar flow in backward facing step channel. All of the channel walls are insulated except the lower thick wall under a constant temperature. The upper wall includes a insulated obstacle perpendicular to flow direction. The effect of obstacle height and location on the fluid flow and heat transfer are numerically explored for the Reynolds number in the range of 10 ≤ Re ≤ 300. Incompressible Navier-Stokes and thermal energy equations are solved simultaneously in fluid region by the upwind compact finite difference scheme based on flux-difference splitting in conjunction with artificial compressibility method. In the thick wall, the energy equation is obtained by Laplace equation. A multi-block approach is used to perform parallel computing to reduce the CPU time. Each block is modeled separately by sharing boundary conditions with neighbors. The developed program for modeling was written in FORTRAN language with OpenMP API. The obtained results showed that using of the multi-block parallel computing method is a simple robust scheme with high performance and high-order accurate. Moreover, the obtained results demonstrated that the increment of Reynolds number and obstacle height as well as decrement of horizontal distance between the obstacle and the step improve the heat transfer.
Design and Verification of Remote Sensing Image Data Center Storage Architecture Based on Hadoop

NASA Astrophysics Data System (ADS)

Tang, D.; Zhou, X.; Jing, Y.; Cong, W.; Li, C.

2018-04-01

The data center is a new concept of data processing and application proposed in recent years. It is a new method of processing technologies based on data, parallel computing, and compatibility with different hardware clusters. While optimizing the data storage management structure, it fully utilizes cluster resource computing nodes and improves the efficiency of data parallel application. This paper used mature Hadoop technology to build a large-scale distributed image management architecture for remote sensing imagery. Using MapReduce parallel processing technology, it called many computing nodes to process image storage blocks and pyramids in the background to improve the efficiency of image reading and application and sovled the need for concurrent multi-user high-speed access to remotely sensed data. It verified the rationality, reliability and superiority of the system design by testing the storage efficiency of different image data and multi-users and analyzing the distributed storage architecture to improve the application efficiency of remote sensing images through building an actual Hadoop service system.
CFD Analysis and Design Optimization Using Parallel Computers

NASA Technical Reports Server (NTRS)

Martinelli, Luigi; Alonso, Juan Jose; Jameson, Antony; Reuther, James

1997-01-01

A versatile and efficient multi-block method is presented for the simulation of both steady and unsteady flow, as well as aerodynamic design optimization of complete aircraft configurations. The compressible Euler and Reynolds Averaged Navier-Stokes (RANS) equations are discretized using a high resolution scheme on body-fitted structured meshes. An efficient multigrid implicit scheme is implemented for time-accurate flow calculations. Optimum aerodynamic shape design is achieved at very low cost using an adjoint formulation. The method is implemented on parallel computing systems using the MPI message passing interface standard to ensure portability. The results demonstrate that, by combining highly efficient algorithms with parallel computing, it is possible to perform detailed steady and unsteady analysis as well as automatic design for complex configurations using the present generation of parallel computers.
Bio-inspired multi-mode optic flow sensors for micro air vehicles

NASA Astrophysics Data System (ADS)

Park, Seokjun; Choi, Jaehyuk; Cho, Jihyun; Yoon, Euisik

2013-06-01

Monitoring wide-field surrounding information is essential for vision-based autonomous navigation in micro-air-vehicles (MAV). Our image-cube (iCube) module, which consists of multiple sensors that are facing different angles in 3-D space, can be applied to the wide-field of view optic flows estimation (μ-Compound eyes) and to attitude control (μ- Ocelli) in the Micro Autonomous Systems and Technology (MAST) platforms. In this paper, we report an analog/digital (A/D) mixed-mode optic-flow sensor, which generates both optic flows and normal images in different modes for μ- Compound eyes and μ-Ocelli applications. The sensor employs a time-stamp based optic flow algorithm which is modified from the conventional EMD (Elementary Motion Detector) algorithm to give an optimum partitioning of hardware blocks in analog and digital domains as well as adequate allocation of pixel-level, column-parallel, and chip-level signal processing. Temporal filtering, which may require huge hardware resources if implemented in digital domain, is remained in a pixel-level analog processing unit. The rest of the blocks, including feature detection and timestamp latching, are implemented using digital circuits in a column-parallel processing unit. Finally, time-stamp information is decoded into velocity from look-up tables, multiplications, and simple subtraction circuits in a chip-level processing unit, thus significantly reducing core digital processing power consumption. In the normal image mode, the sensor generates 8-b digital images using single slope ADCs in the column unit. In the optic flow mode, the sensor estimates 8-b 1-D optic flows from the integrated mixed-mode algorithm core and 2-D optic flows with an external timestamp processing, respectively.
Parallel processing architecture for H.264 deblocking filter on multi-core platforms

NASA Astrophysics Data System (ADS)

Prasad, Durga P.; Sonachalam, Sekar; Kunchamwar, Mangesh K.; Gunupudi, Nageswara Rao

2012-03-01

Massively parallel computing (multi-core) chips offer outstanding new solutions that satisfy the increasing demand for high resolution and high quality video compression technologies such as H.264. Such solutions not only provide exceptional quality but also efficiency, low power, and low latency, previously unattainable in software based designs. While custom hardware and Application Specific Integrated Circuit (ASIC) technologies may achieve lowlatency, low power, and real-time performance in some consumer devices, many applications require a flexible and scalable software-defined solution. The deblocking filter in H.264 encoder/decoder poses difficult implementation challenges because of heavy data dependencies and the conditional nature of the computations. Deblocking filter implementations tend to be fixed and difficult to reconfigure for different needs. The ability to scale up for higher quality requirements such as 10-bit pixel depth or a 4:2:2 chroma format often reduces the throughput of a parallel architecture designed for lower feature set. A scalable architecture for deblocking filtering, created with a massively parallel processor based solution, means that the same encoder or decoder will be deployed in a variety of applications, at different video resolutions, for different power requirements, and at higher bit-depths and better color sub sampling patterns like YUV, 4:2:2, or 4:4:4 formats. Low power, software-defined encoders/decoders may be implemented using a massively parallel processor array, like that found in HyperX technology, with 100 or more cores and distributed memory. The large number of processor elements allows the silicon device to operate more efficiently than conventional DSP or CPU technology. This software programing model for massively parallel processors offers a flexible implementation and a power efficiency close to that of ASIC solutions. This work describes a scalable parallel architecture for an H.264 compliant deblocking filter for multi core platforms such as HyperX technology. Parallel techniques such as parallel processing of independent macroblocks, sub blocks, and pixel row level are examined in this work. The deblocking architecture consists of a basic cell called deblocking filter unit (DFU) and dependent data buffer manager (DFM). The DFU can be used in several instances, catering to different performance needs the DFM serves the data required for the different number of DFUs, and also manages all the neighboring data required for future data processing of DFUs. This approach achieves the scalability, flexibility, and performance excellence required in deblocking filters.
A Framework for Parallel Unstructured Grid Generation for Complex Aerodynamic Simulations

NASA Technical Reports Server (NTRS)

Zagaris, George; Pirzadeh, Shahyar Z.; Chrisochoides, Nikos

2009-01-01

A framework for parallel unstructured grid generation targeting both shared memory multi-processors and distributed memory architectures is presented. The two fundamental building-blocks of the framework consist of: (1) the Advancing-Partition (AP) method used for domain decomposition and (2) the Advancing Front (AF) method used for mesh generation. Starting from the surface mesh of the computational domain, the AP method is applied recursively to generate a set of sub-domains. Next, the sub-domains are meshed in parallel using the AF method. The recursive nature of domain decomposition naturally maps to a divide-and-conquer algorithm which exhibits inherent parallelism. For the parallel implementation, the Master/Worker pattern is employed to dynamically balance the varying workloads of each task on the set of available CPUs. Performance results by this approach are presented and discussed in detail as well as future work and improvements.
Cooperative storage of shared files in a parallel computing system with dynamic block size

DOEpatents

Bent, John M.; Faibish, Sorin; Grider, Gary

2015-11-10

Improved techniques are provided for parallel writing of data to a shared object in a parallel computing system. A method is provided for storing data generated by a plurality of parallel processes to a shared object in a parallel computing system. The method is performed by at least one of the processes and comprises: dynamically determining a block size for storing the data; exchanging a determined amount of the data with at least one additional process to achieve a block of the data having the dynamically determined block size; and writing the block of the data having the dynamically determined block size to a file system. The determined block size comprises, e.g., a total amount of the data to be stored divided by the number of parallel processes. The file system comprises, for example, a log structured virtual parallel file system, such as a Parallel Log-Structured File System (PLFS).
Hybrid Parallelism for Volume Rendering on Large-, Multi-, and Many-Core Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Howison, Mark; Bethel, E. Wes; Childs, Hank

2012-01-01

With the computing industry trending towards multi- and many-core processors, we study how a standard visualization algorithm, ray-casting volume rendering, can benefit from a hybrid parallelism approach. Hybrid parallelism provides the best of both worlds: using distributed-memory parallelism across a large numbers of nodes increases available FLOPs and memory, while exploiting shared-memory parallelism among the cores within each node ensures that each node performs its portion of the larger calculation as efficiently as possible. We demonstrate results from weak and strong scaling studies, at levels of concurrency ranging up to 216,000, and with datasets as large as 12.2 trillion cells.more » The greatest benefit from hybrid parallelism lies in the communication portion of the algorithm, the dominant cost at higher levels of concurrency. We show that reducing the number of participants with a hybrid approach significantly improves performance.« less
Vectorization for Molecular Dynamics on Intel Xeon Phi Corpocessors

NASA Astrophysics Data System (ADS)

Yi, Hongsuk

2014-03-01

Many modern processors are capable of exploiting data-level parallelism through the use of single instruction multiple data (SIMD) execution. The new Intel Xeon Phi coprocessor supports 512 bit vector registers for the high performance computing. In this paper, we have developed a hierarchical parallelization scheme for accelerated molecular dynamics simulations with the Terfoff potentials for covalent bond solid crystals on Intel Xeon Phi coprocessor systems. The scheme exploits multi-level parallelism computing. We combine thread-level parallelism using a tightly coupled thread-level and task-level parallelism with 512-bit vector register. The simulation results show that the parallel performance of SIMD implementations on Xeon Phi is apparently superior to their x86 CPU architecture.
Data Parallel Bin-Based Indexing for Answering Queries on Multi-Core Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gosink, Luke; Wu, Kesheng; Bethel, E. Wes

2009-06-02

The multi-core trend in CPUs and general purpose graphics processing units (GPUs) offers new opportunities for the database community. The increase of cores at exponential rates is likely to affect virtually every server and client in the coming decade, and presents database management systems with a huge, compelling disruption that will radically change how processing is done. This paper presents a new parallel indexing data structure for answering queries that takes full advantage of the increasing thread-level parallelism emerging in multi-core architectures. In our approach, our Data Parallel Bin-based Index Strategy (DP-BIS) first bins the base data, and then partitionsmore » and stores the values in each bin as a separate, bin-based data cluster. In answering a query, the procedures for examining the bin numbers and the bin-based data clusters offer the maximum possible level of concurrency; each record is evaluated by a single thread and all threads are processed simultaneously in parallel. We implement and demonstrate the effectiveness of DP-BIS on two multi-core architectures: a multi-core CPU and a GPU. The concurrency afforded by DP-BIS allows us to fully utilize the thread-level parallelism provided by each architecture--for example, our GPU-based DP-BIS implementation simultaneously evaluates over 12,000 records with an equivalent number of concurrently executing threads. In comparing DP-BIS's performance across these architectures, we show that the GPU-based DP-BIS implementation requires significantly less computation time to answer a query than the CPU-based implementation. We also demonstrate in our analysis that DP-BIS provides better overall performance than the commonly utilized CPU and GPU-based projection index. Finally, due to data encoding, we show that DP-BIS accesses significantly smaller amounts of data than index strategies that operate solely on a column's base data; this smaller data footprint is critical for parallel processors that possess limited memory resources (e.g., GPUs).« less
Algorithm and Application of Gcp-Independent Block Adjustment for Super Large-Scale Domestic High Resolution Optical Satellite Imagery

NASA Astrophysics Data System (ADS)

Sun, Y. S.; Zhang, L.; Xu, B.; Zhang, Y.

2018-04-01

The accurate positioning of optical satellite image without control is the precondition for remote sensing application and small/medium scale mapping in large abroad areas or with large-scale images. In this paper, aiming at the geometric features of optical satellite image, based on a widely used optimization method of constraint problem which is called Alternating Direction Method of Multipliers (ADMM) and RFM least-squares block adjustment, we propose a GCP independent block adjustment method for the large-scale domestic high resolution optical satellite image - GISIBA (GCP-Independent Satellite Imagery Block Adjustment), which is easy to parallelize and highly efficient. In this method, the virtual "average" control points are built to solve the rank defect problem and qualitative and quantitative analysis in block adjustment without control. The test results prove that the horizontal and vertical accuracy of multi-covered and multi-temporal satellite images are better than 10 m and 6 m. Meanwhile the mosaic problem of the adjacent areas in large area DOM production can be solved if the public geographic information data is introduced as horizontal and vertical constraints in the block adjustment process. Finally, through the experiments by using GF-1 and ZY-3 satellite images over several typical test areas, the reliability, accuracy and performance of our developed procedure will be presented and studied in this paper.
Revisiting Parallel Cyclic Reduction and Parallel Prefix-Based Algorithms for Block Tridiagonal System of Equations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Seal, Sudip K; Perumalla, Kalyan S; Hirshman, Steven Paul

2013-01-01

Simulations that require solutions of block tridiagonal systems of equations rely on fast parallel solvers for runtime efficiency. Leading parallel solvers that are highly effective for general systems of equations, dense or sparse, are limited in scalability when applied to block tridiagonal systems. This paper presents scalability results as well as detailed analyses of two parallel solvers that exploit the special structure of block tridiagonal matrices to deliver superior performance, often by orders of magnitude. A rigorous analysis of their relative parallel runtimes is shown to reveal the existence of a critical block size that separates the parameter space spannedmore » by the number of block rows, the block size and the processor count, into distinct regions that favor one or the other of the two solvers. Dependence of this critical block size on the above parameters as well as on machine-specific constants is established. These formal insights are supported by empirical results on up to 2,048 cores of a Cray XT4 system. To the best of our knowledge, this is the highest reported scalability for parallel block tridiagonal solvers to date.« less
Parallel Monte Carlo transport modeling in the context of a time-dependent, three-dimensional multi-physics code

DOE Office of Scientific and Technical Information (OSTI.GOV)

Procassini, R.J.

1997-12-31

The fine-scale, multi-space resolution that is envisioned for accurate simulations of complex weapons systems in three spatial dimensions implies flop-rate and memory-storage requirements that will only be obtained in the near future through the use of parallel computational techniques. Since the Monte Carlo transport models in these simulations usually stress both of these computational resources, they are prime candidates for parallelization. The MONACO Monte Carlo transport package, which is currently under development at LLNL, will utilize two types of parallelism within the context of a multi-physics design code: decomposition of the spatial domain across processors (spatial parallelism) and distribution ofmore » particles in a given spatial subdomain across additional processors (particle parallelism). This implementation of the package will utilize explicit data communication between domains (message passing). Such a parallel implementation of a Monte Carlo transport model will result in non-deterministic communication patterns. The communication of particles between subdomains during a Monte Carlo time step may require a significant level of effort to achieve a high parallel efficiency.« less
GPU COMPUTING FOR PARTICLE TRACKING

DOE Office of Scientific and Technical Information (OSTI.GOV)

Nishimura, Hiroshi; Song, Kai; Muriki, Krishna

2011-03-25

This is a feasibility study of using a modern Graphics Processing Unit (GPU) to parallelize the accelerator particle tracking code. To demonstrate the massive parallelization features provided by GPU computing, a simplified TracyGPU program is developed for dynamic aperture calculation. Performances, issues, and challenges from introducing GPU are also discussed. General purpose Computation on Graphics Processing Units (GPGPU) bring massive parallel computing capabilities to numerical calculation. However, the unique architecture of GPU requires a comprehensive understanding of the hardware and programming model to be able to well optimize existing applications. In the field of accelerator physics, the dynamic aperture calculationmore » of a storage ring, which is often the most time consuming part of the accelerator modeling and simulation, can benefit from GPU due to its embarrassingly parallel feature, which fits well with the GPU programming model. In this paper, we use the Tesla C2050 GPU which consists of 14 multi-processois (MP) with 32 cores on each MP, therefore a total of 448 cores, to host thousands ot threads dynamically. Thread is a logical execution unit of the program on GPU. In the GPU programming model, threads are grouped into a collection of blocks Within each block, multiple threads share the same code, and up to 48 KB of shared memory. Multiple thread blocks form a grid, which is executed as a GPU kernel. A simplified code that is a subset of Tracy++ [2] is developed to demonstrate the possibility of using GPU to speed up the dynamic aperture calculation by having each thread track a particle.« less
LAURA Users Manual: 5.2-43231

NASA Technical Reports Server (NTRS)

Mazaheri, Alireza; Gnoffo, Peter A.; Johnston, Christopher O.; Kleb, Bil

2009-01-01

This users manual provides in-depth information concerning installation and execution of LAURA, version 5. LAURA is a structured, multi-block, computational aerothermodynamic simulation code. Version 5 represents a major refactoring of the original Fortran 77 LAURA code toward a modular structure afforded by Fortran 95. The refactoring improved usability and maintainability by eliminating the requirement for problem-dependent re-compilations, providing more intuitive distribution of functionality, and simplifying interfaces required for multiphysics coupling. As a result, LAURA now shares gas-physics modules, MPI modules, and other low-level modules with the FUN3D unstructured-grid code. In addition to internal refactoring, several new features and capabilities have been added, e.g., a GNU-standard installation process, parallel load balancing, automatic trajectory point sequencing, free-energy minimization, and coupled ablation and flowfield radiation.
Laura Users Manual: 5.1-41601

NASA Technical Reports Server (NTRS)

Mazaheri, Alireza; Gnoffo, Peter A.; Johnston, Christopher O.; Kleb, Bil

2009-01-01

This users manual provides in-depth information concerning installation and execution of LAURA, version 5. LAURA is a structured, multi-block, computational aerothermodynamic simulation code. Version 5 represents a major refactoring of the original Fortran 77 LAURA code toward a modular structure afforded by Fortran 95. The refactoring improved usability and maintainability by eliminating the requirement for problem-dependent re-compilations, providing more intuitive distribution of functionality, and simplifying interfaces required for multiphysics coupling. As a result, LAURA now shares gas-physics modules, MPI modules, and other low-level modules with the FUN3D unstructured-grid code. In addition to internal refactoring, several new features and capabilities have been added, e.g., a GNU-standard installation process, parallel load balancing, automatic trajectory point sequencing, free-energy minimization, and coupled ablation and flowfield radiation.

Parallel block schemes for large scale least squares computations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Golub, G.H.; Plemmons, R.J.; Sameh, A.

1986-04-01

Large scale least squares computations arise in a variety of scientific and engineering problems, including geodetic adjustments and surveys, medical image analysis, molecular structures, partial differential equations and substructuring methods in structural engineering. In each of these problems, matrices often arise which possess a block structure which reflects the local connection nature of the underlying physical problem. For example, such super-large nonlinear least squares computations arise in geodesy. Here the coordinates of positions are calculated by iteratively solving overdetermined systems of nonlinear equations by the Gauss-Newton method. The US National Geodetic Survey will complete this year (1986) the readjustment ofmore » the North American Datum, a problem which involves over 540 thousand unknowns and over 6.5 million observations (equations). The observation matrix for these least squares computations has a block angular form with 161 diagnonal blocks, each containing 3 to 4 thousand unknowns. In this paper parallel schemes are suggested for the orthogonal factorization of matrices in block angular form and for the associated backsubstitution phase of the least squares computations. In addition, a parallel scheme for the calculation of certain elements of the covariance matrix for such problems is described. It is shown that these algorithms are ideally suited for multiprocessors with three levels of parallelism such as the Cedar system at the University of Illinois. 20 refs., 7 figs.« less
KNBD: A Remote Kernel Block Server for Linux

NASA Technical Reports Server (NTRS)

Becker, Jeff

1999-01-01

I am developing a prototype of a Linux remote disk block server whose purpose is to serve as a lower level component of a parallel file system. Parallel file systems are an important component of high performance supercomputers and clusters. Although supercomputer vendors such as SGI and IBM have their own custom solutions, there has been a void and hence a demand for such a system on Beowulf-type PC Clusters. Recently, the Parallel Virtual File System (PVFS) project at Clemson University has begun to address this need (1). Although their system provides much of the functionality of (and indeed was inspired by) the equivalent file systems in the commercial supercomputer market, their system is all in user-space. Migrating their 10 services to the kernel could provide a performance boost, by obviating the need for expensive system calls. Thanks to Pavel Machek, the Linux kernel has provided the network block device (2) with kernels 2.1.101 and later. You can configure this block device to redirect reads and writes to a remote machine's disk. This can be used as a building block for constructing a striped file system across several nodes.
Concurrent Probabilistic Simulation of High Temperature Composite Structural Response

NASA Technical Reports Server (NTRS)

Abdi, Frank

1996-01-01

A computational structural/material analysis and design tool which would meet industry's future demand for expedience and reduced cost is presented. This unique software 'GENOA' is dedicated to parallel and high speed analysis to perform probabilistic evaluation of high temperature composite response of aerospace systems. The development is based on detailed integration and modification of diverse fields of specialized analysis techniques and mathematical models to combine their latest innovative capabilities into a commercially viable software package. The technique is specifically designed to exploit the availability of processors to perform computationally intense probabilistic analysis assessing uncertainties in structural reliability analysis and composite micromechanics. The primary objectives which were achieved in performing the development were: (1) Utilization of the power of parallel processing and static/dynamic load balancing optimization to make the complex simulation of structure, material and processing of high temperature composite affordable; (2) Computational integration and synchronization of probabilistic mathematics, structural/material mechanics and parallel computing; (3) Implementation of an innovative multi-level domain decomposition technique to identify the inherent parallelism, and increasing convergence rates through high- and low-level processor assignment; (4) Creating the framework for Portable Paralleled architecture for the machine independent Multi Instruction Multi Data, (MIMD), Single Instruction Multi Data (SIMD), hybrid and distributed workstation type of computers; and (5) Market evaluation. The results of Phase-2 effort provides a good basis for continuation and warrants Phase-3 government, and industry partnership.
Array-based Hierarchical Mesh Generation in Parallel

DOE PAGES

Ray, Navamita; Grindeanu, Iulian; Zhao, Xinglin; ...

2015-11-03

In this paper, we describe an array-based hierarchical mesh generation capability through uniform refinement of unstructured meshes for efficient solution of PDE's using finite element methods and multigrid solvers. A multi-degree, multi-dimensional and multi-level framework is designed to generate the nested hierarchies from an initial mesh that can be used for a number of purposes such as multi-level methods to generating large meshes. The capability is developed under the parallel mesh framework “Mesh Oriented dAtaBase” a.k.a MOAB. We describe the underlying data structures and algorithms to generate such hierarchies and present numerical results for computational efficiency and mesh quality. Inmore » conclusion, we also present results to demonstrate the applicability of the developed capability to a multigrid finite-element solver.« less
Method of forming oriented block copolymer line patterns, block copolymer line patterns formed thereby, and their use to form patterned articles

DOEpatents

Russell, Thomas P.; Hong, Sung Woo; Lee, Doug Hyun; Park, Soojin; Xu, Ting

2015-10-13

A block copolymer film having a line pattern with a high degree of long-range order is formed by a method that includes forming a block copolymer film on a substrate surface with parallel facets, and annealing the block copolymer film to form an annealed block copolymer film having linear microdomains parallel to the substrate surface and orthogonal to the parallel facets of the substrate. The line-patterned block copolymer films are useful for the fabrication of magnetic storage media, polarizing devices, and arrays of nanowires.
Method of forming oriented block copolymer line patterns, block copolymer line patterns formed thereby, and their use to form patterned articles

DOE Office of Scientific and Technical Information (OSTI.GOV)

Russell, Thomas P.; Hong, Sung Woo; Lee, Dong Hyun

A block copolymer film having a line pattern with a high degree of long-range order is formed by a method that includes forming a block copolymer film on a substrate surface with parallel facets, and annealing the block copolymer film to form an annealed block copolymer film having linear microdomains parallel to the substrate surface and orthogonal to the parallel facets of the substrate. The line-patterned block copolymer films are useful for the fabrication of magnetic storage media, polarizing devices, and arrays of nanowires.
Power/Performance Trade-offs of Small Batched LU Based Solvers on GPUs

DOE Office of Scientific and Technical Information (OSTI.GOV)

Villa, Oreste; Fatica, Massimiliano; Gawande, Nitin A.

In this paper we propose and analyze a set of batched linear solvers for small matrices on Graphic Processing Units (GPUs), evaluating the various alternatives depending on the size of the systems to solve. We discuss three different solutions that operate with different level of parallelization and GPU features. The first, exploiting the CUBLAS library, manages matrices of size up to 32x32 and employs Warp level (one matrix, one Warp) parallelism and shared memory. The second works at Thread-block level parallelism (one matrix, one Thread-block), still exploiting shared memory but managing matrices up to 76x76. The third is Thread levelmore » parallel (one matrix, one thread) and can reach sizes up to 128x128, but it does not exploit shared memory and only relies on the high memory bandwidth of the GPU. The first and second solution only support partial pivoting, the third one easily supports partial and full pivoting, making it attractive to problems that require greater numerical stability. We analyze the trade-offs in terms of performance and power consumption as function of the size of the linear systems that are simultaneously solved. We execute the three implementations on a Tesla M2090 (Fermi) and on a Tesla K20 (Kepler).« less
A distributed parallel storage architecture and its potential application within EOSDIS

NASA Technical Reports Server (NTRS)

Johnston, William E.; Tierney, Brian; Feuquay, Jay; Butzer, Tony

1994-01-01

We describe the architecture, implementation, use of a scalable, high performance, distributed-parallel data storage system developed in the ARPA funded MAGIC gigabit testbed. A collection of wide area distributed disk servers operate in parallel to provide logical block level access to large data sets. Operated primarily as a network-based cache, the architecture supports cooperation among independently owned resources to provide fast, large-scale, on-demand storage to support data handling, simulation, and computation.
Multisensory architectures for action-oriented perception

NASA Astrophysics Data System (ADS)

Alba, L.; Arena, P.; De Fiore, S.; Listán, J.; Patané, L.; Salem, A.; Scordino, G.; Webb, B.

2007-05-01

In order to solve the navigation problem of a mobile robot in an unstructured environment a versatile sensory system and efficient locomotion control algorithms are necessary. In this paper an innovative sensory system for action-oriented perception applied to a legged robot is presented. An important problem we address is how to utilize a large variety and number of sensors, while having systems that can operate in real time. Our solution is to use sensory systems that incorporate analog and parallel processing, inspired by biological systems, to reduce the required data exchange with the motor control layer. In particular, as concerns the visual system, we use the Eye-RIS v1.1 board made by Anafocus, which is based on a fully parallel mixed-signal array sensor-processor chip. The hearing sensor is inspired by the cricket hearing system and allows efficient localization of a specific sound source with a very simple analog circuit. Our robot utilizes additional sensors for touch, posture, load, distance, and heading, and thus requires customized and parallel processing for concurrent acquisition. Therefore a Field Programmable Gate Array (FPGA) based hardware was used to manage the multi-sensory acquisition and processing. This choice was made because FPGAs permit the implementation of customized digital logic blocks that can operate in parallel allowing the sensors to be driven simultaneously. With this approach the multi-sensory architecture proposed can achieve real time capabilities.
Multi-Block Parallel Navier-Stokes Simulation of Unsteady Wind Tunnel and Ground Interference Effects

DTIC Science & Technology

2001-09-01

coefficient and propulsive efficiency showed that these parameters are virtually the same for both TE conditions (cT 0 40 and η 0 21). As a conclusion...difference in the way the two codes work, they yielded virtually the same solution. This shows that, for a reasonably small time step, whether the boundary... Biblioteca Sao Jose dos Campos - SP - Brazil iab@bibl.ita.cta.br 5. Prof. Max F. Platzer Chair, Department of Aeronautics & Astronautics - Naval
Block-Parallel Data Analysis with DIY2

DOE Office of Scientific and Technical Information (OSTI.GOV)

Morozov, Dmitriy; Peterka, Tom

DIY2 is a programming model and runtime for block-parallel analytics on distributed-memory machines. Its main abstraction is block-structured data parallelism: data are decomposed into blocks; blocks are assigned to processing elements (processes or threads); computation is described as iterations over these blocks, and communication between blocks is defined by reusable patterns. By expressing computation in this general form, the DIY2 runtime is free to optimize the movement of blocks between slow and fast memories (disk and flash vs. DRAM) and to concurrently execute blocks residing in memory with multiple threads. This enables the same program to execute in-core, out-of-core, serial,more » parallel, single-threaded, multithreaded, or combinations thereof. This paper describes the implementation of the main features of the DIY2 programming model and optimizations to improve performance. DIY2 is evaluated on benchmark test cases to establish baseline performance for several common patterns and on larger complete analysis codes running on large-scale HPC machines.« less
Malleable architecture generator for FPGA computing

NASA Astrophysics Data System (ADS)

Gokhale, Maya; Kaba, James; Marks, Aaron; Kim, Jang

1996-10-01

The malleable architecture generator (MARGE) is a tool set that translates high-level parallel C to configuration bit streams for field-programmable logic based computing systems. MARGE creates an application-specific instruction set and generates the custom hardware components required to perform exactly those computations specified by the C program. In contrast to traditional fixed-instruction processors, MARGE's dynamic instruction set creation provides for efficient use of hardware resources. MARGE processes intermediate code in which each operation is annotated by the bit lengths of the operands. Each basic block (sequence of straight line code) is mapped into a single custom instruction which contains all the operations and logic inherent in the block. A synthesis phase maps the operations comprising the instructions into register transfer level structural components and control logic which have been optimized to exploit functional parallelism and function unit reuse. As a final stage, commercial technology-specific tools are used to generate configuration bit streams for the desired target hardware. Technology- specific pre-placed, pre-routed macro blocks are utilized to implement as much of the hardware as possible. MARGE currently supports the Xilinx-based Splash-2 reconfigurable accelerator and National Semiconductor's CLAy-based parallel accelerator, MAPA. The MARGE approach has been demonstrated on systolic applications such as DNA sequence comparison.
CMS event processing multi-core efficiency status

NASA Astrophysics Data System (ADS)

Jones, C. D.; CMS Collaboration

2017-10-01

In 2015, CMS was the first LHC experiment to begin using a multi-threaded framework for doing event processing. This new framework utilizes Intel’s Thread Building Block library to manage concurrency via a task based processing model. During the 2015 LHC run period, CMS only ran reconstruction jobs using multiple threads because only those jobs were sufficiently thread efficient. Recent work now allows simulation and digitization to be thread efficient. In addition, during 2015 the multi-threaded framework could run events in parallel but could only use one thread per event. Work done in 2016 now allows multiple threads to be used while processing one event. In this presentation we will show how these recent changes have improved CMS’s overall threading and memory efficiency and we will discuss work to be done to further increase those efficiencies.
An OpenACC-Based Unified Programming Model for Multi-accelerator Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kim, Jungwon; Lee, Seyong; Vetter, Jeffrey S

2015-01-01

This paper proposes a novel SPMD programming model of OpenACC. Our model integrates the different granularities of parallelism from vector-level parallelism to node-level parallelism into a single, unified model based on OpenACC. It allows programmers to write programs for multiple accelerators using a uniform programming model whether they are in shared or distributed memory systems. We implement a prototype of our model and evaluate its performance with a GPU-based supercomputer using three benchmark applications.
BCYCLIC: A parallel block tridiagonal matrix cyclic solver

NASA Astrophysics Data System (ADS)

Hirshman, S. P.; Perumalla, K. S.; Lynch, V. E.; Sanchez, R.

2010-09-01

A block tridiagonal matrix is factored with minimal fill-in using a cyclic reduction algorithm that is easily parallelized. Storage of the factored blocks allows the application of the inverse to multiple right-hand sides which may not be known at factorization time. Scalability with the number of block rows is achieved with cyclic reduction, while scalability with the block size is achieved using multithreaded routines (OpenMP, GotoBLAS) for block matrix manipulation. This dual scalability is a noteworthy feature of this new solver, as well as its ability to efficiently handle arbitrary (non-powers-of-2) block row and processor numbers. Comparison with a state-of-the art parallel sparse solver is presented. It is expected that this new solver will allow many physical applications to optimally use the parallel resources on current supercomputers. Example usage of the solver in magneto-hydrodynamic (MHD), three-dimensional equilibrium solvers for high-temperature fusion plasmas is cited.
A new implementation of full resolution SBAS-DInSAR processing chain for the effective monitoring of structures and infrastructures

NASA Astrophysics Data System (ADS)

Bonano, Manuela; Buonanno, Sabatino; Ojha, Chandrakanta; Berardino, Paolo; Lanari, Riccardo; Zeni, Giovanni; Manunta, Michele

2017-04-01

The advanced DInSAR technique referred to as Small BAseline Subset (SBAS) algorithm has already largely demonstrated its effectiveness to carry out multi-scale and multi-platform surface deformation analyses relevant to both natural and man-made hazards. Thanks to its capability to generate displacement maps and long-term deformation time series at both regional (low resolution analysis) and local (full resolution analysis) spatial scales, it allows to get more insights on the spatial and temporal patterns of localized displacements relevant to single buildings and infrastructures over extended urban areas, with a key role in supporting risk mitigation and preservation activities. The extensive application of the multi-scale SBAS-DInSAR approach in many scientific contexts has gone hand in hand with new SAR satellite mission development, characterized by different frequency bands, spatial resolution, revisit times and ground coverage. This brought to the generation of huge DInSAR data stacks to be efficiently handled, processed and archived, with a strong impact on both the data storage and the computational requirements needed for generating the full resolution SBAS-DInSAR results. Accordingly, innovative and effective solutions for the automatic processing of massive SAR data archives and for the operational management of the derived SBAS-DInSAR products need to be designed and implemented, by exploiting the high efficiency (in terms of portability, scalability and computing performances) of the new ICT methodologies. In this work, we present a novel parallel implementation of the full resolution SBAS-DInSAR processing chain, aimed at investigating localized displacements affecting single buildings and infrastructures relevant to very large urban areas, relying on different granularity level parallelization strategies. The image granularity level is applied in most steps of the SBAS-DInSAR processing chain and exploits the multiprocessor systems with distributed memory. Moreover, in some processing steps very heavy from the computational point of view, the Graphical Processing Units (GPU) are exploited for the processing of blocks working on a pixel-by-pixel basis, requiring strong modifications on some key parts of the sequential full resolution SBAS-DInSAR processing chain. GPU processing is implemented by efficiently exploiting parallel processing architectures (as CUDA) for increasing the computing performances, in terms of optimization of the available GPU memory, as well as reduction of the Input/Output operations on the GPU and of the whole processing time for specific blocks w.r.t. the corresponding sequential implementation, particularly critical in presence of huge DInSAR datasets. Moreover, to efficiently handle the massive amount of DInSAR measurements provided by the new generation SAR constellations (CSK and Sentinel-1), we perform a proper re-design strategy aimed at the robust assimilation of the full resolution SBAS-DInSAR results into the web-based Geonode platform of the Spatial Data Infrastructure, thus allowing the efficient management, analysis and integration of the interferometric results with different data sources.
Single-step reinitialization and extending algorithms for level-set based multi-phase flow simulations

NASA Astrophysics Data System (ADS)

Fu, Lin; Hu, Xiangyu Y.; Adams, Nikolaus A.

2017-12-01

We propose efficient single-step formulations for reinitialization and extending algorithms, which are critical components of level-set based interface-tracking methods. The level-set field is reinitialized with a single-step (non iterative) "forward tracing" algorithm. A minimum set of cells is defined that describes the interface, and reinitialization employs only data from these cells. Fluid states are extrapolated or extended across the interface by a single-step "backward tracing" algorithm. Both algorithms, which are motivated by analogy to ray-tracing, avoid multiple block-boundary data exchanges that are inevitable for iterative reinitialization and extending approaches within a parallel-computing environment. The single-step algorithms are combined with a multi-resolution conservative sharp-interface method and validated by a wide range of benchmark test cases. We demonstrate that the proposed reinitialization method achieves second-order accuracy in conserving the volume of each phase. The interface location is invariant to reapplication of the single-step reinitialization. Generally, we observe smaller absolute errors than for standard iterative reinitialization on the same grid. The computational efficiency is higher than for the standard and typical high-order iterative reinitialization methods. We observe a 2- to 6-times efficiency improvement over the standard method for serial execution. The proposed single-step extending algorithm, which is commonly employed for assigning data to ghost cells with ghost-fluid or conservative interface interaction methods, shows about 10-times efficiency improvement over the standard method while maintaining same accuracy. Despite their simplicity, the proposed algorithms offer an efficient and robust alternative to iterative reinitialization and extending methods for level-set based multi-phase simulations.
On decoding of multi-level MPSK modulation codes

NASA Technical Reports Server (NTRS)

Lin, Shu; Gupta, Alok Kumar

1990-01-01

The decoding problem of multi-level block modulation codes is investigated. The hardware design of soft-decision Viterbi decoder for some short length 8-PSK block modulation codes is presented. An effective way to reduce the hardware complexity of the decoder by reducing the branch metric and path metric, using a non-uniform floating-point to integer mapping scheme, is proposed and discussed. The simulation results of the design are presented. The multi-stage decoding (MSD) of multi-level modulation codes is also investigated. The cases of soft-decision and hard-decision MSD are considered and their performance are evaluated for several codes of different lengths and different minimum squared Euclidean distances. It is shown that the soft-decision MSD reduces the decoding complexity drastically and it is suboptimum. The hard-decision MSD further simplifies the decoding while still maintaining a reasonable coding gain over the uncoded system, if the component codes are chosen properly. Finally, some basic 3-level 8-PSK modulation codes using BCH codes as component codes are constructed and their coding gains are found for hard decision multistage decoding.
A mixed parallel strategy for the solution of coupled multi-scale problems at finite strains

NASA Astrophysics Data System (ADS)

Lopes, I. A. Rodrigues; Pires, F. M. Andrade; Reis, F. J. P.

2018-02-01

A mixed parallel strategy for the solution of homogenization-based multi-scale constitutive problems undergoing finite strains is proposed. The approach aims to reduce the computational time and memory requirements of non-linear coupled simulations that use finite element discretization at both scales (FE^2). In the first level of the algorithm, a non-conforming domain decomposition technique, based on the FETI method combined with a mortar discretization at the interface of macroscopic subdomains, is employed. A master-slave scheme, which distributes tasks by macroscopic element and adopts dynamic scheduling, is then used for each macroscopic subdomain composing the second level of the algorithm. This strategy allows the parallelization of FE^2 simulations in computers with either shared memory or distributed memory architectures. The proposed strategy preserves the quadratic rates of asymptotic convergence that characterize the Newton-Raphson scheme. Several examples are presented to demonstrate the robustness and efficiency of the proposed parallel strategy.
Facing competition: Neural mechanisms underlying parallel programming of antisaccades and prosaccades.

PubMed

Talanow, Tobias; Kasparbauer, Anna-Maria; Steffens, Maria; Meyhöfer, Inga; Weber, Bernd; Smyrnis, Nikolaos; Ettinger, Ulrich

2016-08-01

The antisaccade task is a prominent tool to investigate the response inhibition component of cognitive control. Recent theoretical accounts explain performance in terms of parallel programming of exogenous and endogenous saccades, linked to the horse race metaphor. Previous studies have tested the hypothesis of competing saccade signals at the behavioral level by selectively slowing the programming of endogenous or exogenous processes e.g. by manipulating the probability of antisaccades in an experimental block. To gain a better understanding of inhibitory control processes in parallel saccade programming, we analyzed task-related eye movements and blood oxygenation level dependent (BOLD) responses obtained using functional magnetic resonance imaging (fMRI) at 3T from 16 healthy participants in a mixed antisaccade and prosaccade task. The frequency of antisaccade trials was manipulated across blocks of high (75%) and low (25%) antisaccade frequency. In blocks with high antisaccade frequency, antisaccade latencies were shorter and error rates lower whilst prosaccade latencies were longer and error rates were higher. At the level of BOLD, activations in the task-related saccade network (left inferior parietal lobe, right inferior parietal sulcus, left precentral gyrus reaching into left middle frontal gyrus and inferior frontal junction) and deactivations in components of the default mode network (bilateral temporal cortex, ventromedial prefrontal cortex) compensated increased cognitive control demands. These findings illustrate context dependent mechanisms underlying the coordination of competing decision signals in volitional gaze control. Copyright © 2016 Elsevier Inc. All rights reserved.

Block-Level Added Redundancy Explicit Authentication for Parallelized Encryption and Integrity Checking of Processor-Memory Transactions

NASA Astrophysics Data System (ADS)

Elbaz, Reouven; Torres, Lionel; Sassatelli, Gilles; Guillemin, Pierre; Bardouillet, Michel; Martinez, Albert

The bus between the System on Chip (SoC) and the external memory is one of the weakest points of computer systems: an adversary can easily probe this bus in order to read private data (data confidentiality concern) or to inject data (data integrity concern). The conventional way to protect data against such attacks and to ensure data confidentiality and integrity is to implement two dedicated engines: one performing data encryption and another data authentication. This approach, while secure, prevents parallelizability of the underlying computations. In this paper, we introduce the concept of Block-Level Added Redundancy Explicit Authentication (BL-AREA) and we describe a Parallelized Encryption and Integrity Checking Engine (PE-ICE) based on this concept. BL-AREA and PE-ICE have been designed to provide an effective solution to ensure both security services while allowing for full parallelization on processor read and write operations and optimizing the hardware resources. Compared to standard encryption which ensures only confidentiality, we show that PE-ICE additionally guarantees code and data integrity for less than 4% of run-time performance overhead.
When global rule reversal meets local task switching: The neural mechanisms of coordinated behavioral adaptation to instructed multi-level demand changes.

PubMed

Shi, Yiquan; Wolfensteller, Uta; Schubert, Torsten; Ruge, Hannes

2018-02-01

Cognitive flexibility is essential to cope with changing task demands and often it is necessary to adapt to combined changes in a coordinated manner. The present fMRI study examined how the brain implements such multi-level adaptation processes. Specifically, on a "local," hierarchically lower level, switching between two tasks was required across trials while the rules of each task remained unchanged for blocks of trials. On a "global" level regarding blocks of twelve trials, the task rules could reverse or remain the same. The current task was cued at the start of each trial while the current task rules were instructed before the start of a new block. We found that partly overlapping and partly segregated neural networks play different roles when coping with the combination of global rule reversal and local task switching. The fronto-parietal control network (FPN) supported the encoding of reversed rules at the time of explicit rule instruction. The same regions subsequently supported local task switching processes during actual implementation trials, irrespective of rule reversal condition. By contrast, a cortico-striatal network (CSN) including supplementary motor area and putamen was increasingly engaged across implementation trials and more so for rule reversal than for nonreversal blocks, irrespective of task switching condition. Together, these findings suggest that the brain accomplishes the coordinated adaptation to multi-level demand changes by distributing processing resources either across time (FPN for reversed rule encoding and later for task switching) or across regions (CSN for reversed rule implementation and FPN for concurrent task switching). © 2017 Wiley Periodicals, Inc.
Multi-petascale highly efficient parallel supercomputer

DOE Office of Scientific and Technical Information (OSTI.GOV)

Asaad, Sameh; Bellofatto, Ralph E.; Blocksome, Michael A.

A Multi-Petascale Highly Efficient Parallel Supercomputer of 100 petaflop-scale includes node architectures based upon System-On-a-Chip technology, where each processing node comprises a single Application Specific Integrated Circuit (ASIC). The ASIC nodes are interconnected by a five dimensional torus network that optimally maximize the throughput of packet communications between nodes and minimize latency. The network implements collective network and a global asynchronous network that provides global barrier and notification functions. Integrated in the node design include a list-based prefetcher. The memory system implements transaction memory, thread level speculation, and multiversioning cache that improves soft error rate at the same time andmore » supports DMA functionality allowing for parallel processing message-passing.« less
Endpoint-based parallel data processing with non-blocking collective instructions in a parallel active messaging interface of a parallel computer

DOE Office of Scientific and Technical Information (OSTI.GOV)

Archer, Charles J; Blocksome, Michael A; Cernohous, Bob R

Methods, apparatuses, and computer program products for endpoint-based parallel data processing with non-blocking collective instructions in a parallel active messaging interface (`PAMI`) of a parallel computer are provided. Embodiments include establishing by a parallel application a data communications geometry, the geometry specifying a set of endpoints that are used in collective operations of the PAMI, including associating with the geometry a list of collective algorithms valid for use with the endpoints of the geometry. Embodiments also include registering in each endpoint in the geometry a dispatch callback function for a collective operation and executing without blocking, through a single onemore » of the endpoints in the geometry, an instruction for the collective operation.« less
Multilevel Concatenated Block Modulation Codes for the Frequency Non-selective Rayleigh Fading Channel

NASA Technical Reports Server (NTRS)

Lin, Shu; Rhee, Dojun

1996-01-01

This paper is concerned with construction of multilevel concatenated block modulation codes using a multi-level concatenation scheme for the frequency non-selective Rayleigh fading channel. In the construction of multilevel concatenated modulation code, block modulation codes are used as the inner codes. Various types of codes (block or convolutional, binary or nonbinary) are being considered as the outer codes. In particular, we focus on the special case for which Reed-Solomon (RS) codes are used as the outer codes. For this special case, a systematic algebraic technique for constructing q-level concatenated block modulation codes is proposed. Codes have been constructed for certain specific values of q and compared with the single-level concatenated block modulation codes using the same inner codes. A multilevel closest coset decoding scheme for these codes is proposed.
Parallelization and checkpointing of GPU applications through program transformation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Solano-Quinde, Lizandro Damian

2012-01-01

GPUs have emerged as a powerful tool for accelerating general-purpose applications. The availability of programming languages that makes writing general-purpose applications for running on GPUs tractable have consolidated GPUs as an alternative for accelerating general purpose applications. Among the areas that have benefited from GPU acceleration are: signal and image processing, computational fluid dynamics, quantum chemistry, and, in general, the High Performance Computing (HPC) Industry. In order to continue to exploit higher levels of parallelism with GPUs, multi-GPU systems are gaining popularity. In this context, single-GPU applications are parallelized for running in multi-GPU systems. Furthermore, multi-GPU systems help to solvemore » the GPU memory limitation for applications with large application memory footprint. Parallelizing single-GPU applications has been approached by libraries that distribute the workload at runtime, however, they impose execution overhead and are not portable. On the other hand, on traditional CPU systems, parallelization has been approached through application transformation at pre-compile time, which enhances the application to distribute the workload at application level and does not have the issues of library-based approaches. Hence, a parallelization scheme for GPU systems based on application transformation is needed. Like any computing engine of today, reliability is also a concern in GPUs. GPUs are vulnerable to transient and permanent failures. Current checkpoint/restart techniques are not suitable for systems with GPUs. Checkpointing for GPU systems present new and interesting challenges, primarily due to the natural differences imposed by the hardware design, the memory subsystem architecture, the massive number of threads, and the limited amount of synchronization among threads. Therefore, a checkpoint/restart technique suitable for GPU systems is needed. The goal of this work is to exploit higher levels of parallelism and to develop support for application-level fault tolerance in applications using multiple GPUs. Our techniques reduce the burden of enhancing single-GPU applications to support these features. To achieve our goal, this work designs and implements a framework for enhancing a single-GPU OpenCL application through application transformation.« less
Morphing the feature-based multi-blocks of normative/healthy vertebral geometries to scoliosis vertebral geometries: development of personalized finite element models.

PubMed

Hadagali, Prasannaah; Peters, James R; Balasubramanian, Sriram

2018-03-01

Personalized Finite Element (FE) models and hexahedral elements are preferred for biomechanical investigations. Feature-based multi-block methods are used to develop anatomically accurate personalized FE models with hexahedral mesh. It is tedious to manually construct multi-blocks for large number of geometries on an individual basis to develop personalized FE models. Mesh-morphing method mitigates the aforementioned tediousness in meshing personalized geometries every time, but leads to element warping and loss of geometrical data. Such issues increase in magnitude when normative spine FE model is morphed to scoliosis-affected spinal geometry. The only way to bypass the issue of hex-mesh distortion or loss of geometry as a result of morphing is to rely on manually constructing the multi-blocks for scoliosis-affected spine geometry of each individual, which is time intensive. A method to semi-automate the construction of multi-blocks on the geometry of scoliosis vertebrae from the existing multi-blocks of normative vertebrae is demonstrated in this paper. High-quality hexahedral elements were generated on the scoliosis vertebrae from the morphed multi-blocks of normative vertebrae. Time taken was 3 months to construct the multi-blocks for normative spine and less than a day for scoliosis. Efforts taken to construct multi-blocks on personalized scoliosis spinal geometries are significantly reduced by morphing existing multi-blocks.
On the utility of the multi-level algorithm for the solution of nearly completely decomposable Markov chains

NASA Technical Reports Server (NTRS)

Leutenegger, Scott T.; Horton, Graham

1994-01-01

Recently the Multi-Level algorithm was introduced as a general purpose solver for the solution of steady state Markov chains. In this paper, we consider the performance of the Multi-Level algorithm for solving Nearly Completely Decomposable (NCD) Markov chains, for which special-purpose iteractive aggregation/disaggregation algorithms such as the Koury-McAllister-Stewart (KMS) method have been developed that can exploit the decomposability of the the Markov chain. We present experimental results indicating that the general-purpose Multi-Level algorithm is competitive, and can be significantly faster than the special-purpose KMS algorithm when Gauss-Seidel and Gaussian Elimination are used for solving the individual blocks.
Coarse-grained component concurrency in Earth system modeling: parallelizing atmospheric radiative transfer in the GFDL AM3 model using the Flexible Modeling System coupling framework

NASA Astrophysics Data System (ADS)

Balaji, V.; Benson, Rusty; Wyman, Bruce; Held, Isaac

2016-10-01

Climate models represent a large variety of processes on a variety of timescales and space scales, a canonical example of multi-physics multi-scale modeling. Current hardware trends, such as Graphical Processing Units (GPUs) and Many Integrated Core (MIC) chips, are based on, at best, marginal increases in clock speed, coupled with vast increases in concurrency, particularly at the fine grain. Multi-physics codes face particular challenges in achieving fine-grained concurrency, as different physics and dynamics components have different computational profiles, and universal solutions are hard to come by. We propose here one approach for multi-physics codes. These codes are typically structured as components interacting via software frameworks. The component structure of a typical Earth system model consists of a hierarchical and recursive tree of components, each representing a different climate process or dynamical system. This recursive structure generally encompasses a modest level of concurrency at the highest level (e.g., atmosphere and ocean on different processor sets) with serial organization underneath. We propose to extend concurrency much further by running more and more lower- and higher-level components in parallel with each other. Each component can further be parallelized on the fine grain, potentially offering a major increase in the scalability of Earth system models. We present here first results from this approach, called coarse-grained component concurrency, or CCC. Within the Geophysical Fluid Dynamics Laboratory (GFDL) Flexible Modeling System (FMS), the atmospheric radiative transfer component has been configured to run in parallel with a composite component consisting of every other atmospheric component, including the atmospheric dynamics and all other atmospheric physics components. We will explore the algorithmic challenges involved in such an approach, and present results from such simulations. Plans to achieve even greater levels of coarse-grained concurrency by extending this approach within other components, such as the ocean, will be discussed.
Toward Transparent Data Management in Multi-layer Storage Hierarchy for HPC Systems

DOE PAGES

Wadhwa, Bharti; Byna, Suren; Butt, Ali R.

2018-04-17

Upcoming exascale high performance computing (HPC) systems are expected to comprise multi-tier storage hierarchy, and thus will necessitate innovative storage and I/O mechanisms. Traditional disk and block-based interfaces and file systems face severe challenges in utilizing capabilities of storage hierarchies due to the lack of hierarchy support and semantic interfaces. Object-based and semantically-rich data abstractions for scientific data management on large scale systems offer a sustainable solution to these challenges. Such data abstractions can also simplify users involvement in data movement. Here, we take the first steps of realizing such an object abstraction and explore storage mechanisms for these objectsmore » to enhance I/O performance, especially for scientific applications. We explore how an object-based interface can facilitate next generation scalable computing systems by presenting the mapping of data I/O from two real world HPC scientific use cases: a plasma physics simulation code (VPIC) and a cosmology simulation code (HACC). Our storage model stores data objects in different physical organizations to support data movement across layers of memory/storage hierarchy. Our implementation sclaes well to 16K parallel processes, and compared to the state of the art, such as MPI-IO and HDF5, our object-based data abstractions and data placement strategy in multi-level storage hierarchy achieves up to 7 X I/O performance improvement for scientific data.« less
Toward Transparent Data Management in Multi-layer Storage Hierarchy for HPC Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wadhwa, Bharti; Byna, Suren; Butt, Ali R.

Upcoming exascale high performance computing (HPC) systems are expected to comprise multi-tier storage hierarchy, and thus will necessitate innovative storage and I/O mechanisms. Traditional disk and block-based interfaces and file systems face severe challenges in utilizing capabilities of storage hierarchies due to the lack of hierarchy support and semantic interfaces. Object-based and semantically-rich data abstractions for scientific data management on large scale systems offer a sustainable solution to these challenges. Such data abstractions can also simplify users involvement in data movement. Here, we take the first steps of realizing such an object abstraction and explore storage mechanisms for these objectsmore » to enhance I/O performance, especially for scientific applications. We explore how an object-based interface can facilitate next generation scalable computing systems by presenting the mapping of data I/O from two real world HPC scientific use cases: a plasma physics simulation code (VPIC) and a cosmology simulation code (HACC). Our storage model stores data objects in different physical organizations to support data movement across layers of memory/storage hierarchy. Our implementation sclaes well to 16K parallel processes, and compared to the state of the art, such as MPI-IO and HDF5, our object-based data abstractions and data placement strategy in multi-level storage hierarchy achieves up to 7 X I/O performance improvement for scientific data.« less
Advanced compilation techniques in the PARADIGM compiler for distributed-memory multicomputers

NASA Technical Reports Server (NTRS)

Su, Ernesto; Lain, Antonio; Ramaswamy, Shankar; Palermo, Daniel J.; Hodges, Eugene W., IV; Banerjee, Prithviraj

1995-01-01

The PARADIGM compiler project provides an automated means to parallelize programs, written in a serial programming model, for efficient execution on distributed-memory multicomputers. .A previous implementation of the compiler based on the PTD representation allowed symbolic array sizes, affine loop bounds and array subscripts, and variable number of processors, provided that arrays were single or multi-dimensionally block distributed. The techniques presented here extend the compiler to also accept multidimensional cyclic and block-cyclic distributions within a uniform symbolic framework. These extensions demand more sophisticated symbolic manipulation capabilities. A novel aspect of our approach is to meet this demand by interfacing PARADIGM with a powerful off-the-shelf symbolic package, Mathematica. This paper describes some of the Mathematica routines that performs various transformations, shows how they are invoked and used by the compiler to overcome the new challenges, and presents experimental results for code involving cyclic and block-cyclic arrays as evidence of the feasibility of the approach.
Optimisation of insect cell growth in deep-well blocks: development of a high-throughput insect cell expression screen.

PubMed

Bahia, Daljit; Cheung, Robert; Buchs, Mirjam; Geisse, Sabine; Hunt, Ian

2005-01-01

This report describes a method to culture insects cells in 24 deep-well blocks for the routine small-scale optimisation of baculovirus-mediated protein expression experiments. Miniaturisation of this process provides the necessary reduction in terms of resource allocation, reagents, and labour to allow extensive and rapid optimisation of expression conditions, with the concomitant reduction in lead-time before commencement of large-scale bioreactor experiments. This therefore greatly simplifies the optimisation process and allows the use of liquid handling robotics in much of the initial optimisation stages of the process, thereby greatly increasing the throughput of the laboratory. We present several examples of the use of deep-well block expression studies in the optimisation of therapeutically relevant protein targets. We also discuss how the enhanced throughput offered by this approach can be adapted to robotic handling systems and the implications this has on the capacity to conduct multi-parallel protein expression studies.
SiC Multi-Chip Power Modules as Power-System Building Blocks

NASA Technical Reports Server (NTRS)

Lostetter, Alexander; Franks, Steven

2007-01-01

The term "SiC MCPMs" (wherein "MCPM" signifies "multi-chip power module") denotes electronic power-supply modules containing multiple silicon carbide power devices and silicon-on-insulator (SOI) control integrated-circuit chips. SiC MCPMs are being developed as building blocks of advanced expandable, reconfigurable, fault-tolerant power-supply systems. Exploiting the ability of SiC semiconductor devices to operate at temperatures, breakdown voltages, and current densities significantly greater than those of conventional Si devices, the designs of SiC MCPMs and of systems comprising multiple SiC MCPMs are expected to afford a greater degree of miniaturization through stacking of modules with reduced requirements for heat sinking. Moreover, the higher-temperature capabilities of SiC MCPMs could enable operation in environments hotter than Si-based power systems can withstand. The stacked SiC MCPMs in a given system can be electrically connected in series, parallel, or a series/parallel combination to increase the overall power-handling capability of the system. In addition to power connections, the modules have communication connections. The SOI controllers in the modules communicate with each other as nodes of a decentralized control network, in which no single controller exerts overall command of the system. Control functions effected via the network include synchronization of switching of power devices and rapid reconfiguration of power connections to enable the power system to continue to supply power to a load in the event of failure of one of the modules. In addition to serving as building blocks of reliable power-supply systems, SiC MCPMs could be augmented with external control circuitry to make them perform additional power-handling functions as needed for specific applications: typical functions could include regulating voltages, storing energy, and driving motors. Because identical SiC MCPM building blocks could be utilized in a variety of ways, the cost and difficulty of designing new, highly reliable power systems would be reduced considerably. Several prototype DC-to-DC power-converter modules containing SiC power-switching devices were designed and built to demonstrate the feasibility of the SiC MCPM concept. In anticipation of a future need for operation at high temperature, the circuitry in the modules includes high-temperature inductors and capacitors. These modules were designed to be stacked to construct a system of four modules electrically connected in series and/or parallel. The packaging of the modules is designed to satisfy requirements for series and parallel interconnection among modules, high power density, high thermal efficiency, small size, and light weight. Each module includes four output power connectors two for serial and two for parallel output power connections among the modules. Each module also includes two signal connectors, electrically isolated from the power connectors, that afford four zones for signal interconnections among the SOI controllers. Finally, each module includes two input power connectors, through which it receives power from an in-line power bus. This design feature is included in anticipation of a custom-designed power bus incorporating sockets compatible with snap-on type connectors to enable rapid replacement of failed modules.
Exploiting Vector and Multicore Parallelsim for Recursive, Data- and Task-Parallel Programs

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ren, Bin; Krishnamoorthy, Sriram; Agrawal, Kunal

Modern hardware contains parallel execution resources that are well-suited for data-parallelism-vector units-and task parallelism-multicores. However, most work on parallel scheduling focuses on one type of hardware or the other. In this work, we present a scheduling framework that allows for a unified treatment of task- and data-parallelism. Our key insight is an abstraction, task blocks, that uniformly handles data-parallel iterations and task-parallel tasks, allowing them to be scheduled on vector units or executed independently as multicores. Our framework allows us to define schedulers that can dynamically select between executing task- blocks on vector units or multicores. We show that thesemore » schedulers are asymptotically optimal, and deliver the maximum amount of parallelism available in computation trees. To evaluate our schedulers, we develop program transformations that can convert mixed data- and task-parallel pro- grams into task block-based programs. Using a prototype instantiation of our scheduling framework, we show that, on an 8-core system, we can simultaneously exploit vector and multicore parallelism to achieve 14×-108× speedup over sequential baselines.« less
Study on parallel and distributed management of RS data based on spatial database

NASA Astrophysics Data System (ADS)

Chen, Yingbiao; Qian, Qinglan; Wu, Hongqiao; Liu, Shijin

2009-10-01

With the rapid development of current earth-observing technology, RS image data storage, management and information publication become a bottle-neck for its appliance and popularization. There are two prominent problems in RS image data storage and management system. First, background server hardly handle the heavy process of great capacity of RS data which stored at different nodes in a distributing environment. A tough burden has put on the background server. Second, there is no unique, standard and rational organization of Multi-sensor RS data for its storage and management. And lots of information is lost or not included at storage. Faced at the above two problems, the paper has put forward a framework for RS image data parallel and distributed management and storage system. This system aims at RS data information system based on parallel background server and a distributed data management system. Aiming at the above two goals, this paper has studied the following key techniques and elicited some revelatory conclusions. The paper has put forward a solid index of "Pyramid, Block, Layer, Epoch" according to the properties of RS image data. With the solid index mechanism, a rational organization for different resolution, different area, different band and different period of Multi-sensor RS image data is completed. In data storage, RS data is not divided into binary large objects to be stored at current relational database system, while it is reconstructed through the above solid index mechanism. A logical image database for the RS image data file is constructed. In system architecture, this paper has set up a framework based on a parallel server of several common computers. Under the framework, the background process is divided into two parts, the common WEB process and parallel process.
Study on parallel and distributed management of RS data based on spatial data base

NASA Astrophysics Data System (ADS)

Chen, Yingbiao; Qian, Qinglan; Liu, Shijin

2006-12-01

With the rapid development of current earth-observing technology, RS image data storage, management and information publication become a bottle-neck for its appliance and popularization. There are two prominent problems in RS image data storage and management system. First, background server hardly handle the heavy process of great capacity of RS data which stored at different nodes in a distributing environment. A tough burden has put on the background server. Second, there is no unique, standard and rational organization of Multi-sensor RS data for its storage and management. And lots of information is lost or not included at storage. Faced at the above two problems, the paper has put forward a framework for RS image data parallel and distributed management and storage system. This system aims at RS data information system based on parallel background server and a distributed data management system. Aiming at the above two goals, this paper has studied the following key techniques and elicited some revelatory conclusions. The paper has put forward a solid index of "Pyramid, Block, Layer, Epoch" according to the properties of RS image data. With the solid index mechanism, a rational organization for different resolution, different area, different band and different period of Multi-sensor RS image data is completed. In data storage, RS data is not divided into binary large objects to be stored at current relational database system, while it is reconstructed through the above solid index mechanism. A logical image database for the RS image data file is constructed. In system architecture, this paper has set up a framework based on a parallel server of several common computers. Under the framework, the background process is divided into two parts, the common WEB process and parallel process.
Calculating Path-Dependent Travel Time Prediction Variance and Covariance for the SALSA3D Global Tomographic P-Velocity Model with a Distributed Parallel Multi-Core Computer

NASA Astrophysics Data System (ADS)

Hipp, J. R.; Encarnacao, A.; Ballard, S.; Young, C. J.; Phillips, W. S.; Begnaud, M. L.

2011-12-01

Recently our combined SNL-LANL research team has succeeded in developing a global, seamless 3D tomographic P-velocity model (SALSA3D) that provides superior first P travel time predictions at both regional and teleseismic distances. However, given the variable data quality and uneven data sampling associated with this type of model, it is essential that there be a means to calculate high-quality estimates of the path-dependent variance and covariance associated with the predicted travel times of ray paths through the model. In this paper, we show a methodology for accomplishing this by exploiting the full model covariance matrix. Our model has on the order of 1/2 million nodes, so the challenge in calculating the covariance matrix is formidable: 0.9 TB storage for 1/2 of a symmetric matrix, necessitating an Out-Of-Core (OOC) blocked matrix solution technique. With our approach the tomography matrix (G which includes Tikhonov regularization terms) is multiplied by its transpose (GTG) and written in a blocked sub-matrix fashion. We employ a distributed parallel solution paradigm that solves for (GTG)-1 by assigning blocks to individual processing nodes for matrix decomposition update and scaling operations. We first find the Cholesky decomposition of GTG which is subsequently inverted. Next, we employ OOC matrix multiply methods to calculate the model covariance matrix from (GTG)-1 and an assumed data covariance matrix. Given the model covariance matrix we solve for the travel-time covariance associated with arbitrary ray-paths by integrating the model covariance along both ray paths. Setting the paths equal gives variance for that path. Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000.
Massively parallel algorithm and implementation of RI-MP2 energy calculation for peta-scale many-core supercomputers.

PubMed

Katouda, Michio; Naruse, Akira; Hirano, Yukihiko; Nakajima, Takahito

2016-11-15

A new parallel algorithm and its implementation for the RI-MP2 energy calculation utilizing peta-flop-class many-core supercomputers are presented. Some improvements from the previous algorithm (J. Chem. Theory Comput. 2013, 9, 5373) have been performed: (1) a dual-level hierarchical parallelization scheme that enables the use of more than 10,000 Message Passing Interface (MPI) processes and (2) a new data communication scheme that reduces network communication overhead. A multi-node and multi-GPU implementation of the present algorithm is presented for calculations on a central processing unit (CPU)/graphics processing unit (GPU) hybrid supercomputer. Benchmark results of the new algorithm and its implementation using the K computer (CPU clustering system) and TSUBAME 2.5 (CPU/GPU hybrid system) demonstrate high efficiency. The peak performance of 3.1 PFLOPS is attained using 80,199 nodes of the K computer. The peak performance of the multi-node and multi-GPU implementation is 514 TFLOPS using 1349 nodes and 4047 GPUs of TSUBAME 2.5. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.
Microwave switching power divider. [antenna feeds

NASA Technical Reports Server (NTRS)

Stockton, R. J.; Johnson, R. W. (Inventor)

1981-01-01

A pair of parallel, spaced-apart circular ground planes define a microwave cavity with multi-port microwave power distributing switching circuitry formed on opposite sides of a thin circular dielectric substrate disposed between the ground planes. The power distributing circuitry includes a conductive disk located at the center of the substrate and connected to a source of microwave energy. A high speed, low insertion loss switching diode and a dc blocking capacitor are connected in series between the outer end of a transmission line and an output port. A high impedance, microwave blocking dc bias choke is connected between each switching diode and a source of switching current. The switching source forward biases the diodes to couple microwave energy from the conductive disk to selected output ports and, to associated antenna elements connected to the output ports to form a synthesized antenna pattern.

Multi-threading: A new dimension to massively parallel scientific computation

NASA Astrophysics Data System (ADS)

Nielsen, Ida M. B.; Janssen, Curtis L.

2000-06-01

Multi-threading is becoming widely available for Unix-like operating systems, and the application of multi-threading opens new ways for performing parallel computations with greater efficiency. We here briefly discuss the principles of multi-threading and illustrate the application of multi-threading for a massively parallel direct four-index transformation of electron repulsion integrals. Finally, other potential applications of multi-threading in scientific computing are outlined.
GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers

DOE PAGES

Abraham, Mark James; Murtola, Teemu; Schulz, Roland; ...

2015-07-15

GROMACS is one of the most widely used open-source and free software codes in chemistry, used primarily for dynamical simulations of biomolecules. It provides a rich set of calculation types, preparation and analysis tools. Several advanced techniques for free-energy calculations are supported. In version 5, it reaches new performance heights, through several new and enhanced parallelization algorithms. This work on every level; SIMD registers inside cores, multithreading, heterogeneous CPU–GPU acceleration, state-of-the-art 3D domain decomposition, and ensemble-level parallelization through built-in replica exchange and the separate Copernicus framework. Finally, the latest best-in-class compressed trajectory storage format is supported.
GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Abraham, Mark James; Murtola, Teemu; Schulz, Roland

GROMACS is one of the most widely used open-source and free software codes in chemistry, used primarily for dynamical simulations of biomolecules. It provides a rich set of calculation types, preparation and analysis tools. Several advanced techniques for free-energy calculations are supported. In version 5, it reaches new performance heights, through several new and enhanced parallelization algorithms. This work on every level; SIMD registers inside cores, multithreading, heterogeneous CPU–GPU acceleration, state-of-the-art 3D domain decomposition, and ensemble-level parallelization through built-in replica exchange and the separate Copernicus framework. Finally, the latest best-in-class compressed trajectory storage format is supported.
A Parallel Framework with Block Matrices of a Discrete Fourier Transform for Vector-Valued Discrete-Time Signals.

PubMed

Soto-Quiros, Pablo

2015-01-01

This paper presents a parallel implementation of a kind of discrete Fourier transform (DFT): the vector-valued DFT. The vector-valued DFT is a novel tool to analyze the spectra of vector-valued discrete-time signals. This parallel implementation is developed in terms of a mathematical framework with a set of block matrix operations. These block matrix operations contribute to analysis, design, and implementation of parallel algorithms in multicore processors. In this work, an implementation and experimental investigation of the mathematical framework are performed using MATLAB with the Parallel Computing Toolbox. We found that there is advantage to use multicore processors and a parallel computing environment to minimize the high execution time. Additionally, speedup increases when the number of logical processors and length of the signal increase.
Parallel Lattice Basis Reduction Using a Multi-threaded Schnorr-Euchner LLL Algorithm

NASA Astrophysics Data System (ADS)

Backes, Werner; Wetzel, Susanne

In this paper, we introduce a new parallel variant of the LLL lattice basis reduction algorithm. Our new, multi-threaded algorithm is the first to provide an efficient, parallel implementation of the Schorr-Euchner algorithm for today’s multi-processor, multi-core computer architectures. Experiments with sparse and dense lattice bases show a speed-up factor of about 1.8 for the 2-thread and about factor 3.2 for the 4-thread version of our new parallel lattice basis reduction algorithm in comparison to the traditional non-parallel algorithm.
Parallel computation of GA search for the artery shape determinants with CFD

NASA Astrophysics Data System (ADS)

Himeno, M.; Noda, S.; Fukasaku, K.; Himeno, R.

2010-06-01

We studied which factors play important role to determine the shape of arteries at the carotid artery bifurcation by performing multi-objective optimization with computation fluid dynamics (CFD) and the genetic algorithm (GA). To perform it, the most difficult problem is how to reduce turn-around time of the GA optimization with 3D unsteady computation of blood flow. We devised two levels of parallel computation method with the following features: level 1: parallel CFD computation with appropriate number of cores; level 2: parallel jobs generated by "master", which finds quickly available job cue and dispatches jobs, to reduce turn-around time. As a result, the turn-around time of one GA trial, which would have taken 462 days with one core, was reduced to less than two days on RIKEN supercomputer system, RICC, with 8192 cores. We performed a multi-objective optimization to minimize the maximum mean WSS and to minimize the sum of circumference for four different shapes and obtained a set of trade-off solutions for each shape. In addition, we found that the carotid bulb has the feature of the minimum local mean WSS and minimum local radius. We confirmed that our method is effective for examining determinants of artery shapes.
Language Classification using N-grams Accelerated by FPGA-based Bloom Filters

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jacob, A; Gokhale, M

N-Gram (n-character sequences in text documents) counting is a well-established technique used in classifying the language of text in a document. In this paper, n-gram processing is accelerated through the use of reconfigurable hardware on the XtremeData XD1000 system. Our design employs parallelism at multiple levels, with parallel Bloom Filters accessing on-chip RAM, parallel language classifiers, and parallel document processing. In contrast to another hardware implementation (HAIL algorithm) that uses off-chip SRAM for lookup, our highly scalable implementation uses only on-chip memory blocks. Our implementation of end-to-end language classification runs at 85x comparable software and 1.45x the competing hardware design.
Practical quantum private query of blocks based on unbalanced-state Bennett-Brassard-1984 quantum-key-distribution protocol

NASA Astrophysics Data System (ADS)

Wei, Chun-Yan; Gao, Fei; Wen, Qiao-Yan; Wang, Tian-Yin

2014-12-01

Until now, the only kind of practical quantum private query (QPQ), quantum-key-distribution (QKD)-based QPQ, focuses on the retrieval of a single bit. In fact, meaningful message is generally composed of multiple adjacent bits (i.e., a multi-bit block). To obtain a message from database, the user Alice has to query l times to get each ai. In this condition, the server Bob could gain Alice's privacy once he obtains the address she queried in any of the l queries, since each ai contributes to the message Alice retrieves. Apparently, the longer the retrieved message is, the worse the user privacy becomes. To solve this problem, via an unbalanced-state technique and based on a variant of multi-level BB84 protocol, we present a protocol for QPQ of blocks, which allows the user to retrieve a multi-bit block from database in one query. Our protocol is somewhat like the high-dimension version of the first QKD-based QPQ protocol proposed by Jacobi et al., but some nontrivial modifications are necessary.
Practical quantum private query of blocks based on unbalanced-state Bennett-Brassard-1984 quantum-key-distribution protocol

PubMed Central

Wei, Chun-Yan; Gao, Fei; Wen, Qiao-Yan; Wang, Tian-Yin

2014-01-01

Until now, the only kind of practical quantum private query (QPQ), quantum-key-distribution (QKD)-based QPQ, focuses on the retrieval of a single bit. In fact, meaningful message is generally composed of multiple adjacent bits (i.e., a multi-bit block). To obtain a message from database, the user Alice has to query l times to get each ai. In this condition, the server Bob could gain Alice's privacy once he obtains the address she queried in any of the l queries, since each ai contributes to the message Alice retrieves. Apparently, the longer the retrieved message is, the worse the user privacy becomes. To solve this problem, via an unbalanced-state technique and based on a variant of multi-level BB84 protocol, we present a protocol for QPQ of blocks, which allows the user to retrieve a multi-bit block from database in one query. Our protocol is somewhat like the high-dimension version of the first QKD-based QPQ protocol proposed by Jacobi et al., but some nontrivial modifications are necessary. PMID:25518810
Jet formation and equatorial superrotation in Jupiter's atmosphere: Numerical modelling using a new efficient parallel code

NASA Astrophysics Data System (ADS)

Rivier, Leonard Gilles

Using an efficient parallel code solving the primitive equations of atmospheric dynamics, the jet structure of a Jupiter like atmosphere is modeled. In the first part of this thesis, a parallel spectral code solving both the shallow water equations and the multi-level primitive equations of atmospheric dynamics is built. The implementation of this code called BOB is done so that it runs effectively on an inexpensive cluster of workstations. A one dimensional decomposition and transposition method insuring load balancing among processes is used. The Legendre transform is cache-blocked. A "compute on the fly" of the Legendre polynomials used in the spectral method produces a lower memory footprint and enables high resolution runs on relatively small memory machines. Performance studies are done using a cluster of workstations located at the National Center for Atmospheric Research (NCAR). BOB performances are compared to the parallel benchmark code PSTSWM and the dynamical core of NCAR's CCM3.6.6. In both cases, the comparison favors BOB. In the second part of this thesis, the primitive equation version of the code described in part I is used to study the formation of organized zonal jets and equatorial superrotation in a planetary atmosphere where the parameters are chosen to best model the upper atmosphere of Jupiter. Two levels are used in the vertical and only large scale forcing is present. The model is forced towards a baroclinically unstable flow, so that eddies are generated by baroclinic instability. We consider several types of forcing, acting on either the temperature or the momentum field. We show that only under very specific parametric conditions, zonally elongated structures form and persist resembling the jet structure observed near the cloud level top (1 bar) on Jupiter. We also study the effect of an equatorial heat source, meant to be a crude representation of the effect of the deep convective planetary interior onto the outer atmospheric layer. We show that such heat forcing is able to produce strong equatorial superrotating winds, one of the most striking feature of the Jovian circulation.
Multi-level Hierarchical Poly Tree computer architectures

NASA Technical Reports Server (NTRS)

Padovan, Joe; Gute, Doug

1990-01-01

Based on the concept of hierarchical substructuring, this paper develops an optimal multi-level Hierarchical Poly Tree (HPT) parallel computer architecture scheme which is applicable to the solution of finite element and difference simulations. Emphasis is given to minimizing computational effort, in-core/out-of-core memory requirements, and the data transfer between processors. In addition, a simplified communications network that reduces the number of I/O channels between processors is presented. HPT configurations that yield optimal superlinearities are also demonstrated. Moreover, to generalize the scope of applicability, special attention is given to developing: (1) multi-level reduction trees which provide an orderly/optimal procedure by which model densification/simplification can be achieved, as well as (2) methodologies enabling processor grading that yields architectures with varying types of multi-level granularity.
A general parallel sparse-blocked matrix multiply for linear scaling SCF theory

NASA Astrophysics Data System (ADS)

Challacombe, Matt

2000-06-01

A general approach to the parallel sparse-blocked matrix-matrix multiply is developed in the context of linear scaling self-consistent-field (SCF) theory. The data-parallel message passing method uses non-blocking communication to overlap computation and communication. The space filling curve heuristic is used to achieve data locality for sparse matrix elements that decay with “separation”. Load balance is achieved by solving the bin packing problem for blocks with variable size.With this new method as the kernel, parallel performance of the simplified density matrix minimization (SDMM) for solution of the SCF equations is investigated for RHF/6-31G ∗∗ water clusters and RHF/3-21G estane globules. Sustained rates above 5.7 GFLOPS for the SDMM have been achieved for (H 2 O) 200 with 95 Origin 2000 processors. Scalability is found to be limited by load imbalance, which increases with decreasing granularity, due primarily to the inhomogeneous distribution of variable block sizes.
Endpoint-based parallel data processing with non-blocking collective instructions in a parallel active messaging interface of a parallel computer

DOE Office of Scientific and Technical Information (OSTI.GOV)

Archer, Charles J; Blocksome, Michael A; Cernohous, Bob R

Endpoint-based parallel data processing with non-blocking collective instructions in a PAMI of a parallel computer is disclosed. The PAMI is composed of data communications endpoints, each including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task. The compute nodes are coupled for data communications through the PAMI. The parallel application establishes a data communications geometry specifying a set of endpoints that are used in collective operations of the PAMI by associating with the geometry a list of collective algorithms valid for use with themore » endpoints of the geometry; registering in each endpoint in the geometry a dispatch callback function for a collective operation; and executing without blocking, through a single one of the endpoints in the geometry, an instruction for the collective operation.« less
Cellular mechanisms of desynchronizing effects of hypothermia in an in vitro epilepsy model.

PubMed

Motamedi, Gholam K; Gonzalez-Sulser, Alfredo; Dzakpasu, Rhonda; Vicini, Stefano

2012-01-01

Hypothermia can terminate epileptiform discharges in vitro and in vivo epilepsy models. Hypothermia is becoming a standard treatment for brain injury in infants with perinatal hypoxic ischemic encephalopathy, and it is gaining ground as a potential treatment in patients with drug resistant epilepsy. However, the exact mechanism of action of cooling the brain tissue is unclear. We have studied the 4-aminopyridine model of epilepsy in mice using single- and dual-patch clamp and perforated multi-electrode array recordings from the hippocampus and cortex. Cooling consistently terminated 4-aminopyridine induced epileptiform-like discharges in hippocampal neurons and increased input resistance that was not mimicked by transient receptor potential channel antagonists. Dual-patch clamp recordings showed significant synchrony between distant CA1 and CA3 pyramidal neurons, but less so between the pyramidal neurons and interneurons. In CA1 and CA3 neurons, hypothermia blocked rhythmic action potential discharges and disrupted their synchrony; however, in interneurons, hypothermia blocked rhythmic discharges without abolishing action potentials. In parallel, multi-electrode array recordings showed that synchronized discharges were disrupted by hypothermia, whereas multi-unit activity was unaffected. The differential effect of cooling on transmitting or secreting γ-aminobutyric acid interneurons might disrupt normal network synchrony, aborting the epileptiform discharges. Moreover, the persistence of action potential firing in interneurons would have additional antiepileptic effects through tonic γ-aminobutyric acid release.
Politics of innovation in multi-level water governance systems

NASA Astrophysics Data System (ADS)

Daniell, Katherine A.; Coombes, Peter J.; White, Ian

2014-11-01

Innovations are being proposed in many countries in order to support change towards more sustainable and water secure futures. However, the extent to which they can be implemented is subject to complex politics and powerful coalitions across multi-level governance systems and scales of interest. Exactly how innovation uptake can be best facilitated or blocked in these complex systems is thus a matter of important practical and research interest in water cycle management. From intervention research studies in Australia, China and Bulgaria, this paper seeks to describe and analyse the behind-the-scenes struggles and coalition-building that occurs between water utility providers, private companies, experts, communities and all levels of government in an effort to support or block specific innovations. The research findings suggest that in order to ensure successful passage of the proposed innovations, champions for it are required from at least two administrative levels, including one with innovation implementation capacity, as part of a larger supportive coalition. Higher governance levels can play an important enabling role in facilitating the passage of certain types of innovations that may be in competition with currently entrenched systems of water management. Due to a range of natural biases, experts on certain innovations and disciplines may form part of supporting or blocking coalitions but their evaluations of worth for water system sustainability and security are likely to be subject to competing claims based on different values and expertise, so may not necessarily be of use in resolving questions of "best courses of action". This remains a political values-based decision to be negotiated through the receiving multi-level water governance system.
Vertical Scan (V-SCAN) for 3-D Grid Adaptive Mesh Refinement for an atmospheric Model Dynamical Core

NASA Astrophysics Data System (ADS)

Andronova, N. G.; Vandenberg, D.; Oehmke, R.; Stout, Q. F.; Penner, J. E.

2009-12-01

One of the major building blocks of a rigorous representation of cloud evolution in global atmospheric models is a parallel adaptive grid MPI-based communication library (an Adaptive Blocks for Locally Cartesian Topologies library -- ABLCarT), which manages the block-structured data layout, handles ghost cell updates among neighboring blocks and splits a block as refinements occur. The library has several modules that provide a layer of abstraction for adaptive refinement: blocks, which contain individual cells of user data; shells - the global geometry for the problem, including a sphere, reduced sphere, and now a 3D sphere; a load balancer for placement of blocks onto processors; and a communication support layer which encapsulates all data movement. A major performance concern with adaptive mesh refinement is how to represent calculations that have need to be sequenced in a particular order in a direction, such as calculating integrals along a specific path (e.g. atmospheric pressure or geopotential in the vertical dimension). This concern is compounded if the blocks have varying levels of refinement, or are scattered across different processors, as can be the case in parallel computing. In this paper we describe an implementation in ABLCarT of a vertical scan operation, which allows computing along vertical paths in the correct order across blocks transparent to their resolution and processor location. We test this functionality on a 2D and a 3D advection problem, which tests the performance of the model’s dynamics (transport) and physics (sources and sinks) for different model resolutions needed for inclusion of cloud formation.
Least Reliable Bits Coding (LRBC) for high data rate satellite communications

NASA Technical Reports Server (NTRS)

Vanderaar, Mark; Wagner, Paul; Budinger, James

1992-01-01

An analysis and discussion of a bandwidth efficient multi-level/multi-stage block coded modulation technique called Least Reliable Bits Coding (LRBC) is presented. LRBC uses simple multi-level component codes that provide increased error protection on increasingly unreliable modulated bits in order to maintain an overall high code rate that increases spectral efficiency. Further, soft-decision multi-stage decoding is used to make decisions on unprotected bits through corrections made on more protected bits. Using analytical expressions and tight performance bounds it is shown that LRBC can achieve increased spectral efficiency and maintain equivalent or better power efficiency compared to that of Binary Phase Shift Keying (BPSK). Bit error rates (BER) vs. channel bit energy with Additive White Gaussian Noise (AWGN) are given for a set of LRB Reed-Solomon (RS) encoded 8PSK modulation formats with an ensemble rate of 8/9. All formats exhibit a spectral efficiency of 2.67 = (log2(8))(8/9) information bps/Hz. Bit by bit coded and uncoded error probabilities with soft-decision information are determined. These are traded with with code rate to determine parameters that achieve good performance. The relative simplicity of Galois field algebra vs. the Viterbi algorithm and the availability of high speed commercial Very Large Scale Integration (VLSI) for block codes indicates that LRBC using block codes is a desirable method for high data rate implementations.
Neural networks within multi-core optic fibers

PubMed Central

Cohen, Eyal; Malka, Dror; Shemer, Amir; Shahmoon, Asaf; Zalevsky, Zeev; London, Michael

2016-01-01

Hardware implementation of artificial neural networks facilitates real-time parallel processing of massive data sets. Optical neural networks offer low-volume 3D connectivity together with large bandwidth and minimal heat production in contrast to electronic implementation. Here, we present a conceptual design for in-fiber optical neural networks. Neurons and synapses are realized as individual silica cores in a multi-core fiber. Optical signals are transferred transversely between cores by means of optical coupling. Pump driven amplification in erbium-doped cores mimics synaptic interactions. We simulated three-layered feed-forward neural networks and explored their capabilities. Simulations suggest that networks can differentiate between given inputs depending on specific configurations of amplification; this implies classification and learning capabilities. Finally, we tested experimentally our basic neuronal elements using fibers, couplers, and amplifiers, and demonstrated that this configuration implements a neuron-like function. Therefore, devices similar to our proposed multi-core fiber could potentially serve as building blocks for future large-scale small-volume optical artificial neural networks. PMID:27383911
Neural networks within multi-core optic fibers.

PubMed

Cohen, Eyal; Malka, Dror; Shemer, Amir; Shahmoon, Asaf; Zalevsky, Zeev; London, Michael

2016-07-07

Hardware implementation of artificial neural networks facilitates real-time parallel processing of massive data sets. Optical neural networks offer low-volume 3D connectivity together with large bandwidth and minimal heat production in contrast to electronic implementation. Here, we present a conceptual design for in-fiber optical neural networks. Neurons and synapses are realized as individual silica cores in a multi-core fiber. Optical signals are transferred transversely between cores by means of optical coupling. Pump driven amplification in erbium-doped cores mimics synaptic interactions. We simulated three-layered feed-forward neural networks and explored their capabilities. Simulations suggest that networks can differentiate between given inputs depending on specific configurations of amplification; this implies classification and learning capabilities. Finally, we tested experimentally our basic neuronal elements using fibers, couplers, and amplifiers, and demonstrated that this configuration implements a neuron-like function. Therefore, devices similar to our proposed multi-core fiber could potentially serve as building blocks for future large-scale small-volume optical artificial neural networks.
Architecture and design of a 500-MHz gallium-arsenide processing element for a parallel supercomputer

NASA Technical Reports Server (NTRS)

Fouts, Douglas J.; Butner, Steven E.

1991-01-01

The design of the processing element of GASP, a GaAs supercomputer with a 500-MHz instruction issue rate and 1-GHz subsystem clocks, is presented. The novel, functionally modular, block data flow architecture of GASP is described. The architecture and design of a GASP processing element is then presented. The processing element (PE) is implemented in a hybrid semiconductor module with 152 custom GaAs ICs of eight different types. The effects of the implementation technology on both the system-level architecture and the PE design are discussed. SPICE simulations indicate that parts of the PE are capable of being clocked at 1 GHz, while the rest of the PE uses a 500-MHz clock. The architecture utilizes data flow techniques at a program block level, which allows efficient execution of parallel programs while maintaining reasonably good performance on sequential programs. A simulation study of the architecture indicates that an instruction execution rate of over 30,000 MIPS can be attained with 65 PEs.

A novel modular ANN architecture for efficient monitoring of gases/odours in real-time

NASA Astrophysics Data System (ADS)

Mishra, A.; Rajput, N. S.

2018-04-01

Data pre-processing is tremendously used for enhanced classification of gases. However, it suppresses the concentration variances of different gas samples. A classical solution of using single artificial neural network (ANN) architecture is also inefficient and renders degraded quantification. In this paper, a novel modular ANN design has been proposed to provide an efficient and scalable solution in real–time. Here, two separate ANN blocks viz. classifier block and quantifier block have been used to provide efficient and scalable gas monitoring in real—time. The classifier ANN consists of two stages. In the first stage, the Net 1-NDSRT has been trained to transform raw sensor responses into corresponding virtual multi-sensor responses using normalized difference sensor response transformation (NDSRT). These responses have been fed to the second stage (i.e., Net 2-classifier ). The Net 2-classifier has been trained to classify various gas samples to their respective class. Further, the quantifier block has parallel ANN modules, multiplexed to quantify each gas. Therefore, the classifier ANN decides class and quantifier ANN decides the exact quantity of the gas/odor present in the respective sample of that class.
Adaptive Numerical Algorithms in Space Weather Modeling

NASA Technical Reports Server (NTRS)

Toth, Gabor; vanderHolst, Bart; Sokolov, Igor V.; DeZeeuw, Darren; Gombosi, Tamas I.; Fang, Fang; Manchester, Ward B.; Meng, Xing; Nakib, Dalal; Powell, Kenneth G.;

2010-01-01

Space weather describes the various processes in the Sun-Earth system that present danger to human health and technology. The goal of space weather forecasting is to provide an opportunity to mitigate these negative effects. Physics-based space weather modeling is characterized by disparate temporal and spatial scales as well as by different physics in different domains. A multi-physics system can be modeled by a software framework comprising of several components. Each component corresponds to a physics domain, and each component is represented by one or more numerical models. The publicly available Space Weather Modeling Framework (SWMF) can execute and couple together several components distributed over a parallel machine in a flexible and efficient manner. The framework also allows resolving disparate spatial and temporal scales with independent spatial and temporal discretizations in the various models. Several of the computationally most expensive domains of the framework are modeled by the Block-Adaptive Tree Solar wind Roe Upwind Scheme (BATS-R-US) code that can solve various forms of the magnetohydrodynamics (MHD) equations, including Hall, semi-relativistic, multi-species and multi-fluid MHD, anisotropic pressure, radiative transport and heat conduction. Modeling disparate scales within BATS-R-US is achieved by a block-adaptive mesh both in Cartesian and generalized coordinates. Most recently we have created a new core for BATS-R-US: the Block-Adaptive Tree Library (BATL) that provides a general toolkit for creating, load balancing and message passing in a 1, 2 or 3 dimensional block-adaptive grid. We describe the algorithms of BATL and demonstrate its efficiency and scaling properties for various problems. BATS-R-US uses several time-integration schemes to address multiple time-scales: explicit time stepping with fixed or local time steps, partially steady-state evolution, point-implicit, semi-implicit, explicit/implicit, and fully implicit numerical schemes. Depending on the application, we find that different time stepping methods are optimal. Several of the time integration schemes exploit the block-based granularity of the grid structure. The framework and the adaptive algorithms enable physics based space weather modeling and even forecasting.

Parallel architectures for iterative methods on adaptive, block structured grids

NASA Technical Reports Server (NTRS)

Gannon, D.; Vanrosendale, J.

1983-01-01

A parallel computer architecture well suited to the solution of partial differential equations in complicated geometries is proposed. Algorithms for partial differential equations contain a great deal of parallelism. But this parallelism can be difficult to exploit, particularly on complex problems. One approach to extraction of this parallelism is the use of special purpose architectures tuned to a given problem class. The architecture proposed here is tuned to boundary value problems on complex domains. An adaptive elliptic algorithm which maps effectively onto the proposed architecture is considered in detail. Two levels of parallelism are exploited by the proposed architecture. First, by making use of the freedom one has in grid generation, one can construct grids which are locally regular, permitting a one to one mapping of grids to systolic style processor arrays, at least over small regions. All local parallelism can be extracted by this approach. Second, though there may be a regular global structure to the grids constructed, there will be parallelism at this level. One approach to finding and exploiting this parallelism is to use an architecture having a number of processor clusters connected by a switching network. The use of such a network creates a highly flexible architecture which automatically configures to the problem being solved.
Update on Development of SiC Multi-Chip Power Modules

NASA Technical Reports Server (NTRS)

Lostetter, Alexander; Cilio, Edgar; Mitchell, Gavin; Schupbach, Roberto

2008-01-01

Progress has been made in a continuing effort to develop multi-chip power modules (SiC MCPMs). This effort at an earlier stage was reported in 'SiC Multi-Chip Power Modules as Power-System Building Blocks' (LEW-18008-1), NASA Tech Briefs, Vol. 31, No. 2 (February 2007), page 28. The following recapitulation of information from the cited prior article is prerequisite to a meaningful summary of the progress made since then: 1) SiC MCPMs are, more specifically, electronic power-supply modules containing multiple silicon carbide power integrated-circuit chips and silicon-on-insulator (SOI) control integrated-circuit chips. SiC MCPMs are being developed as building blocks of advanced expandable, reconfigurable, fault-tolerant power-supply systems. Exploiting the ability of SiC semiconductor devices to operate at temperatures, breakdown voltages, and current densities significantly greater than those of conventional Si devices, the designs of SiC MCPMs and of systems comprising multiple SiC MCPMs are expected to afford a greater degree of miniaturization through stacking of modules with reduced requirements for heat sinking; 2) The stacked SiC MCPMs in a given system can be electrically connected in series, parallel, or a series/parallel combination to increase the overall power-handling capability of the system. In addition to power connections, the modules have communication connections. The SOI controllers in the modules communicate with each other as nodes of a decentralized control network, in which no single controller exerts overall command of the system. Control functions effected via the network include synchronization of switching of power devices and rapid reconfiguration of power connections to enable the power system to continue to supply power to a load in the event of failure of one of the modules; and, 3) In addition to serving as building blocks of reliable power-supply systems, SiC MCPMs could be augmented with external control circuitry to make them perform additional power-handling functions as needed for specific applications. Because identical SiC MCPM building blocks could be utilized in such a variety of ways, the cost and difficulty of designing new, highly reliable power systems would be reduced considerably. This concludes the information from the cited prior article. The main activity since the previously reported stage of development was the design, fabrication, and testing a 120- VDC-to-28-VDC modular power-converter system composed of eight SiC MCPMs in a 4 (parallel)-by-2 (series) matrix configuration, with normally-off controllable power switches. The SiC MCPM power modules include closed-loop control subsystems and are capable of operating at high power density or high temperature. The system was tested under various configurations, load conditions, load-transient conditions, and failure-recovery conditions. Planned future work includes refinement of the demonstrated modular system concept and development of a new converter hardware topology that would enable sharing of currents without the need for communication among modules. Toward these ends, it is also planned to develop a new converter control algorithm that would provide for improved sharing of current and power under all conditions, and to implement advanced packaging concepts that would enable operation at higher power density.
On nonlinear finite element analysis in single-, multi- and parallel-processors

NASA Technical Reports Server (NTRS)

Utku, S.; Melosh, R.; Islam, M.; Salama, M.

1982-01-01

Numerical solution of nonlinear equilibrium problems of structures by means of Newton-Raphson type iterations is reviewed. Each step of the iteration is shown to correspond to the solution of a linear problem, therefore the feasibility of the finite element method for nonlinear analysis is established. Organization and flow of data for various types of digital computers, such as single-processor/single-level memory, single-processor/two-level-memory, vector-processor/two-level-memory, and parallel-processors, with and without sub-structuring (i.e. partitioning) are given. The effect of the relative costs of computation, memory and data transfer on substructuring is shown. The idea of assigning comparable size substructures to parallel processors is exploited. Under Cholesky type factorization schemes, the efficiency of parallel processing is shown to decrease due to the occasional shared data, just as that due to the shared facilities.
Reducing power consumption while synchronizing a plurality of compute nodes during execution of a parallel application

DOEpatents

Archer, Charles J [Rochester, MN; Blocksome, Michael A [Rochester, MN; Peters, Amanda A [Rochester, MN; Ratterman, Joseph D [Rochester, MN; Smith, Brian E [Rochester, MN

2012-01-10

Methods, apparatus, and products are disclosed for reducing power consumption while synchronizing a plurality of compute nodes during execution of a parallel application that include: beginning, by each compute node, performance of a blocking operation specified by the parallel application, each compute node beginning the blocking operation asynchronously with respect to the other compute nodes; reducing, for each compute node, power to one or more hardware components of that compute node in response to that compute node beginning the performance of the blocking operation; and restoring, for each compute node, the power to the hardware components having power reduced in response to all of the compute nodes beginning the performance of the blocking operation.
Reducing power consumption while synchronizing a plurality of compute nodes during execution of a parallel application

DOEpatents

Archer, Charles J [Rochester, MN; Blocksome, Michael A [Rochester, MN; Peters, Amanda E [Cambridge, MA; Ratterman, Joseph D [Rochester, MN; Smith, Brian E [Rochester, MN

2012-04-17

Methods, apparatus, and products are disclosed for reducing power consumption while synchronizing a plurality of compute nodes during execution of a parallel application that include: beginning, by each compute node, performance of a blocking operation specified by the parallel application, each compute node beginning the blocking operation asynchronously with respect to the other compute nodes; reducing, for each compute node, power to one or more hardware components of that compute node in response to that compute node beginning the performance of the blocking operation; and restoring, for each compute node, the power to the hardware components having power reduced in response to all of the compute nodes beginning the performance of the blocking operation.
Portable multi-node LQCD Monte Carlo simulations using OpenACC

NASA Astrophysics Data System (ADS)

Bonati, Claudio; Calore, Enrico; D'Elia, Massimo; Mesiti, Michele; Negro, Francesco; Sanfilippo, Francesco; Schifano, Sebastiano Fabio; Silvi, Giorgio; Tripiccione, Raffaele

This paper describes a state-of-the-art parallel Lattice QCD Monte Carlo code for staggered fermions, purposely designed to be portable across different computer architectures, including GPUs and commodity CPUs. Portability is achieved using the OpenACC parallel programming model, used to develop a code that can be compiled for several processor architectures. The paper focuses on parallelization on multiple computing nodes using OpenACC to manage parallelism within the node, and OpenMPI to manage parallelism among the nodes. We first discuss the available strategies to be adopted to maximize performances, we then describe selected relevant details of the code, and finally measure the level of performance and scaling-performance that we are able to achieve. The work focuses mainly on GPUs, which offer a significantly high level of performances for this application, but also compares with results measured on other processors.
Parallel Gaussian elimination of a block tridiagonal matrix using multiple microcomputers

NASA Technical Reports Server (NTRS)

Blech, Richard A.

1989-01-01

The solution of a block tridiagonal matrix using parallel processing is demonstrated. The multiprocessor system on which results were obtained and the software environment used to program that system are described. Theoretical partitioning and resource allocation for the Gaussian elimination method used to solve the matrix are discussed. The results obtained from running 1, 2 and 3 processor versions of the block tridiagonal solver are presented. The PASCAL source code for these solvers is given in the appendix, and may be transportable to other shared memory parallel processors provided that the synchronization outlines are reproduced on the target system.
Efficient implementation of parallel three-dimensional FFT on clusters of PCs

NASA Astrophysics Data System (ADS)

Takahashi, Daisuke

2003-05-01

In this paper, we propose a high-performance parallel three-dimensional fast Fourier transform (FFT) algorithm on clusters of PCs. The three-dimensional FFT algorithm can be altered into a block three-dimensional FFT algorithm to reduce the number of cache misses. We show that the block three-dimensional FFT algorithm improves performance by utilizing the cache memory effectively. We use the block three-dimensional FFT algorithm to implement the parallel three-dimensional FFT algorithm. We succeeded in obtaining performance of over 1.3 GFLOPS on an 8-node dual Pentium III 1 GHz PC SMP cluster.
Non-image forming effects of illuminance level: Exploring parallel effects on physiological arousal and task performance.

PubMed

Huiberts, Laura M; Smolders, Karin C H J; de Kort, Yvonne A W

2016-10-01

This study investigated diurnal non-image forming (NIF) effects of illuminance level on physiological arousal in parallel to NIF effects on vigilance and working memory performance. We employed a counterbalanced within-subjects design in which thirty-nine participants (mean age=21.2; SD=2.1; 11 male) completed three 90-min sessions (165 vs. 600lx vs. 1700lx at eye level) either in the morning (N=18) or afternoon (N=21). During each session, participants completed four measurement blocks (incl. one baseline block) each consisting of a 10-min Psychomotor Vigilance Task (PVT) and a Backwards Digit-Span Task (BDST) including easy trials (4-6 digits) and difficult trials (7-8 digits). Heart rate (HR), skin conductance level (SCL) and systolic blood pressure (SBP) were measured continuously. The results revealed significant improvements in performance on the BDST difficult trials under 1700lx vs. 165lx (p=0.01), while illuminance level did not affect performance on the PVT and BDST easy trials. Illuminance level impacted HR and SCL, but not SBP. In the afternoon sessions, HR was significantly higher under 1700lx vs. 165lx during PVT performance (p=0.05), while during BDST performance, HR was only slightly higher under 600 vs. 165lx (p=0.06). SCL was significantly higher under 1700lx vs. 165lx during performance on BDST easy trials (p=0.02) and showed similar, but nonsignificant trends during the PVT and BDST difficult trials. Although both physiology and performance were affected by illuminance level, no consistent pattern emerged with respect to parallel changes in physiology and performance. Rather, physiology and performance seemed to be affected independently, via unique pathways. Copyright © 2016. Published by Elsevier Inc.
Addressable multi-nozzle electrohydrodynamic jet printing with high consistency by multi-level voltage method

NASA Astrophysics Data System (ADS)

Pan, Yanqiao; Huang, YongAn; Guo, Lei; Ding, Yajiang; Yin, Zhouping

2015-04-01

It is critical and challenging to achieve the individual jetting ability and high consistency in multi-nozzle electrohydrodynamic jet printing (E-jet printing). We proposed multi-level voltage method (MVM) to implement the addressable E-jet printing using multiple parallel nozzles with high consistency. The fabricated multi-nozzle printhead for MVM consists of three parts: PMMA holder, stainless steel capillaries (27G, outer diameter 400 μm) and FR-4 extractor layer. The key of MVM is to control the maximum meniscus electric field on each nozzle. The individual jetting control can be implemented when the rings under the jetting nozzles are 0 kV and the other rings are 0.5 kV. The onset electric field for each nozzle is ˜3.4 kV/mm by numerical simulation. Furthermore, a series of printing experiments are performed to show the advantage of MVM in printing consistency than the "one-voltage method" and "improved E-jet method", by combination with finite element analyses. The good dimension consistency (274μm, 276μm, 280μm) and position consistency of the droplet array on the hydrophobic Si substrate verified the enhancements. It shows that MVM is an effective technique to implement the addressable E-jet printing with multiple parallel nozzles in high consistency.
Parallel Large-Scale Molecular Dynamics Simulation Opens New Perspective to Clarify the Effect of a Porous Structure on the Sintering Process of Ni/YSZ Multiparticles.

PubMed

Xu, Jingxiang; Higuchi, Yuji; Ozawa, Nobuki; Sato, Kazuhisa; Hashida, Toshiyuki; Kubo, Momoji

2017-09-20

Ni sintering in the Ni/YSZ porous anode of a solid oxide fuel cell changes the porous structure, leading to degradation. Preventing sintering and degradation during operation is a great challenge. Usually, a sintering molecular dynamics (MD) simulation model consisting of two particles on a substrate is used; however, the model cannot reflect the porous structure effect on sintering. In our previous study, a multi-nanoparticle sintering modeling method with tens of thousands of atoms revealed the effect of the particle framework and porosity on sintering. However, the method cannot reveal the effect of the particle size on sintering and the effect of sintering on the change in the porous structure. In the present study, we report a strategy to reveal them in the porous structure by using our multi-nanoparticle modeling method and a parallel large-scale multimillion-atom MD simulator. We used this method to investigate the effect of YSZ particle size and tortuosity on sintering and degradation in the Ni/YSZ anodes. Our parallel large-scale MD simulation showed that the sintering degree decreased as the YSZ particle size decreased. The gas fuel diffusion path, which reflects the overpotential, was blocked by pore coalescence during sintering. The degradation of gas diffusion performance increased as the YSZ particle size increased. Furthermore, the gas diffusion performance was quantified by a tortuosity parameter and an optimal YSZ particle size, which is equal to that of Ni, was found for good diffusion after sintering. These findings cannot be obtained by previous MD sintering studies with tens of thousands of atoms. The present parallel large-scale multimillion-atom MD simulation makes it possible to clarify the effects of the particle size and tortuosity on sintering and degradation.
Practical quantum private query of blocks based on unbalanced-state Bennett-Brassard-1984 quantum-key-distribution protocol.

PubMed

Wei, Chun-Yan; Gao, Fei; Wen, Qiao-Yan; Wang, Tian-Yin

2014-12-18

Until now, the only kind of practical quantum private query (QPQ), quantum-key-distribution (QKD)-based QPQ, focuses on the retrieval of a single bit. In fact, meaningful message is generally composed of multiple adjacent bits (i.e., a multi-bit block). To obtain a message a1a2···al from database, the user Alice has to query l times to get each ai. In this condition, the server Bob could gain Alice's privacy once he obtains the address she queried in any of the l queries, since each a(i) contributes to the message Alice retrieves. Apparently, the longer the retrieved message is, the worse the user privacy becomes. To solve this problem, via an unbalanced-state technique and based on a variant of multi-level BB84 protocol, we present a protocol for QPQ of blocks, which allows the user to retrieve a multi-bit block from database in one query. Our protocol is somewhat like the high-dimension version of the first QKD-based QPQ protocol proposed by Jacobi et al., but some nontrivial modifications are necessary.
CELES: CUDA-accelerated simulation of electromagnetic scattering by large ensembles of spheres

NASA Astrophysics Data System (ADS)

Egel, Amos; Pattelli, Lorenzo; Mazzamuto, Giacomo; Wiersma, Diederik S.; Lemmer, Uli

2017-09-01

CELES is a freely available MATLAB toolbox to simulate light scattering by many spherical particles. Aiming at high computational performance, CELES leverages block-diagonal preconditioning, a lookup-table approach to evaluate costly functions and massively parallel execution on NVIDIA graphics processing units using the CUDA computing platform. The combination of these techniques allows to efficiently address large electrodynamic problems (>104 scatterers) on inexpensive consumer hardware. In this paper, we validate near- and far-field distributions against the well-established multi-sphere T-matrix (MSTM) code and discuss the convergence behavior for ensembles of different sizes, including an exemplary system comprising 105 particles.
Real-time multi-mode neutron multiplicity counter

DOEpatents

Rowland, Mark S; Alvarez, Raymond A

2013-02-26

Embodiments are directed to a digital data acquisition method that collects data regarding nuclear fission at high rates and performs real-time preprocessing of large volumes of data into directly useable forms for use in a system that performs non-destructive assaying of nuclear material and assemblies for mass and multiplication of special nuclear material (SNM). Pulses from a multi-detector array are fed in parallel to individual inputs that are tied to individual bits in a digital word. Data is collected by loading a word at the individual bit level in parallel, to reduce the latency associated with current shift-register systems. The word is read at regular intervals, all bits simultaneously, with no manipulation. The word is passed to a number of storage locations for subsequent processing, thereby removing the front-end problem of pulse pileup. The word is used simultaneously in several internal processing schemes that assemble the data in a number of more directly useable forms. The detector includes a multi-mode counter that executes a number of different count algorithms in parallel to determine different attributes of the count data.
Compliant Robot Wrist

NASA Technical Reports Server (NTRS)

Voellmer, George

1992-01-01

Compliant element for robot wrist accepts small displacements in one direction only (to first approximation). Three such elements combined to obtain translational compliance along three orthogonal directions, without rotational compliance along any of them. Element is double-blade flexure joint in which two sheets of spring steel attached between opposing blocks, forming rectangle. Blocks moved parallel to each other in one direction only. Sheets act as double cantilever beams deforming in S-shape, keeping blocks parallel.
Parallelized implicit propagators for the finite-difference Schrödinger equation

NASA Astrophysics Data System (ADS)

Parker, Jonathan; Taylor, K. T.

1995-08-01

We describe the application of block Gauss-Seidel and block Jacobi iterative methods to the design of implicit propagators for finite-difference models of the time-dependent Schrödinger equation. The block-wise iterative methods discussed here are mixed direct-iterative methods for solving simultaneous equations, in the sense that direct methods (e.g. LU decomposition) are used to invert certain block sub-matrices, and iterative methods are used to complete the solution. We describe parallel variants of the basic algorithm that are well suited to the medium- to coarse-grained parallelism of work-station clusters, and MIMD supercomputers, and we show that under a wide range of conditions, fine-grained parallelism of the computation can be achieved. Numerical tests are conducted on a typical one-electron atom Hamiltonian. The methods converge robustly to machine precision (15 significant figures), in some cases in as few as 6 or 7 iterations. The rate of convergence is nearly independent of the finite-difference grid-point separations.
Toward a Model Framework of Generalized Parallel Componential Processing of Multi-Symbol Numbers

ERIC Educational Resources Information Center

Huber, Stefan; Cornelsen, Sonja; Moeller, Korbinian; Nuerk, Hans-Christoph

2015-01-01

In this article, we propose and evaluate a new model framework of parallel componential multi-symbol number processing, generalizing the idea of parallel componential processing of multi-digit numbers to the case of negative numbers by considering the polarity signs similar to single digits. In a first step, we evaluated this account by defining…
Real-time multiplicity counter

DOEpatents

Rowland, Mark S [Alamo, CA; Alvarez, Raymond A [Berkeley, CA

2010-07-13

A neutron multi-detector array feeds pulses in parallel to individual inputs that are tied to individual bits in a digital word. Data is collected by loading a word at the individual bit level in parallel. The word is read at regular intervals, all bits simultaneously, to minimize latency. The electronics then pass the word to a number of storage locations for subsequent processing, thereby removing the front-end problem of pulse pileup.

Bi-level multi-source learning for heterogeneous block-wise missing data.

PubMed

Xiang, Shuo; Yuan, Lei; Fan, Wei; Wang, Yalin; Thompson, Paul M; Ye, Jieping

2014-11-15

Bio-imaging technologies allow scientists to collect large amounts of high-dimensional data from multiple heterogeneous sources for many biomedical applications. In the study of Alzheimer's Disease (AD), neuroimaging data, gene/protein expression data, etc., are often analyzed together to improve predictive power. Joint learning from multiple complementary data sources is advantageous, but feature-pruning and data source selection are critical to learn interpretable models from high-dimensional data. Often, the data collected has block-wise missing entries. In the Alzheimer's Disease Neuroimaging Initiative (ADNI), most subjects have MRI and genetic information, but only half have cerebrospinal fluid (CSF) measures, a different half has FDG-PET; only some have proteomic data. Here we propose how to effectively integrate information from multiple heterogeneous data sources when data is block-wise missing. We present a unified "bi-level" learning model for complete multi-source data, and extend it to incomplete data. Our major contributions are: (1) our proposed models unify feature-level and source-level analysis, including several existing feature learning approaches as special cases; (2) the model for incomplete data avoids imputing missing data and offers superior performance; it generalizes to other applications with block-wise missing data sources; (3) we present efficient optimization algorithms for modeling complete and incomplete data. We comprehensively evaluate the proposed models including all ADNI subjects with at least one of four data types at baseline: MRI, FDG-PET, CSF and proteomics. Our proposed models compare favorably with existing approaches. © 2013 Elsevier Inc. All rights reserved.
Some fast elliptic solvers on parallel architectures and their complexities

NASA Technical Reports Server (NTRS)

Gallopoulos, E.; Saad, Y.

1989-01-01

The discretization of separable elliptic partial differential equations leads to linear systems with special block tridiagonal matrices. Several methods are known to solve these systems, the most general of which is the Block Cyclic Reduction (BCR) algorithm which handles equations with nonconstant coefficients. A method was recently proposed to parallelize and vectorize BCR. In this paper, the mapping of BCR on distributed memory architectures is discussed, and its complexity is compared with that of other approaches including the Alternating-Direction method. A fast parallel solver is also described, based on an explicit formula for the solution, which has parallel computational compelxity lower than that of parallel BCR.
Some fast elliptic solvers on parallel architectures and their complexities

NASA Technical Reports Server (NTRS)

Gallopoulos, E.; Saad, Youcef

1989-01-01

The discretization of separable elliptic partial differential equations leads to linear systems with special block triangular matrices. Several methods are known to solve these systems, the most general of which is the Block Cyclic Reduction (BCR) algorithm which handles equations with nonconsistant coefficients. A method was recently proposed to parallelize and vectorize BCR. Here, the mapping of BCR on distributed memory architectures is discussed, and its complexity is compared with that of other approaches, including the Alternating-Direction method. A fast parallel solver is also described, based on an explicit formula for the solution, which has parallel computational complexity lower than that of parallel BCR.
Facilitating higher-fidelity simulations of axial compressor instability and other turbomachinery flow conditions

NASA Astrophysics Data System (ADS)

Herrick, Gregory Paul

The quest to accurately capture flow phenomena with length-scales both short and long and to accurately represent complex flow phenomena within disparately sized geometry inspires a need for an efficient, high-fidelity, multi-block structured computational fluid dynamics (CFD) parallel computational scheme. This research presents and demonstrates a more efficient computational method by which to perform multi-block structured CFD parallel computational simulations, thus facilitating higher-fidelity solutions of complicated geometries (due to the inclusion of grids for "small'' flow areas which are often merely modeled) and their associated flows. This computational framework offers greater flexibility and user-control in allocating the resource balance between process count and wall-clock computation time. The principal modifications implemented in this revision consist of a "multiple grid block per processing core'' software infrastructure and an analytic computation of viscous flux Jacobians. The development of this scheme is largely motivated by the desire to simulate axial compressor stall inception with more complete gridding of the flow passages (including rotor tip clearance regions) than has been previously done while maintaining high computational efficiency (i.e., minimal consumption of computational resources), and thus this paradigm shall be demonstrated with an examination of instability in a transonic axial compressor. However, the paradigm presented herein facilitates CFD simulation of myriad previously impractical geometries and flows and is not limited to detailed analyses of axial compressor flows. While the simulations presented herein were technically possible under the previous structure of the subject software, they were much less computationally efficient and thus not pragmatically feasible; the previous research using this software to perform three-dimensional, full-annulus, time-accurate, unsteady, full-stage (with sliding-interface) simulations of rotating stall inception in axial compressors utilized tip clearance periodic models, while the scheme here is demonstrated by a simulation of axial compressor stall inception utilizing gridded rotor tip clearance regions. As will be discussed, much previous research---experimental, theoretical, and computational---has suggested that understanding clearance flow behavior is critical to understanding stall inception, and previous computational research efforts which have used tip clearance models have begged the question, "What about the clearance flows?''. This research begins to address that question.
Seamless contiguity method for parallel segmentation of remote sensing image

NASA Astrophysics Data System (ADS)

Wang, Geng; Wang, Guanghui; Yu, Mei; Cui, Chengling

2015-12-01

Seamless contiguity is the key technology for parallel segmentation of remote sensing data with large quantities. It can be effectively integrate fragments of the parallel processing into reasonable results for subsequent processes. There are numerous methods reported in the literature for seamless contiguity, such as establishing buffer, area boundary merging and data sewing. et. We proposed a new method which was also based on building buffers. The seamless contiguity processes we adopt are based on the principle: ensuring the accuracy of the boundary, ensuring the correctness of topology. Firstly, block number is computed based on data processing ability, unlike establishing buffer on both sides of block line, buffer is established just on the right side and underside of the line. Each block of data is segmented respectively and then gets the segmentation objects and their label value. Secondly, choose one block(called master block) and do stitching on the adjacent blocks(called slave block), process the rest of the block in sequence. Through the above processing, topological relationship and boundaries of master block are guaranteed. Thirdly, if the master block polygons boundaries intersect with buffer boundary and the slave blocks polygons boundaries intersect with block line, we adopt certain rules to merge and trade-offs them. Fourthly, check the topology and boundary in the buffer area. Finally, a set of experiments were conducted and prove the feasibility of this method. This novel seamless contiguity algorithm provides an applicable and practical solution for efficient segmentation of massive remote sensing image.
Verification of Electromagnetic Physics Models for Parallel Computing Architectures in the GeantV Project

DOE Office of Scientific and Technical Information (OSTI.GOV)

Amadio, G.; et al.

An intensive R&D and programming effort is required to accomplish new challenges posed by future experimental high-energy particle physics (HEP) programs. The GeantV project aims to narrow the gap between the performance of the existing HEP detector simulation software and the ideal performance achievable, exploiting latest advances in computing technology. The project has developed a particle detector simulation prototype capable of transporting in parallel particles in complex geometries exploiting instruction level microparallelism (SIMD and SIMT), task-level parallelism (multithreading) and high-level parallelism (MPI), leveraging both the multi-core and the many-core opportunities. We present preliminary verification results concerning the electromagnetic (EM) physicsmore » models developed for parallel computing architectures within the GeantV project. In order to exploit the potential of vectorization and accelerators and to make the physics model effectively parallelizable, advanced sampling techniques have been implemented and tested. In this paper we introduce a set of automated statistical tests in order to verify the vectorized models by checking their consistency with the corresponding Geant4 models and to validate them against experimental data.« less
An electrostatic Particle-In-Cell code on multi-block structured meshes

NASA Astrophysics Data System (ADS)

Meierbachtol, Collin S.; Svyatskiy, Daniil; Delzanno, Gian Luca; Vernon, Louis J.; Moulton, J. David

2017-12-01

We present an electrostatic Particle-In-Cell (PIC) code on multi-block, locally structured, curvilinear meshes called Curvilinear PIC (CPIC). Multi-block meshes are essential to capture complex geometries accurately and with good mesh quality, something that would not be possible with single-block structured meshes that are often used in PIC and for which CPIC was initially developed. Despite the structured nature of the individual blocks, multi-block meshes resemble unstructured meshes in a global sense and introduce several new challenges, such as the presence of discontinuities in the mesh properties and coordinate orientation changes across adjacent blocks, and polyjunction points where an arbitrary number of blocks meet. In CPIC, these challenges have been met by an approach that features: (1) a curvilinear formulation of the PIC method: each mesh block is mapped from the physical space, where the mesh is curvilinear and arbitrarily distorted, to the logical space, where the mesh is uniform and Cartesian on the unit cube; (2) a mimetic discretization of Poisson's equation suitable for multi-block meshes; and (3) a hybrid (logical-space position/physical-space velocity), asynchronous particle mover that mitigates the performance degradation created by the necessity to track particles as they move across blocks. The numerical accuracy of CPIC was verified using two standard plasma-material interaction tests, which demonstrate good agreement with the corresponding analytic solutions. Compared to PIC codes on unstructured meshes, which have also been used for their flexibility in handling complex geometries but whose performance suffers from issues associated with data locality and indirect data access patterns, PIC codes on multi-block structured meshes may offer the best compromise for capturing complex geometries while also maintaining solution accuracy and computational efficiency.
An electrostatic Particle-In-Cell code on multi-block structured meshes

DOE PAGES

Meierbachtol, Collin S.; Svyatskiy, Daniil; Delzanno, Gian Luca; ...

2017-09-14

We present an electrostatic Particle-In-Cell (PIC) code on multi-block, locally structured, curvilinear meshes called Curvilinear PIC (CPIC). Multi-block meshes are essential to capture complex geometries accurately and with good mesh quality, something that would not be possible with single-block structured meshes that are often used in PIC and for which CPIC was initially developed. In spite of the structured nature of the individual blocks, multi-block meshes resemble unstructured meshes in a global sense and introduce several new challenges, such as the presence of discontinuities in the mesh properties and coordinate orientation changes across adjacent blocks, and polyjunction points where anmore » arbitrary number of blocks meet. In CPIC, these challenges have been met by an approach that features: (1) a curvilinear formulation of the PIC method: each mesh block is mapped from the physical space, where the mesh is curvilinear and arbitrarily distorted, to the logical space, where the mesh is uniform and Cartesian on the unit cube; (2) a mimetic discretization of Poisson's equation suitable for multi-block meshes; and (3) a hybrid (logical-space position/physical-space velocity), asynchronous particle mover that mitigates the performance degradation created by the necessity to track particles as they move across blocks. The numerical accuracy of CPIC was verified using two standard plasma–material interaction tests, which demonstrate good agreement with the corresponding analytic solutions. And compared to PIC codes on unstructured meshes, which have also been used for their flexibility in handling complex geometries but whose performance suffers from issues associated with data locality and indirect data access patterns, PIC codes on multi-block structured meshes may offer the best compromise for capturing complex geometries while also maintaining solution accuracy and computational efficiency.« less
An electrostatic Particle-In-Cell code on multi-block structured meshes

DOE Office of Scientific and Technical Information (OSTI.GOV)

Meierbachtol, Collin S.; Svyatskiy, Daniil; Delzanno, Gian Luca

We present an electrostatic Particle-In-Cell (PIC) code on multi-block, locally structured, curvilinear meshes called Curvilinear PIC (CPIC). Multi-block meshes are essential to capture complex geometries accurately and with good mesh quality, something that would not be possible with single-block structured meshes that are often used in PIC and for which CPIC was initially developed. In spite of the structured nature of the individual blocks, multi-block meshes resemble unstructured meshes in a global sense and introduce several new challenges, such as the presence of discontinuities in the mesh properties and coordinate orientation changes across adjacent blocks, and polyjunction points where anmore » arbitrary number of blocks meet. In CPIC, these challenges have been met by an approach that features: (1) a curvilinear formulation of the PIC method: each mesh block is mapped from the physical space, where the mesh is curvilinear and arbitrarily distorted, to the logical space, where the mesh is uniform and Cartesian on the unit cube; (2) a mimetic discretization of Poisson's equation suitable for multi-block meshes; and (3) a hybrid (logical-space position/physical-space velocity), asynchronous particle mover that mitigates the performance degradation created by the necessity to track particles as they move across blocks. The numerical accuracy of CPIC was verified using two standard plasma–material interaction tests, which demonstrate good agreement with the corresponding analytic solutions. And compared to PIC codes on unstructured meshes, which have also been used for their flexibility in handling complex geometries but whose performance suffers from issues associated with data locality and indirect data access patterns, PIC codes on multi-block structured meshes may offer the best compromise for capturing complex geometries while also maintaining solution accuracy and computational efficiency.« less
Duct flow nonuniformities: Effect of struts in SSME HGM II(+)

NASA Technical Reports Server (NTRS)

Burke, Roger

1988-01-01

A numerical study, using the INS3D flow solver, of laminar and turbulent flow around a two dimensional strut, and three dimensional flow around a strut in an annulus is presented. A multi-block procedure was used to calculate two dimensional laminar flow around two struts in parallel, with each strut represented by one computational block. Single block calculations were performed for turbulent flow around a two dimensional strut, using a Baldwin-Lomax turbulence model to parameterize the turbulent shear stresses. A modified Baldwin-Lomax model was applied to the case of a three dimensional strut in an annulus. The results displayed the essential features of wing-body flows, including the presence of a horseshoe vortex system at the junction of the strut and the lower annulus surface. A similar system was observed at the upper annulus surface. The test geometries discussed were useful in developing the capability to perform multiblock calculations, and to simulate turbulent flow around obstructions located between curved walls. Both of these skills will be necessary to model the three dimensional flow in the strut assembly of the SSME. Work is now in progress on performing a three dimensional two block turbulent calculation of the flow in the turnaround duct (TAD) and strut/fuel bowl juncture region.
Integrated Task and Data Parallel Programming

NASA Technical Reports Server (NTRS)

Grimshaw, A. S.

1998-01-01

This research investigates the combination of task and data parallel language constructs within a single programming language. There are an number of applications that exhibit properties which would be well served by such an integrated language. Examples include global climate models, aircraft design problems, and multidisciplinary design optimization problems. Our approach incorporates data parallel language constructs into an existing, object oriented, task parallel language. The language will support creation and manipulation of parallel classes and objects of both types (task parallel and data parallel). Ultimately, the language will allow data parallel and task parallel classes to be used either as building blocks or managers of parallel objects of either type, thus allowing the development of single and multi-paradigm parallel applications. 1995 Research Accomplishments In February I presented a paper at Frontiers 1995 describing the design of the data parallel language subset. During the spring I wrote and defended my dissertation proposal. Since that time I have developed a runtime model for the language subset. I have begun implementing the model and hand-coding simple examples which demonstrate the language subset. I have identified an astrophysical fluid flow application which will validate the data parallel language subset. 1996 Research Agenda Milestones for the coming year include implementing a significant portion of the data parallel language subset over the Legion system. Using simple hand-coded methods, I plan to demonstrate (1) concurrent task and data parallel objects and (2) task parallel objects managing both task and data parallel objects. My next steps will focus on constructing a compiler and implementing the fluid flow application with the language. Concurrently, I will conduct a search for a real-world application exhibiting both task and data parallelism within the same program. Additional 1995 Activities During the fall I collaborated with Andrew Grimshaw and Adam Ferrari to write a book chapter which will be included in Parallel Processing in C++ edited by Gregory Wilson. I also finished two courses, Compilers and Advanced Compilers, in 1995. These courses complete my class requirements at the University of Virginia. I have only my dissertation research and defense to complete.
Integrated Task And Data Parallel Programming: Language Design

NASA Technical Reports Server (NTRS)

Grimshaw, Andrew S.; West, Emily A.

1998-01-01

his research investigates the combination of task and data parallel language constructs within a single programming language. There are an number of applications that exhibit properties which would be well served by such an integrated language. Examples include global climate models, aircraft design problems, and multidisciplinary design optimization problems. Our approach incorporates data parallel language constructs into an existing, object oriented, task parallel language. The language will support creation and manipulation of parallel classes and objects of both types (task parallel and data parallel). Ultimately, the language will allow data parallel and task parallel classes to be used either as building blocks or managers of parallel objects of either type, thus allowing the development of single and multi-paradigm parallel applications. 1995 Research Accomplishments In February I presented a paper at Frontiers '95 describing the design of the data parallel language subset. During the spring I wrote and defended my dissertation proposal. Since that time I have developed a runtime model for the language subset. I have begun implementing the model and hand-coding simple examples which demonstrate the language subset. I have identified an astrophysical fluid flow application which will validate the data parallel language subset. 1996 Research Agenda Milestones for the coming year include implementing a significant portion of the data parallel language subset over the Legion system. Using simple hand-coded methods, I plan to demonstrate (1) concurrent task and data parallel objects and (2) task parallel objects managing both task and data parallel objects. My next steps will focus on constructing a compiler and implementing the fluid flow application with the language. Concurrently, I will conduct a search for a real-world application exhibiting both task and data parallelism within the same program m. Additional 1995 Activities During the fall I collaborated with Andrew Grimshaw and Adam Ferrari to write a book chapter which will be included in Parallel Processing in C++ edited by Gregory Wilson. I also finished two courses, Compilers and Advanced Compilers, in 1995. These courses complete my class requirements at the University of Virginia. I have only my dissertation research and defense to complete.
"Science SQL" as a Building Block for Flexible, Standards-based Data Infrastructures

NASA Astrophysics Data System (ADS)

Baumann, Peter

2016-04-01

We have learnt to live with the pain of separating data and metadata into non-interoperable silos. For metadata, we enjoy the flexibility of databases, be they relational, graph, or some other NoSQL. Contrasting this, users still "drown in files" as an unstructured, low-level archiving paradigm. It is time to bridge this chasm which once was technologically induced, but today can be overcome. One building block towards a common re-integrated information space is to support massive multi-dimensional spatio-temporal arrays. These "datacubes" appear as sensor, image, simulation, and statistics data in all science and engineering domains, and beyond. For example, 2-D satellilte imagery, 2-D x/y/t image timeseries and x/y/z geophysical voxel data, and 4-D x/y/z/t climate data contribute to today's data deluge in the Earth sciences. Virtual observatories in the Space sciences routinely generate Petabytes of such data. Life sciences deal with microarray data, confocal microscopy, human brain data, which all fall into the same category. The ISO SQL/MDA (Multi-Dimensional Arrays) candidate standard is extending SQL with modelling and query support for n-D arrays ("datacubes") in a flexible, domain-neutral way. This heralds a new generation of services with new quality parameters, such as flexibility, ease of access, embedding into well-known user tools, and scalability mechanisms that remain completely transparent to users. Technology like the EU rasdaman ("raster data manager") Array Database system can support all of the above examples simultaneously, with one technology. This is practically proven: As of today, rasdaman is in operational use on hundreds of Terabytes of satellite image timeseries datacubes, with transparent query distribution across more than 1,000 nodes. Therefore, Array Databases offering SQL/MDA constitute a natural common building block for next-generation data infrastructures. Being initiator and editor of the standard we present principles, implementation facets, and application examples as a basis for further discussion. Further, we highlight recent implementation progress in parallelization, data distribution, and query optimization showing their effects on real-life use cases.
Modern multicore and manycore architectures: Modelling, optimisation and benchmarking a multiblock CFD code

NASA Astrophysics Data System (ADS)

Hadade, Ioan; di Mare, Luca

2016-08-01

Modern multicore and manycore processors exhibit multiple levels of parallelism through a wide range of architectural features such as SIMD for data parallel execution or threads for core parallelism. The exploitation of multi-level parallelism is therefore crucial for achieving superior performance on current and future processors. This paper presents the performance tuning of a multiblock CFD solver on Intel SandyBridge and Haswell multicore CPUs and the Intel Xeon Phi Knights Corner coprocessor. Code optimisations have been applied on two computational kernels exhibiting different computational patterns: the update of flow variables and the evaluation of the Roe numerical fluxes. We discuss at great length the code transformations required for achieving efficient SIMD computations for both kernels across the selected devices including SIMD shuffles and transpositions for flux stencil computations and global memory transformations. Core parallelism is expressed through threading based on a number of domain decomposition techniques together with optimisations pertaining to alleviating NUMA effects found in multi-socket compute nodes. Results are correlated with the Roofline performance model in order to assert their efficiency for each distinct architecture. We report significant speedups for single thread execution across both kernels: 2-5X on the multicore CPUs and 14-23X on the Xeon Phi coprocessor. Computations at full node and chip concurrency deliver a factor of three speedup on the multicore processors and up to 24X on the Xeon Phi manycore coprocessor.
A novel method for a multi-level hierarchical composite with brick-and-mortar structure

PubMed Central

Brandt, Kristina; Wolff, Michael F. H.; Salikov, Vitalij; Heinrich, Stefan; Schneider, Gerold A.

2013-01-01

The fascination for hierarchically structured hard tissues such as enamel or nacre arises from their unique structure-properties-relationship. During the last decades this numerously motivated the synthesis of composites, mimicking the brick-and-mortar structure of nacre. However, there is still a lack in synthetic engineering materials displaying a true hierarchical structure. Here, we present a novel multi-step processing route for anisotropic 2-level hierarchical composites by combining different coating techniques on different length scales. It comprises polymer-encapsulated ceramic particles as building blocks for the first level, followed by spouted bed spray granulation for a second level, and finally directional hot pressing to anisotropically consolidate the composite. The microstructure achieved reveals a brick-and-mortar hierarchical structure with distinct, however not yet optimized mechanical properties on each level. It opens up a completely new processing route for the synthesis of multi-level hierarchically structured composites, giving prospects to multi-functional structure-properties relationships. PMID:23900554
A novel method for a multi-level hierarchical composite with brick-and-mortar structure.

PubMed

Brandt, Kristina; Wolff, Michael F H; Salikov, Vitalij; Heinrich, Stefan; Schneider, Gerold A

2013-01-01

The fascination for hierarchically structured hard tissues such as enamel or nacre arises from their unique structure-properties-relationship. During the last decades this numerously motivated the synthesis of composites, mimicking the brick-and-mortar structure of nacre. However, there is still a lack in synthetic engineering materials displaying a true hierarchical structure. Here, we present a novel multi-step processing route for anisotropic 2-level hierarchical composites by combining different coating techniques on different length scales. It comprises polymer-encapsulated ceramic particles as building blocks for the first level, followed by spouted bed spray granulation for a second level, and finally directional hot pressing to anisotropically consolidate the composite. The microstructure achieved reveals a brick-and-mortar hierarchical structure with distinct, however not yet optimized mechanical properties on each level. It opens up a completely new processing route for the synthesis of multi-level hierarchically structured composites, giving prospects to multi-functional structure-properties relationships.
A novel method for a multi-level hierarchical composite with brick-and-mortar structure

NASA Astrophysics Data System (ADS)

Brandt, Kristina; Wolff, Michael F. H.; Salikov, Vitalij; Heinrich, Stefan; Schneider, Gerold A.

2013-07-01

The fascination for hierarchically structured hard tissues such as enamel or nacre arises from their unique structure-properties-relationship. During the last decades this numerously motivated the synthesis of composites, mimicking the brick-and-mortar structure of nacre. However, there is still a lack in synthetic engineering materials displaying a true hierarchical structure. Here, we present a novel multi-step processing route for anisotropic 2-level hierarchical composites by combining different coating techniques on different length scales. It comprises polymer-encapsulated ceramic particles as building blocks for the first level, followed by spouted bed spray granulation for a second level, and finally directional hot pressing to anisotropically consolidate the composite. The microstructure achieved reveals a brick-and-mortar hierarchical structure with distinct, however not yet optimized mechanical properties on each level. It opens up a completely new processing route for the synthesis of multi-level hierarchically structured composites, giving prospects to multi-functional structure-properties relationships.
Flexible feature-space-construction architecture and its VLSI implementation for multi-scale object detection

NASA Astrophysics Data System (ADS)

Luo, Aiwen; An, Fengwei; Zhang, Xiangyu; Chen, Lei; Huang, Zunkai; Jürgen Mattausch, Hans

2018-04-01

Feature extraction techniques are a cornerstone of object detection in computer-vision-based applications. The detection performance of vison-based detection systems is often degraded by, e.g., changes in the illumination intensity of the light source, foreground-background contrast variations or automatic gain control from the camera. In order to avoid such degradation effects, we present a block-based L1-norm-circuit architecture which is configurable for different image-cell sizes, cell-based feature descriptors and image resolutions according to customization parameters from the circuit input. The incorporated flexibility in both the image resolution and the cell size for multi-scale image pyramids leads to lower computational complexity and power consumption. Additionally, an object-detection prototype for performance evaluation in 65 nm CMOS implements the proposed L1-norm circuit together with a histogram of oriented gradients (HOG) descriptor and a support vector machine (SVM) classifier. The proposed parallel architecture with high hardware efficiency enables real-time processing, high detection robustness, small chip-core area as well as low power consumption for multi-scale object detection.
A review on battery thermal management in electric vehicle application

NASA Astrophysics Data System (ADS)

Xia, Guodong; Cao, Lei; Bi, Guanglong

2017-11-01

The global issues of energy crisis and air pollution have offered a great opportunity to develop electric vehicles. However, so far, cycle life of power battery, environment adaptability, driving range and charging time seems far to compare with the level of traditional vehicles with internal combustion engine. Effective battery thermal management (BTM) is absolutely essential to relieve this situation. This paper reviews the existing literature from two levels that are cell level and battery module level. For single battery, specific attention is paid to three important processes which are heat generation, heat transport, and heat dissipation. For large format cell, multi-scale multi-dimensional coupled models have been developed. This will facilitate the investigation on factors, such as local irreversible heat generation, thermal resistance, current distribution, etc., that account for intrinsic temperature gradients existing in cell. For battery module based on air and liquid cooling, series, series-parallel and parallel cooling configurations are discussed. Liquid cooling strategies, especially direct liquid cooling strategies, are reviewed and they may advance the battery thermal management system to a new generation.
Multi-variants synthesis of Petri nets for FPGA devices

NASA Astrophysics Data System (ADS)

Bukowiec, Arkadiusz; Doligalski, Michał

2015-09-01

There is presented new method of synthesis of application specific logic controllers for FPGA devices. The specification of control algorithm is made with use of control interpreted Petri net (PT type). It allows specifying parallel processes in easy way. The Petri net is decomposed into state-machine type subnets. In this case, each subnet represents one parallel process. For this purpose there are applied algorithms of coloring of Petri nets. There are presented two approaches of such decomposition: with doublers of macroplaces or with one global wait place. Next, subnets are implemented into two-level logic circuit of the controller. The levels of logic circuit are obtained as a result of its architectural decomposition. The first level combinational circuit is responsible for generation of next places and second level decoder is responsible for generation output symbols. There are worked out two variants of such circuits: with one shared operational memory or with many flexible distributed memories as a decoder. Variants of Petri net decomposition and structures of logic circuits can be combined together without any restrictions. It leads to existence of four variants of multi-variants synthesis.

Self-assembly assisted polymerization (SAAP): approaching long multi-block copolymers with an ordered chain sequence and controllable block length.

PubMed

Wu, Chi; Xie, Zuowei; Zhang, Guangzhao; Zi, Guofu; Tu, Yingfeng; Yang, Yali; Cai, Ping; Nie, Ting

2002-12-07

A combination of polymer physics and synthetic chemistry has enabled us to develop self-assembly assisted polymerization (SAAP), leading to the preparation of long multi-block copolymers with an ordered chain sequence and controllable block lengths.
Parallel Computing for the Computed-Tomography Imaging Spectrometer

NASA Technical Reports Server (NTRS)

Lee, Seungwon

2008-01-01

This software computes the tomographic reconstruction of spatial-spectral data from raw detector images of the Computed-Tomography Imaging Spectrometer (CTIS), which enables transient-level, multi-spectral imaging by capturing spatial and spectral information in a single snapshot.
Gaussian curvature analysis allows for automatic block placement in multi-block hexahedral meshing.

PubMed

Ramme, Austin J; Shivanna, Kiran H; Magnotta, Vincent A; Grosland, Nicole M

2011-10-01

Musculoskeletal finite element analysis (FEA) has been essential to research in orthopaedic biomechanics. The generation of a volumetric mesh is often the most challenging step in a FEA. Hexahedral meshing tools that are based on a multi-block approach rely on the manual placement of building blocks for their mesh generation scheme. We hypothesise that Gaussian curvature analysis could be used to automatically develop a building block structure for multi-block hexahedral mesh generation. The Automated Building Block Algorithm incorporates principles from differential geometry, combinatorics, statistical analysis and computer science to automatically generate a building block structure to represent a given surface without prior information. We have applied this algorithm to 29 bones of varying geometries and successfully generated a usable mesh in all cases. This work represents a significant advancement in automating the definition of building blocks.
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

NASA Technical Reports Server (NTRS)

Jost, Gabriele; Jin, Hao-Qiang; anMey, Dieter; Hatay, Ferhat F.

2003-01-01

Clusters of SMP (Symmetric Multi-Processors) nodes provide support for a wide range of parallel programming paradigms. The shared address space within each node is suitable for OpenMP parallelization. Message passing can be employed within and across the nodes of a cluster. Multiple levels of parallelism can be achieved by combining message passing and OpenMP parallelization. Which programming paradigm is the best will depend on the nature of the given problem, the hardware components of the cluster, the network, and the available software. In this study we compare the performance of different implementations of the same CFD benchmark application, using the same numerical algorithm but employing different programming paradigms.
Homogeneous-oxide stack in IGZO thin-film transistors for multi-level-cell NAND memory application

NASA Astrophysics Data System (ADS)

Ji, Hao; Wei, Yehui; Zhang, Xinlei; Jiang, Ran

2017-11-01

A nonvolatile charge-trap-flash memory that is based on amorphous indium-gallium-zinc-oxide thin film transistors was fabricated with a homogeneous-oxide structure for a multi-level-cell application. All oxide layers, i.e., tunneling layer, charge trapping layer, and blocking layer, were fabricated with Al2O3 films. The fabrication condition (including temperature and deposition method) of the charge trapping layer was different from those of the other oxide layers. This device demonstrated a considerable large memory window of 4 V between the states fully erased and programmed with the operation voltage less than 14 V. This kind of device shows a good prospect for multi-level-cell memory applications.
Nuclide Depletion Capabilities in the Shift Monte Carlo Code

DOE PAGES

Davidson, Gregory G.; Pandya, Tara M.; Johnson, Seth R.; ...

2017-12-21

A new depletion capability has been developed in the Exnihilo radiation transport code suite. This capability enables massively parallel domain-decomposed coupling between the Shift continuous-energy Monte Carlo solver and the nuclide depletion solvers in ORIGEN to perform high-performance Monte Carlo depletion calculations. This paper describes this new depletion capability and discusses its various features, including a multi-level parallel decomposition, high-order transport-depletion coupling, and energy-integrated power renormalization. Several test problems are presented to validate the new capability against other Monte Carlo depletion codes, and the parallel performance of the new capability is analyzed.
Scaling Irregular Applications through Data Aggregation and Software Multithreading

DOE Office of Scientific and Technical Information (OSTI.GOV)

Morari, Alessandro; Tumeo, Antonino; Chavarría-Miranda, Daniel

Bioinformatics, data analytics, semantic databases, knowledge discovery are emerging high performance application areas that exploit dynamic, linked data structures such as graphs, unbalanced trees or unstructured grids. These data structures usually are very large, requiring significantly more memory than available on single shared memory systems. Additionally, these data structures are difficult to partition on distributed memory systems. They also present poor spatial and temporal locality, thus generating unpredictable memory and network accesses. The Partitioned Global Address Space (PGAS) programming model seems suitable for these applications, because it allows using a shared memory abstraction across distributed-memory clusters. However, current PGAS languagesmore » and libraries are built to target regular remote data accesses and block transfers. Furthermore, they usually rely on the Single Program Multiple Data (SPMD) parallel control model, which is not well suited to the fine grained, dynamic and unbalanced parallelism of irregular applications. In this paper we present {\\bf GMT} (Global Memory and Threading library), a custom runtime library that enables efficient execution of irregular applications on commodity clusters. GMT integrates a PGAS data substrate with simple fork/join parallelism and provides automatic load balancing on a per node basis. It implements multi-level aggregation and lightweight multithreading to maximize memory and network bandwidth with fine-grained data accesses and tolerate long data access latencies. A key innovation in the GMT runtime is its thread specialization (workers, helpers and communication threads) that realize the overall functionality. We compare our approach with other PGAS models, such as UPC running using GASNet, and hand-optimized MPI code on a set of typical large-scale irregular applications, demonstrating speedups of an order of magnitude.« less
The ability of multi-site, multi-depth sacral lateral branch blocks to anesthetize the sacroiliac joint complex.

PubMed

Dreyfuss, Paul; Henning, Troy; Malladi, Niriksha; Goldstein, Barry; Bogduk, Nikolai

2009-01-01

To determine the physiologic effectiveness of multi-site, multi-depth sacral lateral branch injections. Double-blind, randomized, placebo-controlled study. Outpatient pain management center. Twenty asymptomatic volunteers. The dorsal innervation to the sacroiliac joint (SIJ) is from the L5 dorsal ramus and the S1-3 lateral branches. Multi-site, multi-depth lateral branch blocks were developed to compensate for the complex regional anatomy that limited the effectiveness of single-site, single-depth lateral branch injections. Bilateral multi-site, multi-depth lateral branch green dye injections and subsequent dissection on two cadavers revealed a 91% accuracy with this technique. Session 1: 20 asymptomatic subjects had a 25-g spinal needle probe their interosseous (IO) and dorsal sacroiliac (DSI) ligaments. The inferior dorsal SIJ was entered and capsular distension with contrast medium was performed. Discomfort had to occur with each provocation maneuver and a contained arthrogram was necessary to continue in the study. Session 2: 1 week later; computer randomized, double-blind multi-site, multi-depth lateral branch blocks injections were performed. Ten subjects received active (bupivicaine 0.75%) and 10 subjects received sham (normal saline) multi-site, multi-depth lateral branch injections. Thirty minutes later, provocation testing was repeated with identical methodology used in session 1. Presence or absence of pain for ligamentous probing and SIJ capsular distension. Seventy percent of the active group had an insensate IO and DSI ligaments, and inferior dorsal SIJ vs 0-10% of the sham group. Twenty percent of the active vs 10% of the sham group did not feel repeat capsular distension. Six of seven subjects (86%) retained the ability to feel repeat capsular distension despite an insensate dorsal SIJ complex. Multi-site, multi-depth lateral branch blocks are physiologically effective at a rate of 70%. Multi-site, multi-depth lateral branch blocks do not effectively block the intra-articular portion of the SIJ. There is physiological evidence that the intra-articular portion of the SIJ is innervated from both ventral and dorsal sources. Comparative multi-site, multi-depth lateral branch blocks should be considered a potentially valuable tool to diagnose extra-articular SIJ pain and determine if lateral branch radiofrequency neurotomy may assist one with SIJ pain.
Multi-channel metabolic imaging, with SENSE reconstruction, of hyperpolarized [1- 13C] pyruvate in a live rat at 3.0 tesla on a clinical MR scanner

NASA Astrophysics Data System (ADS)

Tropp, James; Lupo, Janine M.; Chen, Albert; Calderon, Paul; McCune, Don; Grafendorfer, Thomas; Ozturk-Isik, Esin; Larson, Peder E. Z.; Hu, Simon; Yen, Yi-Fen; Robb, Fraser; Bok, Robert; Schulte, Rolf; Xu, Duan; Hurd, Ralph; Vigneron, Daniel; Nelson, Sarah

2011-01-01

We report metabolic images of 13C, following injection of a bolus of hyperpolarized [1-13C] pyruvate in a live rat. The data were acquired on a clinical scanner, using custom coils for volume transmission and array reception. Proton blocking of all carbon resonators enabled proton anatomic imaging with the system body coil, to allow for registration of anatomic and metabolic images, for which good correlation was achieved, with some anatomic features (kidney and heart) clearly visible in a carbon image, without reference to the corresponding proton image. Parallel imaging with sensitivity encoding was used to increase the spatial resolution in the SI direction of the rat. The signal to noise ratio in was in some instances unexpectedly high in the parallel images; variability of the polarization among different trials, plus partial volume effects, are noted as a possible cause of this.
Real-time simulation of large-scale neural architectures for visual features computation based on GPU.

PubMed

Chessa, Manuela; Bianchi, Valentina; Zampetti, Massimo; Sabatini, Silvio P; Solari, Fabio

2012-01-01

The intrinsic parallelism of visual neural architectures based on distributed hierarchical layers is well suited to be implemented on the multi-core architectures of modern graphics cards. The design strategies that allow us to optimally take advantage of such parallelism, in order to efficiently map on GPU the hierarchy of layers and the canonical neural computations, are proposed. Speciﬁcally, the advantages of a cortical map-like representation of the data are exploited. Moreover, a GPU implementation of a novel neural architecture for the computation of binocular disparity from stereo image pairs, based on populations of binocular energy neurons, is presented. The implemented neural model achieves good performances in terms of reliability of the disparity estimates and a near real-time execution speed, thus demonstrating the effectiveness of the devised design strategies. The proposed approach is valid in general, since the neural building blocks we implemented are a common basis for the modeling of visual neural functionalities.
Parallel Program Systems for the Analysis of Wave Processes in Elastic-Plastic, Granular, Porous and Multi-Blocky Media

NASA Astrophysics Data System (ADS)

Sadovskaya, Oxana; Sadovskii, Vladimir

2017-04-01

Under modeling the wave propagation processes in geomaterials (granular and porous media, soils and rocks) it is necessary to take into account the structural inhomogeneity of these materials. Parallel program systems for numerical solution of 2D and 3D problems of the dynamics of deformable media with constitutive relationships of rather general form on the basis of universal mathematical model describing small strains of elastic, elastic-plastic, granular and porous materials are worked out. In the case of an elastic material, the model is reduced to the system of equations, hyperbolic by Friedrichs, written in terms of velocities and stresses in a symmetric form. In the case of an elastic-plastic material, the model is a special formulation of the Prandtl-Reuss theory in the form of variational inequality with one-sided constraints on the stress tensor. Generalization of the model to describe granularity and the collapse of pores is obtained by means of the rheological approach, taking into account different resistance of materials to tension and compression. Rotational motion of particles in the material microstructure is considered within the framework of a mathematical model of the Cosserat continuum. Computational domain may have a blocky structure, composed of an arbitrary number of layers, strips in a layer and blocks in a strip from different materials with self-consistent curvilinear interfaces. At the external boundaries of computational domain the main types of dissipative boundary conditions in terms of velocities, stresses or mixed boundary conditions can be given. Shock-capturing algorithm is proposed for implementation of the model on supercomputers with cluster architecture. It is based on the two-cyclic splitting method with respect to spatial variables and the special procedures of the stresses correction to take into account plasticity, granularity or porosity of a material. An explicit monotone ENO-scheme is applied for solving one-dimensional systems of equations at the stages of splitting method. The parallelizing of computations is carried out using the MPI library and the SPMD technology. The data exchange between processors occurs at step "predictor" of the finite-difference scheme. Program systems allow simulate the propagation of waves produced by external mechanical effects in a medium, aggregated of arbitrary number of heterogeneous blocks. Some computations of dynamic problems with and without taking into account the moment properties of a material were performed on clusters of ICM SB RAS (Krasnoyarsk) and JSCC RAS (Moscow). Parallel program systems 2Dyn_Granular, 3Dyn_Granular, 2Dyn_Cosserat, 3Dyn_Cosserat and 2Dyn_Blocks_MPI for numerical solution of 2D and 3D elastic-plastic problems of the dynamics of granular media and problems of the Cosserat elasticity theory, as well as for modeling of the dynamic processes in multi-blocky media with pliant viscoelastic, porous and fluid-saturated interlayers on cluster systems were registered by Rospatent.
Multi-block sulfonated poly(phenylene) copolymer proton exchange membranes

DOEpatents

Fujimoto, Cy H [Albuquerque, NM; Hibbs, Michael [Albuquerque, NM; Ambrosini, Andrea [Albuquerque, NM

2012-02-07

Improved multi-block sulfonated poly(phenylene) copolymer compositions, methods of making the same, and their use as proton exchange membranes (PEM) in hydrogen fuel cells, direct methanol fuel cells, in electrode casting solutions and electrodes. The multi-block architecture has defined, controllable hydrophobic and hydrophilic segments. These improved membranes have better ion transport (proton conductivity) and water swelling properties.
Optimized collectives using a DMA on a parallel computer

DOEpatents

Chen, Dong [Croton On Hudson, NY; Gabor, Dozsa [Ardsley, NY; Giampapa, Mark E [Irvington, NY; Heidelberger,; Phillip, [Cortlandt Manor, NY

2011-02-08

Optimizing collective operations using direct memory access controller on a parallel computer, in one aspect, may comprise establishing a byte counter associated with a direct memory access controller for each submessage in a message. The byte counter includes at least a base address of memory and a byte count associated with a submessage. A byte counter associated with a submessage is monitored to determine whether at least a block of data of the submessage has been received. The block of data has a predetermined size, for example, a number of bytes. The block is processed when the block has been fully received, for example, when the byte count indicates all bytes of the block have been received. The monitoring and processing may continue for all blocks in all submessages in the message.
Array-based, parallel hierarchical mesh refinement algorithms for unstructured meshes

DOE PAGES

Ray, Navamita; Grindeanu, Iulian; Zhao, Xinglin; ...

2016-08-18

In this paper, we describe an array-based hierarchical mesh refinement capability through uniform refinement of unstructured meshes for efficient solution of PDE's using finite element methods and multigrid solvers. A multi-degree, multi-dimensional and multi-level framework is designed to generate the nested hierarchies from an initial coarse mesh that can be used for a variety of purposes such as in multigrid solvers/preconditioners, to do solution convergence and verification studies and to improve overall parallel efficiency by decreasing I/O bandwidth requirements (by loading smaller meshes and in memory refinement). We also describe a high-order boundary reconstruction capability that can be used tomore » project the new points after refinement using high-order approximations instead of linear projection in order to minimize and provide more control on geometrical errors introduced by curved boundaries.The capability is developed under the parallel unstructured mesh framework "Mesh Oriented dAtaBase" (MOAB Tautges et al. (2004)). We describe the underlying data structures and algorithms to generate such hierarchies in parallel and present numerical results for computational efficiency and effect on mesh quality. Furthermore, we also present results to demonstrate the applicability of the developed capability to study convergence properties of different point projection schemes for various mesh hierarchies and to a multigrid finite-element solver for elliptic problems.« less
pyPaSWAS: Python-based multi-core CPU and GPU sequence alignment.

PubMed

Warris, Sven; Timal, N Roshan N; Kempenaar, Marcel; Poortinga, Arne M; van de Geest, Henri; Varbanescu, Ana L; Nap, Jan-Peter

2018-01-01

Our previously published CUDA-only application PaSWAS for Smith-Waterman (SW) sequence alignment of any type of sequence on NVIDIA-based GPUs is platform-specific and therefore adopted less than could be. The OpenCL language is supported more widely and allows use on a variety of hardware platforms. Moreover, there is a need to promote the adoption of parallel computing in bioinformatics by making its use and extension more simple through more and better application of high-level languages commonly used in bioinformatics, such as Python. The novel application pyPaSWAS presents the parallel SW sequence alignment code fully packed in Python. It is a generic SW implementation running on several hardware platforms with multi-core systems and/or GPUs that provides accurate sequence alignments that also can be inspected for alignment details. Additionally, pyPaSWAS support the affine gap penalty. Python libraries are used for automated system configuration, I/O and logging. This way, the Python environment will stimulate further extension and use of pyPaSWAS. pyPaSWAS presents an easy Python-based environment for accurate and retrievable parallel SW sequence alignments on GPUs and multi-core systems. The strategy of integrating Python with high-performance parallel compute languages to create a developer- and user-friendly environment should be considered for other computationally intensive bioinformatics algorithms.
Bi-level Multi-Source Learning for Heterogeneous Block-wise Missing Data

PubMed Central

Xiang, Shuo; Yuan, Lei; Fan, Wei; Wang, Yalin; Thompson, Paul M.; Ye, Jieping

2013-01-01

Bio-imaging technologies allow scientists to collect large amounts of high-dimensional data from multiple heterogeneous sources for many biomedical applications. In the study of Alzheimer's Disease (AD), neuroimaging data, gene/protein expression data, etc., are often analyzed together to improve predictive power. Joint learning from multiple complementary data sources is advantageous, but feature-pruning and data source selection are critical to learn interpretable models from high-dimensional data. Often, the data collected has block-wise missing entries. In the Alzheimer’s Disease Neuroimaging Initiative (ADNI), most subjects have MRI and genetic information, but only half have cerebrospinal fluid (CSF) measures, a different half has FDG-PET; only some have proteomic data. Here we propose how to effectively integrate information from multiple heterogeneous data sources when data is block-wise missing. We present a unified “bi-level” learning model for complete multi-source data, and extend it to incomplete data. Our major contributions are: (1) our proposed models unify feature-level and source-level analysis, including several existing feature learning approaches as special cases; (2) the model for incomplete data avoids imputing missing data and offers superior performance; it generalizes to other applications with block-wise missing data sources; (3) we present efficient optimization algorithms for modeling complete and incomplete data. We comprehensively evaluate the proposed models including all ADNI subjects with at least one of four data types at baseline: MRI, FDG-PET, CSF and proteomics. Our proposed models compare favorably with existing approaches. PMID:23988272
Code Optimization Techniques

DOE Office of Scientific and Technical Information (OSTI.GOV)

MAGEE,GLEN I.

Computers transfer data in a number of different ways. Whether through a serial port, a parallel port, over a modem, over an ethernet cable, or internally from a hard disk to memory, some data will be lost. To compensate for that loss, numerous error detection and correction algorithms have been developed. One of the most common error correction codes is the Reed-Solomon code, which is a special subset of BCH (Bose-Chaudhuri-Hocquenghem) linear cyclic block codes. In the AURA project, an unmanned aircraft sends the data it collects back to earth so it can be analyzed during flight and possible flightmore » modifications made. To counter possible data corruption during transmission, the data is encoded using a multi-block Reed-Solomon implementation with a possibly shortened final block. In order to maximize the amount of data transmitted, it was necessary to reduce the computation time of a Reed-Solomon encoding to three percent of the processor's time. To achieve such a reduction, many code optimization techniques were employed. This paper outlines the steps taken to reduce the processing time of a Reed-Solomon encoding and the insight into modern optimization techniques gained from the experience.« less
The characteristics and controlling factor of the multi-stage eogenetic karst in the Longwangmiao Foramtion in Lower Cambrian, western Central Yangtze Block, SW China

NASA Astrophysics Data System (ADS)

Lu, C.

2017-12-01

This study utilized field outcrops, thin sections, geochemical data, and GR logging curves to investigate the development model of paleokarst within the Longwangmiao Formation in the Lower Cambrian, western Central Yangtze Block, SW China. The Longwangmiao Formation, which belongs to a third-order sequence, consists of four forth-order sequences and is located in the uppermost part of the Lower Cambrian. The vertical variations of the δ13C and δ18O values indicate the existence of multi-stage eogenetic karst events. The eogenetic karst event in the uppermost part of the Longwangmiao Formation is recognized by the dripstones developed within paleocaves, vertical paleoweathering crust with four zones (bedrock, a weak weathering zone, an intense weathering zone and a solution collapsed zone), two generations of calcsparite cement showing bright luminescence and a zonation from nonluminescent to bright to nonluminescent, two types breccias (matrix-rich clast-supported chaotic breccia and matrix-supported chaotic breccia) and rundkarren. The episodic variations of stratiform dissolution vugs and breccias in vertical, and facies-controlled dissolution and filling features indicated the development of multi-stages eogenetic karst. The development of the paleokarst model is controlled by multi-level sea-level changes. The long eccentricity cycle dictates the fluctuations of the forth-order sea-level, generating multi-stage eogenetic karst events. The paleokarst model is an important step towards better understanding the link between the probably orbitally forced sea-level oscillations and eogenetic karst in the Lower Cambrian. According to this paleokarst model, hydrocarbon exploration should focus on both the karst highlands and the karst transitional zone.
Design of convolutional tornado code

NASA Astrophysics Data System (ADS)

Zhou, Hui; Yang, Yao; Gao, Hongmin; Tan, Lu

2017-09-01

As a linear block code, the traditional tornado (tTN) code is inefficient in burst-erasure environment and its multi-level structure may lead to high encoding/decoding complexity. This paper presents a convolutional tornado (cTN) code which is able to improve the burst-erasure protection capability by applying the convolution property to the tTN code, and reduce computational complexity by abrogating the multi-level structure. The simulation results show that cTN code can provide a better packet loss protection performance with lower computation complexity than tTN code.
Research on multi-level decision game strategy of electricity sales market considering ETS and block chain

NASA Astrophysics Data System (ADS)

Liu, Jinjie

2017-08-01

In order to fully consider the impact of future policies and technologies on the electricity sales market, improve the efficiency of electricity market operation, realize the dual goal of power reform and energy saving and emission reduction, this paper uses multi-level decision theory to put forward the double-layer game model under the consideration of ETS and block chain. We set the maximization of electricity sales profit as upper level objective and establish a game strategy model of electricity purchase; while we set maximization of user satisfaction as lower level objective and build a choice behavior model based on customer satisfaction. This paper applies the strategy to the simulation of a sales company's transaction, and makes a horizontal comparison of the same industry competitors as well as a longitudinal comparison of game strategies considering different factors. The results show that Double-layer game model is reasonable and effective, it can significantly improve the efficiency of the electricity sales companies and user satisfaction, while promoting new energy consumption and achieving energy-saving emission reduction.

Design and synthesis of diverse functional kinked nanowire structures for nanoelectronic bioprobes.

PubMed

Xu, Lin; Jiang, Zhe; Qing, Quan; Mai, Liqiang; Zhang, Qingjie; Lieber, Charles M

2013-02-13

Functional kinked nanowires (KNWs) represent a new class of nanowire building blocks, in which functional devices, for example, nanoscale field-effect transistors (nanoFETs), are encoded in geometrically controlled nanowire superstructures during synthesis. The bottom-up control of both structure and function of KNWs enables construction of spatially isolated point-like nanoelectronic probes that are especially useful for monitoring biological systems where finely tuned feature size and structure are highly desired. Here we present three new types of functional KNWs including (1) the zero-degree KNW structures with two parallel heavily doped arms of U-shaped structures with a nanoFET at the tip of the "U", (2) series multiplexed functional KNW integrating multi-nanoFETs along the arm and at the tips of V-shaped structures, and (3) parallel multiplexed KNWs integrating nanoFETs at the two tips of W-shaped structures. First, U-shaped KNWs were synthesized with separations as small as 650 nm between the parallel arms and used to fabricate three-dimensional nanoFET probes at least 3 times smaller than previous V-shaped designs. In addition, multiple nanoFETs were encoded during synthesis in one of the arms/tip of V-shaped and distinct arms/tips of W-shaped KNWs. These new multiplexed KNW structures were structurally verified by optical and electron microscopy of dopant-selective etched samples and electrically characterized using scanning gate microscopy and transport measurements. The facile design and bottom-up synthesis of these diverse functional KNWs provides a growing toolbox of building blocks for fabricating highly compact and multiplexed three-dimensional nanoprobes for applications in life sciences, including intracellular and deep tissue/cell recordings.
Efficient partitioning and assignment on programs for multiprocessor execution

NASA Technical Reports Server (NTRS)

Standley, Hilda M.

1993-01-01

The general problem studied is that of segmenting or partitioning programs for distribution across a multiprocessor system. Efficient partitioning and the assignment of program elements are of great importance since the time consumed in this overhead activity may easily dominate the computation, effectively eliminating any gains made by the use of the parallelism. In this study, the partitioning of sequentially structured programs (written in FORTRAN) is evaluated. Heuristics, developed for similar applications are examined. Finally, a model for queueing networks with finite queues is developed which may be used to analyze multiprocessor system architectures with a shared memory approach to the problem of partitioning. The properties of sequentially written programs form obstacles to large scale (at the procedure or subroutine level) parallelization. Data dependencies of even the minutest nature, reflecting the sequential development of the program, severely limit parallelism. The design of heuristic algorithms is tied to the experience gained in the parallel splitting. Parallelism obtained through the physical separation of data has seen some success, especially at the data element level. Data parallelism on a grander scale requires models that accurately reflect the effects of blocking caused by finite queues. A model for the approximation of the performance of finite queueing networks is developed. This model makes use of the decomposition approach combined with the efficiency of product form solutions.
Dissipation, Voltage Profile and Levy Dragon in a Special Ladder Network

ERIC Educational Resources Information Center

Ucak, C.

2009-01-01

A ladder network constructed by an elementary two-terminal network consisting of a parallel resistor-inductor block in series with a parallel resistor-capacitor block sometimes is said to have a non-dispersive dissipative response. This special ladder network is created iteratively by replacing the elementary two-terminal network in place of the…
Multi-threaded parallel simulation of non-local non-linear problems in ultrashort laser pulse propagation in the presence of plasma

NASA Astrophysics Data System (ADS)

Baregheh, Mandana; Mezentsev, Vladimir; Schmitz, Holger

2011-06-01

We describe a parallel multi-threaded approach for high performance modelling of wide class of phenomena in ultrafast nonlinear optics. Specific implementation has been performed using the highly parallel capabilities of a programmable graphics processor.
Simulating electron wave dynamics in graphene superlattices exploiting parallel processing advantages

NASA Astrophysics Data System (ADS)

Rodrigues, Manuel J.; Fernandes, David E.; Silveirinha, Mário G.; Falcão, Gabriel

2018-01-01

This work introduces a parallel computing framework to characterize the propagation of electron waves in graphene-based nanostructures. The electron wave dynamics is modeled using both "microscopic" and effective medium formalisms and the numerical solution of the two-dimensional massless Dirac equation is determined using a Finite-Difference Time-Domain scheme. The propagation of electron waves in graphene superlattices with localized scattering centers is studied, and the role of the symmetry of the microscopic potential in the electron velocity is discussed. The computational methodologies target the parallel capabilities of heterogeneous multi-core CPU and multi-GPU environments and are built with the OpenCL parallel programming framework which provides a portable, vendor agnostic and high throughput-performance solution. The proposed heterogeneous multi-GPU implementation achieves speedup ratios up to 75x when compared to multi-thread and multi-core CPU execution, reducing simulation times from several hours to a couple of minutes.
What is adaptive about adaptive decision making? A parallel constraint satisfaction account.

PubMed

Glöckner, Andreas; Hilbig, Benjamin E; Jekel, Marc

2014-12-01

There is broad consensus that human cognition is adaptive. However, the vital question of how exactly this adaptivity is achieved has remained largely open. Herein, we contrast two frameworks which account for adaptive decision making, namely broad and general single-mechanism accounts vs. multi-strategy accounts. We propose and fully specify a single-mechanism model for decision making based on parallel constraint satisfaction processes (PCS-DM) and contrast it theoretically and empirically against a multi-strategy account. To achieve sufficiently sensitive tests, we rely on a multiple-measure methodology including choice, reaction time, and confidence data as well as eye-tracking. Results show that manipulating the environmental structure produces clear adaptive shifts in choice patterns - as both frameworks would predict. However, results on the process level (reaction time, confidence), in information acquisition (eye-tracking), and from cross-predicting choice consistently corroborate single-mechanisms accounts in general, and the proposed parallel constraint satisfaction model for decision making in particular. Copyright © 2014 Elsevier B.V. All rights reserved.
A parallel algorithm for multi-level logic synthesis using the transduction method. M.S. Thesis

NASA Technical Reports Server (NTRS)

Lim, Chieng-Fai

1991-01-01

The Transduction Method has been shown to be a powerful tool in the optimization of multilevel networks. Many tools such as the SYLON synthesis system (X90), (CM89), (LM90) have been developed based on this method. A parallel implementation is presented of SYLON-XTRANS (XM89) on an eight processor Encore Multimax shared memory multiprocessor. It minimizes multilevel networks consisting of simple gates through parallel pruning, gate substitution, gate merging, generalized gate substitution, and gate input reduction. This implementation, called Parallel TRANSduction (PTRANS), also uses partitioning to break large circuits up and performs inter- and intra-partition dynamic load balancing. With this, good speedups and high processor efficiencies are achievable without sacrificing the resulting circuit quality.
Multi-Modulator for Bandwidth-Efficient Communication

NASA Technical Reports Server (NTRS)

Gray, Andrew; Lee, Dennis; Lay, Norman; Cheetham, Craig; Fong, Wai; Yeh, Pen-Shu; King, Robin; Ghuman, Parminder; Hoy, Scott; Fisher, Dave

2009-01-01

A modulator circuit board has recently been developed to be used in conjunction with a vector modulator to generate any of a large number of modulations for bandwidth-efficient radio transmission of digital data signals at rates than can exceed 100 Mb/s. The modulations include quadrature phaseshift keying (QPSK), offset quadrature phase-shift keying (OQPSK), Gaussian minimum-shift keying (GMSK), and octonary phase-shift keying (8PSK) with square-root raised-cosine pulse shaping. The figure is a greatly simplified block diagram showing the relationship between the modulator board and the rest of the transmitter. The role of the modulator board is to encode the incoming data stream and to shape the resulting pulses, which are fed as inputs to the vector modulator. The combination of encoding and pulse shaping in a given application is chosen to maximize the bandwidth efficiency. The modulator board includes gallium arsenide serial-to-parallel converters at its input end. A complementary metal oxide/semiconductor (CMOS) field-programmable gate array (FPGA) performs the coding and modulation computations and utilizes parallel processing in doing so. The results of the parallel computation are combined and converted to pulse waveforms by use of gallium arsenide parallel-to-serial converters integrated with digital-to-analog converters. Without changing the hardware, one can configure the modulator to produce any of the designed combinations of coding and modulation by loading the appropriate bit configuration file into the FPGA.
Standardized Modular Power Interfaces for Future Space Explorations Missions

NASA Technical Reports Server (NTRS)

Oeftering, Richard

2015-01-01

Earlier studies show that future human explorations missions are composed of multi-vehicle assemblies with interconnected electric power systems. Some vehicles are often intended to serve as flexible multi-purpose or multi-mission platforms. This drives the need for power architectures that can be reconfigured to support this level of flexibility. Power system developmental costs can be reduced, program wide, by utilizing a common set of modular building blocks. Further, there are mission operational and logistics cost benefits of using a common set of modular spares. These benefits are the goals of the Advanced Exploration Systems (AES) Modular Power System (AMPS) project. A common set of modular blocks requires a substantial level of standardization in terms of the Electrical, Data System, and Mechanical interfaces. The AMPS project is developing a set of proposed interface standards that will provide useful guidance for modular hardware developers but not needlessly constrain technology options, or limit future growth in capability. In 2015 the AMPS project focused on standardizing the interfaces between the elements of spacecraft power distribution and energy storage. The development of the modular power standard starts with establishing mission assumptions and ground rules to define design application space. The standards are defined in terms of AMPS objectives including Commonality, Reliability-Availability, Flexibility-Configurability and Supportability-Reusability. The proposed standards are aimed at assembly and sub-assembly level building blocks. AMPS plans to adopt existing standards for spacecraft command and data, software, network interfaces, and electrical power interfaces where applicable. Other standards including structural encapsulation, heat transfer, and fluid transfer, are governed by launch and spacecraft environments and bound by practical limitations of weight and volume. Developing these mechanical interface standards is more difficult but an essential part of defining physical building blocks of modular power. This presentation describes the AMPS projects progress towards standardized modular power interfaces.
Multi-Bit Quantum Private Query

NASA Astrophysics Data System (ADS)

Shi, Wei-Xu; Liu, Xing-Tong; Wang, Jian; Tang, Chao-Jing

2015-09-01

Most of the existing Quantum Private Queries (QPQ) protocols provide only single-bit queries service, thus have to be repeated several times when more bits are retrieved. Wei et al.'s scheme for block queries requires a high-dimension quantum key distribution system to sustain, which is still restricted in the laboratory. Here, based on Markus Jakobi et al.'s single-bit QPQ protocol, we propose a multi-bit quantum private query protocol, in which the user can get access to several bits within one single query. We also extend the proposed protocol to block queries, using a binary matrix to guard database security. Analysis in this paper shows that our protocol has better communication complexity, implementability and can achieve a considerable level of security.
[Series: Medical Applications of the PHITS Code (2): Acceleration by Parallel Computing].

PubMed

Furuta, Takuya; Sato, Tatsuhiko

2015-01-01

Time-consuming Monte Carlo dose calculation becomes feasible owing to the development of computer technology. However, the recent development is due to emergence of the multi-core high performance computers. Therefore, parallel computing becomes a key to achieve good performance of software programs. A Monte Carlo simulation code PHITS contains two parallel computing functions, the distributed-memory parallelization using protocols of message passing interface (MPI) and the shared-memory parallelization using open multi-processing (OpenMP) directives. Users can choose the two functions according to their needs. This paper gives the explanation of the two functions with their advantages and disadvantages. Some test applications are also provided to show their performance using a typical multi-core high performance workstation.
Dynamic Load-Balancing for Distributed Heterogeneous Computing of Parallel CFD Problems

NASA Technical Reports Server (NTRS)

Ecer, A.; Chien, Y. P.; Boenisch, T.; Akay, H. U.

2000-01-01

The developed methodology is aimed at improving the efficiency of executing block-structured algorithms on parallel, distributed, heterogeneous computers. The basic approach of these algorithms is to divide the flow domain into many sub- domains called blocks, and solve the governing equations over these blocks. Dynamic load balancing problem is defined as the efficient distribution of the blocks among the available processors over a period of several hours of computations. In environments with computers of different architecture, operating systems, CPU speed, memory size, load, and network speed, balancing the loads and managing the communication between processors becomes crucial. Load balancing software tools for mutually dependent parallel processes have been created to efficiently utilize an advanced computation environment and algorithms. These tools are dynamic in nature because of the chances in the computer environment during execution time. More recently, these tools were extended to a second operating system: NT. In this paper, the problems associated with this application will be discussed. Also, the developed algorithms were combined with the load sharing capability of LSF to efficiently utilize workstation clusters for parallel computing. Finally, results will be presented on running a NASA based code ADPAC to demonstrate the developed tools for dynamic load balancing.
Parallel design of JPEG-LS encoder on graphics processing units

NASA Astrophysics Data System (ADS)

Duan, Hao; Fang, Yong; Huang, Bormin

2012-01-01

With recent technical advances in graphic processing units (GPUs), GPUs have outperformed CPUs in terms of compute capability and memory bandwidth. Many successful GPU applications to high performance computing have been reported. JPEG-LS is an ISO/IEC standard for lossless image compression which utilizes adaptive context modeling and run-length coding to improve compression ratio. However, adaptive context modeling causes data dependency among adjacent pixels and the run-length coding has to be performed in a sequential way. Hence, using JPEG-LS to compress large-volume hyperspectral image data is quite time-consuming. We implement an efficient parallel JPEG-LS encoder for lossless hyperspectral compression on a NVIDIA GPU using the computer unified device architecture (CUDA) programming technology. We use the block parallel strategy, as well as such CUDA techniques as coalesced global memory access, parallel prefix sum, and asynchronous data transfer. We also show the relation between GPU speedup and AVIRIS block size, as well as the relation between compression ratio and AVIRIS block size. When AVIRIS images are divided into blocks, each with 64×64 pixels, we gain the best GPU performance with 26.3x speedup over its original CPU code.
Cathepsin S-cleavable, multi-block HPMA copolymers for improved SPECT/CT imaging of pancreatic cancer.

PubMed

Fan, Wei; Shi, Wen; Zhang, Wenting; Jia, Yinnong; Zhou, Zhengyuan; Brusnahan, Susan K; Garrison, Jered C

2016-10-01

This work continues our efforts to improve the diagnostic and radiotherapeutic effectiveness of nanomedicine platforms by developing approaches to reduce the non-target accumulation of these agents. Herein, we developed multi-block HPMA copolymers with backbones that are susceptible to cleavage by cathepsin S, a protease that is abundantly expressed in tissues of the mononuclear phagocyte system (MPS). Specifically, a bis-thiol terminated HPMA telechelic copolymer containing 1,4,7,10-tetraazacyclododecane-1,4,7,10-tetraacetic acid (DOTA) was synthesized by reversible addition-fragmentation chain transfer (RAFT) polymerization. Three maleimide modified linkers with different sequences, including cathepsin S degradable oligopeptide, scramble oligopeptide and oligo ethylene glycol, were subsequently synthesized and used for the extension of the HPMA copolymers by thiol-maleimide click chemistry. All multi-block HPMA copolymers could be labeled by (177)Lu with high labeling efficiency and exhibited high serum stability. In vitro cleavage studies demonstrated highly selective and efficient cathepsin S mediated cleavage of the cathepsin S-susceptible multi-block HPMA copolymer. A modified multi-block HPMA copolymer series capable of Förster Resonance Energy Transfer (FRET) was utilized to investigate the rate of cleavage of the multi-block HPMA copolymers in monocyte-derived macrophages. Confocal imaging and flow cytometry studies revealed substantially higher rates of cleavage for the multi-block HPMA copolymers containing the cathepsin S-susceptible linker. The efficacy of the cathepsin S-cleavable multi-block HPMA copolymer was further examined using an in vivo model of pancreatic ductal adenocarcinoma. Based on the biodistribution and SPECT/CT studies, the copolymer extended with the cathepsin S susceptible linker exhibited significantly faster clearance and lower non-target retention without compromising tumor targeting. Overall, these results indicate that exploitation of the cathepsin S activity in MPS tissues can be utilized to substantially lower non-target accumulation, suggesting this is a promising approach for the development of diagnostic and radiotherapeutic nanomedicine platforms. Copyright © 2016 Elsevier Ltd. All rights reserved.
Nano-structured polymer composites and process for preparing same

DOEpatents

Hillmyer, Marc; Chen, Liang

2013-04-16

A process for preparing a polymer composite that includes reacting (a) a multi-functional monomer and (b) a block copolymer comprising (i) a first block and (ii) a second block that includes a functional group capable of reacting with the multi-functional monomer, to form a crosslinked, nano-structured, bi-continuous composite. The composite includes a continuous matrix phase and a second continuous phase comprising the first block of the block copolymer.
Autoplan: A self-processing network model for an extended blocks world planning environment

NASA Technical Reports Server (NTRS)

Dautrechy, C. Lynne; Reggia, James A.; Mcfadden, Frank

1990-01-01

Self-processing network models (neural/connectionist models, marker passing/message passing networks, etc.) are currently undergoing intense investigation for a variety of information processing applications. These models are potentially very powerful in that they support a large amount of explicit parallel processing, and they cleanly integrate high level and low level information processing. However they are currently limited by a lack of understanding of how to apply them effectively in many application areas. The formulation of self-processing network methods for dynamic, reactive planning is studied. The long-term goal is to formulate robust, computationally effective information processing methods for the distributed control of semiautonomous exploration systems, e.g., the Mars Rover. The current research effort is focusing on hierarchical plan generation, execution and revision through local operations in an extended blocks world environment. This scenario involves many challenging features that would be encountered in a real planning and control environment: multiple simultaneous goals, parallel as well as sequential action execution, action sequencing determined not only by goals and their interactions but also by limited resources (e.g., three tasks, two acting agents), need to interpret unanticipated events and react appropriately through replanning, etc.
Options for Parallelizing a Planning and Scheduling Algorithm

NASA Technical Reports Server (NTRS)

Clement, Bradley J.; Estlin, Tara A.; Bornstein, Benjamin D.

2011-01-01

Space missions have a growing interest in putting multi-core processors onboard spacecraft. For many missions processing power significantly slows operations. We investigate how continual planning and scheduling algorithms can exploit multi-core processing and outline different potential design decisions for a parallelized planning architecture. This organization of choices and challenges helps us with an initial design for parallelizing the CASPER planning system for a mesh multi-core processor. This work extends that presented at another workshop with some preliminary results.
High Performance Analytics with the R3-Cache

NASA Astrophysics Data System (ADS)

Eavis, Todd; Sayeed, Ruhan

Contemporary data warehouses now represent some of the world’s largest databases. As these systems grow in size and complexity, however, it becomes increasingly difficult for brute force query processing approaches to meet the performance demands of end users. Certainly, improved indexing and more selective view materialization are helpful in this regard. Nevertheless, with warehouses moving into the multi-terabyte range, it is clear that the minimization of external memory accesses must be a primary performance objective. In this paper, we describe the R 3-cache, a natively multi-dimensional caching framework designed specifically to support sophisticated warehouse/OLAP environments. R 3-cache is based upon an in-memory version of the R-tree that has been extended to support buffer pages rather than disk blocks. A key strength of the R 3-cache is that it is able to utilize multi-dimensional fragments of previous query results so as to significantly minimize the frequency and scale of disk accesses. Moreover, the new caching model directly accommodates the standard relational storage model and provides mechanisms for pro-active updates that exploit the existence of query “hot spots”. The current prototype has been evaluated as a component of the Sidera DBMS, a “shared nothing” parallel OLAP server designed for multi-terabyte analytics. Experimental results demonstrate significant performance improvements relative to simpler alternatives.
Template based parallel checkpointing in a massively parallel computer system

DOEpatents

Archer, Charles Jens [Rochester, MN; Inglett, Todd Alan [Rochester, MN

2009-01-13

A method and apparatus for a template based parallel checkpoint save for a massively parallel super computer system using a parallel variation of the rsync protocol, and network broadcast. In preferred embodiments, the checkpoint data for each node is compared to a template checkpoint file that resides in the storage and that was previously produced. Embodiments herein greatly decrease the amount of data that must be transmitted and stored for faster checkpointing and increased efficiency of the computer system. Embodiments are directed to a parallel computer system with nodes arranged in a cluster with a high speed interconnect that can perform broadcast communication. The checkpoint contains a set of actual small data blocks with their corresponding checksums from all nodes in the system. The data blocks may be compressed using conventional non-lossy data compression algorithms to further reduce the overall checkpoint size.
Multi-GPU parallel algorithm design and analysis for improved inversion of probability tomography with gravity gradiometry data

NASA Astrophysics Data System (ADS)

Hou, Zhenlong; Huang, Danian

2017-09-01

In this paper, we make a study on the inversion of probability tomography (IPT) with gravity gradiometry data at first. The space resolution of the results is improved by multi-tensor joint inversion, depth weighting matrix and the other methods. Aiming at solving the problems brought by the big data in the exploration, we present the parallel algorithm and the performance analysis combining Compute Unified Device Architecture (CUDA) with Open Multi-Processing (OpenMP) based on Graphics Processing Unit (GPU) accelerating. In the test of the synthetic model and real data from Vinton Dome, we get the improved results. It is also proved that the improved inversion algorithm is effective and feasible. The performance of parallel algorithm we designed is better than the other ones with CUDA. The maximum speedup could be more than 200. In the performance analysis, multi-GPU speedup and multi-GPU efficiency are applied to analyze the scalability of the multi-GPU programs. The designed parallel algorithm is demonstrated to be able to process larger scale of data and the new analysis method is practical.

Numerical Upscaling of Solute Transport in Fractured Porous Media Based on Flow Aligned Blocks

NASA Astrophysics Data System (ADS)

Leube, P.; Nowak, W.; Sanchez-Vila, X.

2013-12-01

High-contrast or fractured-porous media (FPM) pose one of the largest unresolved challenges for simulating large hydrogeological systems. The high contrast in advective transport between fast conduits and low-permeability rock matrix, including complex mass transfer processes, leads to the typical complex characteristics of early bulk arrivals and long tailings. Adequate direct representation of FPM requires enormous numerical resolutions. For large scales, e.g. the catchment scale, and when allowing for uncertainty in the fracture network architecture or in matrix properties, computational costs quickly reach an intractable level. In such cases, multi-scale simulation techniques have become useful tools. They allow decreasing the complexity of models by aggregating and transferring their parameters to coarser scales and so drastically reduce the computational costs. However, these advantages come at a loss of detail and accuracy. In this work, we develop and test a new multi-scale or upscaled modeling approach based on block upscaling. The novelty is that individual blocks are defined by and aligned with the local flow coordinates. We choose a multi-rate mass transfer (MRMT) model to represent the remaining sub-block non-Fickian behavior within these blocks on the coarse scale. To make the scale transition simple and to save computational costs, we capture sub-block features by temporal moments (TM) of block-wise particle arrival times to be matched with the MRMT model. By predicting spatial mass distributions of injected tracers in a synthetic test scenario, our coarse-scale solution matches reasonably well with the corresponding fine-scale reference solution. For predicting higher TM-orders (such as arrival time and effective dispersion), the prediction accuracy steadily decreases. This is compensated to some extent by the MRMT model. If the MRMT model becomes too complex, it loses its effect. We also found that prediction accuracy is sensitive to the choice of the effective dispersion coefficients and on the block resolution. A key advantage of the flow-aligned blocks is that the small-scale velocity field is reproduced quite accurately on the block-scale through their flow alignment. Thus, the block-scale transverse dispersivities remain in the similar magnitude as local ones, and they do not have to represent macroscopic uncertainty. Also, the flow-aligned blocks minimize numerical dispersion when solving the large-scale transport problem.
Who multi-tasks and why? Multi-tasking ability, perceived multi-tasking ability, impulsivity, and sensation seeking.

PubMed

Sanbonmatsu, David M; Strayer, David L; Medeiros-Ward, Nathan; Watson, Jason M

2013-01-01

The present study examined the relationship between personality and individual differences in multi-tasking ability. Participants enrolled at the University of Utah completed measures of multi-tasking activity, perceived multi-tasking ability, impulsivity, and sensation seeking. In addition, they performed the Operation Span in order to assess their executive control and actual multi-tasking ability. The findings indicate that the persons who are most capable of multi-tasking effectively are not the persons who are most likely to engage in multiple tasks simultaneously. To the contrary, multi-tasking activity as measured by the Media Multitasking Inventory and self-reported cell phone usage while driving were negatively correlated with actual multi-tasking ability. Multi-tasking was positively correlated with participants' perceived ability to multi-task ability which was found to be significantly inflated. Participants with a strong approach orientation and a weak avoidance orientation--high levels of impulsivity and sensation seeking--reported greater multi-tasking behavior. Finally, the findings suggest that people often engage in multi-tasking because they are less able to block out distractions and focus on a singular task. Participants with less executive control--low scorers on the Operation Span task and persons high in impulsivity--tended to report higher levels of multi-tasking activity.
Topography Modeling in Atmospheric Flows Using the Immersed Boundary Method

NASA Technical Reports Server (NTRS)

Ackerman, A. S.; Senocak, I.; Mansour, N. N.; Stevens, D. E.

2004-01-01

Numerical simulation of flow over complex geometry needs accurate and efficient computational methods. Different techniques are available to handle complex geometry. The unstructured grid and multi-block body-fitted grid techniques have been widely adopted for complex geometry in engineering applications. In atmospheric applications, terrain fitted single grid techniques have found common use. Although these are very effective techniques, their implementation, coupling with the flow algorithm, and efficient parallelization of the complete method are more involved than a Cartesian grid method. The grid generation can be tedious and one needs to pay special attention in numerics to handle skewed cells for conservation purposes. Researchers have long sought for alternative methods to ease the effort involved in simulating flow over complex geometry.
Jagged Tiling for Intra-tile Parallelism and Fine-Grain Multithreading

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shrestha, Sunil; Manzano Franco, Joseph B.; Marquez, Andres

In this paper, we have developed a novel methodology that takes into consideration multithreaded many-core designs to better utilize memory/processing resources and improve memory residence on tileable applications. It takes advantage of polyhedral analysis and transformation in the form of PLUTO, combined with a highly optimized finegrain tile runtime to exploit parallelism at all levels. The main contributions of this paper include the introduction of multi-hierarchical tiling techniques that increases intra tile parallelism; and a data-flow inspired runtime library that allows the expression of parallel tiles with an efficient synchronization registry. Our current implementation shows performance improvements on an Intelmore » Xeon Phi board up to 32.25% against instances produced by state-of-the-art compiler frameworks for selected stencil applications.« less
Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kim, Kyungjoo; Rajamanickam, Sivasankaran; Stelle, George Widgery

We introduce a task-parallel algorithm for sparse incomplete Cholesky factorization that utilizes a 2D sparse partitioned-block layout of a matrix. Our factorization algorithm follows the idea of algorithms-by-blocks by using the block layout. The algorithm-byblocks approach induces a task graph for the factorization. These tasks are inter-related to each other through their data dependences in the factorization algorithm. To process the tasks on various manycore architectures in a portable manner, we also present a portable tasking API that incorporates different tasking backends and device-specific features using an open-source framework for manycore platforms i.e., Kokkos. A performance evaluation is presented onmore » both Intel Sandybridge and Xeon Phi platforms for matrices from the University of Florida sparse matrix collection to illustrate merits of the proposed task-based factorization. Experimental results demonstrate that our task-parallel implementation delivers about 26.6x speedup (geometric mean) over single-threaded incomplete Choleskyby- blocks and 19.2x speedup over serial Cholesky performance which does not carry tasking overhead using 56 threads on the Intel Xeon Phi processor for sparse matrices arising from various application problems.« less
An interactive multi-block grid generation system

NASA Technical Reports Server (NTRS)

Kao, T. J.; Su, T. Y.; Appleby, Ruth

1992-01-01

A grid generation procedure combining interactive and batch grid generation programs was put together to generate multi-block grids for complex aircraft configurations. The interactive section provides the tools for 3D geometry manipulation, surface grid extraction, boundary domain construction for 3D volume grid generation, and block-block relationships and boundary conditions for flow solvers. The procedure improves the flexibility and quality of grid generation to meet the design/analysis requirements.
Multilevel Parallelization of AutoDock 4.2.

PubMed

Norgan, Andrew P; Coffman, Paul K; Kocher, Jean-Pierre A; Katzmann, David J; Sosa, Carlos P

2011-04-28

Virtual (computational) screening is an increasingly important tool for drug discovery. AutoDock is a popular open-source application for performing molecular docking, the prediction of ligand-receptor interactions. AutoDock is a serial application, though several previous efforts have parallelized various aspects of the program. In this paper, we report on a multi-level parallelization of AutoDock 4.2 (mpAD4). Using MPI and OpenMP, AutoDock 4.2 was parallelized for use on MPI-enabled systems and to multithread the execution of individual docking jobs. In addition, code was implemented to reduce input/output (I/O) traffic by reusing grid maps at each node from docking to docking. Performance of mpAD4 was examined on two multiprocessor computers. Using MPI with OpenMP multithreading, mpAD4 scales with near linearity on the multiprocessor systems tested. In situations where I/O is limiting, reuse of grid maps reduces both system I/O and overall screening time. Multithreading of AutoDock's Lamarkian Genetic Algorithm with OpenMP increases the speed of execution of individual docking jobs, and when combined with MPI parallelization can significantly reduce the execution time of virtual screens. This work is significant in that mpAD4 speeds the execution of certain molecular docking workloads and allows the user to optimize the degree of system-level (MPI) and node-level (OpenMP) parallelization to best fit both workloads and computational resources.
Scalable non-negative matrix tri-factorization.

PubMed

Čopar, Andrej; Žitnik, Marinka; Zupan, Blaž

2017-01-01

Matrix factorization is a well established pattern discovery tool that has seen numerous applications in biomedical data analytics, such as gene expression co-clustering, patient stratification, and gene-disease association mining. Matrix factorization learns a latent data model that takes a data matrix and transforms it into a latent feature space enabling generalization, noise removal and feature discovery. However, factorization algorithms are numerically intensive, and hence there is a pressing challenge to scale current algorithms to work with large datasets. Our focus in this paper is matrix tri-factorization, a popular method that is not limited by the assumption of standard matrix factorization about data residing in one latent space. Matrix tri-factorization solves this by inferring a separate latent space for each dimension in a data matrix, and a latent mapping of interactions between the inferred spaces, making the approach particularly suitable for biomedical data mining. We developed a block-wise approach for latent factor learning in matrix tri-factorization. The approach partitions a data matrix into disjoint submatrices that are treated independently and fed into a parallel factorization system. An appealing property of the proposed approach is its mathematical equivalence with serial matrix tri-factorization. In a study on large biomedical datasets we show that our approach scales well on multi-processor and multi-GPU architectures. On a four-GPU system we demonstrate that our approach can be more than 100-times faster than its single-processor counterpart. A general approach for scaling non-negative matrix tri-factorization is proposed. The approach is especially useful parallel matrix factorization implemented in a multi-GPU environment. We expect the new approach will be useful in emerging procedures for latent factor analysis, notably for data integration, where many large data matrices need to be collectively factorized.
Field Trials of the Multi-Source Approach for Resistivity and Induced Polarization Data Acquisition

NASA Astrophysics Data System (ADS)

LaBrecque, D. J.; Morelli, G.; Fischanger, F.; Lamoureux, P.; Brigham, R.

2013-12-01

Implementing systems of distributed receivers and transmitters for resistivity and induced polarization data is an almost inevitable result of the availability of wireless data communication modules and GPS modules offering precise timing and instrument locations. Such systems have a number of advantages; for example, they can be deployed around obstacles such as rivers, canyons, or mountains which would be difficult with traditional 'hard-wired' systems. However, deploying a system of identical, small, battery powered, transceivers, each capable of injecting a known current and measuring the induced potential has an additional and less obvious advantage in that multiple units can inject current simultaneously. The original purpose for using multiple simultaneous current sources (multi-source) was to increase signal levels. In traditional systems, to double the received signal you inject twice the current which requires you to apply twice the voltage and thus four times the power. Alternatively, one approach to increasing signal levels for large-scale surveys collected using small, battery powered transceivers is it to allow multiple units to transmit in parallel. In theory, using four 400 watt transmitters on separate, parallel dipoles yields roughly the same signal as a single 6400 watt transmitter. Furthermore, implementing the multi-source approach creates the opportunity to apply more complex current flow patterns than simple, parallel dipoles. For a perfect, noise-free system, multi-sources adds no new information to a data set that contains a comprehensive set of data collected using single sources. However, for realistic, noisy systems, it appears that multi-source data can substantially impact survey results. In preliminary model studies, the multi-source data produced such startling improvements in subsurface images that even the authors questioned their veracity. Between December of 2012 and July of 2013, we completed multi-source surveys at five sites with depths of exploration ranging from 150 to 450 m. The sites included shallow geothermal sites near Reno Nevada, Pomarance Italy, and Volterra Italy; a mineral exploration site near Timmins Quebec; and a landslide investigation near Vajont Dam in northern Italy. These sites provided a series of challenges in survey design and deployment including some extremely difficult terrain and a broad range of background resistivity and induced values. Despite these challenges, comparison of multi-source results to resistivity and induced polarization data collection with more traditional methods support the thesis that the multi-source approach is capable of providing substantial improvements in both depth of penetration and resolution over conventional approaches.
Comparative Implementation of High Performance Computing for Power System Dynamic Simulations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jin, Shuangshuang; Huang, Zhenyu; Diao, Ruisheng

Dynamic simulation for transient stability assessment is one of the most important, but intensive, computations for power system planning and operation. Present commercial software is mainly designed for sequential computation to run a single simulation, which is very time consuming with a single processer. The application of High Performance Computing (HPC) to dynamic simulations is very promising in accelerating the computing process by parallelizing its kernel algorithms while maintaining the same level of computation accuracy. This paper describes the comparative implementation of four parallel dynamic simulation schemes in two state-of-the-art HPC environments: Message Passing Interface (MPI) and Open Multi-Processing (OpenMP).more » These implementations serve to match the application with dedicated multi-processor computing hardware and maximize the utilization and benefits of HPC during the development process.« less
Aerodynamic simulation on massively parallel systems

NASA Technical Reports Server (NTRS)

Haeuser, Jochem; Simon, Horst D.

1992-01-01

This paper briefly addresses the computational requirements for the analysis of complete configurations of aircraft and spacecraft currently under design to be used for advanced transportation in commercial applications as well as in space flight. The discussion clearly shows that massively parallel systems are the only alternative which is both cost effective and on the other hand can provide the necessary TeraFlops, needed to satisfy the narrow design margins of modern vehicles. It is assumed that the solution of the governing physical equations, i.e., the Navier-Stokes equations which may be complemented by chemistry and turbulence models, is done on multiblock grids. This technique is situated between the fully structured approach of classical boundary fitted grids and the fully unstructured tetrahedra grids. A fully structured grid best represents the flow physics, while the unstructured grid gives best geometrical flexibility. The multiblock grid employed is structured within a block, but completely unstructured on the block level. While a completely unstructured grid is not straightforward to parallelize, the above mentioned multiblock grid is inherently parallel, in particular for multiple instruction multiple datastream (MIMD) machines. In this paper guidelines are provided for setting up or modifying an existing sequential code so that a direct parallelization on a massively parallel system is possible. Results are presented for three parallel systems, namely the Intel hypercube, the Ncube hypercube, and the FPS 500 system. Some preliminary results for an 8K CM2 machine will also be mentioned. The code run is the two dimensional grid generation module of Grid, which is a general two dimensional and three dimensional grid generation code for complex geometries. A system of nonlinear Poisson equations is solved. This code is also a good testcase for complex fluid dynamics codes, since the same datastructures are used. All systems provided good speedups, but message passing MIMD systems seem to be best suited for large miltiblock applications.
A novel parallel architecture for local histogram equalization

NASA Astrophysics Data System (ADS)

Ohannessian, Mesrob I.; Choueiter, Ghinwa F.; Diab, Hassan

2005-07-01

Local histogram equalization is an image enhancement algorithm that has found wide application in the pre-processing stage of areas such as computer vision, pattern recognition and medical imaging. The computationally intensive nature of the procedure, however, is a main limitation when real time interactive applications are in question. This work explores the possibility of performing parallel local histogram equalization, using an array of special purpose elementary processors, through an HDL implementation that targets FPGA or ASIC platforms. A novel parallelization scheme is presented and the corresponding architecture is derived. The algorithm is reduced to pixel-level operations. Processing elements are assigned image blocks, to maintain a reasonable performance-cost ratio. To further simplify both processor and memory organizations, a bit-serial access scheme is used. A brief performance assessment is provided to illustrate and quantify the merit of the approach.
Multicore Challenges and Benefits for High Performance Scientific Computing

DOE PAGES

Nielsen, Ida M. B.; Janssen, Curtis L.

2008-01-01

Until recently, performance gains in processors were achieved largely by improvements in clock speeds and instruction level parallelism. Thus, applications could obtain performance increases with relatively minor changes by upgrading to the latest generation of computing hardware. Currently, however, processor performance improvements are realized by using multicore technology and hardware support for multiple threads within each core, and taking full advantage of this technology to improve the performance of applications requires exposure of extreme levels of software parallelism. We will here discuss the architecture of parallel computers constructed from many multicore chips as well as techniques for managing the complexitymore » of programming such computers, including the hybrid message-passing/multi-threading programming model. We will illustrate these ideas with a hybrid distributed memory matrix multiply and a quantum chemistry algorithm for energy computation using Møller–Plesset perturbation theory.« less
Parallel Multi-Step/Multi-Rate Integration of Two-Time Scale Dynamic Systems

NASA Technical Reports Server (NTRS)

Chang, Johnny T.; Ploen, Scott R.; Sohl, Garett. A,; Martin, Bryan J.

2004-01-01

Increasing demands on the fidelity of simulations for real-time and high-fidelity simulations are stressing the capacity of modern processors. New integration techniques are required that provide maximum efficiency for systems that are parallelizable. However many current techniques make assumptions that are at odds with non-cascadable systems. A new serial multi-step/multi-rate integration algorithm for dual-timescale continuous state systems is presented which applies to these systems, and is extended to a parallel multi-step/multi-rate algorithm. The superior performance of both algorithms is demonstrated through a representative example.
On codes with multi-level error-correction capabilities

NASA Technical Reports Server (NTRS)

Lin, Shu

1987-01-01

In conventional coding for error control, all the information symbols of a message are regarded equally significant, and hence codes are devised to provide equal protection for each information symbol against channel errors. However, in some occasions, some information symbols in a message are more significant than the other symbols. As a result, it is desired to devise codes with multilevel error-correcting capabilities. Another situation where codes with multi-level error-correcting capabilities are desired is in broadcast communication systems. An m-user broadcast channel has one input and m outputs. The single input and each output form a component channel. The component channels may have different noise levels, and hence the messages transmitted over the component channels require different levels of protection against errors. Block codes with multi-level error-correcting capabilities are also known as unequal error protection (UEP) codes. Structural properties of these codes are derived. Based on these structural properties, two classes of UEP codes are constructed.
An object-oriented approach for parallel self adaptive mesh refinement on block structured grids

NASA Technical Reports Server (NTRS)

Lemke, Max; Witsch, Kristian; Quinlan, Daniel

1993-01-01

Self-adaptive mesh refinement dynamically matches the computational demands of a solver for partial differential equations to the activity in the application's domain. In this paper we present two C++ class libraries, P++ and AMR++, which significantly simplify the development of sophisticated adaptive mesh refinement codes on (massively) parallel distributed memory architectures. The development is based on our previous research in this area. The C++ class libraries provide abstractions to separate the issues of developing parallel adaptive mesh refinement applications into those of parallelism, abstracted by P++, and adaptive mesh refinement, abstracted by AMR++. P++ is a parallel array class library to permit efficient development of architecture independent codes for structured grid applications, and AMR++ provides support for self-adaptive mesh refinement on block-structured grids of rectangular non-overlapping blocks. Using these libraries, the application programmers' work is greatly simplified to primarily specifying the serial single grid application and obtaining the parallel and self-adaptive mesh refinement code with minimal effort. Initial results for simple singular perturbation problems solved by self-adaptive multilevel techniques (FAC, AFAC), being implemented on the basis of prototypes of the P++/AMR++ environment, are presented. Singular perturbation problems frequently arise in large applications, e.g. in the area of computational fluid dynamics. They usually have solutions with layers which require adaptive mesh refinement and fast basic solvers in order to be resolved efficiently.
Efficient Parallelization of a Dynamic Unstructured Application on the Tera MTA

NASA Technical Reports Server (NTRS)

Oliker, Leonid; Biswas, Rupak

1999-01-01

The success of parallel computing in solving real-life computationally-intensive problems relies on their efficient mapping and execution on large-scale multiprocessor architectures. Many important applications are both unstructured and dynamic in nature, making their efficient parallel implementation a daunting task. This paper presents the parallelization of a dynamic unstructured mesh adaptation algorithm using three popular programming paradigms on three leading supercomputers. We examine an MPI message-passing implementation on the Cray T3E and the SGI Origin2OOO, a shared-memory implementation using cache coherent nonuniform memory access (CC-NUMA) of the Origin2OOO, and a multi-threaded version on the newly-released Tera Multi-threaded Architecture (MTA). We compare several critical factors of this parallel code development, including runtime, scalability, programmability, and memory overhead. Our overall results demonstrate that multi-threaded systems offer tremendous potential for quickly and efficiently solving some of the most challenging real-life problems on parallel computers.
Parallelization strategies for continuum-generalized method of moments on the multi-thread systems

NASA Astrophysics Data System (ADS)

Bustamam, A.; Handhika, T.; Ernastuti, Kerami, D.

2017-07-01

Continuum-Generalized Method of Moments (C-GMM) covers the Generalized Method of Moments (GMM) shortfall which is not as efficient as Maximum Likelihood estimator by using the continuum set of moment conditions in a GMM framework. However, this computation would take a very long time since optimizing regularization parameter. Unfortunately, these calculations are processed sequentially whereas in fact all modern computers are now supported by hierarchical memory systems and hyperthreading technology, which allowing for parallel computing. This paper aims to speed up the calculation process of C-GMM by designing a parallel algorithm for C-GMM on the multi-thread systems. First, parallel regions are detected for the original C-GMM algorithm. There are two parallel regions in the original C-GMM algorithm, that are contributed significantly to the reduction of computational time: the outer-loop and the inner-loop. Furthermore, this parallel algorithm will be implemented with standard shared-memory application programming interface, i.e. Open Multi-Processing (OpenMP). The experiment shows that the outer-loop parallelization is the best strategy for any number of observations.
Research in computer science

NASA Technical Reports Server (NTRS)

Ortega, J. M.

1986-01-01

Various graduate research activities in the field of computer science are reported. Among the topics discussed are: (1) failure probabilities in multi-version software; (2) Gaussian Elimination on parallel computers; (3) three dimensional Poisson solvers on parallel/vector computers; (4) automated task decomposition for multiple robot arms; (5) multi-color incomplete cholesky conjugate gradient methods on the Cyber 205; and (6) parallel implementation of iterative methods for solving linear equations.
Obsessive-compulsive tendencies are associated with a focused information processing strategy.

PubMed

Soref, Assaf; Dar, Reuven; Argov, Galit; Meiran, Nachshon

2008-12-01

The study examined the hypothesis that obsessive-compulsive (OC) tendencies are related to a reliance on focused and serial rather than a parallel, speed-oriented information processing style. Ten students with high OC tendencies and 10 students with low OC tendencies performed the flanker task, in which they were required to quickly classify a briefly presented target letter (S or H) that was flanked by compatible (e.g., SSSSS) or incompatible (e.g., HHSHH) noise letters. Participants received 4 blocks of 100 trials each, two with 50% compatible trials and two with 80% compatible trials and were informed of the probability of compatible trials before the beginning of each block. As predicted, high OC participants, as compared to low OC participants, had slower overall reaction time (RT) and lower tendency for parallel processing (defined as incompatible trials RT minus compatible trials RT). Low, more than high OC participants tended to adjust their focused/parallel processing including a shift towards parallel processing in blocks with 80% compatible trials and in trials following compatible trials. Implications of these results to the cognitive theory and therapy of OCD are discussed.

Parallel deterministic neutronics with AMR in 3D

DOE Office of Scientific and Technical Information (OSTI.GOV)

Clouse, C.; Ferguson, J.; Hendrickson, C.

1997-12-31

AMTRAN, a three dimensional Sn neutronics code with adaptive mesh refinement (AMR) has been parallelized over spatial domains and energy groups and runs on the Meiko CS-2 with MPI message passing. Block refined AMR is used with linear finite element representations for the fluxes, which allows for a straight forward interpretation of fluxes at block interfaces with zoning differences. The load balancing algorithm assumes 8 spatial domains, which minimizes idle time among processors.
System-level analysis and design for RGB-NIR CMOS camera

NASA Astrophysics Data System (ADS)

Geelen, Bert; Spooren, Nick; Tack, Klaas; Lambrechts, Andy; Jayapala, Murali

2017-02-01

This paper presents system-level analysis of a sensor capable of simultaneously acquiring both standard absorption based RGB color channels (400-700nm, 75nm FWHM), as well as an additional NIR channel (central wavelength: 808 nm, FWHM: 30nm collimated light). Parallel acquisition of RGB and NIR info on the same CMOS image sensor is enabled by monolithic pixel-level integration of both a NIR pass thin film filter and NIR blocking filters for the RGB channels. This overcomes the need for a standard camera-level NIR blocking filter to remove the NIR leakage present in standard RGB absorption filters from 700-1000nm. Such a camera-level NIR blocking filter would inhibit the acquisition of the NIR channel on the same sensor. Thin film filters do not operate in isolation. Rather, their performance is influenced by the system context in which they operate. The spectral distribution of light arriving at the photo diode is shaped a.o. by the illumination spectral profile, optical component transmission characteristics and sensor quantum efficiency. For example, knowledge of a low quantum efficiency (QE) of the CMOS image sensor above 800nm may reduce the filter's blocking requirements and simplify the filter structure. Similarly, knowledge of the incoming light angularity as set by the objective lens' F/# and exit pupil location may be taken into account during the thin film's optimization. This paper demonstrates how knowledge of the application context can facilitate filter design and relax design trade-offs and presents experimental results.
Cloud computing-based TagSNP selection algorithm for human genome data.

PubMed

Hung, Che-Lun; Chen, Wen-Pei; Hua, Guan-Jie; Zheng, Huiru; Tsai, Suh-Jen Jane; Lin, Yaw-Ling

2015-01-05

Single nucleotide polymorphisms (SNPs) play a fundamental role in human genetic variation and are used in medical diagnostics, phylogeny construction, and drug design. They provide the highest-resolution genetic fingerprint for identifying disease associations and human features. Haplotypes are regions of linked genetic variants that are closely spaced on the genome and tend to be inherited together. Genetics research has revealed SNPs within certain haplotype blocks that introduce few distinct common haplotypes into most of the population. Haplotype block structures are used in association-based methods to map disease genes. In this paper, we propose an efficient algorithm for identifying haplotype blocks in the genome. In chromosomal haplotype data retrieved from the HapMap project website, the proposed algorithm identified longer haplotype blocks than an existing algorithm. To enhance its performance, we extended the proposed algorithm into a parallel algorithm that copies data in parallel via the Hadoop MapReduce framework. The proposed MapReduce-paralleled combinatorial algorithm performed well on real-world data obtained from the HapMap dataset; the improvement in computational efficiency was proportional to the number of processors used.
Cloud Computing-Based TagSNP Selection Algorithm for Human Genome Data

PubMed Central

Hung, Che-Lun; Chen, Wen-Pei; Hua, Guan-Jie; Zheng, Huiru; Tsai, Suh-Jen Jane; Lin, Yaw-Ling

2015-01-01

Single nucleotide polymorphisms (SNPs) play a fundamental role in human genetic variation and are used in medical diagnostics, phylogeny construction, and drug design. They provide the highest-resolution genetic fingerprint for identifying disease associations and human features. Haplotypes are regions of linked genetic variants that are closely spaced on the genome and tend to be inherited together. Genetics research has revealed SNPs within certain haplotype blocks that introduce few distinct common haplotypes into most of the population. Haplotype block structures are used in association-based methods to map disease genes. In this paper, we propose an efficient algorithm for identifying haplotype blocks in the genome. In chromosomal haplotype data retrieved from the HapMap project website, the proposed algorithm identified longer haplotype blocks than an existing algorithm. To enhance its performance, we extended the proposed algorithm into a parallel algorithm that copies data in parallel via the Hadoop MapReduce framework. The proposed MapReduce-paralleled combinatorial algorithm performed well on real-world data obtained from the HapMap dataset; the improvement in computational efficiency was proportional to the number of processors used. PMID:25569088
APPARATUS FOR PRODUCING IONS OF VAPORIZABLE MATERIALS

DOEpatents

Starr, C.

1957-11-19

This patent relates to electronic discharge devices used as ion sources, and in particular describes an ion source for application in a calutron. The source utilizes two cathodes disposed at opposite ends of a longitudinal opening in an arc block fed with vaporized material. A magnetic field is provided parallel to the length of the arc block opening. The electrons from the cathodes are directed through slits in collimating electrodes into the arc block parallel to the magnetic field and cause an arc discharge to occur between the cathodes, as the arc block and collimating electrodes are at a positive potential with respect to the cathode. The ions are withdrawn by suitable electrodes disposed opposite the arc block opening. When such an ion source is used in a calutron, an arc discharge of increased length may be utilized, thereby increasing the efficiency and economy of operation.
Micropore extrusion-induced alignment transition from perpendicular to parallel of cylindrical domains in block copolymers.

PubMed

Qu, Ting; Zhao, Yongbin; Li, Zongbo; Wang, Pingping; Cao, Shubo; Xu, Yawei; Li, Yayuan; Chen, Aihua

2016-02-14

The orientation transition from perpendicular to parallel alignment of PEO cylindrical domains of PEO-b-PMA(Az) films has been demonstrated by extruding the block copolymer (BCP) solutions through a micropore of a plastic gastight syringe. The parallelized orientation of PEO domains induced by this micropore extrusion can be recovered to perpendicular alignment via ultrasonication of the extruded BCP solutions and subsequent annealing. A plausible mechanism is proposed in this study. The BCP films can be used as templates to prepare nanowire arrays with controlled layers, which has enormous potential application in the field of integrated circuits.
The design and implementation of signal decomposition system of CL multi-wavelet transform based on DSP builder

NASA Astrophysics Data System (ADS)

Huang, Yan; Wang, Zhihui

2015-12-01

With the development of FPGA, DSP Builder is widely applied to design system-level algorithms. The algorithm of CL multi-wavelet is more advanced and effective than scalar wavelets in processing signal decomposition. Thus, a system of CL multi-wavelet based on DSP Builder is designed for the first time in this paper. The system mainly contains three parts: a pre-filtering subsystem, a one-level decomposition subsystem and a two-level decomposition subsystem. It can be converted into hardware language VHDL by the Signal Complier block that can be used in Quartus II. After analyzing the energy indicator, it shows that this system outperforms Daubenchies wavelet in signal decomposition. Furthermore, it has proved to be suitable for the implementation of signal fusion based on SoPC hardware, and it will become a solid foundation in this new field.
Rubus: A compiler for seamless and extensible parallelism.

PubMed

Adnan, Muhammad; Aslam, Faisal; Nawaz, Zubair; Sarwar, Syed Mansoor

2017-01-01

Nowadays, a typical processor may have multiple processing cores on a single chip. Furthermore, a special purpose processing unit called Graphic Processing Unit (GPU), originally designed for 2D/3D games, is now available for general purpose use in computers and mobile devices. However, the traditional programming languages which were designed to work with machines having single core CPUs, cannot utilize the parallelism available on multi-core processors efficiently. Therefore, to exploit the extraordinary processing power of multi-core processors, researchers are working on new tools and techniques to facilitate parallel programming. To this end, languages like CUDA and OpenCL have been introduced, which can be used to write code with parallelism. The main shortcoming of these languages is that programmer needs to specify all the complex details manually in order to parallelize the code across multiple cores. Therefore, the code written in these languages is difficult to understand, debug and maintain. Furthermore, to parallelize legacy code can require rewriting a significant portion of code in CUDA or OpenCL, which can consume significant time and resources. Thus, the amount of parallelism achieved is proportional to the skills of the programmer and the time spent in code optimizations. This paper proposes a new open source compiler, Rubus, to achieve seamless parallelism. The Rubus compiler relieves the programmer from manually specifying the low-level details. It analyses and transforms a sequential program into a parallel program automatically, without any user intervention. This achieves massive speedup and better utilization of the underlying hardware without a programmer's expertise in parallel programming. For five different benchmarks, on average a speedup of 34.54 times has been achieved by Rubus as compared to Java on a basic GPU having only 96 cores. Whereas, for a matrix multiplication benchmark the average execution speedup of 84 times has been achieved by Rubus on the same GPU. Moreover, Rubus achieves this performance without drastically increasing the memory footprint of a program.
Rubus: A compiler for seamless and extensible parallelism

PubMed Central

Adnan, Muhammad; Aslam, Faisal; Sarwar, Syed Mansoor

2017-01-01

Nowadays, a typical processor may have multiple processing cores on a single chip. Furthermore, a special purpose processing unit called Graphic Processing Unit (GPU), originally designed for 2D/3D games, is now available for general purpose use in computers and mobile devices. However, the traditional programming languages which were designed to work with machines having single core CPUs, cannot utilize the parallelism available on multi-core processors efficiently. Therefore, to exploit the extraordinary processing power of multi-core processors, researchers are working on new tools and techniques to facilitate parallel programming. To this end, languages like CUDA and OpenCL have been introduced, which can be used to write code with parallelism. The main shortcoming of these languages is that programmer needs to specify all the complex details manually in order to parallelize the code across multiple cores. Therefore, the code written in these languages is difficult to understand, debug and maintain. Furthermore, to parallelize legacy code can require rewriting a significant portion of code in CUDA or OpenCL, which can consume significant time and resources. Thus, the amount of parallelism achieved is proportional to the skills of the programmer and the time spent in code optimizations. This paper proposes a new open source compiler, Rubus, to achieve seamless parallelism. The Rubus compiler relieves the programmer from manually specifying the low-level details. It analyses and transforms a sequential program into a parallel program automatically, without any user intervention. This achieves massive speedup and better utilization of the underlying hardware without a programmer’s expertise in parallel programming. For five different benchmarks, on average a speedup of 34.54 times has been achieved by Rubus as compared to Java on a basic GPU having only 96 cores. Whereas, for a matrix multiplication benchmark the average execution speedup of 84 times has been achieved by Rubus on the same GPU. Moreover, Rubus achieves this performance without drastically increasing the memory footprint of a program. PMID:29211758
Petascale computation of multi-physics seismic simulations

NASA Astrophysics Data System (ADS)

Gabriel, Alice-Agnes; Madden, Elizabeth H.; Ulrich, Thomas; Wollherr, Stephanie; Duru, Kenneth C.

2017-04-01

Capturing the observed complexity of earthquake sources in concurrence with seismic wave propagation simulations is an inherently multi-scale, multi-physics problem. In this presentation, we present simulations of earthquake scenarios resolving high-detail dynamic rupture evolution and high frequency ground motion. The simulations combine a multitude of representations of model complexity; such as non-linear fault friction, thermal and fluid effects, heterogeneous fault stress and fault strength initial conditions, fault curvature and roughness, on- and off-fault non-elastic failure to capture dynamic rupture behavior at the source; and seismic wave attenuation, 3D subsurface structure and bathymetry impacting seismic wave propagation. Performing such scenarios at the necessary spatio-temporal resolution requires highly optimized and massively parallel simulation tools which can efficiently exploit HPC facilities. Our up to multi-PetaFLOP simulations are performed with SeisSol (www.seissol.org), an open-source software package based on an ADER-Discontinuous Galerkin (DG) scheme solving the seismic wave equations in velocity-stress formulation in elastic, viscoelastic, and viscoplastic media with high-order accuracy in time and space. Our flux-based implementation of frictional failure remains free of spurious oscillations. Tetrahedral unstructured meshes allow for complicated model geometry. SeisSol has been optimized on all software levels, including: assembler-level DG kernels which obtain 50% peak performance on some of the largest supercomputers worldwide; an overlapping MPI-OpenMP parallelization shadowing the multiphysics computations; usage of local time stepping; parallel input and output schemes and direct interfaces to community standard data formats. All these factors enable aim to minimise the time-to-solution. The results presented highlight the fact that modern numerical methods and hardware-aware optimization for modern supercomputers are essential to further our understanding of earthquake source physics and complement both physic-based ground motion research and empirical approaches in seismic hazard analysis. Lastly, we will conclude with an outlook on future exascale ADER-DG solvers for seismological applications.
An adaptive block-based fusion method with LUE-SSIM for multi-focus images

NASA Astrophysics Data System (ADS)

Zheng, Jianing; Guo, Yongcai; Huang, Yukun

2016-09-01

Because of the lenses' limited depth of field, digital cameras are incapable of acquiring an all-in-focus image of objects at varying distances in a scene. Multi-focus image fusion technique can effectively solve this problem. Aiming at the block-based multi-focus image fusion methods, the problem that blocking-artifacts often occurs. An Adaptive block-based fusion method based on lifting undistorted-edge structural similarity (LUE-SSIM) is put forward. In this method, image quality metrics LUE-SSIM is firstly proposed, which utilizes the characteristics of human visual system (HVS) and structural similarity (SSIM) to make the metrics consistent with the human visual perception. Particle swarm optimization(PSO) algorithm which selects LUE-SSIM as the object function is used for optimizing the block size to construct the fused image. Experimental results on LIVE image database shows that LUE-SSIM outperform SSIM on Gaussian defocus blur images quality assessment. Besides, multi-focus image fusion experiment is carried out to verify our proposed image fusion method in terms of visual and quantitative evaluation. The results show that the proposed method performs better than some other block-based methods, especially in reducing the blocking-artifact of the fused image. And our method can effectively preserve the undistorted-edge details in focus region of the source images.
Scalable geocomputation: evolving an environmental model building platform from single-core to supercomputers

NASA Astrophysics Data System (ADS)

Schmitz, Oliver; de Jong, Kor; Karssenberg, Derek

2017-04-01

There is an increasing demand to run environmental models on a big scale: simulations over large areas at high resolution. The heterogeneity of available computing hardware such as multi-core CPUs, GPUs or supercomputer potentially provides significant computing power to fulfil this demand. However, this requires detailed knowledge of the underlying hardware, parallel algorithm design and the implementation thereof in an efficient system programming language. Domain scientists such as hydrologists or ecologists often lack this specific software engineering knowledge, their emphasis is (and should be) on exploratory building and analysis of simulation models. As a result, models constructed by domain specialists mostly do not take full advantage of the available hardware. A promising solution is to separate the model building activity from software engineering by offering domain specialists a model building framework with pre-programmed building blocks that they combine to construct a model. The model building framework, consequently, needs to have built-in capabilities to make full usage of the available hardware. Developing such a framework providing understandable code for domain scientists and being runtime efficient at the same time poses several challenges on developers of such a framework. For example, optimisations can be performed on individual operations or the whole model, or tasks need to be generated for a well-balanced execution without explicitly knowing the complexity of the domain problem provided by the modeller. Ideally, a modelling framework supports the optimal use of available hardware whichsoever combination of model building blocks scientists use. We demonstrate our ongoing work on developing parallel algorithms for spatio-temporal modelling and demonstrate 1) PCRaster, an environmental software framework (http://www.pcraster.eu) providing spatio-temporal model building blocks and 2) parallelisation of about 50 of these building blocks using the new Fern library (https://github.com/geoneric/fern/), an independent generic raster processing library. Fern is a highly generic software library and its algorithms can be configured according to the configuration of a modelling framework. With manageable programming effort (e.g. matching data types between programming and domain language) we created a binding between Fern and PCRaster. The resulting PCRaster Python multicore module can be used to execute existing PCRaster models without having to make any changes to the model code. We show initial results on synthetic and geoscientific models indicating significant runtime improvements provided by parallel local and focal operations. We further outline challenges in improving remaining algorithms such as flow operations over digital elevation maps and further potential improvements like enhancing disk I/O.
ZettaBricks: A Language Compiler and Runtime System for Anyscale Computing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Amarasinghe, Saman

This grant supported the ZettaBricks and OpenTuner projects. ZettaBricks is a new implicitly parallel language and compiler where defining multiple implementations of multiple algorithms to solve a problem is the natural way of programming. ZettaBricks makes algorithmic choice a first class construct of the language. Choices are provided in a way that also allows our compiler to tune at a finer granularity. The ZettaBricks compiler autotunes programs by making both fine-grained as well as algorithmic choices. Choices also include different automatic parallelization techniques, data distributions, algorithmic parameters, transformations, and blocking. Additionally, ZettaBricks introduces novel techniques to autotune algorithms for differentmore » convergence criteria. When choosing between various direct and iterative methods, the ZettaBricks compiler is able to tune a program in such a way that delivers near-optimal efficiency for any desired level of accuracy. The compiler has the flexibility of utilizing different convergence criteria for the various components within a single algorithm, providing the user with accuracy choice alongside algorithmic choice. OpenTuner is a generalization of the experience gained in building an autotuner for ZettaBricks. OpenTuner is a new open source framework for building domain-specific multi-objective program autotuners. OpenTuner supports fully-customizable configuration representations, an extensible technique representation to allow for domain-specific techniques, and an easy to use interface for communicating with the program to be autotuned. A key capability inside OpenTuner is the use of ensembles of disparate search techniques simultaneously; techniques that perform well will dynamically be allocated a larger proportion of tests.« less
SYSTEM FOR UNLOADING REACTORS

DOEpatents

Rand, A.C. Jr.

1961-05-01

An unloading device for individual vertical fuel channels in a nuclear reactor is shown. The channels are arranged in parallel rows and underneath each is a separate supporting block on which the fuel in the channel rests. The blocks are raounted in contiguous rows on an array of parallel pairs of tracks over the bottom of the reactor. Oblong hollows in the blocks form a continuous passageway through the middle of the row of blocks on each pair of tracks. At the end of each passageway is a horizontal grappling rod with a T- or L extension at the end next to the reactor of a length to permit it to pass through the oblong passageway in one position, but when rotated ninety degrees the head will strike one of the longer sides of the oblong hollow of one of the blocks. The grappling rod is actuated by a controllable reciprocating and rotating device which extends it beyond any individual block desired, rotates it and retracts it far enough to permit the fuel in the vertical channel above the block to fall into a handling tank below the reactor.
Extraction and Measurement of Multi-Level Parallelism in Productions Systems

DTIC Science & Technology

1990-12-14

the conflict set (agenda): (RULE.A (RULEB ( MATCHA ) (MATCHB) (ACTA)) (ACTB)) where the MATCH predicates are sets of rules in working memory and the act...ACTAfnMATCHB = 0 ) A ( MATCHA n ACTB = 0 )). The non-interference criteria above is conservative and may not detect all possible paral- lelism, but more
On Parallel Software Engineering Education Using Python

ERIC Educational Resources Information Center

Marowka, Ami

2018-01-01

Python is gaining popularity in academia as the preferred language to teach novices serial programming. The syntax of Python is clean, easy, and simple to understand. At the same time, it is a high-level programming language that supports multi programming paradigms such as imperative, functional, and object-oriented. Therefore, by default, it is…
Multi-Objective Parallel Test-Sheet Composition Using Enhanced Particle Swarm Optimization

ERIC Educational Resources Information Center

Ho, Tsu-Feng; Yin, Peng-Yeng; Hwang, Gwo-Jen; Shyu, Shyong Jian; Yean, Ya-Nan

2009-01-01

For large-scale tests, such as certification tests or entrance examinations, the composed test sheets must meet multiple assessment criteria. Furthermore, to fairly compare the knowledge levels of the persons who receive tests at different times owing to the insufficiency of available examination halls or the occurrence of certain unexpected…
Large bedrock slope failures in a British Columbia, Canada fjord: first documented submarine sackungen

NASA Astrophysics Data System (ADS)

Conway, Kim W.; Vaughn Barrie, J.

2018-01-01

Very large (>60×106 m3) sackungen or deep-seated gravitational slope deformations occur below sea level along a steep fjord wall in central Douglas Channel, British Columbia. The massive bedrock blocks were mobile between 13 and 11.5 thousand radiocarbon years BP (15,800 and 13,400 BP) immediately following deglaciation. Deformation of fjord sediments is apparent in sedimentary units overlying and adjacent to the blocks. Faults bound the edges of each block, cutting the glacial section but not the Holocene sediments. Retrogressive slides, small inset landslides as well as incipient and older slides are found on and around the large failure blocks. Lineations, fractures and faults parallel the coastline of Douglas Channel along the shoreline of the study area. Topographic data onshore indicate that faults and joints demarcate discrete rhomboid-shaped blocks which controlled the form, size and location of the sackungen. The described submarine sackungen share characteristic geomorphic features with many montane occurrences, such as uphill-facing scarps, foliated bedrock composition, largely vertical dislocation and a deglacial timing of development.
Large bedrock slope failures in a British Columbia, Canada fjord: first documented submarine sackungen

NASA Astrophysics Data System (ADS)

Conway, Kim W.; Vaughn Barrie, J.

2018-06-01

Very large (>60×106 m3) sackungen or deep-seated gravitational slope deformations occur below sea level along a steep fjord wall in central Douglas Channel, British Columbia. The massive bedrock blocks were mobile between 13 and 11.5 thousand radiocarbon years BP (15,800 and 13,400 BP) immediately following deglaciation. Deformation of fjord sediments is apparent in sedimentary units overlying and adjacent to the blocks. Faults bound the edges of each block, cutting the glacial section but not the Holocene sediments. Retrogressive slides, small inset landslides as well as incipient and older slides are found on and around the large failure blocks. Lineations, fractures and faults parallel the coastline of Douglas Channel along the shoreline of the study area. Topographic data onshore indicate that faults and joints demarcate discrete rhomboid-shaped blocks which controlled the form, size and location of the sackungen. The described submarine sackungen share characteristic geomorphic features with many montane occurrences, such as uphill-facing scarps, foliated bedrock composition, largely vertical dislocation and a deglacial timing of development.
A communication-avoiding, hybrid-parallel, rank-revealing orthogonalization method.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hoemmen, Mark

2010-11-01

Orthogonalization consumes much of the run time of many iterative methods for solving sparse linear systems and eigenvalue problems. Commonly used algorithms, such as variants of Gram-Schmidt or Householder QR, have performance dominated by communication. Here, 'communication' includes both data movement between the CPU and memory, and messages between processors in parallel. Our Tall Skinny QR (TSQR) family of algorithms requires asymptotically fewer messages between processors and data movement between CPU and memory than typical orthogonalization methods, yet achieves the same accuracy as Householder QR factorization. Furthermore, in block orthogonalizations, TSQR is faster and more accurate than existing approaches formore » orthogonalizing the vectors within each block ('normalization'). TSQR's rank-revealing capability also makes it useful for detecting deflation in block iterative methods, for which existing approaches sacrifice performance, accuracy, or both. We have implemented a version of TSQR that exploits both distributed-memory and shared-memory parallelism, and supports real and complex arithmetic. Our implementation is optimized for the case of orthogonalizing a small number (5-20) of very long vectors. The shared-memory parallel component uses Intel's Threading Building Blocks, though its modular design supports other shared-memory programming models as well, including computation on the GPU. Our implementation achieves speedups of 2 times or more over competing orthogonalizations. It is available now in the development branch of the Trilinos software package, and will be included in the 10.8 release.« less

Three-way parallel independent component analysis for imaging genetics using multi-objective optimization.

PubMed

Ulloa, Alvaro; Jingyu Liu; Vergara, Victor; Jiayu Chen; Calhoun, Vince; Pattichis, Marios

2014-01-01

In the biomedical field, current technology allows for the collection of multiple data modalities from the same subject. In consequence, there is an increasing interest for methods to analyze multi-modal data sets. Methods based on independent component analysis have proven to be effective in jointly analyzing multiple modalities, including brain imaging and genetic data. This paper describes a new algorithm, three-way parallel independent component analysis (3pICA), for jointly identifying genomic loci associated with brain function and structure. The proposed algorithm relies on the use of multi-objective optimization methods to identify correlations among the modalities and maximally independent sources within modality. We test the robustness of the proposed approach by varying the effect size, cross-modality correlation, noise level, and dimensionality of the data. Simulation results suggest that 3p-ICA is robust to data with SNR levels from 0 to 10 dB and effect-sizes from 0 to 3, while presenting its best performance with high cross-modality correlations, and more than one subject per 1,000 variables. In an experimental study with 112 human subjects, the method identified links between a genetic component (pointing to brain function and mental disorder associated genes, including PPP3CC, KCNQ5, and CYP7B1), a functional component related to signal decreases in the default mode network during the task, and a brain structure component indicating increases of gray matter in brain regions of the default mode region. Although such findings need further replication, the simulation and in-vivo results validate the three-way parallel ICA algorithm presented here as a useful tool in biomedical data decomposition applications.
Disappearance of Anisotropic Intermittency in Large-amplitude MHD Turbulence and Its Comparison with Small-amplitude MHD Turbulence

NASA Astrophysics Data System (ADS)

Yang, Liping; Zhang, Lei; He, Jiansen; Tu, Chuanyi; Li, Shengtai; Wang, Xin; Wang, Linghua

2018-03-01

Multi-order structure functions in the solar wind are reported to display a monofractal scaling when sampled parallel to the local magnetic field and a multifractal scaling when measured perpendicularly. Whether and to what extent will the scaling anisotropy be weakened by the enhancement of turbulence amplitude relative to the background magnetic strength? In this study, based on two runs of the magnetohydrodynamic (MHD) turbulence simulation with different relative levels of turbulence amplitude, we investigate and compare the scaling of multi-order magnetic structure functions and magnetic probability distribution functions (PDFs) as well as their dependence on the direction of the local field. The numerical results show that for the case of large-amplitude MHD turbulence, the multi-order structure functions display a multifractal scaling at all angles to the local magnetic field, with PDFs deviating significantly from the Gaussian distribution and a flatness larger than 3 at all angles. In contrast, for the case of small-amplitude MHD turbulence, the multi-order structure functions and PDFs have different features in the quasi-parallel and quasi-perpendicular directions: a monofractal scaling and Gaussian-like distribution in the former, and a conversion of a monofractal scaling and Gaussian-like distribution into a multifractal scaling and non-Gaussian tail distribution in the latter. These results hint that when intermittencies are abundant and intense, the multifractal scaling in the structure functions can appear even if it is in the quasi-parallel direction; otherwise, the monofractal scaling in the structure functions remains even if it is in the quasi-perpendicular direction.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Luszczek, Piotr R; Tomov, Stanimire Z; Dongarra, Jack J

We present an efficient and scalable programming model for the development of linear algebra in heterogeneous multi-coprocessor environments. The model incorporates some of the current best design and implementation practices for the heterogeneous acceleration of dense linear algebra (DLA). Examples are given as the basis for solving linear systems' algorithms - the LU, QR, and Cholesky factorizations. To generate the extreme level of parallelism needed for the efficient use of coprocessors, algorithms of interest are redesigned and then split into well-chosen computational tasks. The tasks execution is scheduled over the computational components of a hybrid system of multi-core CPUs andmore » coprocessors using a light-weight runtime system. The use of lightweight runtime systems keeps scheduling overhead low, while enabling the expression of parallelism through otherwise sequential code. This simplifies the development efforts and allows the exploration of the unique strengths of the various hardware components.« less
Multi-LED parallel transmission for long distance underwater VLC system with one SPAD receiver

NASA Astrophysics Data System (ADS)

Wang, Chao; Yu, Hong-Yi; Zhu, Yi-Jun; Wang, Tao; Ji, Ya-Wei

2018-03-01

In this paper, a multiple light emitting diode (LED) chips parallel transmission (Multi-LED-PT) scheme for underwater visible light communication system with one photon-counting single photon avalanche diode (SPAD) receiver is proposed. As the lamp always consists of multi-LED chips, the data rate could be improved when we drive these multi-LED chips parallel by using the interleaver-division-multiplexing technique. For each chip, the on-off-keying modulation is used to reduce the influence of clipping. Then a serial successive interference cancellation detection algorithm based on ideal Poisson photon-counting channel by the SPAD is proposed. Finally, compared to the SPAD-based direct current-biased optical orthogonal frequency division multiplexing system, the proposed Multi-LED-PT system could improve the error-rate performance and anti-nonlinearity performance significantly under the effects of absorption, scattering and weak turbulence-induced channel fading together.
A Review of Lightweight Thread Approaches for High Performance Computing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Castello, Adrian; Pena, Antonio J.; Seo, Sangmin

High-level, directive-based solutions are becoming the programming models (PMs) of the multi/many-core architectures. Several solutions relying on operating system (OS) threads perfectly work with a moderate number of cores. However, exascale systems will spawn hundreds of thousands of threads in order to exploit their massive parallel architectures and thus conventional OS threads are too heavy for that purpose. Several lightweight thread (LWT) libraries have recently appeared offering lighter mechanisms to tackle massive concurrency. In order to examine the suitability of LWTs in high-level runtimes, we develop a set of microbenchmarks consisting of commonlyfound patterns in current parallel codes. Moreover, wemore » study the semantics offered by some LWT libraries in order to expose the similarities between different LWT application programming interfaces. This study reveals that a reduced set of LWT functions can be sufficient to cover the common parallel code patterns and that those LWT libraries perform better than OS threads-based solutions in cases where task and nested parallelism are becoming more popular with new architectures.« less
Design of a 0.13-μm CMOS cascade expandable ΣΔ modulator for multi-standard RF telecom systems

NASA Astrophysics Data System (ADS)

Morgado, Alonso; del Río, Rocío; de la Rosa, José M.

2007-05-01

This paper reports a 130-nm CMOS programmable cascade ΣΔ modulator for multi-standard wireless terminals, capable of operating on three standards: GSM, Bluetooth and UMTS. The modulator is reconfigured at both architecture- and circuit- level in order to adapt its performance to the different standards specifications with optimized power consumption. The design of the building blocks is based upon a top-down CAD methodology that combines simulation and statistical optimization at different levels of the system hierarchy. Transistor-level simulations show correct operation for all standards, featuring 13-bit, 11.3-bit and 9-bit effective resolution within 200-kHz, 1-MHz and 4-MHz bandwidth, respectively.
Salivary Streptococcus mutans and Lactobacilli levels following probiotic cheese consumption in adults: A double blind randomized clinical trial*

PubMed Central

Mortazavi, Shiva; Akhlaghi, Najme

2012-01-01

BACKGROUND: The beneficial effects of Lactobacillus species have been reported but the role of these species including Lactobacillus casei (L. casei) on oral health is not well documented. The purpose of this study was to evaluate the effects of conventional or probiotic cheese containing L.casei on salivary Streptococcus mutans (SM) and Lactobacilli levels. METHODS: In this double-blind controlled trial (IRCT201009144745N1), 60 adults were randomly allocated in 2 parallel blocks. SM and Lactobacilli count assessment were performed three times. Subjects consumed either cheese containing L. casei (1×106Cfu/g) (probiotic block, n=29) or cheese without any probiotic (control block, n=31) twice daily for two weeks. Bacterial levels changes were compared using Wilcoxon and Mann-Whitney Tests. Logistic regression compared changes in number of subjects with lowest and highest SM or Lactobacilli levels. RESULTS: Statistically significant (p = 0.001) reduction of salivary SM was found in probiotic group. SM levels reduction was not significant between placebo and trial groups (p = 0.46, 62% in probiotic vs. 32% in placebo group). Lacto-bacilli count changes during trial were not statistically significant inter and intra blocks (p = 0.12). Probiotic intervention was significantly effective in high levels (> 105cfu/ml) of SM (Odds Ratio 11.6, 95% CI 1.56–86.17, p = 0.017). CONCLUSIONS: Probiotic cheese containing L. casei was not effective in salivary SM levels reduction comparing to conventional cheese. Adding L. casei to cheese could be useful in decreasing SM counts in adults 18-37 years old with highest level of SM. PMID:23248658
Seismic responses and controlling factors of Miocene deepwater gravity-flow deposits in Block A, Lower Congo Basin

NASA Astrophysics Data System (ADS)

Wang, Linlin; Wang, Zhenqi; Yu, Shui; Ngia, Ngong Roger

2016-08-01

The Miocene deepwater gravity-flow sedimentary system in Block A of the southwestern part of the Lower Congo Basin was identified and interpreted using high-resolution 3-D seismic, drilling and logging data to reveal development characteristics and main controlling factors. Five types of deepwater gravity-flow sedimentary units have been identified in the Miocene section of Block A, including mass transport, deepwater channel, levee, abandoned channel and sedimentary lobe deposits. Each type of sedimentary unit has distinct external features, internal structures and lateral characteristics in seismic profiles. Mass transport deposits (MTDs) in particular correspond to chaotic low-amplitude reflections in contact with mutants on both sides. The cross section of deepwater channel deposits in the seismic profile is in U- or V-shape. The channel deposits change in ascending order from low-amplitude, poor-continuity, chaotic filling reflections at the bottom, to high-amplitude, moderate to poor continuity, chaotic or sub-parallel reflections in the middle section and to moderate-weak amplitude, good continuity, parallel or sub-parallel reflections in the upper section. The sedimentary lobes are laterally lobate, which corresponds to high-amplitude, good-continuity, moundy reflection signatures in the seismic profile. Due to sediment flux, faults, and inherited terrain, few mass transport deposits occur in the northeastern part of the study area. The front of MTDs is mainly composed of channel-levee complex deposits, while abandoned-channel and lobe-deposits are usually developed in high-curvature channel sections and the channel terminals, respectively. The distribution of deepwater channel, levee, abandoned channel and sedimentary lobe deposits is predominantly controlled by relative sea level fluctuations and to a lesser extent by tectonism and inherited terrain.
Final report for the Tera Computer TTI CRADA

DOE Office of Scientific and Technical Information (OSTI.GOV)

Davidson, G.S.; Pavlakos, C.; Silva, C.

1997-01-01

Tera Computer and Sandia National Laboratories have completed a CRADA, which examined the Tera Multi-Threaded Architecture (MTA) for use with large codes of importance to industry and DOE. The MTA is an innovative architecture that uses parallelism to mask latency between memories and processors. The physical implementation is a parallel computer with high cross-section bandwidth and GaAs processors designed by Tera, which support many small computation threads and fast, lightweight context switches between them. When any thread blocks while waiting for memory accesses to complete, another thread immediately begins execution so that high CPU utilization is maintained. The Tera MTAmore » parallel computer has a single, global address space, which is appealing when porting existing applications to a parallel computer. This ease of porting is further enabled by compiler technology that helps break computations into parallel threads. DOE and Sandia National Laboratories were interested in working with Tera to further develop this computing concept. While Tera Computer would continue the hardware development and compiler research, Sandia National Laboratories would work with Tera to ensure that their compilers worked well with important Sandia codes, most particularly CTH, a shock physics code used for weapon safety computations. In addition to that important code, Sandia National Laboratories would complete research on a robotic path planning code, SANDROS, which is important in manufacturing applications, and would evaluate the MTA performance on this code. Finally, Sandia would work directly with Tera to develop 3D visualization codes, which would be appropriate for use with the MTA. Each of these tasks has been completed to the extent possible, given that Tera has just completed the MTA hardware. All of the CRADA work had to be done on simulators.« less
Parallelization of interpolation, solar radiation and water flow simulation modules in GRASS GIS using OpenMP

NASA Astrophysics Data System (ADS)

Hofierka, Jaroslav; Lacko, Michal; Zubal, Stanislav

2017-10-01

In this paper, we describe the parallelization of three complex and computationally intensive modules of GRASS GIS using the OpenMP application programming interface for multi-core computers. These include the v.surf.rst module for spatial interpolation, the r.sun module for solar radiation modeling and the r.sim.water module for water flow simulation. We briefly describe the functionality of the modules and parallelization approaches used in the modules. Our approach includes the analysis of the module's functionality, identification of source code segments suitable for parallelization and proper application of OpenMP parallelization code to create efficient threads processing the subtasks. We document the efficiency of the solutions using the airborne laser scanning data representing land surface in the test area and derived high-resolution digital terrain model grids. We discuss the performance speed-up and parallelization efficiency depending on the number of processor threads. The study showed a substantial increase in computation speeds on a standard multi-core computer while maintaining the accuracy of results in comparison to the output from original modules. The presented parallelization approach showed the simplicity and efficiency of the parallelization of open-source GRASS GIS modules using OpenMP, leading to an increased performance of this geospatial software on standard multi-core computers.
Real-time SHVC software decoding with multi-threaded parallel processing

NASA Astrophysics Data System (ADS)

Gudumasu, Srinivas; He, Yuwen; Ye, Yan; He, Yong; Ryu, Eun-Seok; Dong, Jie; Xiu, Xiaoyu

2014-09-01

This paper proposes a parallel decoding framework for scalable HEVC (SHVC). Various optimization technologies are implemented on the basis of SHVC reference software SHM-2.0 to achieve real-time decoding speed for the two layer spatial scalability configuration. SHVC decoder complexity is analyzed with profiling information. The decoding process at each layer and the up-sampling process are designed in parallel and scheduled by a high level application task manager. Within each layer, multi-threaded decoding is applied to accelerate the layer decoding speed. Entropy decoding, reconstruction, and in-loop processing are pipeline designed with multiple threads based on groups of coding tree units (CTU). A group of CTUs is treated as a processing unit in each pipeline stage to achieve a better trade-off between parallelism and synchronization. Motion compensation, inverse quantization, and inverse transform modules are further optimized with SSE4 SIMD instructions. Simulations on a desktop with an Intel i7 processor 2600 running at 3.4 GHz show that the parallel SHVC software decoder is able to decode 1080p spatial 2x at up to 60 fps (frames per second) and 1080p spatial 1.5x at up to 50 fps for those bitstreams generated with SHVC common test conditions in the JCT-VC standardization group. The decoding performance at various bitrates with different optimization technologies and different numbers of threads are compared in terms of decoding speed and resource usage, including processor and memory.
Time-Dependent Simulations of Turbopump Flows

NASA Technical Reports Server (NTRS)

Kiris, Cetin; Kwak, Dochan; Chan, William; Williams, Robert

2002-01-01

Unsteady flow simulations for RLV (Reusable Launch Vehicles) 2nd Generation baseline turbopump for one and half impeller rotations have been completed by using a 34.3 Million grid points model. MLP (Multi-Level Parallelism) shared memory parallelism has been implemented in INS3D, and benchmarked. Code optimization for cash based platforms will be completed by the end of September 2001. Moving boundary capability is obtained by using DCF module. Scripting capability from CAD (computer aided design) geometry to solution has been developed. Data compression is applied to reduce data size in post processing. Fluid/Structure coupling has been initiated.
GPU-completeness: theory and implications

NASA Astrophysics Data System (ADS)

Lin, I.-Jong

2011-01-01

This paper formalizes a major insight into a class of algorithms that relate parallelism and performance. The purpose of this paper is to define a class of algorithms that trades off parallelism for quality of result (e.g. visual quality, compression rate), and we propose a similar method for algorithmic classification based on NP-Completeness techniques, applied toward parallel acceleration. We will define this class of algorithm as "GPU-Complete" and will postulate the necessary properties of the algorithms for admission into this class. We will also formally relate his algorithmic space and imaging algorithms space. This concept is based upon our experience in the print production area where GPUs (Graphic Processing Units) have shown a substantial cost/performance advantage within the context of HPdelivered enterprise services and commercial printing infrastructure. While CPUs and GPUs are converging in their underlying hardware and functional blocks, their system behaviors are clearly distinct in many ways: memory system design, programming paradigms, and massively parallel SIMD architecture. There are applications that are clearly suited to each architecture: for CPU: language compilation, word processing, operating systems, and other applications that are highly sequential in nature; for GPU: video rendering, particle simulation, pixel color conversion, and other problems clearly amenable to massive parallelization. While GPUs establishing themselves as a second, distinct computing architecture from CPUs, their end-to-end system cost/performance advantage in certain parts of computation inform the structure of algorithms and their efficient parallel implementations. While GPUs are merely one type of architecture for parallelization, we show that their introduction into the design space of printing systems demonstrate the trade-offs against competing multi-core, FPGA, and ASIC architectures. While each architecture has its own optimal application, we believe that the selection of architecture can be defined in terms of properties of GPU-Completeness. For a welldefined subset of algorithms, GPU-Completeness is intended to connect the parallelism, algorithms and efficient architectures into a unified framework to show that multiple layers of parallel implementation are guided by the same underlying trade-off.
Implications of Multi-Core Architectures on the Development of Multiple Independent Levels of Security (MILS) Compliant Systems

DTIC Science & Technology

2012-10-01

REPORT 3. DATES COVERED (From - To) MAR 2010 – APR 2012 4 . TITLE AND SUBTITLE IMPLICATIONS OF MULT-CORE ARCHITECTURES ON THE DEVELOPMENT OF...Framework for Multicore Information Flow Analysis ...................................... 23 4 4.1 A Hypothetical Reference Architecture... 4 Figure 2: Pentium II Block Diagram
Parallel software support for computational structural mechanics

NASA Technical Reports Server (NTRS)

Jordan, Harry F.

1987-01-01

The application of the parallel programming methodology known as the Force was conducted. Two application issues were addressed. The first involves the efficiency of the implementation and its completeness in terms of satisfying the needs of other researchers implementing parallel algorithms. Support for, and interaction with, other Computational Structural Mechanics (CSM) researchers using the Force was the main issue, but some independent investigation of the Barrier construct, which is extremely important to overall performance, was also undertaken. Another efficiency issue which was addressed was that of relaxing the strong synchronization condition imposed on the self-scheduled parallel DO loop. The Force was extended by the addition of logical conditions to the cases of a parallel case construct and by the inclusion of a self-scheduled version of this construct. The second issue involved applying the Force to the parallelization of finite element codes such as those found in the NICE/SPAR testbed system. One of the more difficult problems encountered is the determination of what information in COMMON blocks is actually used outside of a subroutine and when a subroutine uses a COMMON block merely as scratch storage for internal temporary results.
A programmable computational image sensor for high-speed vision

NASA Astrophysics Data System (ADS)

Yang, Jie; Shi, Cong; Long, Xitian; Wu, Nanjian

2013-08-01

In this paper we present a programmable computational image sensor for high-speed vision. This computational image sensor contains four main blocks: an image pixel array, a massively parallel processing element (PE) array, a row processor (RP) array and a RISC core. The pixel-parallel PE is responsible for transferring, storing and processing image raw data in a SIMD fashion with its own programming language. The RPs are one dimensional array of simplified RISC cores, it can carry out complex arithmetic and logic operations. The PE array and RP array can finish great amount of computation with few instruction cycles and therefore satisfy the low- and middle-level high-speed image processing requirement. The RISC core controls the whole system operation and finishes some high-level image processing algorithms. We utilize a simplified AHB bus as the system bus to connect our major components. Programming language and corresponding tool chain for this computational image sensor are also developed.
Multilevel-Dc-Bus Inverter For Providing Sinusoidal And Pwm Electrical Machine Voltages

DOEpatents

Su, Gui-Jia [Knoxville, TN

2005-11-29

A circuit for controlling an ac machine comprises a full bridge network of commutation switches which are connected to supply current for a corresponding voltage phase to the stator windings, a plurality of diodes, each in parallel connection to a respective one of the commutation switches, a plurality of dc source connections providing a multi-level dc bus for the full bridge network of commutation switches to produce sinusoidal voltages or PWM signals, and a controller connected for control of said dc source connections and said full bridge network of commutation switches to output substantially sinusoidal voltages to the stator windings. With the invention, the number of semiconductor switches is reduced to m+3 for a multi-level dc bus having m levels. A method of machine control is also disclosed.
Performance evaluation of throughput computing workloads using multi-core processors and graphics processors

NASA Astrophysics Data System (ADS)

Dave, Gaurav P.; Sureshkumar, N.; Blessy Trencia Lincy, S. S.

2017-11-01

Current trend in processor manufacturing focuses on multi-core architectures rather than increasing the clock speed for performance improvement. Graphic processors have become as commodity hardware for providing fast co-processing in computer systems. Developments in IoT, social networking web applications, big data created huge demand for data processing activities and such kind of throughput intensive applications inherently contains data level parallelism which is more suited for SIMD architecture based GPU. This paper reviews the architectural aspects of multi/many core processors and graphics processors. Different case studies are taken to compare performance of throughput computing applications using shared memory programming in OpenMP and CUDA API based programming.
Feature Clustering for Accelerating Parallel Coordinate Descent

DOE Office of Scientific and Technical Information (OSTI.GOV)

Scherrer, Chad; Tewari, Ambuj; Halappanavar, Mahantesh

2012-12-06

We demonstrate an approach for accelerating calculation of the regularization path for L1 sparse logistic regression problems. We show the benefit of feature clustering as a preconditioning step for parallel block-greedy coordinate descent algorithms.
A comparison of parallel and diverging screw angles in the stability of locked plate constructs.

PubMed

Wähnert, D; Windolf, M; Brianza, S; Rothstock, S; Radtke, R; Brighenti, V; Schwieger, K

2011-09-01

We investigated the static and cyclical strength of parallel and angulated locking plate screws using rigid polyurethane foam (0.32 g/cm(3)) and bovine cancellous bone blocks. Custom-made stainless steel plates with two conically threaded screw holes with different angulations (parallel, 10° and 20° divergent) and 5 mm self-tapping locking screws underwent pull-out and cyclical pull and bending tests. The bovine cancellous blocks were only subjected to static pull-out testing. We also performed finite element analysis for the static pull-out test of the parallel and 20° configurations. In both the foam model and the bovine cancellous bone we found the significantly highest pull-out force for the parallel constructs. In the finite element analysis there was a 47% more damage in the 20° divergent constructs than in the parallel configuration. Under cyclical loading, the mean number of cycles to failure was significantly higher for the parallel group, followed by the 10° and 20° divergent configurations. In our laboratory setting we clearly showed the biomechanical disadvantage of a diverging locking screw angle under static and cyclical loading.

What Multilevel Parallel Programs do when you are not Watching: A Performance Analysis Case Study Comparing MPI/OpenMP, MLP, and Nested OpenMP

NASA Technical Reports Server (NTRS)

Jost, Gabriele; Labarta, Jesus; Gimenez, Judit

2004-01-01

With the current trend in parallel computer architectures towards clusters of shared memory symmetric multi-processors, parallel programming techniques have evolved that support parallelism beyond a single level. When comparing the performance of applications based on different programming paradigms, it is important to differentiate between the influence of the programming model itself and other factors, such as implementation specific behavior of the operating system (OS) or architectural issues. Rewriting-a large scientific application in order to employ a new programming paradigms is usually a time consuming and error prone task. Before embarking on such an endeavor it is important to determine that there is really a gain that would not be possible with the current implementation. A detailed performance analysis is crucial to clarify these issues. The multilevel programming paradigms considered in this study are hybrid MPI/OpenMP, MLP, and nested OpenMP. The hybrid MPI/OpenMP approach is based on using MPI [7] for the coarse grained parallelization and OpenMP [9] for fine grained loop level parallelism. The MPI programming paradigm assumes a private address space for each process. Data is transferred by explicitly exchanging messages via calls to the MPI library. This model was originally designed for distributed memory architectures but is also suitable for shared memory systems. The second paradigm under consideration is MLP which was developed by Taft. The approach is similar to MPi/OpenMP, using a mix of coarse grain process level parallelization and loop level OpenMP parallelization. As it is the case with MPI, a private address space is assumed for each process. The MLP approach was developed for ccNUMA architectures and explicitly takes advantage of the availability of shared memory. A shared memory arena which is accessible by all processes is required. Communication is done by reading from and writing to the shared memory.
High-Throughput, Adaptive FFT Architecture for FPGA-Based Spaceborne Data Processors

NASA Technical Reports Server (NTRS)

NguyenKobayashi, Kayla; Zheng, Jason X.; He, Yutao; Shah, Biren N.

2011-01-01

Exponential growth in microelectronics technology such as field-programmable gate arrays (FPGAs) has enabled high-performance spaceborne instruments with increasing onboard data processing capabilities. As a commonly used digital signal processing (DSP) building block, fast Fourier transform (FFT) has been of great interest in onboard data processing applications, which needs to strike a reasonable balance between high-performance (throughput, block size, etc.) and low resource usage (power, silicon footprint, etc.). It is also desirable to be designed so that a single design can be reused and adapted into instruments with different requirements. The Multi-Pass Wide Kernel FFT (MPWK-FFT) architecture was developed, in which the high-throughput benefits of the parallel FFT structure and the low resource usage of Singleton s single butterfly method is exploited. The result is a wide-kernel, multipass, adaptive FFT architecture. The 32K-point MPWK-FFT architecture includes 32 radix-2 butterflies, 64 FIFOs to store the real inputs, 64 FIFOs to store the imaginary inputs, complex twiddle factor storage, and FIFO logic to route the outputs to the correct FIFO. The inputs are stored in sequential fashion into the FIFOs, and the outputs of each butterfly are sequentially written first into the even FIFO, then the odd FIFO. Because of the order of the outputs written into the FIFOs, the depth of the even FIFOs, which are 768 each, are 1.5 times larger than the odd FIFOs, which are 512 each. The total memory needed for data storage, assuming that each sample is 36 bits, is 2.95 Mbits. The twiddle factors are stored in internal ROM inside the FPGA for fast access time. The total memory size to store the twiddle factors is 589.9Kbits. This FFT structure combines the benefits of high throughput from the parallel FFT kernels and low resource usage from the multi-pass FFT kernels with desired adaptability. Space instrument missions that need onboard FFT capabilities such as the proposed DESDynl, SWOT (Surface Water Ocean Topography), and Europa sounding radar missions would greatly benefit from this technology with significant reductions in non-recurring cost and risk.
Present-day crustal deformation and strain transfer in northeastern Tibetan Plateau

NASA Astrophysics Data System (ADS)

Li, Yuhang; Liu, Mian; Wang, Qingliang; Cui, Duxin

2018-04-01

The three-dimensional present-day crustal deformation and strain partitioning in northeastern Tibetan Plateau are analyzed using available GPS and precise leveling data. We used the multi-scale wavelet method to analyze strain rates, and the elastic block model to estimate slip rates on the major faults and internal strain within each block. Our results show that shear strain is strongly localized along major strike-slip faults, as expected in the tectonic extrusion model. However, extrusion ends and transfers to crustal contraction near the eastern margin of the Tibetan Plateau. The strain transfer is abrupt along the Haiyuan Fault and diffusive along the East Kunlun Fault. Crustal contraction is spatially correlated with active uplifting. The present-day strain is concentrated along major fault zones; however, within many terranes bounded by these faults, intra-block strain is detectable. Terranes having high intra-block strain rates also show strong seismicity. On average the Ordos and Sichuan blocks show no intra-block strain, but localized strain on the southwestern corner of the Ordos block indicates tectonic encroachment.
Extending HPF for advanced data parallel applications

NASA Technical Reports Server (NTRS)

Chapman, Barbara; Mehrotra, Piyush; Zima, Hans

1994-01-01

The stated goal of High Performance Fortran (HPF) was to 'address the problems of writing data parallel programs where the distribution of data affects performance'. After examining the current version of the language we are led to the conclusion that HPF has not fully achieved this goal. While the basic distribution functions offered by the language - regular block, cyclic, and block cyclic distributions - can support regular numerical algorithms, advanced applications such as particle-in-cell codes or unstructured mesh solvers cannot be expressed adequately. We believe that this is a major weakness of HPF, significantly reducing its chances of becoming accepted in the numeric community. The paper discusses the data distribution and alignment issues in detail, points out some flaws in the basic language, and outlines possible future paths of development. Furthermore, we briefly deal with the issue of task parallelism and its integration with the data parallel paradigm of HPF.
Parallel Adaptive Mesh Refinement Library

NASA Technical Reports Server (NTRS)

Mac-Neice, Peter; Olson, Kevin

2005-01-01

Parallel Adaptive Mesh Refinement Library (PARAMESH) is a package of Fortran 90 subroutines designed to provide a computer programmer with an easy route to extension of (1) a previously written serial code that uses a logically Cartesian structured mesh into (2) a parallel code with adaptive mesh refinement (AMR). Alternatively, in its simplest use, and with minimal effort, PARAMESH can operate as a domain-decomposition tool for users who want to parallelize their serial codes but who do not wish to utilize adaptivity. The package builds a hierarchy of sub-grids to cover the computational domain of a given application program, with spatial resolution varying to satisfy the demands of the application. The sub-grid blocks form the nodes of a tree data structure (a quad-tree in two or an oct-tree in three dimensions). Each grid block has a logically Cartesian mesh. The package supports one-, two- and three-dimensional models.
Parallelization of implicit finite difference schemes in computational fluid dynamics

NASA Technical Reports Server (NTRS)

Decker, Naomi H.; Naik, Vijay K.; Nicoules, Michel

1990-01-01

Implicit finite difference schemes are often the preferred numerical schemes in computational fluid dynamics, requiring less stringent stability bounds than the explicit schemes. Each iteration in an implicit scheme involves global data dependencies in the form of second and higher order recurrences. Efficient parallel implementations of such iterative methods are considerably more difficult and non-intuitive. The parallelization of the implicit schemes that are used for solving the Euler and the thin layer Navier-Stokes equations and that require inversions of large linear systems in the form of block tri-diagonal and/or block penta-diagonal matrices is discussed. Three-dimensional cases are emphasized and schemes that minimize the total execution time are presented. Partitioning and scheduling schemes for alleviating the effects of the global data dependencies are described. An analysis of the communication and the computation aspects of these methods is presented. The effect of the boundary conditions on the parallel schemes is also discussed.
Parallel PAB3D: Experiences with a Prototype in MPI

NASA Technical Reports Server (NTRS)

Guerinoni, Fabio; Abdol-Hamid, Khaled S.; Pao, S. Paul

1998-01-01

PAB3D is a three-dimensional Navier Stokes solver that has gained acceptance in the research and industrial communities. It takes as computational domain, a set disjoint blocks covering the physical domain. This is the first report on the implementation of PAB3D using the Message Passing Interface (MPI), a standard for parallel processing. We discuss briefly the characteristics of tile code and define a prototype for testing. The principal data structure used for communication is derived from preprocessing "patching". We describe a simple interface (COMMSYS) for MPI communication, and some general techniques likely to be encountered when working on problems of this nature. Last, we identify levels of improvement from the current version and outline future work.
Incomplete Sparse Approximate Inverses for Parallel Preconditioning

DOE PAGES

Anzt, Hartwig; Huckle, Thomas K.; Bräckle, Jürgen; ...

2017-10-28

In this study, we propose a new preconditioning method that can be seen as a generalization of block-Jacobi methods, or as a simplification of the sparse approximate inverse (SAI) preconditioners. The “Incomplete Sparse Approximate Inverses” (ISAI) is in particular efficient in the solution of sparse triangular linear systems of equations. Those arise, for example, in the context of incomplete factorization preconditioning. ISAI preconditioners can be generated via an algorithm providing fine-grained parallelism, which makes them attractive for hardware with a high concurrency level. Finally, in a study covering a large number of matrices, we identify the ISAI preconditioner as anmore » attractive alternative to exact triangular solves in the context of incomplete factorization preconditioning.« less
Parallel Grand Canonical Monte Carlo (ParaGrandMC) Simulation Code

NASA Technical Reports Server (NTRS)

Yamakov, Vesselin I.

2016-01-01

This report provides an overview of the Parallel Grand Canonical Monte Carlo (ParaGrandMC) simulation code. This is a highly scalable parallel FORTRAN code for simulating the thermodynamic evolution of metal alloy systems at the atomic level, and predicting the thermodynamic state, phase diagram, chemical composition and mechanical properties. The code is designed to simulate multi-component alloy systems, predict solid-state phase transformations such as austenite-martensite transformations, precipitate formation, recrystallization, capillary effects at interfaces, surface absorption, etc., which can aid the design of novel metallic alloys. While the software is mainly tailored for modeling metal alloys, it can also be used for other types of solid-state systems, and to some degree for liquid or gaseous systems, including multiphase systems forming solid-liquid-gas interfaces.
Method and apparatus for determining two-phase flow in rock fracture

DOEpatents

Persoff, Peter; Pruess, Karsten; Myer, Larry

1994-01-01

An improved method and apparatus as disclosed for measuring the permeability of multiple phases through a rock fracture. The improvement in the method comprises delivering the respective phases through manifolds to uniformly deliver and collect the respective phases to and from opposite edges of the rock fracture in a distributed manner across the edge of the fracture. The improved apparatus comprises first and second manifolds comprising bores extending within porous blocks parallel to the rock fracture for distributing and collecting the wetting phase to and from surfaces of the porous blocks, which respectively face the opposite edges of the rock fracture. The improved apparatus further comprises other manifolds in the form of plenums located adjacent the respective porous blocks for uniform delivery of the non-wetting phase to parallel grooves disposed on the respective surfaces of the porous blocks facing the opposite edges of the rock fracture and generally perpendicular to the rock fracture.
TIMEDELN: A programme for the detection and parametrization of overlapping resonances using the time-delay method

NASA Astrophysics Data System (ADS)

Little, Duncan A.; Tennyson, Jonathan; Plummer, Martin; Noble, Clifford J.; Sunderland, Andrew G.

2017-06-01

TIMEDELN implements the time-delay method of determining resonance parameters from the characteristic Lorentzian form displayed by the largest eigenvalues of the time-delay matrix. TIMEDELN constructs the time-delay matrix from input K-matrices and analyses its eigenvalues. This new version implements multi-resonance fitting and may be run serially or as a high performance parallel code with three levels of parallelism. TIMEDELN takes K-matrices from a scattering calculation, either read from a file or calculated on a dynamically adjusted grid, and calculates the time-delay matrix. This is then diagonalized, with the largest eigenvalue representing the longest time-delay experienced by the scattering particle. A resonance shows up as a characteristic Lorentzian form in the time-delay: the programme searches the time-delay eigenvalues for maxima and traces resonances when they pass through different eigenvalues, separating overlapping resonances. It also performs the fitting of the calculated data to the Lorentzian form and outputs resonance positions and widths. Any remaining overlapping resonances can be fitted jointly. The branching ratios of decay into the open channels can also be found. The programme may be run serially or in parallel with three levels of parallelism. The parallel code modules are abstracted from the main physics code and can be used independently.
Parallelized multi-graphics processing unit framework for high-speed Gabor-domain optical coherence microscopy.

PubMed

Tankam, Patrice; Santhanam, Anand P; Lee, Kye-Sung; Won, Jungeun; Canavesi, Cristina; Rolland, Jannick P

2014-07-01

Gabor-domain optical coherence microscopy (GD-OCM) is a volumetric high-resolution technique capable of acquiring three-dimensional (3-D) skin images with histological resolution. Real-time image processing is needed to enable GD-OCM imaging in a clinical setting. We present a parallelized and scalable multi-graphics processing unit (GPU) computing framework for real-time GD-OCM image processing. A parallelized control mechanism was developed to individually assign computation tasks to each of the GPUs. For each GPU, the optimal number of amplitude-scans (A-scans) to be processed in parallel was selected to maximize GPU memory usage and core throughput. We investigated five computing architectures for computational speed-up in processing 1000×1000 A-scans. The proposed parallelized multi-GPU computing framework enables processing at a computational speed faster than the GD-OCM image acquisition, thereby facilitating high-speed GD-OCM imaging in a clinical setting. Using two parallelized GPUs, the image processing of a 1×1×0.6 mm3 skin sample was performed in about 13 s, and the performance was benchmarked at 6.5 s with four GPUs. This work thus demonstrates that 3-D GD-OCM data may be displayed in real-time to the examiner using parallelized GPU processing.
Decoupling Principle Analysis and Development of a Parallel Three-Dimensional Force Sensor

PubMed Central

Zhao, Yanzhi; Jiao, Leihao; Weng, Dacheng; Zhang, Dan; Zheng, Rencheng

2016-01-01

In the development of the multi-dimensional force sensor, dimension coupling is the ubiquitous factor restricting the improvement of the measurement accuracy. To effectively reduce the influence of dimension coupling on the parallel multi-dimensional force sensor, a novel parallel three-dimensional force sensor is proposed using a mechanical decoupling principle, and the influence of the friction on dimension coupling is effectively reduced by making the friction rolling instead of sliding friction. In this paper, the mathematical model is established by combining with the structure model of the parallel three-dimensional force sensor, and the modeling and analysis of mechanical decoupling are carried out. The coupling degree (ε) of the designed sensor is defined and calculated, and the calculation results show that the mechanical decoupling parallel structure of the sensor possesses good decoupling performance. A prototype of the parallel three-dimensional force sensor was developed, and FEM analysis was carried out. The load calibration and data acquisition experiment system are built, and then calibration experiments were done. According to the calibration experiments, the measurement accuracy is less than 2.86% and the coupling accuracy is less than 3.02%. The experimental results show that the sensor system possesses high measuring accuracy, which provides a basis for the applied research of the parallel multi-dimensional force sensor. PMID:27649194
Regularized Generalized Canonical Correlation Analysis

ERIC Educational Resources Information Center

Tenenhaus, Arthur; Tenenhaus, Michel

2011-01-01

Regularized generalized canonical correlation analysis (RGCCA) is a generalization of regularized canonical correlation analysis to three or more sets of variables. It constitutes a general framework for many multi-block data analysis methods. It combines the power of multi-block data analysis methods (maximization of well identified criteria) and…
A Hierarchical Model for Simultaneous Detection and Estimation in Multi-subject fMRI Studies

PubMed Central

Degras, David; Lindquist, Martin A.

2014-01-01

In this paper we introduce a new hierarchical model for the simultaneous detection of brain activation and estimation of the shape of the hemodynamic response in multi-subject fMRI studies. The proposed approach circumvents a major stumbling block in standard multi-subject fMRI data analysis, in that it both allows the shape of the hemodynamic response function to vary across region and subjects, while still providing a straightforward way to estimate population-level activation. An e cient estimation algorithm is presented, as is an inferential framework that not only allows for tests of activation, but also for tests for deviations from some canonical shape. The model is validated through simulations and application to a multi-subject fMRI study of thermal pain. PMID:24793829
Intra-class correlation estimates for assessment of vitamin A intake in children.

PubMed

Agarwal, Girdhar G; Awasthi, Shally; Walter, Stephen D

2005-03-01

In many community-based surveys, multi-level sampling is inherent in the design. In the design of these studies, especially to calculate the appropriate sample size, investigators need good estimates of intra-class correlation coefficient (ICC), along with the cluster size, to adjust for variation inflation due to clustering at each level. The present study used data on the assessment of clinical vitamin A deficiency and intake of vitamin A-rich food in children in a district in India. For the survey, 16 households were sampled from 200 villages nested within eight randomly-selected blocks of the district. ICCs and components of variances were estimated from a three-level hierarchical random effects analysis of variance model. Estimates of ICCs and variance components were obtained at village and block levels. Between-cluster variation was evident at each level of clustering. In these estimates, ICCs were inversely related to cluster size, but the design effect could be substantial for large clusters. At the block level, most ICC estimates were below 0.07. At the village level, many ICC estimates ranged from 0.014 to 0.45. These estimates may provide useful information for the design of epidemiological studies in which the sampled (or allocated) units range in size from households to large administrative zones.
Seismic analysis of parallel structures coupled by lead extrusion dampers

NASA Astrophysics Data System (ADS)

Patel, C. C.

2017-06-01

In this paper, the response behaviors of two parallel structures coupled by Lead Extrusion Dampers (LED) under various earthquake ground motion excitations are investigated. The equation of motion for the two parallel, multi-degree-of-freedom (MDOF) structures connected by LEDs is formulated. To explore the viability of LED to control the responses, namely displacement, acceleration and shear force of parallel coupled structures, the numerical study is done in two parts: (1) two parallel MDOF structures connected with LEDs having same damper damping in all the dampers and (2) two parallel MDOF structures connected with LEDs having different damper damping. A parametric study is conducted to investigate the optimum damping of the dampers. Moreover, to limit the cost of the dampers, the study is conducted with only 50% of total dampers at optimal locations, instead of placing the dampers at all the floor level. Results show that LEDs connecting the parallel structures of different fundamental frequencies, the earthquake-induced responses of either structure can be effectively reduced. Further, it is not necessary to connect the two structures at all floors; however, lesser damper at appropriate locations can significantly reduce the earthquake response of the coupled system, thus reducing the cost of the dampers significantly.
Genetic Parallel Programming: design and implementation.

PubMed

Cheang, Sin Man; Leung, Kwong Sak; Lee, Kin Hong

2006-01-01

This paper presents a novel Genetic Parallel Programming (GPP) paradigm for evolving parallel programs running on a Multi-Arithmetic-Logic-Unit (Multi-ALU) Processor (MAP). The MAP is a Multiple Instruction-streams, Multiple Data-streams (MIMD), general-purpose register machine that can be implemented on modern Very Large-Scale Integrated Circuits (VLSIs) in order to evaluate genetic programs at high speed. For human programmers, writing parallel programs is more difficult than writing sequential programs. However, experimental results show that GPP evolves parallel programs with less computational effort than that of their sequential counterparts. It creates a new approach to evolving a feasible problem solution in parallel program form and then serializes it into a sequential program if required. The effectiveness and efficiency of GPP are investigated using a suite of 14 well-studied benchmark problems. Experimental results show that GPP speeds up evolution substantially.
Multi-Kilowatt Power Module for High-Power Hall Thrusters

NASA Technical Reports Server (NTRS)

Pinero, Luis R.; Bowers, Glen E.

2005-01-01

Future NASA missions will require high-performance electric propulsion systems. Hall thrusters are being developed at NASA Glenn for high-power, high-specific impulse operation. These thrusters operate at power levels up to 50 kW of power and discharge voltages in excess of 600 V. A parallel effort is being conducted to develop power electronics for these thrusters that push the technology beyond the 5kW state-of-the-art power level. A 10 kW power module was designed to produce an output of 500 V and 20 A from a nominal 100 V input. Resistive load tests revealed efficiencies in excess of 96 percent. Load current share and phase synchronization circuits were designed and tested that will allow connecting multiple modules in parallel to process higher power.
Multiple-image encryption based on double random phase encoding and compressive sensing by using a measurement array preprocessed with orthogonal-basis matrices

NASA Astrophysics Data System (ADS)

Zhang, Luozhi; Zhou, Yuanyuan; Huo, Dongming; Li, Jinxi; Zhou, Xin

2018-09-01

A method is presented for multiple-image encryption by using the combination of orthogonal encoding and compressive sensing based on double random phase encoding. As an original thought in optical encryption, it is demonstrated theoretically and carried out by using the orthogonal-basis matrices to build a modified measurement array, being projected onto the images. In this method, all the images can be compressed in parallel into a stochastic signal and be diffused to be a stationary white noise. Meanwhile, each single-image can be separately reestablished by adopting a proper decryption key combination through the block-reconstruction rather than the entire-rebuilt, for its costs of data and decryption time are greatly decreased, which may be promising both in multi-user multiplexing and huge-image encryption/decryption. Besides, the security of this method is characterized by using the bit-length of key, and the parallelism is investigated as well. The simulations and discussions are also made on the effects of decryption as well as the correlation coefficient by using a series of sampling rates, occlusion attacks, keys with various error rates, etc.

Block iterative restoration of astronomical images with the massively parallel processor

NASA Technical Reports Server (NTRS)

Heap, Sara R.; Lindler, Don J.

1987-01-01

A method is described for algebraic image restoration capable of treating astronomical images. For a typical 500 x 500 image, direct algebraic restoration would require the solution of a 250,000 x 250,000 linear system. The block iterative approach is used to reduce the problem to solving 4900 121 x 121 linear systems. The algorithm was implemented on the Goddard Massively Parallel Processor, which can solve a 121 x 121 system in approximately 0.06 seconds. Examples are shown of the results for various astronomical images.
Subcostal Transverse Abdominis Plane Block for Acute Pain Management: A Review.

PubMed

Soliz, Jose M; Lipski, Ian; Hancher-Hodges, Shannon; Speer, Barbra Bryce; Popat, Keyuri

2017-10-01

The subcostal transverse abdominis plane (SCTAP) block is the deposition of local anesthetic in the transverse abdominis plane inferior and parallel to the costal margin. There is a growing consensus that the SCTAP block provides better analgesia for upper abdominal incisions than the traditional transverse abdominis plane block. In addition, when used as part of a four-quadrant transverse abdominis plane block, the SCTAP block may provide adequate analgesia for major abdominal surgery. The purpose of this review is to discuss the SCTAP block, including its indications, technique, local anesthetic solutions, and outcomes.
Experiences with Cray multi-tasking

NASA Technical Reports Server (NTRS)

Miya, E. N.

1985-01-01

The issues involved in modifying an existing code for multitasking is explored. They include Cray extensions to FORTRAN, an examination of the application code under study, designing workable modifications, specific code modifications to the VAX and Cray versions, performance, and efficiency results. The finished product is a faster, fully synchronous, parallel version of the original program. A production program is partitioned by hand to run on two CPUs. Loop splitting multitasks three key subroutines. Simply dividing subroutine data and control structure down the middle of a subroutine is not safe. Simple division produces results that are inconsistent with uniprocessor runs. The safest way to partition the code is to transfer one block of loops at a time and check the results of each on a test case. Other issues include debugging and performance. Task startup and maintenance (e.g., synchronization) are potentially expensive.
Rapid and annealing-free self-assembly of DNA building blocks for 3D hydrogel chaperoned by cationic comb-type copolymers.

PubMed

Zhang, Zheng; Wu, Yuyang; Yu, Feng; Niu, Chaoqun; Du, Zhi; Chen, Yong; Du, Jie

2017-10-01

The construction and self-assembly of DNA building blocks are the foundation of bottom-up development of three-dimensional DNA nanostructures or hydrogels. However, most self-assembly from DNA components is impeded by the mishybridized intermediates or the thermodynamic instability. To enable rapid production of complicated DNA objects with high yields no need for annealing process, herein different DNA building blocks (Y-shaped, L- and L'-shaped units) were assembled in presence of a cationic comb-type copolymer, poly (L-lysine)-graft-dextran (PLL-g-Dex), under physiological conditions. The results demonstrated that PLL-g-Dex not only significantly promoted the self-assembly of DNA blocks with high efficiency, but also stabilized the assembled multi-level structures especially for promoting the complicated 3D DNA hydrogel formation. This study develops a novel strategy for rapid and high-yield production of DNA hydrogel even derived from instable building blocks at relatively low DNA concentrations, which would endow DNA nanotechnology for more practical applications.
A highly efficient multi-core algorithm for clustering extremely large datasets

PubMed Central

2010-01-01

Background In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer. Results We introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization. Conclusions Most desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer. PMID:20370922
Physics-Based Preconditioning of a Compressible Flow Solver for Large-Scale Simulations of Additive Manufacturing Processes

NASA Astrophysics Data System (ADS)

Weston, Brian; Nourgaliev, Robert; Delplanque, Jean-Pierre

2017-11-01

We present a new block-based Schur complement preconditioner for simulating all-speed compressible flow with phase change. The conservation equations are discretized with a reconstructed Discontinuous Galerkin method and integrated in time with fully implicit time discretization schemes. The resulting set of non-linear equations is converged using a robust Newton-Krylov framework. Due to the stiffness of the underlying physics associated with stiff acoustic waves and viscous material strength effects, we solve for the primitive-variables (pressure, velocity, and temperature). To enable convergence of the highly ill-conditioned linearized systems, we develop a physics-based preconditioner, utilizing approximate block factorization techniques to reduce the fully-coupled 3×3 system to a pair of reduced 2×2 systems. We demonstrate that our preconditioned Newton-Krylov framework converges on very stiff multi-physics problems, corresponding to large CFL and Fourier numbers, with excellent algorithmic and parallel scalability. Results are shown for the classic lid-driven cavity flow problem as well as for 3D laser-induced phase change. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Design and implementation of highly parallel pipelined VLSI systems

NASA Astrophysics Data System (ADS)

Delange, Alphonsus Anthonius Jozef

A methodology and its realization as a prototype CAD (Computer Aided Design) system for the design and analysis of complex multiprocessor systems is presented. The design is an iterative process in which the behavioral specifications of the system components are refined into structural descriptions consisting of interconnections and lower level components etc. A model for the representation and analysis of multiprocessor systems at several levels of abstraction and an implementation of a CAD system based on this model are described. A high level design language, an object oriented development kit for tool design, a design data management system, and design and analysis tools such as a high level simulator and graphics design interface which are integrated into the prototype system and graphics interface are described. Procedures for the synthesis of semiregular processor arrays, and to compute the switching of input/output signals, memory management and control of processor array, and sequencing and segmentation of input/output data streams due to partitioning and clustering of the processor array during the subsequent synthesis steps, are described. The architecture and control of a parallel system is designed and each component mapped to a module or module generator in a symbolic layout library, compacted for design rules of VLSI (Very Large Scale Integration) technology. An example of the design of a processor that is a useful building block for highly parallel pipelined systems in the signal/image processing domains is given.
A multi-component parallel-plate flow chamber system for studying the effect of exercise-induced wall shear stress on endothelial cells.

PubMed

Wang, Yan-Xia; Xiang, Cheng; Liu, Bo; Zhu, Yong; Luan, Yong; Liu, Shu-Tian; Qin, Kai-Rong

2016-12-28

In vivo studies have demonstrated that reasonable exercise training can improve endothelial function. To confirm the key role of wall shear stress induced by exercise on endothelial cells, and to understand how wall shear stress affects the structure and the function of endothelial cells, it is crucial to design and fabricate an in vitro multi-component parallel-plate flow chamber system which can closely replicate exercise-induced wall shear stress waveforms in artery. The in vivo wall shear stress waveforms from the common carotid artery of a healthy volunteer in resting and immediately after 30 min acute aerobic cycling exercise were first calculated by measuring the inner diameter and the center-line blood flow velocity with a color Doppler ultrasound. According to the above in vivo wall shear stress waveforms, we designed and fabricated a parallel-plate flow chamber system with appropriate components based on a lumped parameter hemodynamics model. To validate the feasibility of this system, human umbilical vein endothelial cells (HUVECs) line were cultured within the parallel-plate flow chamber under abovementioned two types of wall shear stress waveforms and the intracellular actin microfilaments and nitric oxide (NO) production level were evaluated using fluorescence microscope. Our results show that the trends of resting and exercise-induced wall shear stress waveforms, especially the maximal, minimal and mean wall shear stress as well as oscillatory shear index, generated by the parallel-plate flow chamber system are similar to those acquired from the common carotid artery. In addition, the cellular experiments demonstrate that the actin microfilaments and the production of NO within cells exposed to the two different wall shear stress waveforms exhibit different dynamic behaviors; there are larger numbers of actin microfilaments and higher level NO in cells exposed in exercise-induced wall shear stress condition than resting wall shear stress condition. The parallel-plate flow chamber system can well reproduce wall shear stress waveforms acquired from the common carotid artery in resting and immediately after exercise states. Furthermore, it can be used for studying the endothelial cells responses under resting and exercise-induced wall shear stress environments in vitro.
Orientation of airborne laser scanning point clouds with multi-view, multi-scale image blocks.

PubMed

Rönnholm, Petri; Hyyppä, Hannu; Hyyppä, Juha; Haggrén, Henrik

2009-01-01

Comprehensive 3D modeling of our environment requires integration of terrestrial and airborne data, which is collected, preferably, using laser scanning and photogrammetric methods. However, integration of these multi-source data requires accurate relative orientations. In this article, two methods for solving relative orientation problems are presented. The first method includes registration by minimizing the distances between of an airborne laser point cloud and a 3D model. The 3D model was derived from photogrammetric measurements and terrestrial laser scanning points. The first method was used as a reference and for validation. Having completed registration in the object space, the relative orientation between images and laser point cloud is known. The second method utilizes an interactive orientation method between a multi-scale image block and a laser point cloud. The multi-scale image block includes both aerial and terrestrial images. Experiments with the multi-scale image block revealed that the accuracy of a relative orientation increased when more images were included in the block. The orientations of the first and second methods were compared. The comparison showed that correct rotations were the most difficult to detect accurately by using the interactive method. Because the interactive method forces laser scanning data to fit with the images, inaccurate rotations cause corresponding shifts to image positions. However, in a test case, in which the orientation differences included only shifts, the interactive method could solve the relative orientation of an aerial image and airborne laser scanning data repeatedly within a couple of centimeters.
Orientation of Airborne Laser Scanning Point Clouds with Multi-View, Multi-Scale Image Blocks

PubMed Central

Rönnholm, Petri; Hyyppä, Hannu; Hyyppä, Juha; Haggrén, Henrik

2009-01-01

Comprehensive 3D modeling of our environment requires integration of terrestrial and airborne data, which is collected, preferably, using laser scanning and photogrammetric methods. However, integration of these multi-source data requires accurate relative orientations. In this article, two methods for solving relative orientation problems are presented. The first method includes registration by minimizing the distances between of an airborne laser point cloud and a 3D model. The 3D model was derived from photogrammetric measurements and terrestrial laser scanning points. The first method was used as a reference and for validation. Having completed registration in the object space, the relative orientation between images and laser point cloud is known. The second method utilizes an interactive orientation method between a multi-scale image block and a laser point cloud. The multi-scale image block includes both aerial and terrestrial images. Experiments with the multi-scale image block revealed that the accuracy of a relative orientation increased when more images were included in the block. The orientations of the first and second methods were compared. The comparison showed that correct rotations were the most difficult to detect accurately by using the interactive method. Because the interactive method forces laser scanning data to fit with the images, inaccurate rotations cause corresponding shifts to image positions. However, in a test case, in which the orientation differences included only shifts, the interactive method could solve the relative orientation of an aerial image and airborne laser scanning data repeatedly within a couple of centimeters. PMID:22454569
Playback system designed for X-Band SAR

NASA Astrophysics Data System (ADS)

Yuquan, Liu; Changyong, Dou

2014-03-01

SAR(Synthetic Aperture Radar) has extensive application because it is daylight and weather independent. In particular, X-Band SAR strip map, designed by Institute of Remote Sensing and Digital Earth, Chinese Academy of Sciences, provides high ground resolution images, at the same time it has a large spatial coverage and a short acquisition time, so it is promising in multi-applications. When sudden disaster comes, the emergency situation acquires radar signal data and image as soon as possible, in order to take action to reduce loss and save lives in the first time. This paper summarizes a type of X-Band SAR playback processing system designed for disaster response and scientific needs. It describes SAR data workflow includes the payload data transmission and reception process. Playback processing system completes signal analysis on the original data, providing SAR level 0 products and quick image. Gigabit network promises radar signal transmission efficiency from recorder to calculation unit. Multi-thread parallel computing and ping pong operation can ensure computation speed. Through gigabit network, multi-thread parallel computing and ping pong operation, high speed data transmission and processing meet the SAR radar data playback real time requirement.
Current correlations for the transport of interacting electrons through parallel quantum dots in a photon cavity

NASA Astrophysics Data System (ADS)

Gudmundsson, Vidar; Abdullah, Nzar Rauf; Sitek, Anna; Goan, Hsi-Sheng; Tang, Chi-Shung; Manolescu, Andrei

2018-06-01

We calculate the current correlations for the steady-state electron transport through multi-level parallel quantum dots embedded in a short quantum wire, that is placed in a non-perfect photon cavity. We account for the electron-electron Coulomb interaction, and the para- and diamagnetic electron-photon interactions with a stepwise scheme of configuration interactions and truncation of the many-body Fock spaces. In the spectral density of the temporal current-current correlations we identify all the transitions, radiative and non-radiative, active in the system in order to maintain the steady state. We observe strong signs of two types of Rabi oscillations.
Atmospheric Blocking and Atlantic Multi-Decadal Ocean Variability

NASA Technical Reports Server (NTRS)

Hakkinen, Sirpa; Rhines, Peter B.; Worthen, Denise L.

2011-01-01

Atmospheric blocking over the northern North Atlantic involves isolation of large regions of air from the westerly circulation for 5-14 days or more. From a recent 20th century atmospheric reanalysis (1,2) winters with more frequent blocking persist over several decades and correspond to a warm North Atlantic Ocean, in-phase with Atlantic multi-decadal ocean variability (AMV). Ocean circulation is forced by wind-stress curl and related air/sea heat exchange, and we find that their space-time structure is associated with dominant blocking patterns: weaker ocean gyres and weaker heat exchange contribute to the warm phase of AMV. Increased blocking activity extending from Greenland to British Isles is evident when winter blocking days of the cold years (1900-1929) are subtracted from those of the warm years (1939-1968).
a Spatiotemporal Aggregation Query Method Using Multi-Thread Parallel Technique Based on Regional Division

NASA Astrophysics Data System (ADS)

Liao, S.; Chen, L.; Li, J.; Xiong, W.; Wu, Q.

2015-07-01

Existing spatiotemporal database supports spatiotemporal aggregation query over massive moving objects datasets. Due to the large amounts of data and single-thread processing method, the query speed cannot meet the application requirements. On the other hand, the query efficiency is more sensitive to spatial variation then temporal variation. In this paper, we proposed a spatiotemporal aggregation query method using multi-thread parallel technique based on regional divison and implemented it on the server. Concretely, we divided the spatiotemporal domain into several spatiotemporal cubes, computed spatiotemporal aggregation on all cubes using the technique of multi-thread parallel processing, and then integrated the query results. By testing and analyzing on the real datasets, this method has improved the query speed significantly.
Optimizing transformations of stencil operations for parallel cache-based architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bassetti, F.; Davis, K.

This paper describes a new technique for optimizing serial and parallel stencil- and stencil-like operations for cache-based architectures. This technique takes advantage of the semantic knowledge implicity in stencil-like computations. The technique is implemented as a source-to-source program transformation; because of its specificity it could not be expected of a conventional compiler. Empirical results demonstrate a uniform factor of two speedup. The experiments clearly show the benefits of this technique to be a consequence, as intended, of the reduction in cache misses. The test codes are based on a 5-point stencil obtained by the discretization of the Poisson equation andmore » applied to a two-dimensional uniform grid using the Jacobi method as an iterative solver. Results are presented for a 1-D tiling for a single processor, and in parallel using 1-D data partition. For the parallel case both blocking and non-blocking communication are tested. The same scheme of experiments has bee n performed for the 2-D tiling case. However, for the parallel case the 2-D partitioning is not discussed here, so the parallel case handled for 2-D is 2-D tiling with 1-D data partitioning.« less
A parallel adaptive mesh refinement algorithm

NASA Technical Reports Server (NTRS)

Quirk, James J.; Hanebutte, Ulf R.

1993-01-01

Over recent years, Adaptive Mesh Refinement (AMR) algorithms which dynamically match the local resolution of the computational grid to the numerical solution being sought have emerged as powerful tools for solving problems that contain disparate length and time scales. In particular, several workers have demonstrated the effectiveness of employing an adaptive, block-structured hierarchical grid system for simulations of complex shock wave phenomena. Unfortunately, from the parallel algorithm developer's viewpoint, this class of scheme is quite involved; these schemes cannot be distilled down to a small kernel upon which various parallelizing strategies may be tested. However, because of their block-structured nature such schemes are inherently parallel, so all is not lost. In this paper we describe the method by which Quirk's AMR algorithm has been parallelized. This method is built upon just a few simple message passing routines and so it may be implemented across a broad class of MIMD machines. Moreover, the method of parallelization is such that the original serial code is left virtually intact, and so we are left with just a single product to support. The importance of this fact should not be underestimated given the size and complexity of the original algorithm.
Application of a Scalable, Parallel, Unstructured-Grid-Based Navier-Stokes Solver

NASA Technical Reports Server (NTRS)

Parikh, Paresh

2001-01-01

A parallel version of an unstructured-grid based Navier-Stokes solver, USM3Dns, previously developed for efficient operation on a variety of parallel computers, has been enhanced to incorporate upgrades made to the serial version. The resultant parallel code has been extensively tested on a variety of problems of aerospace interest and on two sets of parallel computers to understand and document its characteristics. An innovative grid renumbering construct and use of non-blocking communication are shown to produce superlinear computing performance. Preliminary results from parallelization of a recently introduced "porous surface" boundary condition are also presented.
Implicit gas-kinetic unified algorithm based on multi-block docking grid for multi-body reentry flows covering all flow regimes

NASA Astrophysics Data System (ADS)

Peng, Ao-Ping; Li, Zhi-Hui; Wu, Jun-Lin; Jiang, Xin-Yu

2016-12-01

Based on the previous researches of the Gas-Kinetic Unified Algorithm (GKUA) for flows from highly rarefied free-molecule transition to continuum, a new implicit scheme of cell-centered finite volume method is presented for directly solving the unified Boltzmann model equation covering various flow regimes. In view of the difficulty in generating the single-block grid system with high quality for complex irregular bodies, a multi-block docking grid generation method is designed on the basis of data transmission between blocks, and the data structure is constructed for processing arbitrary connection relations between blocks with high efficiency and reliability. As a result, the gas-kinetic unified algorithm with the implicit scheme and multi-block docking grid has been firstly established and used to solve the reentry flow problems around the multi-bodies covering all flow regimes with the whole range of Knudsen numbers from 10 to 3.7E-6. The implicit and explicit schemes are applied to computing and analyzing the supersonic flows in near-continuum and continuum regimes around a circular cylinder with careful comparison each other. It is shown that the present algorithm and modelling possess much higher computational efficiency and faster converging properties. The flow problems including two and three side-by-side cylinders are simulated from highly rarefied to near-continuum flow regimes, and the present computed results are found in good agreement with the related DSMC simulation and theoretical analysis solutions, which verify the good accuracy and reliability of the present method. It is observed that the spacing of the multi-body is smaller, the cylindrical throat obstruction is greater with the flow field of single-body asymmetrical more obviously and the normal force coefficient bigger. While in the near-continuum transitional flow regime of near-space flying surroundings, the spacing of the multi-body increases to six times of the diameter of the single-body, the interference effects of the multi-bodies tend to be negligible. The computing practice has confirmed that it is feasible for the present method to compute the aerodynamics and reveal flow mechanism around complex multi-body vehicles covering all flow regimes from the gas-kinetic point of view of solving the unified Boltzmann model velocity distribution function equation.
Parallel and fault-tolerant algorithms for hypercube multiprocessors

DOE Office of Scientific and Technical Information (OSTI.GOV)

Aykanat, C.

1988-01-01

Several techniques for increasing the performance of parallel algorithms on distributed-memory message-passing multi-processor systems are investigated. These techniques are effectively implemented for the parallelization of the Scaled Conjugate Gradient (SCG) algorithm on a hypercube connected message-passing multi-processor. Significant performance improvement is achieved by using these techniques. The SCG algorithm is used for the solution phase of an FE modeling system. Almost linear speed-up is achieved, and it is shown that hypercube topology is scalable for an FE class of problem. The SCG algorithm is also shown to be suitable for vectorization, and near supercomputer performance is achieved on a vectormore » hypercube multiprocessor by exploiting both parallelization and vectorization. Fault-tolerance issues for the parallel SCG algorithm and for the hypercube topology are also addressed.« less
Level-Set Simulation of Viscous Free Surface Flow Around a Commercial Hull Form

DTIC Science & Technology

2005-04-15

Abstract The viscous free surface flow around a 3600 TEU KRISO Container Ship is computed using the finite volume based multi-block RANS code, WAVIS...developed at KRISO . The free surface is captured with the Level-set method and the realizable k-ε model is employed for turbulence closure. The...computations are done for a 3600 TEU container ship of Korea Research Institute of Ships & Ocean Engineering, KORDI (hereafter, KRISO ) selected as

Parallel transformation of K-SVD solar image denoising algorithm

NASA Astrophysics Data System (ADS)

Liang, Youwen; Tian, Yu; Li, Mei

2017-02-01

The images obtained by observing the sun through a large telescope always suffered with noise due to the low SNR. K-SVD denoising algorithm can effectively remove Gauss white noise. Training dictionaries for sparse representations is a time consuming task, due to the large size of the data involved and to the complexity of the training algorithms. In this paper, an OpenMP parallel programming language is proposed to transform the serial algorithm to the parallel version. Data parallelism model is used to transform the algorithm. Not one atom but multiple atoms updated simultaneously is the biggest change. The denoising effect and acceleration performance are tested after completion of the parallel algorithm. Speedup of the program is 13.563 in condition of using 16 cores. This parallel version can fully utilize the multi-core CPU hardware resources, greatly reduce running time and easily to transplant in multi-core platform.
A Scalable, Parallel Approach for Multi-Point, High-Fidelity Aerostructural Optimization of Aircraft Configurations

NASA Astrophysics Data System (ADS)

Kenway, Gaetan K. W.

This thesis presents new tools and techniques developed to address the challenging problem of high-fidelity aerostructural optimization with respect to large numbers of design variables. A new mesh-movement scheme is developed that is both computationally efficient and sufficiently robust to accommodate large geometric design changes and aerostructural deformations. A fully coupled Newton-Krylov method is presented that accelerates the convergence of aerostructural systems and provides a 20% performance improvement over the traditional nonlinear block Gauss-Seidel approach and can handle more exible structures. A coupled adjoint method is used that efficiently computes derivatives for a gradient-based optimization algorithm. The implementation uses only machine accurate derivative techniques and is verified to yield fully consistent derivatives by comparing against the complex step method. The fully-coupled large-scale coupled adjoint solution method is shown to have 30% better performance than the segregated approach. The parallel scalability of the coupled adjoint technique is demonstrated on an Euler Computational Fluid Dynamics (CFD) model with more than 80 million state variables coupled to a detailed structural finite-element model of the wing with more than 1 million degrees of freedom. Multi-point high-fidelity aerostructural optimizations of a long-range wide-body, transonic transport aircraft configuration are performed using the developed techniques. The aerostructural analysis employs Euler CFD with a 2 million cell mesh and a structural finite element model with 300 000 DOF. Two design optimization problems are solved: one where takeoff gross weight is minimized, and another where fuel burn is minimized. Each optimization uses a multi-point formulation with 5 cruise conditions and 2 maneuver conditions. The optimization problems have 476 design variables are optimal results are obtained within 36 hours of wall time using 435 processors. The TOGW minimization results in a 4.2% reduction in TOGW with a 6.6% fuel burn reduction, while the fuel burn optimization resulted in a 11.2% fuel burn reduction with no change to the takeoff gross weight.
Synthetic velocity gradient map of the San Francisco Bay region, California, supports use of average block velocities to estimate fault slip rate where effective locking depth is small relative to inter-fault distance

NASA Astrophysics Data System (ADS)

Graymer, R. W.; Simpson, R. W.

2014-12-01

Graymer and Simpson (2013, AGU Fall Meeting) showed that in a simple 2D multi-fault system (vertical, parallel, strike-slip faults bounding blocks without strong material property contrasts) slip rate on block-bounding faults can be reasonably estimated by the difference between the mean velocity of adjacent blocks if the ratio of the effective locking depth to the distance between the faults is 1/3 or less ("effective" locking depth is a synthetic parameter taking into account actual locking depth, fault creep, and material properties of the fault zone). To check the validity of that observation for a more complex 3D fault system and a realistic distribution of observation stations, we developed a synthetic suite of GPS velocities from a dislocation model, with station location and fault parameters based on the San Francisco Bay region. Initial results show that if the effective locking depth is set at the base of the seismogenic zone (about 12-15 km), about 1/2 the interfault distance, the resulting synthetic velocity observations, when clustered, do a poor job of returning the input fault slip rates. However, if the apparent locking depth is set at 1/2 the distance to the base of the seismogenic zone, or about 1/4 the interfault distance, the synthetic velocity field does a good job of returning the input slip rates except where the fault is in a strong restraining orientation relative to block motion or where block velocity is not well defined (for example west of the northern San Andreas Fault where there are no observations to the west in the ocean). The question remains as to where in the real world a low effective locking depth could usefully model fault behavior. Further tests are planned to define the conditions where average cluster-defined block velocities can be used to reliably estimate slip rates on block-bounding faults. These rates are an important ingredient in earthquake hazard estimation, and another tool to provide them should be useful.
Macro-actor execution on multilevel data-driven architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gaudiot, J.L.; Najjar, W.

1988-12-31

The data-flow model of computation brings to multiprocessors high programmability at the expense of increased overhead. Applying the model at a higher level leads to better performance but also introduces loss of parallelism. We demonstrate here syntax directed program decomposition methods for the creation of large macro-actors in numerical algorithms. In order to alleviate some of the problems introduced by the lower resolution interpretation, we describe a multi-level of resolution and analyze the requirements for its actual hardware and software integration.
Optimization of computations for adjoint field and Jacobian needed in 3D CSEM inversion

NASA Astrophysics Data System (ADS)

Dehiya, Rahul; Singh, Arun; Gupta, Pravin K.; Israil, M.

2017-01-01

We present the features and results of a newly developed code, based on Gauss-Newton optimization technique, for solving three-dimensional Controlled-Source Electromagnetic inverse problem. In this code a special emphasis has been put on representing the operations by block matrices for conjugate gradient iteration. We show how in the computation of Jacobian, the matrix formed by differentiation of system matrix can be made independent of frequency to optimize the operations at conjugate gradient step. The coarse level parallel computing, using OpenMP framework, is used primarily due to its simplicity in implementation and accessibility of shared memory multi-core computing machine to almost anyone. We demonstrate how the coarseness of modeling grid in comparison to source (comp`utational receivers) spacing can be exploited for efficient computing, without compromising the quality of the inverted model, by reducing the number of adjoint calls. It is also demonstrated that the adjoint field can even be computed on a grid coarser than the modeling grid without affecting the inversion outcome. These observations were reconfirmed using an experiment design where the deviation of source from straight tow line is considered. Finally, a real field data inversion experiment is presented to demonstrate robustness of the code.
Research and application of multi-hydrogen acidizing technology of low-permeability reservoirs for increasing water injection

NASA Astrophysics Data System (ADS)

Ning, Mengmeng; Che, Hang; Kong, Weizhong; Wang, Peng; Liu, Bingxiao; Xu, Zhengdong; Wang, Xiaochao; Long, Changjun; Zhang, Bin; Wu, Youmei

2017-12-01

The physical characteristics of Xiliu 10 Block reservoir is poor, it has strong reservoir inhomogeneity between layers and high kaolinite content of the reservoir, the scaling trend of fluid is serious, causing high block injection well pressure and difficulty in achieving injection requirements. In the past acidizing process, the reaction speed with mineral is fast, the effective distance is shorter and It is also easier to lead to secondary sedimentation in conventional mud acid system. On this point, we raised multi-hydrogen acid technology, multi-hydrogen acid release hydrogen ions by multistage ionization which could react with pore blockage, fillings and skeletal effects with less secondary pollution. Multi-hydrogen acid system has advantages as moderate speed, deep penetration, clay low corrosion rate, wet water and restrains precipitation, etc. It can reach the goal of plug removal in deep stratum. The field application result shows that multi-hydrogen acid plug removal method has good effects on application in low permeability reservoir in Block Xiliu 10.
Second International Workshop on Software Engineering and Code Design in Parallel Meteorological and Oceanographic Applications

NASA Technical Reports Server (NTRS)

OKeefe, Matthew (Editor); Kerr, Christopher L. (Editor)

1998-01-01

This report contains the abstracts and technical papers from the Second International Workshop on Software Engineering and Code Design in Parallel Meteorological and Oceanographic Applications, held June 15-18, 1998, in Scottsdale, Arizona. The purpose of the workshop is to bring together software developers in meteorology and oceanography to discuss software engineering and code design issues for parallel architectures, including Massively Parallel Processors (MPP's), Parallel Vector Processors (PVP's), Symmetric Multi-Processors (SMP's), Distributed Shared Memory (DSM) multi-processors, and clusters. Issues to be discussed include: (1) code architectures for current parallel models, including basic data structures, storage allocation, variable naming conventions, coding rules and styles, i/o and pre/post-processing of data; (2) designing modular code; (3) load balancing and domain decomposition; (4) techniques that exploit parallelism efficiently yet hide the machine-related details from the programmer; (5) tools for making the programmer more productive; and (6) the proliferation of programming models (F--, OpenMP, MPI, and HPF).
Application of a hybrid MPI/OpenMP approach for parallel groundwater model calibration using multi-core computers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tang, Guoping; D'Azevedo, Ed F; Zhang, Fan

2010-01-01

Calibration of groundwater models involves hundreds to thousands of forward solutions, each of which may solve many transient coupled nonlinear partial differential equations, resulting in a computationally intensive problem. We describe a hybrid MPI/OpenMP approach to exploit two levels of parallelisms in software and hardware to reduce calibration time on multi-core computers. HydroGeoChem 5.0 (HGC5) is parallelized using OpenMP for direct solutions for a reactive transport model application, and a field-scale coupled flow and transport model application. In the reactive transport model, a single parallelizable loop is identified to account for over 97% of the total computational time using GPROF.more » Addition of a few lines of OpenMP compiler directives to the loop yields a speedup of about 10 on a 16-core compute node. For the field-scale model, parallelizable loops in 14 of 174 HGC5 subroutines that require 99% of the execution time are identified. As these loops are parallelized incrementally, the scalability is found to be limited by a loop where Cray PAT detects over 90% cache missing rates. With this loop rewritten, similar speedup as the first application is achieved. The OpenMP-parallelized code can be run efficiently on multiple workstations in a network or multiple compute nodes on a cluster as slaves using parallel PEST to speedup model calibration. To run calibration on clusters as a single task, the Levenberg Marquardt algorithm is added to HGC5 with the Jacobian calculation and lambda search parallelized using MPI. With this hybrid approach, 100 200 compute cores are used to reduce the calibration time from weeks to a few hours for these two applications. This approach is applicable to most of the existing groundwater model codes for many applications.« less
Portable parallel stochastic optimization for the design of aeropropulsion components

NASA Technical Reports Server (NTRS)

Sues, Robert H.; Rhodes, G. S.

1994-01-01

This report presents the results of Phase 1 research to develop a methodology for performing large-scale Multi-disciplinary Stochastic Optimization (MSO) for the design of aerospace systems ranging from aeropropulsion components to complete aircraft configurations. The current research recognizes that such design optimization problems are computationally expensive, and require the use of either massively parallel or multiple-processor computers. The methodology also recognizes that many operational and performance parameters are uncertain, and that uncertainty must be considered explicitly to achieve optimum performance and cost. The objective of this Phase 1 research was to initialize the development of an MSO methodology that is portable to a wide variety of hardware platforms, while achieving efficient, large-scale parallelism when multiple processors are available. The first effort in the project was a literature review of available computer hardware, as well as review of portable, parallel programming environments. The first effort was to implement the MSO methodology for a problem using the portable parallel programming language, Parallel Virtual Machine (PVM). The third and final effort was to demonstrate the example on a variety of computers, including a distributed-memory multiprocessor, a distributed-memory network of workstations, and a single-processor workstation. Results indicate the MSO methodology can be well-applied towards large-scale aerospace design problems. Nearly perfect linear speedup was demonstrated for computation of optimization sensitivity coefficients on both a 128-node distributed-memory multiprocessor (the Intel iPSC/860) and a network of workstations (speedups of almost 19 times achieved for 20 workstations). Very high parallel efficiencies (75 percent for 31 processors and 60 percent for 50 processors) were also achieved for computation of aerodynamic influence coefficients on the Intel. Finally, the multi-level parallelization strategy that will be needed for large-scale MSO problems was demonstrated to be highly efficient. The same parallel code instructions were used on both platforms, demonstrating portability. There are many applications for which MSO can be applied, including NASA's High-Speed-Civil Transport, and advanced propulsion systems. The use of MSO will reduce design and development time and testing costs dramatically.
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

NASA Technical Reports Server (NTRS)

Jost, Gabriele; Jin, Haoqiang; anMey, Dieter; Hatay, Ferhat F.

2003-01-01

With the advent of parallel hardware and software technologies users are faced with the challenge to choose a programming paradigm best suited for the underlying computer architecture. With the current trend in parallel computer architectures towards clusters of shared memory symmetric multi-processors (SMP), parallel programming techniques have evolved to support parallelism beyond a single level. Which programming paradigm is the best will depend on the nature of the given problem, the hardware architecture, and the available software. In this study we will compare different programming paradigms for the parallelization of a selected benchmark application on a cluster of SMP nodes. We compare the timings of different implementations of the same CFD benchmark application employing the same numerical algorithm on a cluster of Sun Fire SMP nodes. The rest of the paper is structured as follows: In section 2 we briefly discuss the programming models under consideration. We describe our compute platform in section 3. The different implementations of our benchmark code are described in section 4 and the performance results are presented in section 5. We conclude our study in section 6.
Multi-shape memory polymers achieved by the spatio-assembly of 3D printable thermoplastic building blocks.

PubMed

Li, Hongze; Gao, Xiang; Luo, Yingwu

2016-04-07

Multi-shape memory polymers were prepared by the macroscale spatio-assembly of building blocks in this work. The building blocks were methyl acrylate-co-styrene (MA-co-St) copolymers, which have the St-block-(St-random-MA)-block-St tri-block chain sequence. This design ensures that their transition temperatures can be adjusted over a wide range by varying the composition of the middle block. The two St blocks at the chain ends can generate a crosslink network in the final device to achieve strong bonding force between building blocks and the shape memory capacity. Due to their thermoplastic properties, 3D printing was employed for the spatio-assembly to build devices. This method is capable of introducing many transition phases into one device and preparing complicated shapes via 3D printing. The device can perform a complex action via a series of shape changes. Besides, this method can avoid the difficult programing of a series of temporary shapes. The control of intermediate temporary shapes was realized via programing the shapes and locations of building blocks in the final device.
Parallel Algorithms for Image Analysis.

DTIC Science & Technology

1982-06-01

8217 _ _ _ _ _ _ _ 4. TITLE (aid Subtitle) S. TYPE OF REPORT & PERIOD COVERED PARALLEL ALGORITHMS FOR IMAGE ANALYSIS TECHNICAL 6. PERFORMING O4G. REPORT NUMBER TR-1180...Continue on reverse side it neceesary aid Identlfy by block number) Image processing; image analysis ; parallel processing; cellular computers. 20... IMAGE ANALYSIS TECHNICAL 6. PERFORMING ONG. REPORT NUMBER TR-1180 - 7. AUTHOR(&) S. CONTRACT OR GRANT NUMBER(s) Azriel Rosenfeld AFOSR-77-3271 9
ABINIT: Plane-Wave-Based Density-Functional Theory on High Performance Computers

NASA Astrophysics Data System (ADS)

Torrent, Marc

2014-03-01

For several years, a continuous effort has been produced to adapt electronic structure codes based on Density-Functional Theory to the future computing architectures. Among these codes, ABINIT is based on a plane-wave description of the wave functions which allows to treat systems of any kind. Porting such a code on petascale architectures pose difficulties related to the many-body nature of the DFT equations. To improve the performances of ABINIT - especially for what concerns standard LDA/GGA ground-state and response-function calculations - several strategies have been followed: A full multi-level parallelisation MPI scheme has been implemented, exploiting all possible levels and distributing both computation and memory. It allows to increase the number of distributed processes and could not be achieved without a strong restructuring of the code. The core algorithm used to solve the eigen problem (``Locally Optimal Blocked Congugate Gradient''), a Blocked-Davidson-like algorithm, is based on a distribution of processes combining plane-waves and bands. In addition to the distributed memory parallelization, a full hybrid scheme has been implemented, using standard shared-memory directives (openMP/openACC) or porting some comsuming code sections to Graphics Processing Units (GPU). As no simple performance model exists, the complexity of use has been increased; the code efficiency strongly depends on the distribution of processes among the numerous levels. ABINIT is able to predict the performances of several process distributions and automatically choose the most favourable one. On the other hand, a big effort has been carried out to analyse the performances of the code on petascale architectures, showing which sections of codes have to be improved; they all are related to Matrix Algebra (diagonalisation, orthogonalisation). The different strategies employed to improve the code scalability will be described. They are based on an exploration of new diagonalization algorithm, as well as the use of external optimized librairies. Part of this work has been supported by the european Prace project (PaRtnership for Advanced Computing in Europe) in the framework of its workpackage 8.
Data preprocessing for determining outer/inner parallelization in the nested loop problem using OpenMP

NASA Astrophysics Data System (ADS)

Handhika, T.; Bustamam, A.; Ernastuti, Kerami, D.

2017-07-01

Multi-thread programming using OpenMP on the shared-memory architecture with hyperthreading technology allows the resource to be accessed by multiple processors simultaneously. Each processor can execute more than one thread for a certain period of time. However, its speedup depends on the ability of the processor to execute threads in limited quantities, especially the sequential algorithm which contains a nested loop. The number of the outer loop iterations is greater than the maximum number of threads that can be executed by a processor. The thread distribution technique that had been found previously only be applied by the high-level programmer. This paper generates a parallelization procedure for low-level programmer in dealing with 2-level nested loop problems with the maximum number of threads that can be executed by a processor is smaller than the number of the outer loop iterations. Data preprocessing which is related to the number of the outer loop and the inner loop iterations, the computational time required to execute each iteration and the maximum number of threads that can be executed by a processor are used as a strategy to determine which parallel region that will produce optimal speedup.
Investigation of upwind, multigrid, multiblock numerical schemes for three dimensional flows. Volume 1: Runge-Kutta methods for a thin layer Navier-Stokes solver

NASA Technical Reports Server (NTRS)

Cannizzaro, Frank E.; Ash, Robert L.

1992-01-01

A state-of-the-art computer code has been developed that incorporates a modified Runge-Kutta time integration scheme, upwind numerical techniques, multigrid acceleration, and multi-block capabilities (RUMM). A three-dimensional thin-layer formulation of the Navier-Stokes equations is employed. For turbulent flow cases, the Baldwin-Lomax algebraic turbulence model is used. Two different upwind techniques are available: van Leer's flux-vector splitting and Roe's flux-difference splitting. Full approximation multi-grid plus implicit residual and corrector smoothing were implemented to enhance the rate of convergence. Multi-block capabilities were developed to provide geometric flexibility. This feature allows the developed computer code to accommodate any grid topology or grid configuration with multiple topologies. The results shown in this dissertation were chosen to validate the computer code and display its geometric flexibility, which is provided by the multi-block structure.
The Parallel System for Integrating Impact Models and Sectors (pSIMS)

NASA Technical Reports Server (NTRS)

Elliott, Joshua; Kelly, David; Chryssanthacopoulos, James; Glotter, Michael; Jhunjhnuwala, Kanika; Best, Neil; Wilde, Michael; Foster, Ian

2014-01-01

We present a framework for massively parallel climate impact simulations: the parallel System for Integrating Impact Models and Sectors (pSIMS). This framework comprises a) tools for ingesting and converting large amounts of data to a versatile datatype based on a common geospatial grid; b) tools for translating this datatype into custom formats for site-based models; c) a scalable parallel framework for performing large ensemble simulations, using any one of a number of different impacts models, on clusters, supercomputers, distributed grids, or clouds; d) tools and data standards for reformatting outputs to common datatypes for analysis and visualization; and e) methodologies for aggregating these datatypes to arbitrary spatial scales such as administrative and environmental demarcations. By automating many time-consuming and error-prone aspects of large-scale climate impacts studies, pSIMS accelerates computational research, encourages model intercomparison, and enhances reproducibility of simulation results. We present the pSIMS design and use example assessments to demonstrate its multi-model, multi-scale, and multi-sector versatility.
Communication-avoiding symmetric-indefinite factorization

DOE PAGES

Ballard, Grey Malone; Becker, Dulcenia; Demmel, James; ...

2014-11-13

We describe and analyze a novel symmetric triangular factorization algorithm. The algorithm is essentially a block version of Aasen's triangular tridiagonalization. It factors a dense symmetric matrix A as the product A=PLTL TP T where P is a permutation matrix, L is lower triangular, and T is block tridiagonal and banded. The algorithm is the first symmetric-indefinite communication-avoiding factorization: it performs an asymptotically optimal amount of communication in a two-level memory hierarchy for almost any cache-line size. Adaptations of the algorithm to parallel computers are likely to be communication efficient as well; one such adaptation has been recently published. Asmore » a result, the current paper describes the algorithm, proves that it is numerically stable, and proves that it is communication optimal.« less
Communication-avoiding symmetric-indefinite factorization

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ballard, Grey Malone; Becker, Dulcenia; Demmel, James

We describe and analyze a novel symmetric triangular factorization algorithm. The algorithm is essentially a block version of Aasen's triangular tridiagonalization. It factors a dense symmetric matrix A as the product A=PLTL TP T where P is a permutation matrix, L is lower triangular, and T is block tridiagonal and banded. The algorithm is the first symmetric-indefinite communication-avoiding factorization: it performs an asymptotically optimal amount of communication in a two-level memory hierarchy for almost any cache-line size. Adaptations of the algorithm to parallel computers are likely to be communication efficient as well; one such adaptation has been recently published. Asmore » a result, the current paper describes the algorithm, proves that it is numerically stable, and proves that it is communication optimal.« less
Together We STRIDE: A quasi-experimental trial testing the effectiveness of a multi-level obesity intervention for Hispanic children in rural communities.

PubMed

Ko, Linda K; Rillamas-Sun, Eileen; Bishop, Sonia; Cisneros, Oralia; Holte, Sarah; Thompson, Beti

2018-04-01

Hispanic children are disproportionally overweight and obese compared to their non-Hispanic white counterparts in the US. Community-wide, multi-level interventions have been successful to promote healthier nutrition, increased physical activity (PA), and weight loss. Using community-based participatory approach (CBPR) that engages community members in rural Hispanic communities is a promising way to promote behavior change, and ultimately weight loss among Hispanic children. Led by a community-academic partnership, the Together We STRIDE (Strategizing Together Relevant Interventions for Diet and Exercise) aims to test the effectiveness of a community-wide, multi-level intervention to promote healthier diets, increased PA, and weight loss among Hispanic children. The Together We STRIDE is a parallel quasi-experimental trial with a goal of recruiting 900 children aged 8-12 years nested within two communities (one intervention and one comparison). Children will be recruited from their respective elementary schools. Components of the 2-year multi-level intervention include comic books (individual-level), multi-generational nutrition and PA classes (family-level), teacher-led PA breaks and media literacy education (school-level), family nights, a farmer's market and a community PA event (known as ciclovia) at the community-level. Children from the comparison community will receive two newsletters. Height and weight measures will be collected from children in both communities at three time points (baseline, 6-months, and 18-months). The Together We STRIDE study aims to promote healthier diet and increased PA to produce healthy weight among Hispanic children. The use of CBPR approach and the engagement of the community will springboard strategies for intervention' sustainability. Clinical Trials Registration Number: NCT02982759 Retrospectively registered. Copyright © 2018 Elsevier Inc. All rights reserved.
Computational performance of a smoothed particle hydrodynamics simulation for shared-memory parallel computing

NASA Astrophysics Data System (ADS)

Nishiura, Daisuke; Furuichi, Mikito; Sakaguchi, Hide

2015-09-01

The computational performance of a smoothed particle hydrodynamics (SPH) simulation is investigated for three types of current shared-memory parallel computer devices: many integrated core (MIC) processors, graphics processing units (GPUs), and multi-core CPUs. We are especially interested in efficient shared-memory allocation methods for each chipset, because the efficient data access patterns differ between compute unified device architecture (CUDA) programming for GPUs and OpenMP programming for MIC processors and multi-core CPUs. We first introduce several parallel implementation techniques for the SPH code, and then examine these on our target computer architectures to determine the most effective algorithms for each processor unit. In addition, we evaluate the effective computing performance and power efficiency of the SPH simulation on each architecture, as these are critical metrics for overall performance in a multi-device environment. In our benchmark test, the GPU is found to produce the best arithmetic performance as a standalone device unit, and gives the most efficient power consumption. The multi-core CPU obtains the most effective computing performance. The computational speed of the MIC processor on Xeon Phi approached that of two Xeon CPUs. This indicates that using MICs is an attractive choice for existing SPH codes on multi-core CPUs parallelized by OpenMP, as it gains computational acceleration without the need for significant changes to the source code.

Multi-thread parallel algorithm for reconstructing 3D large-scale porous structures

NASA Astrophysics Data System (ADS)

Ju, Yang; Huang, Yaohui; Zheng, Jiangtao; Qian, Xu; Xie, Heping; Zhao, Xi

2017-04-01

Geomaterials inherently contain many discontinuous, multi-scale, geometrically irregular pores, forming a complex porous structure that governs their mechanical and transport properties. The development of an efficient reconstruction method for representing porous structures can significantly contribute toward providing a better understanding of the governing effects of porous structures on the properties of porous materials. In order to improve the efficiency of reconstructing large-scale porous structures, a multi-thread parallel scheme was incorporated into the simulated annealing reconstruction method. In the method, four correlation functions, which include the two-point probability function, the linear-path functions for the pore phase and the solid phase, and the fractal system function for the solid phase, were employed for better reproduction of the complex well-connected porous structures. In addition, a random sphere packing method and a self-developed pre-conditioning method were incorporated to cast the initial reconstructed model and select independent interchanging pairs for parallel multi-thread calculation, respectively. The accuracy of the proposed algorithm was evaluated by examining the similarity between the reconstructed structure and a prototype in terms of their geometrical, topological, and mechanical properties. Comparisons of the reconstruction efficiency of porous models with various scales indicated that the parallel multi-thread scheme significantly shortened the execution time for reconstruction of a large-scale well-connected porous model compared to a sequential single-thread procedure.
Multi-line triggering and interdigitated electrode structure for photoconductive semiconductor switches

DOEpatents

Mar, Alan [Albuquerque, NM; Zutavern, Fred J [Albuquerque, NM; Loubriel, Guillermo [Albuquerque, NM

2007-02-06

An improved photoconductive semiconductor switch comprises multiple-line optical triggering of multiple, high-current parallel filaments between the switch electrodes. The switch can also have a multi-gap, interdigitated electrode for the generation of additional parallel filaments. Multi-line triggering can increase the switch lifetime at high currents by increasing the number of current filaments and reducing the current density at the contact electrodes in a controlled manner. Furthermore, the improved switch can mitigate the degradation of switching conditions with increased number of firings of the switch.
Meros Preconditioner Package

DOE Office of Scientific and Technical Information (OSTI.GOV)

2004-04-01

Meros uses the compositional, aggregation, and overload operator capabilities of TSF to provide an object-oriented package providing segregated/block preconditioners for linear systems related to fully-coupled Navier-Stokes problems. This class of preconditioners exploits the special properties of these problems to segregate the equations and use multi-level preconditioners (through ML) on the matrix sub-blocks. Several preconditioners are provided, including the Fp and BFB preconditioners of Kay & Loghin and Silvester, Elman, Kay & Wathen. The overall performance and scalability of these preconditioners approaches that of multigrid for certain types of problems. Meros also provides more traditional pressure projection methods including SIMPLE andmore » SIMPLEC.« less
Harmony search algorithm: application to the redundancy optimization problem

NASA Astrophysics Data System (ADS)

Nahas, Nabil; Thien-My, Dao

2010-09-01

The redundancy optimization problem is a well known NP-hard problem which involves the selection of elements and redundancy levels to maximize system performance, given different system-level constraints. This article presents an efficient algorithm based on the harmony search algorithm (HSA) to solve this optimization problem. The HSA is a new nature-inspired algorithm which mimics the improvization process of music players. Two kinds of problems are considered in testing the proposed algorithm, with the first limited to the binary series-parallel system, where the problem consists of a selection of elements and redundancy levels used to maximize the system reliability given various system-level constraints; the second problem for its part concerns the multi-state series-parallel systems with performance levels ranging from perfect operation to complete failure, and in which identical redundant elements are included in order to achieve a desirable level of availability. Numerical results for test problems from previous research are reported and compared. The results of HSA showed that this algorithm could provide very good solutions when compared to those obtained through other approaches.
Bin-Hash Indexing: A Parallel Method for Fast Query Processing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bethel, Edward W; Gosink, Luke J.; Wu, Kesheng

2008-06-27

This paper presents a new parallel indexing data structure for answering queries. The index, called Bin-Hash, offers extremely high levels of concurrency, and is therefore well-suited for the emerging commodity of parallel processors, such as multi-cores, cell processors, and general purpose graphics processing units (GPU). The Bin-Hash approach first bins the base data, and then partitions and separately stores the values in each bin as a perfect spatial hash table. To answer a query, we first determine whether or not a record satisfies the query conditions based on the bin boundaries. For the bins with records that can not bemore » resolved, we examine the spatial hash tables. The procedures for examining the bin numbers and the spatial hash tables offer the maximum possible level of concurrency; all records are able to be evaluated by our procedure independently in parallel. Additionally, our Bin-Hash procedures access much smaller amounts of data than similar parallel methods, such as the projection index. This smaller data footprint is critical for certain parallel processors, like GPUs, where memory resources are limited. To demonstrate the effectiveness of Bin-Hash, we implement it on a GPU using the data-parallel programming language CUDA. The concurrency offered by the Bin-Hash index allows us to fully utilize the GPU's massive parallelism in our work; over 12,000 records can be simultaneously evaluated at any one time. We show that our new query processing method is an order of magnitude faster than current state-of-the-art CPU-based indexing technologies. Additionally, we compare our performance to existing GPU-based projection index strategies.« less
Manyscale Computing for Sensor Processing in Support of Space Situational Awareness

NASA Astrophysics Data System (ADS)

Schmalz, M.; Chapman, W.; Hayden, E.; Sahni, S.; Ranka, S.

2014-09-01

Increasing image and signal data burden associated with sensor data processing in support of space situational awareness implies continuing computational throughput growth beyond the petascale regime. In addition to growing applications data burden and diversity, the breadth, diversity and scalability of high performance computing architectures and their various organizations challenge the development of a single, unifying, practicable model of parallel computation. Therefore, models for scalable parallel processing have exploited architectural and structural idiosyncrasies, yielding potential misapplications when legacy programs are ported among such architectures. In response to this challenge, we have developed a concise, efficient computational paradigm and software called Manyscale Computing to facilitate efficient mapping of annotated application codes to heterogeneous parallel architectures. Our theory, algorithms, software, and experimental results support partitioning and scheduling of application codes for envisioned parallel architectures, in terms of work atoms that are mapped (for example) to threads or thread blocks on computational hardware. Because of the rigor, completeness, conciseness, and layered design of our manyscale approach, application-to-architecture mapping is feasible and scalable for architectures at petascales, exascales, and above. Further, our methodology is simple, relying primarily on a small set of primitive mapping operations and support routines that are readily implemented on modern parallel processors such as graphics processing units (GPUs) and hybrid multi-processors (HMPs). In this paper, we overview the opportunities and challenges of manyscale computing for image and signal processing in support of space situational awareness applications. We discuss applications in terms of a layered hardware architecture (laboratory > supercomputer > rack > processor > component hierarchy). Demonstration applications include performance analysis and results in terms of execution time as well as storage, power, and energy consumption for bus-connected and/or networked architectures. The feasibility of the manyscale paradigm is demonstrated by addressing four principal challenges: (1) architectural/structural diversity, parallelism, and locality, (2) masking of I/O and memory latencies, (3) scalability of design as well as implementation, and (4) efficient representation/expression of parallel applications. Examples will demonstrate how manyscale computing helps solve these challenges efficiently on real-world computing systems.
Compressed multi-block local binary pattern for object tracking

NASA Astrophysics Data System (ADS)

Li, Tianwen; Gao, Yun; Zhao, Lei; Zhou, Hao

2018-04-01

Both robustness and real-time are very important for the application of object tracking under a real environment. The focused trackers based on deep learning are difficult to satisfy with the real-time of tracking. Compressive sensing provided a technical support for real-time tracking. In this paper, an object can be tracked via a multi-block local binary pattern feature. The feature vector was extracted based on the multi-block local binary pattern feature, which was compressed via a sparse random Gaussian matrix as the measurement matrix. The experiments showed that the proposed tracker ran in real-time and outperformed the existed compressive trackers based on Haar-like feature on many challenging video sequences in terms of accuracy and robustness.
An efficient predictor-corrector-based dynamic mesh method for multi-block structured grid with extremely large deformation and its applications

NASA Astrophysics Data System (ADS)

Guo, Tongqing; Chen, Hao; Lu, Zhiliang

2018-05-01

Aiming at extremely large deformation, a novel predictor-corrector-based dynamic mesh method for multi-block structured grid is proposed. In this work, the dynamic mesh generation is completed in three steps. At first, some typical dynamic positions are selected and high-quality multi-block grids with the same topology are generated at those positions. Then, Lagrange interpolation method is adopted to predict the dynamic mesh at any dynamic position. Finally, a rapid elastic deforming technique is used to correct the small deviation between the interpolated geometric configuration and the actual instantaneous one. Compared with the traditional methods, the results demonstrate that the present method shows stronger deformation ability and higher dynamic mesh quality.
Family, Community and Clinic Collaboration to Treat Overweight and Obese Children: Stanford GOALS -- a Randomized Controlled Trial of a Three-Year, Multi-Component, Multi-Level, Multi-Setting Intervention

PubMed Central

Robinson, Thomas N.; Matheson, Donna; Desai, Manisha; Wilson, Darrell M.; Weintraub, Dana L.; Haskell, William L.; McClain, Arianna; McClure, Samuel; Banda, Jorge; Sanders, Lee M.; Haydel, K. Farish; Killen, Joel D.

2013-01-01

Objective To test the effects of a three-year, community-based, multi-component, multi-level, multi-setting (MMM) approach for treating overweight and obese children. Design Two-arm, parallel group, randomized controlled trial with measures at baseline, 12, 24, and 36 months after randomization. Participants Seven through eleven year old, overweight and obese children (BMI ≥ 85th percentile) and their parents/caregivers recruited from community locations in low-income, primarily Latino neighborhoods in Northern California. Interventions Families are randomized to the MMM intervention versus a community health education active-placebo comparison intervention. Interventions last for three years for each participant. The MMM intervention includes a community-based after school team sports program designed specifically for overweight and obese children, a home-based family intervention to reduce screen time, alter the home food/eating environment, and promote self-regulatory skills for eating and activity behavior change, and a primary care behavioral counseling intervention linked to the community and home interventions. The active-placebo comparison intervention includes semi-annual health education home visits, monthly health education newsletters for children and for parents/guardians, and a series of community-based health education events for families. Main Outcome Measure Body mass index trajectory over the three-year study. Secondary outcome measures include waist circumference, triceps skinfold thickness, accelerometer-measured physical activity, 24-hour dietary recalls, screen time and other sedentary behaviors, blood pressure, fasting lipids, glucose, insulin, hemoglobin A1c, C-reactive protein, alanine aminotransferase, and psychosocial measures. Conclusions The Stanford GOALS trial is testing the efficacy of a novel community-based multi-component, multi-level, multi-setting treatment for childhood overweight and obesity in low-income, Latino families. PMID:24028942
Family, community and clinic collaboration to treat overweight and obese children: Stanford GOALS-A randomized controlled trial of a three-year, multi-component, multi-level, multi-setting intervention.

PubMed

Robinson, Thomas N; Matheson, Donna; Desai, Manisha; Wilson, Darrell M; Weintraub, Dana L; Haskell, William L; McClain, Arianna; McClure, Samuel; Banda, Jorge A; Sanders, Lee M; Haydel, K Farish; Killen, Joel D

2013-11-01

To test the effects of a three-year, community-based, multi-component, multi-level, multi-setting (MMM) approach for treating overweight and obese children. Two-arm, parallel group, randomized controlled trial with measures at baseline, 12, 24, and 36 months after randomization. Seven through eleven year old, overweight and obese children (BMI ≥ 85th percentile) and their parents/caregivers recruited from community locations in low-income, primarily Latino neighborhoods in Northern California. Families are randomized to the MMM intervention versus a community health education active-placebo comparison intervention. Interventions last for three years for each participant. The MMM intervention includes a community-based after school team sports program designed specifically for overweight and obese children, a home-based family intervention to reduce screen time, alter the home food/eating environment, and promote self-regulatory skills for eating and activity behavior change, and a primary care behavioral counseling intervention linked to the community and home interventions. The active-placebo comparison intervention includes semi-annual health education home visits, monthly health education newsletters for children and for parents/guardians, and a series of community-based health education events for families. Body mass index trajectory over the three-year study. Secondary outcome measures include waist circumference, triceps skinfold thickness, accelerometer-measured physical activity, 24-hour dietary recalls, screen time and other sedentary behaviors, blood pressure, fasting lipids, glucose, insulin, hemoglobin A1c, C-reactive protein, alanine aminotransferase, and psychosocial measures. The Stanford GOALS trial is testing the efficacy of a novel community-based multi-component, multi-level, multi-setting treatment for childhood overweight and obesity in low-income, Latino families. © 2013 Elsevier Inc. All rights reserved.
A new digital pulse power supply in heavy ion research facility in Lanzhou

NASA Astrophysics Data System (ADS)

Wang, Rongkun; Chen, Youxin; Huang, Yuzhen; Gao, Daqing; Zhou, Zhongzu; Yan, Huaihai; Zhao, Jiang; Shi, Chunfeng; Wu, Fengjun; Yan, Hongbin; Xia, Jiawen; Yuan, Youjin

2013-11-01

To meet the increasing requirements of the Heavy Ion Research Facility in Lanzhou-Cooler Storage Ring (HIRFL-CSR), a new digital pulse power supply, which employs multi-level converter, was designed. This power supply was applied with a multi H-bridge converters series-parallel connection topology. A new control model named digital power supply regulator system (DPSRS) was proposed, and a pulse power supply prototype based on DPSRS has been built and tested. The experimental results indicate that tracking error and ripple current meet the requirements of this design. The achievement of prototype provides a perfect model for HIRFL-CSR power supply system.
Scalability and Portability of Two Parallel Implementations of ADI

NASA Technical Reports Server (NTRS)

Phung, Thanh; VanderWijngaart, Rob F.

1994-01-01

Two domain decompositions for the implementation of the NAS Scalar Penta-diagonal Parallel Benchmark on MIMD systems are investigated, namely transposition and multi-partitioning. Hardware platforms considered are the Intel iPSC/860 and Paragon XP/S-15, and clusters of SGI workstations on ethernet, communicating through PVM. It is found that the multi-partitioning strategy offers the kind of coarse granularity that allows scaling up to hundreds of processors on a massively parallel machine. Moreover, efficiency is retained when the code is ported verbatim (save message passing syntax) to a PVM environment on a modest size cluster of workstations.
Dual-Tree Complex Wavelet Transform and Image Block Residual-Based Multi-Focus Image Fusion in Visual Sensor Networks

PubMed Central

Yang, Yong; Tong, Song; Huang, Shuying; Lin, Pan

2014-01-01

This paper presents a novel framework for the fusion of multi-focus images explicitly designed for visual sensor network (VSN) environments. Multi-scale based fusion methods can often obtain fused images with good visual effect. However, because of the defects of the fusion rules, it is almost impossible to completely avoid the loss of useful information in the thus obtained fused images. The proposed fusion scheme can be divided into two processes: initial fusion and final fusion. The initial fusion is based on a dual-tree complex wavelet transform (DTCWT). The Sum-Modified-Laplacian (SML)-based visual contrast and SML are employed to fuse the low- and high-frequency coefficients, respectively, and an initial composited image is obtained. In the final fusion process, the image block residuals technique and consistency verification are used to detect the focusing areas and then a decision map is obtained. The map is used to guide how to achieve the final fused image. The performance of the proposed method was extensively tested on a number of multi-focus images, including no-referenced images, referenced images, and images with different noise levels. The experimental results clearly indicate that the proposed method outperformed various state-of-the-art fusion methods, in terms of both subjective and objective evaluations, and is more suitable for VSNs. PMID:25587878
Dual-tree complex wavelet transform and image block residual-based multi-focus image fusion in visual sensor networks.

PubMed

Yang, Yong; Tong, Song; Huang, Shuying; Lin, Pan

2014-11-26

This paper presents a novel framework for the fusion of multi-focus images explicitly designed for visual sensor network (VSN) environments. Multi-scale based fusion methods can often obtain fused images with good visual effect. However, because of the defects of the fusion rules, it is almost impossible to completely avoid the loss of useful information in the thus obtained fused images. The proposed fusion scheme can be divided into two processes: initial fusion and final fusion. The initial fusion is based on a dual-tree complex wavelet transform (DTCWT). The Sum-Modified-Laplacian (SML)-based visual contrast and SML are employed to fuse the low- and high-frequency coefficients, respectively, and an initial composited image is obtained. In the final fusion process, the image block residuals technique and consistency verification are used to detect the focusing areas and then a decision map is obtained. The map is used to guide how to achieve the final fused image. The performance of the proposed method was extensively tested on a number of multi-focus images, including no-referenced images, referenced images, and images with different noise levels. The experimental results clearly indicate that the proposed method outperformed various state-of-the-art fusion methods, in terms of both subjective and objective evaluations, and is more suitable for VSNs.
Compact 0-complete trees: A new method for searching large files

DOE Office of Scientific and Technical Information (OSTI.GOV)

Orlandic, R.; Pfaltz, J.L.

1988-01-26

In this report, a novel approach to ordered retrieval in very large files is developed. The method employs a B-tree like search algorithm that is independent of key type or key length because all keys in index blocks are encoded by a 1 byte surrogate. The replacement of actual key sequences by the 1 byte surrogate ensures a maximal possible fan out and greatly reduces the storage overhead of maintaining access indices. Initially, retrieval in binary trie structure is developed. With the aid of a fairly complex recurrence relation, the rather scraggly binary trie is transformed into compact multi-way searchmore » tree. Then the recurrence relation itself is replaced by an unusually simple search algorithm. Then implementation details and empirical performance results are presented. Reduction of index size by 50%--75% opens up the possibility of replicating system-wide indices for parallel access in distributed databases. 23 figs.« less
Task Assignment Heuristics for Parallel and Distributed CFD Applications

NASA Technical Reports Server (NTRS)

Lopez-Benitez, Noe; Djomehri, M. Jahed; Biswas, Rupak

2003-01-01

This paper proposes a task graph (TG) model to represent a single discrete step of multi-block overset grid computational fluid dynamics (CFD) applications. The TG model is then used to not only balance the computational workload across the overset grids but also to reduce inter-grid communication costs. We have developed a set of task assignment heuristics based on the constraints inherent in this class of CFD problems. Two basic assignments, the smallest task first (STF) and the largest task first (LTF), are first presented. They are then systematically costs. To predict the performance of the proposed task assignment heuristics, extensive performance evaluations are conducted on a synthetic TG with tasks defined in terms of the number of grid points in predetermined overlapping grids. A TG derived from a realistic problem with eight million grid points is also used as a test case.
Neuromorphic Kalman filter implementation in IBM’s TrueNorth

NASA Astrophysics Data System (ADS)

Carney, R.; Bouchard, K.; Calafiura, P.; Clark, D.; Donofrio, D.; Garcia-Sciveres, M.; Livezey, J.

2017-10-01

Following the advent of a post-Moore’s law field of computation, novel architectures continue to emerge. With composite, multi-million connection neuromorphic chips like IBM’s TrueNorth, neural engineering has now become a feasible technology in this novel computing paradigm. High Energy Physics experiments are continuously exploring new methods of computation and data handling, including neuromorphic, to support the growing challenges of the field and be prepared for future commodity computing trends. This work details the first instance of a Kalman filter implementation in IBM’s neuromorphic architecture, TrueNorth, for both parallel and serial spike trains. The implementation is tested on multiple simulated systems and its performance is evaluated with respect to an equivalent non-spiking Kalman filter. The limits of the implementation are explored whilst varying the size of weight and threshold registers, the number of spikes used to encode a state, size of neuron block for spatial encoding, and neuron potential reset schemes.
Adaptive multi-resolution 3D Hartree-Fock-Bogoliubov solver for nuclear structure

NASA Astrophysics Data System (ADS)

Pei, J. C.; Fann, G. I.; Harrison, R. J.; Nazarewicz, W.; Shi, Yue; Thornton, S.

2014-08-01

Background: Complex many-body systems, such as triaxial and reflection-asymmetric nuclei, weakly bound halo states, cluster configurations, nuclear fragments produced in heavy-ion fusion reactions, cold Fermi gases, and pasta phases in neutron star crust, are all characterized by large sizes and complex topologies in which many geometrical symmetries characteristic of ground-state configurations are broken. A tool of choice to study such complex forms of matter is an adaptive multi-resolution wavelet analysis. This method has generated much excitement since it provides a common framework linking many diversified methodologies across different fields, including signal processing, data compression, harmonic analysis and operator theory, fractals, and quantum field theory. Purpose: To describe complex superfluid many-fermion systems, we introduce an adaptive pseudospectral method for solving self-consistent equations of nuclear density functional theory in three dimensions, without symmetry restrictions. Methods: The numerical method is based on the multi-resolution and computational harmonic analysis techniques with a multi-wavelet basis. The application of state-of-the-art parallel programming techniques include sophisticated object-oriented templates which parse the high-level code into distributed parallel tasks with a multi-thread task queue scheduler for each multi-core node. The internode communications are asynchronous. The algorithm is variational and is capable of solving coupled complex-geometric systems of equations adaptively, with functional and boundary constraints, in a finite spatial domain of very large size, limited by existing parallel computer memory. For smooth functions, user-defined finite precision is guaranteed. Results: The new adaptive multi-resolution Hartree-Fock-Bogoliubov (HFB) solver madness-hfb is benchmarked against a two-dimensional coordinate-space solver hfb-ax that is based on the B-spline technique and a three-dimensional solver hfodd that is based on the harmonic-oscillator basis expansion. Several examples are considered, including the self-consistent HFB problem for spin-polarized trapped cold fermions and the Skyrme-Hartree-Fock (+BCS) problem for triaxial deformed nuclei. Conclusions: The new madness-hfb framework has many attractive features when applied to nuclear and atomic problems involving many-particle superfluid systems. Of particular interest are weakly bound nuclear configurations close to particle drip lines, strongly elongated and dinuclear configurations such as those present in fission and heavy-ion fusion, and exotic pasta phases that appear in neutron star crust.
Parallelizing ATLAS Reconstruction and Simulation: Issues and Optimization Solutions for Scaling on Multi- and Many-CPU Platforms

NASA Astrophysics Data System (ADS)

Leggett, C.; Binet, S.; Jackson, K.; Levinthal, D.; Tatarkhanov, M.; Yao, Y.

2011-12-01

Thermal limitations have forced CPU manufacturers to shift from simply increasing clock speeds to improve processor performance, to producing chip designs with multi- and many-core architectures. Further the cores themselves can run multiple threads as a zero overhead context switch allowing low level resource sharing (Intel Hyperthreading). To maximize bandwidth and minimize memory latency, memory access has become non uniform (NUMA). As manufacturers add more cores to each chip, a careful understanding of the underlying architecture is required in order to fully utilize the available resources. We present AthenaMP and the Atlas event loop manager, the driver of the simulation and reconstruction engines, which have been rewritten to make use of multiple cores, by means of event based parallelism, and final stage I/O synchronization. However, initial studies on 8 andl6 core Intel architectures have shown marked non-linearities as parallel process counts increase, with as much as 30% reductions in event throughput in some scenarios. Since the Intel Nehalem architecture (both Gainestown and Westmere) will be the most common choice for the next round of hardware procurements, an understanding of these scaling issues is essential. Using hardware based event counters and Intel's Performance Tuning Utility, we have studied the performance bottlenecks at the hardware level, and discovered optimization schemes to maximize processor throughput. We have also produced optimization mechanisms, common to all large experiments, that address the extreme nature of today's HEP code, which due to it's size, places huge burdens on the memory infrastructure of today's processors.
MILC Code Performance on High End CPU and GPU Supercomputer Clusters

NASA Astrophysics Data System (ADS)

DeTar, Carleton; Gottlieb, Steven; Li, Ruizi; Toussaint, Doug

2018-03-01

With recent developments in parallel supercomputing architecture, many core, multi-core, and GPU processors are now commonplace, resulting in more levels of parallelism, memory hierarchy, and programming complexity. It has been necessary to adapt the MILC code to these new processors starting with NVIDIA GPUs, and more recently, the Intel Xeon Phi processors. We report on our efforts to port and optimize our code for the Intel Knights Landing architecture. We consider performance of the MILC code with MPI and OpenMP, and optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on the staggered conjugate gradient and gauge force. We also consider performance on recent NVIDIA GPUs using the QUDA library.

Atmospheric blocking as a traffic jam in the jet stream

NASA Astrophysics Data System (ADS)

Nakamura, N.; Huang, S. Y.

2017-12-01

It is demonstrated using the ERA-Interim product that synoptic to intraseasonal variabilities of extratropical circulation in the boreal storm track regions are strongly affected by the zonal convergence of the column-integrated eastward flux of local wave activity (LWA). In particular, from the multi-year daily samples of LWA fluxes, we find that the wintertime zonal LWA flux in the jet exit regions tends to maximize for an intermediate value of column-averaged LWA. This is because an increasing LWA decelerates the zonal flow, eventually weakening the eastward advection of LWA. From theory we argue that large wave events on the decreasing side of the flux curve with increasing LWA cannot be maintained as a stable steady state. Consistent with this argument, observed states corresponding to that side of flux curve often exhibit local wave breaking and blocking events. A close parallelism exists for the traffic flow problem, in which the traffic flux (traffic density times traffic speed) is often observed to maximize for an intermediate value of traffic density. This is because the traffic speed is controlled not only by the imposed speed limit but also by the traffic density — an increasingly heavy traffic slows down the flow naturally and eventually decreases the flux. Once the flux starts to decrease with an increasing traffic density, a traffic jam kicks in suddenly (Lighthill and Whitham 1955, Richards 1956). The above idea is demonstrated by a simple conceptual model based on the equivalent barotropic PV contour design (Nakamura and Huang 2017, JAS), which predicts a threshold of blocking onset. The idea also suggests that the LWA that gives the `flux capacity,' i.e., the maximum LWA flux at a given location, is a useful predictor of local wave breaking/block formation.
A high-throughput, multi-channel photon-counting detector with picosecond timing

NASA Astrophysics Data System (ADS)

Lapington, J. S.; Fraser, G. W.; Miller, G. M.; Ashton, T. J. R.; Jarron, P.; Despeisse, M.; Powolny, F.; Howorth, J.; Milnes, J.

2009-06-01

High-throughput photon counting with high time resolution is a niche application area where vacuum tubes can still outperform solid-state devices. Applications in the life sciences utilizing time-resolved spectroscopies, particularly in the growing field of proteomics, will benefit greatly from performance enhancements in event timing and detector throughput. The HiContent project is a collaboration between the University of Leicester Space Research Centre, the Microelectronics Group at CERN, Photek Ltd., and end-users at the Gray Cancer Institute and the University of Manchester. The goal is to develop a detector system specifically designed for optical proteomics, capable of high content (multi-parametric) analysis at high throughput. The HiContent detector system is being developed to exploit this niche market. It combines multi-channel, high time resolution photon counting in a single miniaturized detector system with integrated electronics. The combination of enabling technologies; small pore microchannel plate devices with very high time resolution, and high-speed multi-channel ASIC electronics developed for the LHC at CERN, provides the necessary building blocks for a high-throughput detector system with up to 1024 parallel counting channels and 20 ps time resolution. We describe the detector and electronic design, discuss the current status of the HiContent project and present the results from a 64-channel prototype system. In the absence of an operational detector, we present measurements of the electronics performance using a pulse generator to simulate detector events. Event timing results from the NINO high-speed front-end ASIC captured using a fast digital oscilloscope are compared with data taken with the proposed electronic configuration which uses the multi-channel HPTDC timing ASIC.
Parallel LC circuit model for multi-band absorption and preliminary design of radiative cooling.

PubMed

Feng, Rui; Qiu, Jun; Liu, Linhua; Ding, Weiqiang; Chen, Lixue

2014-12-15

We perform a comprehensive analysis of multi-band absorption by exciting magnetic polaritons in the infrared region. According to the independent properties of the magnetic polaritons, we propose a parallel inductance and capacitance(PLC) circuit model to explain and predict the multi-band resonant absorption peaks, which is fully validated by using the multi-sized structure with identical dielectric spacing layer and the multilayer structure with the same strip width. More importantly, we present the application of the PLC circuit model to preliminarily design a radiative cooling structure realized by merging several close peaks together. This omnidirectional and polarization insensitive structure is a good candidate for radiative cooling application.
RGCA: A Reliable GPU Cluster Architecture for Large-Scale Internet of Things Computing Based on Effective Performance-Energy Optimization

PubMed Central

Chen, Qingkui; Zhao, Deyu; Wang, Jingjuan

2017-01-01

This paper aims to develop a low-cost, high-performance and high-reliability computing system to process large-scale data using common data mining algorithms in the Internet of Things (IoT) computing environment. Considering the characteristics of IoT data processing, similar to mainstream high performance computing, we use a GPU (Graphics Processing Unit) cluster to achieve better IoT services. Firstly, we present an energy consumption calculation method (ECCM) based on WSNs. Then, using the CUDA (Compute Unified Device Architecture) Programming model, we propose a Two-level Parallel Optimization Model (TLPOM) which exploits reasonable resource planning and common compiler optimization techniques to obtain the best blocks and threads configuration considering the resource constraints of each node. The key to this part is dynamic coupling Thread-Level Parallelism (TLP) and Instruction-Level Parallelism (ILP) to improve the performance of the algorithms without additional energy consumption. Finally, combining the ECCM and the TLPOM, we use the Reliable GPU Cluster Architecture (RGCA) to obtain a high-reliability computing system considering the nodes’ diversity, algorithm characteristics, etc. The results show that the performance of the algorithms significantly increased by 34.1%, 33.96% and 24.07% for Fermi, Kepler and Maxwell on average with TLPOM and the RGCA ensures that our IoT computing system provides low-cost and high-reliability services. PMID:28777325
RGCA: A Reliable GPU Cluster Architecture for Large-Scale Internet of Things Computing Based on Effective Performance-Energy Optimization.

PubMed

Fang, Yuling; Chen, Qingkui; Xiong, Neal N; Zhao, Deyu; Wang, Jingjuan

2017-08-04

This paper aims to develop a low-cost, high-performance and high-reliability computing system to process large-scale data using common data mining algorithms in the Internet of Things (IoT) computing environment. Considering the characteristics of IoT data processing, similar to mainstream high performance computing, we use a GPU (Graphics Processing Unit) cluster to achieve better IoT services. Firstly, we present an energy consumption calculation method (ECCM) based on WSNs. Then, using the CUDA (Compute Unified Device Architecture) Programming model, we propose a Two-level Parallel Optimization Model (TLPOM) which exploits reasonable resource planning and common compiler optimization techniques to obtain the best blocks and threads configuration considering the resource constraints of each node. The key to this part is dynamic coupling Thread-Level Parallelism (TLP) and Instruction-Level Parallelism (ILP) to improve the performance of the algorithms without additional energy consumption. Finally, combining the ECCM and the TLPOM, we use the Reliable GPU Cluster Architecture (RGCA) to obtain a high-reliability computing system considering the nodes' diversity, algorithm characteristics, etc. The results show that the performance of the algorithms significantly increased by 34.1%, 33.96% and 24.07% for Fermi, Kepler and Maxwell on average with TLPOM and the RGCA ensures that our IoT computing system provides low-cost and high-reliability services.
Beyond Low Rank + Sparse: Multi-scale Low Rank Matrix Decomposition

PubMed Central

Ong, Frank; Lustig, Michael

2016-01-01

We present a natural generalization of the recent low rank + sparse matrix decomposition and consider the decomposition of matrices into components of multiple scales. Such decomposition is well motivated in practice as data matrices often exhibit local correlations in multiple scales. Concretely, we propose a multi-scale low rank modeling that represents a data matrix as a sum of block-wise low rank matrices with increasing scales of block sizes. We then consider the inverse problem of decomposing the data matrix into its multi-scale low rank components and approach the problem via a convex formulation. Theoretically, we show that under various incoherence conditions, the convex program recovers the multi-scale low rank components either exactly or approximately. Practically, we provide guidance on selecting the regularization parameters and incorporate cycle spinning to reduce blocking artifacts. Experimentally, we show that the multi-scale low rank decomposition provides a more intuitive decomposition than conventional low rank methods and demonstrate its effectiveness in four applications, including illumination normalization for face images, motion separation for surveillance videos, multi-scale modeling of the dynamic contrast enhanced magnetic resonance imaging and collaborative filtering exploiting age information. PMID:28450978
Parallelism measurement for base plate of standard artifact with multiple tactile approaches

NASA Astrophysics Data System (ADS)

Ye, Xiuling; Zhao, Yan; Wang, Yiwen; Wang, Zhong; Fu, Luhua; Liu, Changjie

2018-01-01

Nowadays, as workpieces become more precise and more specialized which results in more sophisticated structures and higher accuracy for the artifacts, higher requirements have been put forward for measuring accuracy and measuring methods. As an important method to obtain the size of workpieces, coordinate measuring machine (CMM) has been widely used in many industries. In order to achieve the calibration of a self-developed CMM, it is found that the parallelism of the base plate used for fixing the standard artifact is an important factor which affects the measurement accuracy in the process of studying self-made high-precision standard artifact. And aimed to measure the parallelism of the base plate, by using the existing high-precision CMM, gauge blocks, dial gauge and marble platform with the tactile approach, three methods for parallelism measurement of workpieces are employed, and comparisons are made within the measurement results. The results of experiments show that the final accuracy of all the three methods is able to reach micron level and meets the measurement requirements. Simultaneously, these three approaches are suitable for different measurement conditions which provide a basis for rapid and high-precision measurement under different equipment conditions.
A block-based algorithm for the solution of compressible flows in rotor-stator combinations

NASA Technical Reports Server (NTRS)

Akay, H. U.; Ecer, A.; Beskok, A.

1990-01-01

A block-based solution algorithm is developed for the solution of compressible flows in rotor-stator combinations. The method allows concurrent solution of multiple solution blocks in parallel machines. It also allows a time averaged interaction at the stator-rotor interfaces. Numerical results are presented to illustrate the performance of the algorithm. The effect of the interaction between the stator and rotor is evaluated.
Crystal MD: The massively parallel molecular dynamics software for metal with BCC structure

NASA Astrophysics Data System (ADS)

Hu, Changjun; Bai, He; He, Xinfu; Zhang, Boyao; Nie, Ningming; Wang, Xianmeng; Ren, Yingwen

2017-02-01

Material irradiation effect is one of the most important keys to use nuclear power. However, the lack of high-throughput irradiation facility and knowledge of evolution process, lead to little understanding of the addressed issues. With the help of high-performance computing, we could make a further understanding of micro-level-material. In this paper, a new data structure is proposed for the massively parallel simulation of the evolution of metal materials under irradiation environment. Based on the proposed data structure, we developed the new molecular dynamics software named Crystal MD. The simulation with Crystal MD achieved over 90% parallel efficiency in test cases, and it takes more than 25% less memory on multi-core clusters than LAMMPS and IMD, which are two popular molecular dynamics simulation software. Using Crystal MD, a two trillion particles simulation has been performed on Tianhe-2 cluster.
Multi-Channel Seismic Images of the Mariana Forearc: EW0202 Initial Results

NASA Astrophysics Data System (ADS)

Oakley, A. J.; Goodliffe, A. M.; Taylor, B.; Moore, G. F.; Fryer, P.

2002-12-01

During the Spring of 2002, the Mariana Subduction Factory was surveyed using multi-channel seismics (MCS) as the first major phase of a US-Japanese collaborative NSF-MARGINS funded project. The resulting geophysical transects extend from the Pacific Plate to the West Mariana remnant arc. For details of this survey, including the results from the back-arc, refer to Taylor et al. (this session). The incoming Pacific Plate and its accompanying seamounts are deformed by plate flexure, resulting in extension of the upper crust as it enters the subduction zone. The resultant trench parallel faults dominate the bathymetry and MCS data. Beneath the forearc, in the southern transects near Saipan, the subducting slab is imaged to a distance of 50-60 km arcward. In addition to ubiquitous trench parallel normal faulting, a N-S transect of the forearc clearly shows normal faults perpendicular to the trench resulting from N-S extension. On the east side of the Mariana Ridge, thick sediment packages extend into the forearc. Directly east of Saipan and Tinian, a large, deeply scouring slide mass is imaged. Several serpentine mud volcanoes (Big Blue, Turquoise and Celestial) were imaged on the Mariana Forearc. Deep horizontal reflectors (likely original forearc crust) are imaged under the flanks of some of these seamounts. A possible "throat" reflector is resolved on multiple profiles at the summit of Big Blue, the northern-most seamount in the study area. The flanks of Turquoise seamount terminate in toe thrusts that represent uplift and rotation of surrounding sediments as the volcano grows outward. These thrusts form a basal ridge around the seamount similar to that previously noted encircling Conical Seamount. Furthermore, MCS data has revealed that some forearc highs previously thought to be fault blocks are in actuality mud volcanoes.
An embedded multi-core parallel model for real-time stereo imaging

NASA Astrophysics Data System (ADS)

He, Wenjing; Hu, Jian; Niu, Jingyu; Li, Chuanrong; Liu, Guangyu

2018-04-01

The real-time processing based on embedded system will enhance the application capability of stereo imaging for LiDAR and hyperspectral sensor. The task partitioning and scheduling strategies for embedded multiprocessor system starts relatively late, compared with that for PC computer. In this paper, aimed at embedded multi-core processing platform, a parallel model for stereo imaging is studied and verified. After analyzing the computing amount, throughout capacity and buffering requirements, a two-stage pipeline parallel model based on message transmission is established. This model can be applied to fast stereo imaging for airborne sensors with various characteristics. To demonstrate the feasibility and effectiveness of the parallel model, a parallel software was designed using test flight data, based on the 8-core DSP processor TMS320C6678. The results indicate that the design performed well in workload distribution and had a speed-up ratio up to 6.4.
Parallel Hough Transform-Based Straight Line Detection and Its FPGA Implementation in Embedded Vision

PubMed Central

Lu, Xiaofeng; Song, Li; Shen, Sumin; He, Kang; Yu, Songyu; Ling, Nam

2013-01-01

Hough Transform has been widely used for straight line detection in low-definition and still images, but it suffers from execution time and resource requirements. Field Programmable Gate Arrays (FPGA) provide a competitive alternative for hardware acceleration to reap tremendous computing performance. In this paper, we propose a novel parallel Hough Transform (PHT) and FPGA architecture-associated framework for real-time straight line detection in high-definition videos. A resource-optimized Canny edge detection method with enhanced non-maximum suppression conditions is presented to suppress most possible false edges and obtain more accurate candidate edge pixels for subsequent accelerated computation. Then, a novel PHT algorithm exploiting spatial angle-level parallelism is proposed to upgrade computational accuracy by improving the minimum computational step. Moreover, the FPGA based multi-level pipelined PHT architecture optimized by spatial parallelism ensures real-time computation for 1,024 × 768 resolution videos without any off-chip memory consumption. This framework is evaluated on ALTERA DE2-115 FPGA evaluation platform at a maximum frequency of 200 MHz, and it can calculate straight line parameters in 15.59 ms on the average for one frame. Qualitative and quantitative evaluation results have validated the system performance regarding data throughput, memory bandwidth, resource, speed and robustness. PMID:23867746
Parallel Hough Transform-based straight line detection and its FPGA implementation in embedded vision.

PubMed

Lu, Xiaofeng; Song, Li; Shen, Sumin; He, Kang; Yu, Songyu; Ling, Nam

2013-07-17

Hough Transform has been widely used for straight line detection in low-definition and still images, but it suffers from execution time and resource requirements. Field Programmable Gate Arrays (FPGA) provide a competitive alternative for hardware acceleration to reap tremendous computing performance. In this paper, we propose a novel parallel Hough Transform (PHT) and FPGA architecture-associated framework for real-time straight line detection in high-definition videos. A resource-optimized Canny edge detection method with enhanced non-maximum suppression conditions is presented to suppress most possible false edges and obtain more accurate candidate edge pixels for subsequent accelerated computation. Then, a novel PHT algorithm exploiting spatial angle-level parallelism is proposed to upgrade computational accuracy by improving the minimum computational step. Moreover, the FPGA based multi-level pipelined PHT architecture optimized by spatial parallelism ensures real-time computation for 1,024 × 768 resolution videos without any off-chip memory consumption. This framework is evaluated on ALTERA DE2-115 FPGA evaluation platform at a maximum frequency of 200 MHz, and it can calculate straight line parameters in 15.59 ms on the average for one frame. Qualitative and quantitative evaluation results have validated the system performance regarding data throughput, memory bandwidth, resource, speed and robustness.
Hybrid Parallel Contour Trees, Version 1.0

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sewell, Christopher; Fasel, Patricia; Carr, Hamish

A common operation in scientific visualization is to compute and render a contour of a data set. Given a function of the form f : R^d -> R, a level set is defined as an inverse image f^-1(h) for an isovalue h, and a contour is a single connected component of a level set. The Reeb graph can then be defined to be the result of contracting each contour to a single point, and is well defined for Euclidean spaces or for general manifolds. For simple domains, the graph is guaranteed to be a tree, and is called the contourmore » tree. Analysis can then be performed on the contour tree in order to identify isovalues of particular interest, based on various metrics, and render the corresponding contours, without having to know such isovalues a priori. This code is intended to be the first data-parallel algorithm for computing contour trees. Our implementation will use the portable data-parallel primitives provided by Nvidia’s Thrust library, allowing us to compile our same code for both GPUs and multi-core CPUs. Native OpenMP and purely serial versions of the code will likely also be included. It will also be extended to provide a hybrid data-parallel / distributed algorithm, allowing scaling beyond a single GPU or CPU.« less
A General Sparse Tensor Framework for Electronic Structure Theory

DOE PAGES

Manzer, Samuel; Epifanovsky, Evgeny; Krylov, Anna I.; ...

2017-01-24

Linear-scaling algorithms must be developed in order to extend the domain of applicability of electronic structure theory to molecules of any desired size. But, the increasing complexity of modern linear-scaling methods makes code development and maintenance a significant challenge. A major contributor to this difficulty is the lack of robust software abstractions for handling block-sparse tensor operations. We therefore report the development of a highly efficient symbolic block-sparse tensor library in order to provide access to high-level software constructs to treat such problems. Our implementation supports arbitrary multi-dimensional sparsity in all input and output tensors. We then avoid cumbersome machine-generatedmore » code by implementing all functionality as a high-level symbolic C++ language library and demonstrate that our implementation attains very high performance for linear-scaling sparse tensor contractions.« less
Multi-Dimensional, Mesoscopic Monte Carlo Simulations of Inhomogeneous Reaction-Drift-Diffusion Systems on Graphics-Processing Units

PubMed Central

Vigelius, Matthias; Meyer, Bernd

2012-01-01

For many biological applications, a macroscopic (deterministic) treatment of reaction-drift-diffusion systems is insufficient. Instead, one has to properly handle the stochastic nature of the problem and generate true sample paths of the underlying probability distribution. Unfortunately, stochastic algorithms are computationally expensive and, in most cases, the large number of participating particles renders the relevant parameter regimes inaccessible. In an attempt to address this problem we present a genuine stochastic, multi-dimensional algorithm that solves the inhomogeneous, non-linear, drift-diffusion problem on a mesoscopic level. Our method improves on existing implementations in being multi-dimensional and handling inhomogeneous drift and diffusion. The algorithm is well suited for an implementation on data-parallel hardware architectures such as general-purpose graphics processing units (GPUs). We integrate the method into an operator-splitting approach that decouples chemical reactions from the spatial evolution. We demonstrate the validity and applicability of our algorithm with a comprehensive suite of standard test problems that also serve to quantify the numerical accuracy of the method. We provide a freely available, fully functional GPU implementation. Integration into Inchman, a user-friendly web service, that allows researchers to perform parallel simulations of reaction-drift-diffusion systems on GPU clusters is underway. PMID:22506001
Coding for parallel execution of hardware-in-the-loop millimeter-wave scene generation models on multicore SIMD processor architectures

NASA Astrophysics Data System (ADS)

Olson, Richard F.

2013-05-01

Rendering of point scatterer based radar scenes for millimeter wave (mmW) seeker tests in real-time hardware-in-the-loop (HWIL) scene generation requires efficient algorithms and vector-friendly computer architectures for complex signal synthesis. New processor technology from Intel implements an extended 256-bit vector SIMD instruction set (AVX, AVX2) in a multi-core CPU design providing peak execution rates of hundreds of GigaFLOPS (GFLOPS) on one chip. Real world mmW scene generation code can approach peak SIMD execution rates only after careful algorithm and source code design. An effective software design will maintain high computing intensity emphasizing register-to-register SIMD arithmetic operations over data movement between CPU caches or off-chip memories. Engineers at the U.S. Army Aviation and Missile Research, Development and Engineering Center (AMRDEC) applied two basic parallel coding methods to assess new 256-bit SIMD multi-core architectures for mmW scene generation in HWIL. These include use of POSIX threads built on vector library functions and more portable, highlevel parallel code based on compiler technology (e.g. OpenMP pragmas and SIMD autovectorization). Since CPU technology is rapidly advancing toward high processor core counts and TeraFLOPS peak SIMD execution rates, it is imperative that coding methods be identified which produce efficient and maintainable parallel code. This paper describes the algorithms used in point scatterer target model rendering, the parallelization of those algorithms, and the execution performance achieved on an AVX multi-core machine using the two basic parallel coding methods. The paper concludes with estimates for scale-up performance on upcoming multi-core technology.
Using a source-to-source transformation to introduce multi-threading into the AliRoot framework for a parallel event reconstruction

NASA Astrophysics Data System (ADS)

Lohn, Stefan B.; Dong, Xin; Carminati, Federico

2012-12-01

Chip-Multiprocessors are going to support massive parallelism by many additional physical and logical cores. Improving performance can no longer be obtained by increasing clock-frequency because the technical limits are almost reached. Instead, parallel execution must be used to gain performance. Resources like main memory, the cache hierarchy, bandwidth of the memory bus or links between cores and sockets are not going to be improved as fast. Hence, parallelism can only result into performance gains if the memory usage is optimized and the communication between threads is minimized. Besides concurrent programming has become a domain for experts. Implementing multi-threading is error prone and labor-intensive. A full reimplementation of the whole AliRoot source-code is unaffordable. This paper describes the effort to evaluate the adaption of AliRoot to the needs of multi-threading and to provide the capability of parallel processing by using a semi-automatic source-to-source transformation to address the problems as described before and to provide a straight-forward way of parallelization with almost no interference between threads. This makes the approach simple and reduces the required manual changes in the code. In a first step, unconditional thread-safety will be introduced to bring the original sequential and thread unaware source-code into the position of utilizing multi-threading. Afterwards further investigations have to be performed to point out candidates of classes that are useful to share amongst threads. Then in a second step, the transformation has to change the code to share these classes and finally to verify if there are anymore invalid interferences between threads.
Multi-Objective and Multidisciplinary Design Optimisation (MDO) of UAV Systems using Hierarchical Asynchronous Parallel Evolutionary Algorithms

DTIC Science & Technology

2007-09-17

been proposed; these include a combination of variable fidelity models, parallelisation strategies and hybridisation techniques (Coello, Veldhuizen et...Coello et al (Coello, Veldhuizen et al. 2002). 4.4.2 HIERARCHICAL POPULATION TOPOLOGY A hierarchical population topology, when integrated into...to hybrid parallel Multi-Objective Evolutionary Algorithms (pMOEA) (Cantu-Paz 2000; Veldhuizen , Zydallis et al. 2003); it uses a master slave
Large-Scale, Parallel, Multi-Sensor Atmospheric Data Fusion Using Cloud Computing

NASA Astrophysics Data System (ADS)

Wilson, B. D.; Manipon, G.; Hua, H.; Fetzer, E.

2013-05-01

NASA's Earth Observing System (EOS) is an ambitious facility for studying global climate change. The mandate now is to combine measurements from the instruments on the "A-Train" platforms (AIRS, AMSR-E, MODIS, MISR, MLS, and CloudSat) and other Earth probes to enable large-scale studies of climate change over decades. Moving to multi-sensor, long-duration analyses of important climate variables presents serious challenges for large-scale data mining and fusion. For example, one might want to compare temperature and water vapor retrievals from one instrument (AIRS) to another (MODIS), and to a model (ECMWF), stratify the comparisons using a classification of the "cloud scenes" from CloudSat, and repeat the entire analysis over 10 years of data. To efficiently assemble such datasets, we are utilizing Elastic Computing in the Cloud and parallel map/reduce-based algorithms. However, these problems are Data Intensive computing so the data transfer times and storage costs (for caching) are key issues. SciReduce is a Hadoop-like parallel analysis system, programmed in parallel python, that is designed from the ground up for Earth science. SciReduce executes inside VMWare images and scales to any number of nodes in the Cloud. Unlike Hadoop, SciReduce operates on bundles of named numeric arrays, which can be passed in memory or serialized to disk in netCDF4 or HDF5. Figure 1 shows the architecture of the full computational system, with SciReduce at the core. Multi-year datasets are automatically "sharded" by time and space across a cluster of nodes so that years of data (millions of files) can be processed in a massively parallel way. Input variables (arrays) are pulled on-demand into the Cloud using OPeNDAP URLs or other subsetting services, thereby minimizing the size of the cached input and intermediate datasets. We are using SciReduce to automate the production of multiple versions of a ten-year A-Train water vapor climatology under a NASA MEASURES grant. We will present the architecture of SciReduce, describe the achieved "clock time" speedups in fusing datasets on our own nodes and in the Cloud, and discuss the Cloud cost tradeoffs for storage, compute, and data transfer. We will also present a concept/prototype for staging NASA's A-Train Atmospheric datasets (Levels 2 & 3) in the Amazon Cloud so that any number of compute jobs can be executed "near" the multi-sensor data. Given such a system, multi-sensor climate studies over 10-20 years of data could be performed in an efficient way, with the researcher paying only his own Cloud compute bill.; Figure 1 -- Architecture.

Large-Scale, Parallel, Multi-Sensor Atmospheric Data Fusion Using Cloud Computing

NASA Astrophysics Data System (ADS)

Wilson, B. D.; Manipon, G.; Hua, H.; Fetzer, E. J.

2013-12-01

NASA's Earth Observing System (EOS) is an ambitious facility for studying global climate change. The mandate now is to combine measurements from the instruments on the 'A-Train' platforms (AIRS, AMSR-E, MODIS, MISR, MLS, and CloudSat) and other Earth probes to enable large-scale studies of climate change over decades. Moving to multi-sensor, long-duration analyses of important climate variables presents serious challenges for large-scale data mining and fusion. For example, one might want to compare temperature and water vapor retrievals from one instrument (AIRS) to another (MODIS), and to a model (MERRA), stratify the comparisons using a classification of the 'cloud scenes' from CloudSat, and repeat the entire analysis over 10 years of data. To efficiently assemble such datasets, we are utilizing Elastic Computing in the Cloud and parallel map/reduce-based algorithms. However, these problems are Data Intensive computing so the data transfer times and storage costs (for caching) are key issues. SciReduce is a Hadoop-like parallel analysis system, programmed in parallel python, that is designed from the ground up for Earth science. SciReduce executes inside VMWare images and scales to any number of nodes in the Cloud. Unlike Hadoop, SciReduce operates on bundles of named numeric arrays, which can be passed in memory or serialized to disk in netCDF4 or HDF5. Figure 1 shows the architecture of the full computational system, with SciReduce at the core. Multi-year datasets are automatically 'sharded' by time and space across a cluster of nodes so that years of data (millions of files) can be processed in a massively parallel way. Input variables (arrays) are pulled on-demand into the Cloud using OPeNDAP URLs or other subsetting services, thereby minimizing the size of the cached input and intermediate datasets. We are using SciReduce to automate the production of multiple versions of a ten-year A-Train water vapor climatology under a NASA MEASURES grant. We will present the architecture of SciReduce, describe the achieved 'clock time' speedups in fusing datasets on our own compute nodes and in the public Cloud, and discuss the Cloud cost tradeoffs for storage, compute, and data transfer. We will also present a concept/prototype for staging NASA's A-Train Atmospheric datasets (Levels 2 & 3) in the Amazon Cloud so that any number of compute jobs can be executed 'near' the multi-sensor data. Given such a system, multi-sensor climate studies over 10-20 years of data could be performed in an efficient way, with the researcher paying only his own Cloud compute bill. SciReduce Architecture
Optimal Golomb Ruler Sequences Generation for Optical WDM Systems: A Novel Parallel Hybrid Multi-objective Bat Algorithm

NASA Astrophysics Data System (ADS)

Bansal, Shonak; Singh, Arun Kumar; Gupta, Neena

2017-02-01

In real-life, multi-objective engineering design problems are very tough and time consuming optimization problems due to their high degree of nonlinearities, complexities and inhomogeneity. Nature-inspired based multi-objective optimization algorithms are now becoming popular for solving multi-objective engineering design problems. This paper proposes original multi-objective Bat algorithm (MOBA) and its extended form, namely, novel parallel hybrid multi-objective Bat algorithm (PHMOBA) to generate shortest length Golomb ruler called optimal Golomb ruler (OGR) sequences at a reasonable computation time. The OGRs found their application in optical wavelength division multiplexing (WDM) systems as channel-allocation algorithm to reduce the four-wave mixing (FWM) crosstalk. The performances of both the proposed algorithms to generate OGRs as optical WDM channel-allocation is compared with other existing classical computing and nature-inspired algorithms, including extended quadratic congruence (EQC), search algorithm (SA), genetic algorithms (GAs), biogeography based optimization (BBO) and big bang-big crunch (BB-BC) optimization algorithms. Simulations conclude that the proposed parallel hybrid multi-objective Bat algorithm works efficiently as compared to original multi-objective Bat algorithm and other existing algorithms to generate OGRs for optical WDM systems. The algorithm PHMOBA to generate OGRs, has higher convergence and success rate than original MOBA. The efficiency improvement of proposed PHMOBA to generate OGRs up to 20-marks, in terms of ruler length and total optical channel bandwidth (TBW) is 100 %, whereas for original MOBA is 85 %. Finally the implications for further research are also discussed.
Neural simulations on multi-core architectures.

PubMed

Eichner, Hubert; Klug, Tobias; Borst, Alexander

2009-01-01

Neuroscience is witnessing increasing knowledge about the anatomy and electrophysiological properties of neurons and their connectivity, leading to an ever increasing computational complexity of neural simulations. At the same time, a rather radical change in personal computer technology emerges with the establishment of multi-cores: high-density, explicitly parallel processor architectures for both high performance as well as standard desktop computers. This work introduces strategies for the parallelization of biophysically realistic neural simulations based on the compartmental modeling technique and results of such an implementation, with a strong focus on multi-core architectures and automation, i.e. user-transparent load balancing.
Neural Simulations on Multi-Core Architectures

PubMed Central

Eichner, Hubert; Klug, Tobias; Borst, Alexander

2009-01-01

Neuroscience is witnessing increasing knowledge about the anatomy and electrophysiological properties of neurons and their connectivity, leading to an ever increasing computational complexity of neural simulations. At the same time, a rather radical change in personal computer technology emerges with the establishment of multi-cores: high-density, explicitly parallel processor architectures for both high performance as well as standard desktop computers. This work introduces strategies for the parallelization of biophysically realistic neural simulations based on the compartmental modeling technique and results of such an implementation, with a strong focus on multi-core architectures and automation, i.e. user-transparent load balancing. PMID:19636393
Multi-Probe SPM using Interference Patterns for a Parallel Nano Imaging

NASA Astrophysics Data System (ADS)

Koyama, Hirotaka; Oohira, Fumikazu; Hosogi, Maho; Hashiguchi, Gen

This paper proposes a new composition of the multi-probe using optical interference patterns for a parallel nano imaging in a large area scanning. We achieved large-scale integration with 50,000 probes fabricated with MEMS technology, and measured the optical interference patterns with CCD, which was difficult in a conventional single scanning probe. In this research, the multi-probes are made of Si3N4 by MEMS process, and, the multi-probes are joined with a Pyrex glass by an anodic bonding. We designed, fabricated, and evaluated the characteristics of the probe. In addition, we changed the probe shape to decrease the warpage of the Si3N4 probe. We used the supercritical drying to avoid stiction of the Si3N4 probe with the glass surface and fabricated 4 types of the probe shapes without stiction. We took some interference patterns by CCD and measured the position of them. We calculate the probe height using the interference displacement and compared the result with the theoretical deflection curve. As a result, these interference patterns matched the theoretical deflection curve. We found that this multi-probe chip using interference patterns is effective in measurement for a parallel nano imaging.
High Performance Compression of Science Data

NASA Technical Reports Server (NTRS)

Storer, James A.; Carpentieri, Bruno; Cohn, Martin

1994-01-01

Two papers make up the body of this report. One presents a single-pass adaptive vector quantization algorithm that learns a codebook of variable size and shape entries; the authors present experiments on a set of test images showing that with no training or prior knowledge of the data, for a given fidelity, the compression achieved typically equals or exceeds that of the JPEG standard. The second paper addresses motion compensation, one of the most effective techniques used in interframe data compression. A parallel block-matching algorithm for estimating interframe displacement of blocks with minimum error is presented. The algorithm is designed for a simple parallel architecture to process video in real time.
CUDA Optimization Strategies for Compute- and Memory-Bound Neuroimaging Algorithms

PubMed Central

Lee, Daren; Dinov, Ivo; Dong, Bin; Gutman, Boris; Yanovsky, Igor; Toga, Arthur W.

2011-01-01

As neuroimaging algorithms and technology continue to grow faster than CPU performance in complexity and image resolution, data-parallel computing methods will be increasingly important. The high performance, data-parallel architecture of modern graphical processing units (GPUs) can reduce computational times by orders of magnitude. However, its massively threaded architecture introduces challenges when GPU resources are exceeded. This paper presents optimization strategies for compute- and memory-bound algorithms for the CUDA architecture. For compute-bound algorithms, the registers are reduced through variable reuse via shared memory and the data throughput is increased through heavier thread workloads and maximizing the thread configuration for a single thread block per multiprocessor. For memory-bound algorithms, fitting the data into the fast but limited GPU resources is achieved through reorganizing the data into self-contained structures and employing a multi-pass approach. Memory latencies are reduced by selecting memory resources whose cache performance are optimized for the algorithm's access patterns. We demonstrate the strategies on two computationally expensive algorithms and achieve optimized GPU implementations that perform up to 6× faster than unoptimized ones. Compared to CPU implementations, we achieve peak GPU speedups of 129× for the 3D unbiased nonlinear image registration technique and 93× for the non-local means surface denoising algorithm. PMID:21159404
Scalable, High-performance 3D Imaging Software Platform: System Architecture and Application to Virtual Colonoscopy

PubMed Central

Yoshida, Hiroyuki; Wu, Yin; Cai, Wenli; Brett, Bevin

2013-01-01

One of the key challenges in three-dimensional (3D) medical imaging is to enable the fast turn-around time, which is often required for interactive or real-time response. This inevitably requires not only high computational power but also high memory bandwidth due to the massive amount of data that need to be processed. In this work, we have developed a software platform that is designed to support high-performance 3D medical image processing for a wide range of applications using increasingly available and affordable commodity computing systems: multi-core, clusters, and cloud computing systems. To achieve scalable, high-performance computing, our platform (1) employs size-adaptive, distributable block volumes as a core data structure for efficient parallelization of a wide range of 3D image processing algorithms; (2) supports task scheduling for efficient load distribution and balancing; and (3) consists of a layered parallel software libraries that allow a wide range of medical applications to share the same functionalities. We evaluated the performance of our platform by applying it to an electronic cleansing system in virtual colonoscopy, with initial experimental results showing a 10 times performance improvement on an 8-core workstation over the original sequential implementation of the system. PMID:23366803
CUDA optimization strategies for compute- and memory-bound neuroimaging algorithms.

PubMed

Lee, Daren; Dinov, Ivo; Dong, Bin; Gutman, Boris; Yanovsky, Igor; Toga, Arthur W

2012-06-01

As neuroimaging algorithms and technology continue to grow faster than CPU performance in complexity and image resolution, data-parallel computing methods will be increasingly important. The high performance, data-parallel architecture of modern graphical processing units (GPUs) can reduce computational times by orders of magnitude. However, its massively threaded architecture introduces challenges when GPU resources are exceeded. This paper presents optimization strategies for compute- and memory-bound algorithms for the CUDA architecture. For compute-bound algorithms, the registers are reduced through variable reuse via shared memory and the data throughput is increased through heavier thread workloads and maximizing the thread configuration for a single thread block per multiprocessor. For memory-bound algorithms, fitting the data into the fast but limited GPU resources is achieved through reorganizing the data into self-contained structures and employing a multi-pass approach. Memory latencies are reduced by selecting memory resources whose cache performance are optimized for the algorithm's access patterns. We demonstrate the strategies on two computationally expensive algorithms and achieve optimized GPU implementations that perform up to 6× faster than unoptimized ones. Compared to CPU implementations, we achieve peak GPU speedups of 129× for the 3D unbiased nonlinear image registration technique and 93× for the non-local means surface denoising algorithm. Copyright © 2010 Elsevier Ireland Ltd. All rights reserved.
Effective System for Automatic Bundle Block Adjustment and Ortho Image Generation from Multi Sensor Satellite Imagery

NASA Astrophysics Data System (ADS)

Akilan, A.; Nagasubramanian, V.; Chaudhry, A.; Reddy, D. Rajesh; Sudheer Reddy, D.; Usha Devi, R.; Tirupati, T.; Radhadevi, P. V.; Varadan, G.

2014-11-01

Block Adjustment is a technique for large area mapping for images obtained from different remote sensingsatellites.The challenge in this process is to handle huge number of satellite imageries from different sources with different resolution and accuracies at the system level. This paper explains a system with various tools and techniques to effectively handle the end-to-end chain in large area mapping and production with good level of automation and the provisions for intuitive analysis of final results in 3D and 2D environment. In addition, the interface for using open source ortho and DEM references viz., ETM, SRTM etc. and displaying ESRI shapes for the image foot-prints are explained. Rigorous theory, mathematical modelling, workflow automation and sophisticated software engineering tools are included to ensure high photogrammetric accuracy and productivity. Major building blocks like Georeferencing, Geo-capturing and Geo-Modelling tools included in the block adjustment solution are explained in this paper. To provide optimal bundle block adjustment solution with high precision results, the system has been optimized in many stages to exploit the full utilization of hardware resources. The robustness of the system is ensured by handling failure in automatic procedure and saving the process state in every stage for subsequent restoration from the point of interruption. The results obtained from various stages of the system are presented in the paper.
Improving operating room productivity via parallel anesthesia processing.

PubMed

Brown, Michael J; Subramanian, Arun; Curry, Timothy B; Kor, Daryl J; Moran, Steven L; Rohleder, Thomas R

2014-01-01

Parallel processing of regional anesthesia may improve operating room (OR) efficiency in patients undergoes upper extremity surgical procedures. The purpose of this paper is to evaluate whether performing regional anesthesia outside the OR in parallel increases total cases per day, improve efficiency and productivity. Data from all adult patients who underwent regional anesthesia as their primary anesthetic for upper extremity surgery over a one-year period were used to develop a simulation model. The model evaluated pure operating modes of regional anesthesia performed within and outside the OR in a parallel manner. The scenarios were used to evaluate how many surgeries could be completed in a standard work day (555 minutes) and assuming a standard three cases per day, what was the predicted end-of-day time overtime. Modeling results show that parallel processing of regional anesthesia increases the average cases per day for all surgeons included in the study. The average increase was 0.42 surgeries per day. Where it was assumed that three cases per day would be performed by all surgeons, the days going to overtime was reduced by 43 percent with parallel block. The overtime with parallel anesthesia was also projected to be 40 minutes less per day per surgeon. Key limitations include the assumption that all cases used regional anesthesia in the comparisons. Many days may have both regional and general anesthesia. Also, as a case study, single-center research may limit generalizability. Perioperative care providers should consider parallel administration of regional anesthesia where there is a desire to increase daily upper extremity surgical case capacity. Where there are sufficient resources to do parallel anesthesia processing, efficiency and productivity can be significantly improved. Simulation modeling can be an effective tool to show practice change effects at a system-wide level.
A high performance parallel algorithm for 1-D FFT

DOE Office of Scientific and Technical Information (OSTI.GOV)

Agarwal, R.C.; Gustavson, F.G.; Zubair, M.

1994-12-31

In this paper the authors propose a parallel high performance FFT algorithm based on a multi-dimensional formulation. They use this to solve a commonly encountered FFT based kernel on a distributed memory parallel machine, the IBM scalable parallel system, SP1. The kernel requires a forward FFT computation of an input sequence, multiplication of the transformed data by a coefficient array, and finally an inverse FFT computation of the resultant data. They show that the multi-dimensional formulation helps in reducing the communication costs and also improves the single node performance by effectively utilizing the memory system of the node. They implementedmore » this kernel on the IBM SP1 and observed a performance of 1.25 GFLOPS on a 64-node machine.« less
A WENO-Limited, ADER-DT, Finite-Volume Scheme for Efficient, Robust, and Communication-Avoiding Multi-Dimensional Transport

DOE Office of Scientific and Technical Information (OSTI.GOV)

Norman, Matthew R

2014-01-01

The novel ADER-DT time discretization is applied to two-dimensional transport in a quadrature-free, WENO- and FCT-limited, Finite-Volume context. Emphasis is placed on (1) the serial and parallel computational properties of ADER-DT and this framework and (2) the flexibility of ADER-DT and this framework in efficiently balancing accuracy with other constraints important to transport applications. This study demonstrates a range of choices for the user when approaching their specific application while maintaining good parallel properties. In this method, genuine multi-dimensionality, single-step and single-stage time stepping, strict positivity, and a flexible range of limiting are all achieved with only one parallel synchronizationmore » and data exchange per time step. In terms of parallel data transfers per simulated time interval, this improves upon multi-stage time stepping and post-hoc filtering techniques such as hyperdiffusion. This method is evaluated with standard transport test cases over a range of limiting options to demonstrate quantitatively and qualitatively what a user should expect when employing this method in their application.« less
FANS-3D Users Guide (ESTEP Project ER 201031)

DTIC Science & Technology

2016-08-01

governing laminar and turbulent flows in body-fitted curvilinear grids. The code employs multi-block overset ( chimera ) grids, including fully matched...governing incompressible flow in body-fitted grids. The code allows for multi-block overset ( chimera ) grids, which can be fully matched, arbitrarily...interested reader may consult the Chimera Overset Structured Mesh-Interpolation Code (COSMIC) Users’ Manual (Chen, 2009). The input file used for
HTR8/SVneo cells display trophoblast progenitor cell-like characteristics indicative of self-renewal, repopulation activity, and expression of "stemness-" associated transcription factors.

PubMed

Weber, Maja; Knoefler, Ilka; Schleussner, Ekkehard; Markert, Udo R; Fitzgerald, Justine S

2013-01-01

JEG3 is a choriocarcinoma--and HTR8/SVneo a transformed extravillous trophoblast--cell line often used to model the physiologically invasive extravillous trophoblast. Past studies suggest that these cell lines possess some stem or progenitor cell characteristics. Aim was to study whether these cells fulfill minimum criteria used to identify stem-like (progenitor) cells. In summary, we found that the expression profile of HTR8/SVneo (CDX2+, NOTCH1+, SOX2+, NANOG+, and OCT-) is distinct from JEG3 (CDX2+ and NOTCH1+) as seen only in human-serum blocked immunocytochemistry. This correlates with HTR8/SVneo's self-renewal capacities, as made visible via spheroid formation and multi-passagability in hanging drops protocols paralleling those used to maintain embryoid bodies. JEG3 displayed only low propensity to form and reform spheroids. HTR8/SVneo spheroids migrated to cover and seemingly repopulate human chorionic villi during confrontation cultures with placental explants in hanging drops. We conclude that HTR8/SVneo spheroid cells possess progenitor cell traits that are probably attained through corruption of "stemness-" associated transcription factor networks. Furthermore, trophoblastic cells are highly prone to unspecific binding, which is resistant to conventional blocking methods, but which can be alleviated through blockage with human serum.
Simulation requirements for the Large Deployable Reflector (LDR)

NASA Technical Reports Server (NTRS)

Soosaar, K.

1984-01-01

Simulation tools for the large deployable reflector (LDR) are discussed. These tools are often the transfer function variety equations. However, transfer functions are inadequate to represent time-varying systems for multiple control systems with overlapping bandwidths characterized by multi-input, multi-output features. Frequency domain approaches are the useful design tools, but a full-up simulation is needed. Because of the need for a dedicated computer for high frequency multi degree of freedom components encountered, non-real time smulation is preferred. Large numerical analysis software programs are useful only to receive inputs and provide output to the next block, and should be kept out of the direct loop of simulation. The following blocks make up the simulation. The thermal model block is a classical heat transfer program. It is a non-steady state program. The quasistatic block deals with problems associated with rigid body control of reflector segments. The steady state block assembles data into equations of motion and dynamics. A differential raytrace is obtained to establish a change in wave aberrations. The observation scene is described. The focal plane module converts the photon intensity impinging on it into electron streams or into permanent film records.
Hierarchically Ordered Nanopatterns for Spatial Control of Biomolecules

PubMed Central

2015-01-01

The development and study of a benchtop, high-throughput, and inexpensive fabrication strategy to obtain hierarchical patterns of biomolecules with sub-50 nm resolution is presented. A diblock copolymer of polystyrene-b-poly(ethylene oxide), PS-b-PEO, is synthesized with biotin capping the PEO block and 4-bromostyrene copolymerized within the polystyrene block at 5 wt %. These two handles allow thin films of the block copolymer to be postfunctionalized with biotinylated biomolecules of interest and to obtain micropatterns of nanoscale-ordered films via photolithography. The design of this single polymer further allows access to two distinct superficial nanopatterns (lines and dots), where the PEO cylinders are oriented parallel or perpendicular to the substrate. Moreover, we present a strategy to obtain hierarchical mixed morphologies: a thin-film coating of cylinders both parallel and perpendicular to the substrate can be obtained by tuning the solvent annealing and irradiation conditions. PMID:25363506
Hierarchically Ordered Nanopatterns for Spatial Control of Biomolecules

DOE PAGES

Tran, Helen; Ronaldson, Kacey; Bailey, Nevette A.; ...

2014-11-04

We present the development and study of a benchtop, high-throughput, and inexpensive fabrication strategy to obtain hierarchical patterns of biomolecules with sub-50 nm resolution. A diblock copolymer of polystyrene-b-poly(ethylene oxide), PS-b-PEO, is synthesized with biotin capping the PEO block and 4-bromostyrene copolymerized within the polystyrene block at 5 wt %. These two handles allow thin films of the block copolymer to be postfunctionalized with biotinylated biomolecules of interest and to obtain micropatterns of nanoscale-ordered films via photolithography. The design of this single polymer further allows access to two distinct superficial nanopatterns (lines and dots), where the PEO cylinders are orientedmore » parallel or perpendicular to the substrate. Moreover, we present a strategy to obtain hierarchical mixed morphologies: a thin-film coating of cylinders both parallel and perpendicular to the substrate can be obtained by tuning the solvent annealing and irradiation conditions.« less
Parallel Work of CO2 Ejectors Installed in a Multi-Ejector Module of Refrigeration System

NASA Astrophysics Data System (ADS)

Bodys, Jakub; Palacz, Michal; Haida, Michal; Smolka, Jacek; Nowak, Andrzej J.; Banasiak, Krzysztof; Hafner, Armin

2016-09-01

A performance analysis on of fixed ejectors installed in a multi-ejector module in a CO2 refrigeration system is presented in this study. The serial and the parallel work of four fixed-geometry units that compose the multi-ejector pack was carried out. The executed numerical simulations were performed with the use of validated Homogeneous Equilibrium Model (HEM). The computational tool ejectorPL for typical transcritical parameters at the motive nozzle were used in all the tests. A wide range of the operating conditions for supermarket applications in three different European climate zones were taken into consideration. The obtained results present the high and stable performance of all the ejectors in the multi-ejector pack.
Parallel and Preemptable Dynamically Dimensioned Search Algorithms for Single and Multi-objective Optimization in Water Resources

NASA Astrophysics Data System (ADS)

Tolson, B.; Matott, L. S.; Gaffoor, T. A.; Asadzadeh, M.; Shafii, M.; Pomorski, P.; Xu, X.; Jahanpour, M.; Razavi, S.; Haghnegahdar, A.; Craig, J. R.

2015-12-01

We introduce asynchronous parallel implementations of the Dynamically Dimensioned Search (DDS) family of algorithms including DDS, discrete DDS, PA-DDS and DDS-AU. These parallel algorithms are unique from most existing parallel optimization algorithms in the water resources field in that parallel DDS is asynchronous and does not require an entire population (set of candidate solutions) to be evaluated before generating and then sending a new candidate solution for evaluation. One key advance in this study is developing the first parallel PA-DDS multi-objective optimization algorithm. The other key advance is enhancing the computational efficiency of solving optimization problems (such as model calibration) by combining a parallel optimization algorithm with the deterministic model pre-emption concept. These two efficiency techniques can only be combined because of the asynchronous nature of parallel DDS. Model pre-emption functions to terminate simulation model runs early, prior to completely simulating the model calibration period for example, when intermediate results indicate the candidate solution is so poor that it will definitely have no influence on the generation of further candidate solutions. The computational savings of deterministic model preemption available in serial implementations of population-based algorithms (e.g., PSO) disappear in synchronous parallel implementations as these algorithms. In addition to the key advances above, we implement the algorithms across a range of computation platforms (Windows and Unix-based operating systems from multi-core desktops to a supercomputer system) and package these for future modellers within a model-independent calibration software package called Ostrich as well as MATLAB versions. Results across multiple platforms and multiple case studies (from 4 to 64 processors) demonstrate the vast improvement over serial DDS-based algorithms and highlight the important role model pre-emption plays in the performance of parallel, pre-emptable DDS algorithms. Case studies include single- and multiple-objective optimization problems in water resources model calibration and in many cases linear or near linear speedups are observed.

Algorithms for the automatic generation of 2-D structured multi-block grids

NASA Technical Reports Server (NTRS)

Schoenfeld, Thilo; Weinerfelt, Per; Jenssen, Carl B.

1995-01-01

Two different approaches to the fully automatic generation of structured multi-block grids in two dimensions are presented. The work aims to simplify the user interactivity necessary for the definition of a multiple block grid topology. The first approach is based on an advancing front method commonly used for the generation of unstructured grids. The original algorithm has been modified toward the generation of large quadrilateral elements. The second method is based on the divide-and-conquer paradigm with the global domain recursively partitioned into sub-domains. For either method each of the resulting blocks is then meshed using transfinite interpolation and elliptic smoothing. The applicability of these methods to practical problems is demonstrated for typical geometries of fluid dynamics.
Experiments with a Parallel Multi-Objective Evolutionary Algorithm for Scheduling

NASA Technical Reports Server (NTRS)

Brown, Matthew; Johnston, Mark D.

2013-01-01

Evolutionary multi-objective algorithms have great potential for scheduling in those situations where tradeoffs among competing objectives represent a key requirement. One challenge, however, is runtime performance, as a consequence of evolving not just a single schedule, but an entire population, while attempting to sample the Pareto frontier as accurately and uniformly as possible. The growing availability of multi-core processors in end user workstations, and even laptops, has raised the question of the extent to which such hardware can be used to speed up evolutionary algorithms. In this paper we report on early experiments in parallelizing a Generalized Differential Evolution (GDE) algorithm for scheduling long-range activities on NASA's Deep Space Network. Initial results show that significant speedups can be achieved, but that performance does not necessarily improve as more cores are utilized. We describe our preliminary results and some initial suggestions from parallelizing the GDE algorithm. Directions for future work are outlined.
Performance of multi-hop parallel free-space optical communication over gamma-gamma fading channel with pointing errors.

PubMed

Gao, Zhengguang; Liu, Hongzhan; Ma, Xiaoping; Lu, Wei

2016-11-10

Multi-hop parallel relaying is considered in a free-space optical (FSO) communication system deploying binary phase-shift keying (BPSK) modulation under the combined effects of a gamma-gamma (GG) distribution and misalignment fading. Based on the best path selection criterion, the cumulative distribution function (CDF) of this cooperative random variable is derived. Then the performance of this optical mesh network is analyzed in detail. A Monte Carlo simulation is also conducted to demonstrate the effectiveness of the results for the average bit error rate (ABER) and outage probability. The numerical result proves that it needs a smaller average transmitted optical power to achieve the same ABER and outage probability when using the multi-hop parallel network in FSO links. Furthermore, the system use of more number of hops and cooperative paths can improve the quality of the communication.
Multi-channel temperature measurement system for automotive battery stack

NASA Astrophysics Data System (ADS)

Lewczuk, Radoslaw; Wojtkowski, Wojciech

2017-08-01

A multi-channel temperature measurement system for monitoring of automotive battery stack is presented in the paper. The presented system is a complete battery temperature measuring system for hybrid / electric vehicles that incorporates multi-channel temperature measurements with digital temperature sensors communicating through 1-Wire buses, individual 1-Wire bus for each sensor for parallel computing (parallel measurements instead of sequential), FPGA device which collects data from sensors and translates it for CAN bus frames. CAN bus is incorporated for communication with car Battery Management System and uses additional CAN bus controller which communicates with FPGA device through SPI bus. The described system can parallel measure up to 12 temperatures but can be easily extended in the future in case of additional needs. The structure of the system as well as particular devices are described in the paper. Selected results of experimental investigations which show proper operation of the system are presented as well.
A Parallel, Multi-Scale Watershed-Hydrologic-Inundation Model with Adaptively Switching Mesh for Capturing Flooding and Lake Dynamics

NASA Astrophysics Data System (ADS)

Ji, X.; Shen, C.

2017-12-01

Flood inundation presents substantial societal hazards and also changes biogeochemistry for systems like the Amazon. It is often expensive to simulate high-resolution flood inundation and propagation in a long-term watershed-scale model. Due to the Courant-Friedrichs-Lewy (CFL) restriction, high resolution and large local flow velocity both demand prohibitively small time steps even for parallel codes. Here we develop a parallel surface-subsurface process-based model enhanced by multi-resolution meshes that are adaptively switched on or off. The high-resolution overland flow meshes are enabled only when the flood wave invades to floodplains. This model applies semi-implicit, semi-Lagrangian (SISL) scheme in solving dynamic wave equations, and with the assistant of the multi-mesh method, it also adaptively chooses the dynamic wave equation only in the area of deep inundation. Therefore, the model achieves a balance between accuracy and computational cost.
Fast data reconstructed method of Fourier transform imaging spectrometer based on multi-core CPU

NASA Astrophysics Data System (ADS)

Yu, Chunchao; Du, Debiao; Xia, Zongze; Song, Li; Zheng, Weijian; Yan, Min; Lei, Zhenggang

2017-10-01

Imaging spectrometer can gain two-dimensional space image and one-dimensional spectrum at the same time, which shows high utility in color and spectral measurements, the true color image synthesis, military reconnaissance and so on. In order to realize the fast reconstructed processing of the Fourier transform imaging spectrometer data, the paper designed the optimization reconstructed algorithm with OpenMP parallel calculating technology, which was further used for the optimization process for the HyperSpectral Imager of `HJ-1' Chinese satellite. The results show that the method based on multi-core parallel computing technology can control the multi-core CPU hardware resources competently and significantly enhance the calculation of the spectrum reconstruction processing efficiency. If the technology is applied to more cores workstation in parallel computing, it will be possible to complete Fourier transform imaging spectrometer real-time data processing with a single computer.
Node Resource Manager: A Distributed Computing Software Framework Used for Solving Geophysical Problems

NASA Astrophysics Data System (ADS)

Lawry, B. J.; Encarnacao, A.; Hipp, J. R.; Chang, M.; Young, C. J.

2011-12-01

With the rapid growth of multi-core computing hardware, it is now possible for scientific researchers to run complex, computationally intensive software on affordable, in-house commodity hardware. Multi-core CPUs (Central Processing Unit) and GPUs (Graphics Processing Unit) are now commonplace in desktops and servers. Developers today have access to extremely powerful hardware that enables the execution of software that could previously only be run on expensive, massively-parallel systems. It is no longer cost-prohibitive for an institution to build a parallel computing cluster consisting of commodity multi-core servers. In recent years, our research team has developed a distributed, multi-core computing system and used it to construct global 3D earth models using seismic tomography. Traditionally, computational limitations forced certain assumptions and shortcuts in the calculation of tomographic models; however, with the recent rapid growth in computational hardware including faster CPU's, increased RAM, and the development of multi-core computers, we are now able to perform seismic tomography, 3D ray tracing and seismic event location using distributed parallel algorithms running on commodity hardware, thereby eliminating the need for many of these shortcuts. We describe Node Resource Manager (NRM), a system we developed that leverages the capabilities of a parallel computing cluster. NRM is a software-based parallel computing management framework that works in tandem with the Java Parallel Processing Framework (JPPF, http://www.jppf.org/), a third party library that provides a flexible and innovative way to take advantage of modern multi-core hardware. NRM enables multiple applications to use and share a common set of networked computers, regardless of their hardware platform or operating system. Using NRM, algorithms can be parallelized to run on multiple processing cores of a distributed computing cluster of servers and desktops, which results in a dramatic speedup in execution time. NRM is sufficiently generic to support applications in any domain, as long as the application is parallelizable (i.e., can be subdivided into multiple individual processing tasks). At present, NRM has been effective in decreasing the overall runtime of several algorithms: 1) the generation of a global 3D model of the compressional velocity distribution in the Earth using tomographic inversion, 2) the calculation of the model resolution matrix, model covariance matrix, and travel time uncertainty for the aforementioned velocity model, and 3) the correlation of waveforms with archival data on a massive scale for seismic event detection. Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000.
Default Parallels Plesk Panel Page

Science.gov Websites

services that small businesses want and need. Our software includes key building blocks of cloud service virtualized servers Service Provider Products ParallelsÂ® Automation Hosting, SaaS, and cloud computing , the leading hosting automation software. You see this page because there is no Web site at this
Parallelized computation for computer simulation of electrocardiograms using personal computers with multi-core CPU and general-purpose GPU.

PubMed

Shen, Wenfeng; Wei, Daming; Xu, Weimin; Zhu, Xin; Yuan, Shizhong

2010-10-01

Biological computations like electrocardiological modelling and simulation usually require high-performance computing environments. This paper introduces an implementation of parallel computation for computer simulation of electrocardiograms (ECGs) in a personal computer environment with an Intel CPU of Core (TM) 2 Quad Q6600 and a GPU of Geforce 8800GT, with software support by OpenMP and CUDA. It was tested in three parallelization device setups: (a) a four-core CPU without a general-purpose GPU, (b) a general-purpose GPU plus 1 core of CPU, and (c) a four-core CPU plus a general-purpose GPU. To effectively take advantage of a multi-core CPU and a general-purpose GPU, an algorithm based on load-prediction dynamic scheduling was developed and applied to setting (c). In the simulation with 1600 time steps, the speedup of the parallel computation as compared to the serial computation was 3.9 in setting (a), 16.8 in setting (b), and 20.0 in setting (c). This study demonstrates that a current PC with a multi-core CPU and a general-purpose GPU provides a good environment for parallel computations in biological modelling and simulation studies. Copyright 2010 Elsevier Ireland Ltd. All rights reserved.
Creation of parallel algorithms for the solution of problems of gas dynamics on multi-core computers and GPU

NASA Astrophysics Data System (ADS)

Rybakin, B.; Bogatencov, P.; Secrieru, G.; Iliuha, N.

2013-10-01

The paper deals with a parallel algorithm for calculations on multiprocessor computers and GPU accelerators. The calculations of shock waves interaction with low-density bubble results and the problem of the gas flow with the forces of gravity are presented. This algorithm combines a possibility to capture a high resolution of shock waves, the second-order accuracy for TVD schemes, and a possibility to observe a low-level diffusion of the advection scheme. Many complex problems of continuum mechanics are numerically solved on structured or unstructured grids. To improve the accuracy of the calculations is necessary to choose a sufficiently small grid (with a small cell size). This leads to the drawback of a substantial increase of computation time. Therefore, for the calculations of complex problems it is reasonable to use the method of Adaptive Mesh Refinement. That is, the grid refinement is performed only in the areas of interest of the structure, where, e.g., the shock waves are generated, or a complex geometry or other such features exist. Thus, the computing time is greatly reduced. In addition, the execution of the application on the resulting sequence of nested, decreasing nets can be parallelized. Proposed algorithm is based on the AMR method. Utilization of AMR method can significantly improve the resolution of the difference grid in areas of high interest, and from other side to accelerate the processes of the multi-dimensional problems calculating. Parallel algorithms of the analyzed difference models realized for the purpose of calculations on graphic processors using the CUDA technology [1].
A dual-processor multi-frequency implementation of the FINDS algorithm

NASA Technical Reports Server (NTRS)

Godiwala, Pankaj M.; Caglayan, Alper K.

1987-01-01

This report presents a parallel processing implementation of the FINDS (Fault Inferring Nonlinear Detection System) algorithm on a dual processor configured target flight computer. First, a filter initialization scheme is presented which allows the no-fail filter (NFF) states to be initialized using the first iteration of the flight data. A modified failure isolation strategy, compatible with the new failure detection strategy reported earlier, is discussed and the performance of the new FDI algorithm is analyzed using flight recorded data from the NASA ATOPS B-737 aircraft in a Microwave Landing System (MLS) environment. The results show that low level MLS, IMU, and IAS sensor failures are detected and isolated instantaneously, while accelerometer and rate gyro failures continue to take comparatively longer to detect and isolate. The parallel implementation is accomplished by partitioning the FINDS algorithm into two parts: one based on the translational dynamics and the other based on the rotational kinematics. Finally, a multi-rate implementation of the algorithm is presented yielding significantly low execution times with acceptable estimation and FDI performance.
''A Parallel Adaptive Simulation Tool for Two Phase Steady State Reacting Flows in Industrial Boilers and Furnaces''

DOE Office of Scientific and Technical Information (OSTI.GOV)

Michael J. Bockelie

2002-01-04

This DOE SBIR Phase II final report summarizes research that has been performed to develop a parallel adaptive tool for modeling steady, two phase turbulent reacting flow. The target applications for the new tool are full scale, fossil-fuel fired boilers and furnaces such as those used in the electric utility industry, chemical process industry and mineral/metal process industry. The type of analyses to be performed on these systems are engineering calculations to evaluate the impact on overall furnace performance due to operational, process or equipment changes. To develop a Computational Fluid Dynamics (CFD) model of an industrial scale furnace requiresmore » a carefully designed grid that will capture all of the large and small scale features of the flowfield. Industrial systems are quite large, usually measured in tens of feet, but contain numerous burners, air injection ports, flames and localized behavior with dimensions that are measured in inches or fractions of inches. To create an accurate computational model of such systems requires capturing length scales within the flow field that span several orders of magnitude. In addition, to create an industrially useful model, the grid can not contain too many grid points - the model must be able to execute on an inexpensive desktop PC in a matter of days. An adaptive mesh provides a convenient means to create a grid that can capture both fine flow field detail within a very large domain with a ''reasonable'' number of grid points. However, the use of an adaptive mesh requires the development of a new flow solver. To create the new simulation tool, we have combined existing reacting CFD modeling software with new software based on emerging block structured Adaptive Mesh Refinement (AMR) technologies developed at Lawrence Berkeley National Laboratory (LBNL). Specifically, we combined: -physical models, modeling expertise, and software from existing combustion simulation codes used by Reaction Engineering International; -mesh adaption, data management, and parallelization software and technology being developed by users of the BoxLib library at LBNL; and -solution methods for problems formulated on block structured grids that were being developed in collaboration with technical staff members at the University of Utah Center for High Performance Computing (CHPC) and at LBNL. The combustion modeling software used by Reaction Engineering International represents an investment of over fifty man-years of development, conducted over a period of twenty years. Thus, it was impractical to achieve our objective by starting from scratch. The research program resulted in an adaptive grid, reacting CFD flow solver that can be used only on limited problems. In current form the code is appropriate for use on academic problems with simplified geometries. The new solver is not sufficiently robust or sufficiently general to be used in a ''production mode'' for industrial applications. The principle difficulty lies with the multi-level solver technology. The use of multi-level solvers on adaptive grids with embedded boundaries is not yet a mature field and there are many issues that remain to be resolved. From the lessons learned in this SBIR program, we have started work on a new flow solver with an AMR capability. The new code is based on a conventional cell-by-cell mesh refinement strategy used in unstructured grid solvers that employ hexahedral cells. The new solver employs several of the concepts and solution strategies developed within this research program. The formulation of the composite grid problem for the new solver has been designed to avoid the embedded boundary complications encountered in this SBIR project. This follow-on effort will result in a reacting flow CFD solver with localized mesh capability that can be used to perform engineering calculations on industrial problems in a production mode.« less
Individual relocation decisions after tornadoes: a multi-level analysis.

PubMed

Cong, Zhen; Nejat, Ali; Liang, Daan; Pei, Yaolin; Javid, Roxana J

2018-04-01

This study examines how multi-level factors affected individuals' relocation decisions after EF4 and EF5 (Enhanced Fujita Tornado Intensity Scale) tornadoes struck the United States in 2013. A telephone survey was conducted with 536 respondents, including oversampled older adults, one year after these two disaster events. Respondents' addresses were used to associate individual information with block group-level variables recorded by the American Community Survey. Logistic regression revealed that residential damage and homeownership are important predictors of relocation. There was also significant interaction between these two variables, indicating less difference between homeowners and renters at higher damage levels. Homeownership diminished the likelihood of relocation among younger respondents. Random effects logistic regression found that the percentage of homeownership and of higher income households in the community buffered the effect of damage on relocation; the percentage of older adults reduced the likelihood of this group relocating. The findings are assessed from the standpoint of age difference, policy implications, and social capital and vulnerability. © 2018 The Author(s). Disasters © Overseas Development Institute, 2018.
Multi-scale analysis of the effect of nano-filler particle diameter on the physical properties of CAD/CAM composite resin blocks.

PubMed

Yamaguchi, Satoshi; Inoue, Sayuri; Sakai, Takahiko; Abe, Tomohiro; Kitagawa, Haruaki; Imazato, Satoshi

2017-05-01

The objective of this study was to assess the effect of silica nano-filler particle diameters in a computer-aided design/manufacturing (CAD/CAM) composite resin (CR) block on physical properties at the multi-scale in silico. CAD/CAM CR blocks were modeled, consisting of silica nano-filler particles (20, 40, 60, 80, and 100 nm) and matrix (Bis-GMA/TEGDMA), with filler volume contents of 55.161%. Calculation of Young's moduli and Poisson's ratios for the block at macro-scale were analyzed by homogenization. Macro-scale CAD/CAM CR blocks (3 × 3 × 3 mm) were modeled and compressive strengths were defined when the fracture loads exceeded 6075 N. MPS values of the nano-scale models were compared by localization analysis. As the filler size decreased, Young's moduli and compressive strength increased, while Poisson's ratios and MPS decreased. All parameters were significantly correlated with the diameters of the filler particles (Pearson's correlation test, r = -0.949, 0.943, -0.951, 0.976, p < 0.05). The in silico multi-scale model established in this study demonstrates that the Young's moduli, Poisson's ratios, and compressive strengths of CAD/CAM CR blocks can be enhanced by loading silica nanofiller particles of smaller diameter. CAD/CAM CR blocks by using smaller silica nano-filler particles have a potential to increase fracture resistance.
Hybrid MPI+OpenMP Programming of an Overset CFD Solver and Performance Investigations

NASA Technical Reports Server (NTRS)

Djomehri, M. Jahed; Jin, Haoqiang H.; Biegel, Bryan (Technical Monitor)

2002-01-01

This report describes a two level parallelization of a Computational Fluid Dynamic (CFD) solver with multi-zone overset structured grids. The approach is based on a hybrid MPI+OpenMP programming model suitable for shared memory and clusters of shared memory machines. The performance investigations of the hybrid application on an SGI Origin2000 (O2K) machine is reported using medium and large scale test problems.
Parallel Mutual Information Based Construction of Genome-Scale Networks on the Intel® Xeon Phi™ Coprocessor.

PubMed

Misra, Sanchit; Pamnany, Kiran; Aluru, Srinivas

2015-01-01

Construction of whole-genome networks from large-scale gene expression data is an important problem in systems biology. While several techniques have been developed, most cannot handle network reconstruction at the whole-genome scale, and the few that can, require large clusters. In this paper, we present a solution on the Intel Xeon Phi coprocessor, taking advantage of its multi-level parallelism including many x86-based cores, multiple threads per core, and vector processing units. We also present a solution on the Intel® Xeon® processor. Our solution is based on TINGe, a fast parallel network reconstruction technique that uses mutual information and permutation testing for assessing statistical significance. We demonstrate the first ever inference of a plant whole genome regulatory network on a single chip by constructing a 15,575 gene network of the plant Arabidopsis thaliana from 3,137 microarray experiments in only 22 minutes. In addition, our optimization for parallelizing mutual information computation on the Intel Xeon Phi coprocessor holds out lessons that are applicable to other domains.
Large-Scale, Multi-Sensor Atmospheric Data Fusion Using Hybrid Cloud Computing

NASA Astrophysics Data System (ADS)

Wilson, Brian; Manipon, Gerald; Hua, Hook; Fetzer, Eric

2014-05-01

NASA's Earth Observing System (EOS) is an ambitious facility for studying global climate change. The mandate now is to combine measurements from the instruments on the "A-Train" platforms (AIRS, AMSR-E, MODIS, MISR, MLS, and CloudSat) and other Earth probes to enable large-scale studies of climate change over decades. Moving to multi-sensor, long-duration analyses of important climate variables presents serious challenges for large-scale data mining and fusion. For example, one might want to compare temperature and water vapor retrievals from one instrument (AIRS) to another (MODIS), and to a model (ECMWF), stratify the comparisons using a classification of the "cloud scenes" from CloudSat, and repeat the entire analysis over 10 years of data. To efficiently assemble such datasets, we are utilizing Elastic Computing in the Cloud and parallel map-reduce-based algorithms. However, these problems are Data Intensive computing so the data transfer times and storage costs (for caching) are key issues. SciReduce is a Hadoop-like parallel analysis system, programmed in parallel python, that is designed from the ground up for Earth science. SciReduce executes inside VMWare images and scales to any number of nodes in a hybrid Cloud (private eucalyptus & public Amazon). Unlike Hadoop, SciReduce operates on bundles of named numeric arrays, which can be passed in memory or serialized to disk in netCDF4 or HDF5. Multi-year datasets are automatically "sharded" by time and space across a cluster of nodes so that years of data (millions of files) can be processed in a massively parallel way. Input variables (arrays) are pulled on-demand into the Cloud using OPeNDAP URLs or other subsetting services, thereby minimizing the size of the cached input and intermediate datasets. We are using SciReduce to automate the production of multiple versions of a ten-year A-Train water vapor climatology under a NASA MEASURES grant. We will present the architecture of SciReduce, describe the achieved "clock time" speedups in fusing datasets on our own nodes and in the Cloud, and discuss the Cloud cost tradeoffs for storage, compute, and data transfer. We will also present a concept and prototype for staging NASA's A-Train Atmospheric datasets (Levels 2 & 3) in the Amazon Cloud so that any number of compute jobs can be executed "near" the multi-sensor data. Given such a system, multi-sensor climate studies over 10-20 years of data could be perform
Enabling Object Storage via shims for Grid Middleware

NASA Astrophysics Data System (ADS)

Cadellin Skipsey, Samuel; De Witt, Shaun; Dewhurst, Alastair; Britton, David; Roy, Gareth; Crooks, David

2015-12-01

The Object Store model has quickly become the basis of most commercially successful mass storage infrastructure, backing so-called ”Cloud” storage such as Amazon S3, but also underlying the implementation of most parallel distributed storage systems. Many of the assumptions in Object Store design are similar, but not identical, to concepts in the design of Grid Storage Elements, although the requirement for ”POSIX-like” filesystem structures on top of SEs makes the disjunction seem larger. As modern Object Stores provide many features that most Grid SEs do not (block level striping, parallel access, automatic file repair, etc.), it is of interest to see how easily we can provide interfaces to typical Object Stores via plugins and shims for Grid tools, and how well experiments can adapt their data models to them. We present evaluation of, and first-deployment experiences with, (for example) Xrootd-Ceph interfaces for direct object-store access, as part of an initiative within GridPP[1] hosted at RAL. Additionally, we discuss the tradeoffs and experience of developing plugins for the currently-popular Ceph parallel distributed filesystem for the GFAL2 access layer, at Glasgow.
Atmospheric Blocking and Atlantic Multi-Decadal Ocean Variability

NASA Technical Reports Server (NTRS)

Haekkinen, Sirpa; Rhines, Peter B.; Worthlen, Denise L.

2011-01-01

Based on the 20th century atmospheric reanalysis, winters with more frequent blocking, in a band of blocked latitudes from Greenland to Western Europe, are found to persist over several decades and correspond to a warm North Atlantic Ocean, in-phase with Atlantic multi-decadal ocean variability. Atmospheric blocking over the northern North Atlantic, which involves isolation of large regions of air from the westerly circulation for 5 days or more, influences fundamentally the ocean circulation and upper ocean properties by impacting wind patterns. Winters with clusters of more frequent blocking between Greenland and western Europe correspond to a warmer, more saline subpolar ocean. The correspondence between blocked westerly winds and warm ocean holds in recent decadal episodes (especially, 1996-2010). It also describes much longer-timescale Atlantic multidecadal ocean variability (AMV), including the extreme, pre-greenhouse-gas, northern warming of the 1930s-1960s. The space-time structure of the wind forcing associated with a blocked regime leads to weaker ocean gyres and weaker heat-exchange, both of which contribute to the warm phase of AMV.
Power deposition on misaligned castellated tungsten blocks in the Magnum-PSI and Pilot-PSI linear devices

NASA Astrophysics Data System (ADS)

Morgan, T. W.; van den Berg, M. A.; De Temmerman, G.; Bardin, S.; Aussems, D. U. B.; Pitts, R. A.

2017-12-01

For the final design of the ITER divertor it is important to determine whether shaping of each tungsten monoblock to eliminate leading edges is required or not. In order to aid this decision, two experiments were performed in DIFFER’s linear plasma devices to study heat loads on misaligned water cooled blocks at glancing incidence. First, a series of tungsten blocks were exposed to a high parallel heat flux (26 MW \

Hybrid Parallelization of Adaptive MHD-Kinetic Module in Multi-Scale Fluid-Kinetic Simulation Suite

DOE PAGES

Borovikov, Sergey; Heerikhuisen, Jacob; Pogorelov, Nikolai

2013-04-01

The Multi-Scale Fluid-Kinetic Simulation Suite has a computational tool set for solving partially ionized flows. In this paper we focus on recent developments of the kinetic module which solves the Boltzmann equation using the Monte-Carlo method. The module has been recently redesigned to utilize intra-node hybrid parallelization. We describe in detail the redesign process, implementation issues, and modifications made to the code. Finally, we conduct a performance analysis.
High performance compression of science data

NASA Technical Reports Server (NTRS)

Storer, James A.; Cohn, Martin

1994-01-01

Two papers make up the body of this report. One presents a single-pass adaptive vector quantization algorithm that learns a codebook of variable size and shape entries; the authors present experiments on a set of test images showing that with no training or prior knowledge of the data, for a given fidelity, the compression achieved typically equals or exceeds that of the JPEG standard. The second paper addresses motion compensation, one of the most effective techniques used in the interframe data compression. A parallel block-matching algorithm for estimating interframe displacement of blocks with minimum error is presented. The algorithm is designed for a simple parallel architecture to process video in real time.
ION PRODUCING MECHANISM

DOEpatents

MacKenzie, K.R.

1958-09-01

An ion source is described for use in a calutron and more particularly deals with an improved filament arrangement for a calutron. According to the invention, the ion source block has a gas ionizing passage open along two adjoining sides of the block. A filament is disposed in overlying relation to one of the passage openings and has a greater width than the passage width, so that both the filament and opening lengths are parallel and extend in a transverse relation to the magnetic field. The other passage opening is parallel to the length of the magnetic field. This arrangement is effective in assisting in the production of a stable, long-lived arc for the general improvement of calutron operation.
A fast ultrasonic simulation tool based on massively parallel implementations

NASA Astrophysics Data System (ADS)

Lambert, Jason; Rougeron, Gilles; Lacassagne, Lionel; Chatillon, Sylvain

2014-02-01

This paper presents a CIVA optimized ultrasonic inspection simulation tool, which takes benefit of the power of massively parallel architectures: graphical processing units (GPU) and multi-core general purpose processors (GPP). This tool is based on the classical approach used in CIVA: the interaction model is based on Kirchoff, and the ultrasonic field around the defect is computed by the pencil method. The model has been adapted and parallelized for both architectures. At this stage, the configurations addressed by the tool are : multi and mono-element probes, planar specimens made of simple isotropic materials, planar rectangular defects or side drilled holes of small diameter. Validations on the model accuracy and performances measurements are presented.
Multi-Resolution Climate Ensemble Parameter Analysis with Nested Parallel Coordinates Plots.

PubMed

Wang, Junpeng; Liu, Xiaotong; Shen, Han-Wei; Lin, Guang

2017-01-01

Due to the uncertain nature of weather prediction, climate simulations are usually performed multiple times with different spatial resolutions. The outputs of simulations are multi-resolution spatial temporal ensembles. Each simulation run uses a unique set of values for multiple convective parameters. Distinct parameter settings from different simulation runs in different resolutions constitute a multi-resolution high-dimensional parameter space. Understanding the correlation between the different convective parameters, and establishing a connection between the parameter settings and the ensemble outputs are crucial to domain scientists. The multi-resolution high-dimensional parameter space, however, presents a unique challenge to the existing correlation visualization techniques. We present Nested Parallel Coordinates Plot (NPCP), a new type of parallel coordinates plots that enables visualization of intra-resolution and inter-resolution parameter correlations. With flexible user control, NPCP integrates superimposition, juxtaposition and explicit encodings in a single view for comparative data visualization and analysis. We develop an integrated visual analytics system to help domain scientists understand the connection between multi-resolution convective parameters and the large spatial temporal ensembles. Our system presents intricate climate ensembles with a comprehensive overview and on-demand geographic details. We demonstrate NPCP, along with the climate ensemble visualization system, based on real-world use-cases from our collaborators in computational and predictive science.
Scalable domain decomposition solvers for stochastic PDEs in high performance computing

DOE PAGES

Desai, Ajit; Khalil, Mohammad; Pettit, Chris; ...

2017-09-21

Stochastic spectral finite element models of practical engineering systems may involve solutions of linear systems or linearized systems for non-linear problems with billions of unknowns. For stochastic modeling, it is therefore essential to design robust, parallel and scalable algorithms that can efficiently utilize high-performance computing to tackle such large-scale systems. Domain decomposition based iterative solvers can handle such systems. And though these algorithms exhibit excellent scalabilities, significant algorithmic and implementational challenges exist to extend them to solve extreme-scale stochastic systems using emerging computing platforms. Intrusive polynomial chaos expansion based domain decomposition algorithms are extended here to concurrently handle high resolutionmore » in both spatial and stochastic domains using an in-house implementation. Sparse iterative solvers with efficient preconditioners are employed to solve the resulting global and subdomain level local systems through multi-level iterative solvers. We also use parallel sparse matrix–vector operations to reduce the floating-point operations and memory requirements. Numerical and parallel scalabilities of these algorithms are presented for the diffusion equation having spatially varying diffusion coefficient modeled by a non-Gaussian stochastic process. Scalability of the solvers with respect to the number of random variables is also investigated.« less
Scalable domain decomposition solvers for stochastic PDEs in high performance computing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Desai, Ajit; Khalil, Mohammad; Pettit, Chris

Stochastic spectral finite element models of practical engineering systems may involve solutions of linear systems or linearized systems for non-linear problems with billions of unknowns. For stochastic modeling, it is therefore essential to design robust, parallel and scalable algorithms that can efficiently utilize high-performance computing to tackle such large-scale systems. Domain decomposition based iterative solvers can handle such systems. And though these algorithms exhibit excellent scalabilities, significant algorithmic and implementational challenges exist to extend them to solve extreme-scale stochastic systems using emerging computing platforms. Intrusive polynomial chaos expansion based domain decomposition algorithms are extended here to concurrently handle high resolutionmore » in both spatial and stochastic domains using an in-house implementation. Sparse iterative solvers with efficient preconditioners are employed to solve the resulting global and subdomain level local systems through multi-level iterative solvers. We also use parallel sparse matrix–vector operations to reduce the floating-point operations and memory requirements. Numerical and parallel scalabilities of these algorithms are presented for the diffusion equation having spatially varying diffusion coefficient modeled by a non-Gaussian stochastic process. Scalability of the solvers with respect to the number of random variables is also investigated.« less
Parallel and Efficient Sensitivity Analysis of Microscopy Image Segmentation Workflows in Hybrid Systems

PubMed Central

Barreiros, Willian; Teodoro, George; Kurc, Tahsin; Kong, Jun; Melo, Alba C. M. A.; Saltz, Joel

2017-01-01

We investigate efficient sensitivity analysis (SA) of algorithms that segment and classify image features in a large dataset of high-resolution images. Algorithm SA is the process of evaluating variations of methods and parameter values to quantify differences in the output. A SA can be very compute demanding because it requires re-processing the input dataset several times with different parameters to assess variations in output. In this work, we introduce strategies to efficiently speed up SA via runtime optimizations targeting distributed hybrid systems and reuse of computations from runs with different parameters. We evaluate our approach using a cancer image analysis workflow on a hybrid cluster with 256 nodes, each with an Intel Phi and a dual socket CPU. The SA attained a parallel efficiency of over 90% on 256 nodes. The cooperative execution using the CPUs and the Phi available in each node with smart task assignment strategies resulted in an additional speedup of about 2×. Finally, multi-level computation reuse lead to an additional speedup of up to 2.46× on the parallel version. The level of performance attained with the proposed optimizations will allow the use of SA in large-scale studies. PMID:29081725
Additive Manufacturing of Molds for Fabrication of Insulated Concrete Block

DOE Office of Scientific and Technical Information (OSTI.GOV)

Love, Lonnie J.; Lloyd, Peter D.

ORNL worked with concrete block manufacturer, NRG Insulated Block, to demonstrate additive manufacturing of a multi-component block mold for its line of insulated blocks. Solid models of the mold parts were constructed from existing two-dimensional drawings and the parts were fabricated on a Stratasys Fortus 900 using ULTEM 9085. Block mold parts were delivered to NRG and installed on one of their fabrication lines. While form and fit were acceptable, the molds failed to function during NRG’s testing.
Fatigue Life Estimation under Cumulative Cyclic Loading Conditions

NASA Technical Reports Server (NTRS)

Kalluri, Sreeramesh; McGaw, Michael A; Halford, Gary R.

1999-01-01

The cumulative fatigue behavior of a cobalt-base superalloy, Haynes 188 was investigated at 760 C in air. Initially strain-controlled tests were conducted on solid cylindrical gauge section specimens of Haynes 188 under fully-reversed, tensile and compressive mean strain-controlled fatigue tests. Fatigue data from these tests were used to establish the baseline fatigue behavior of the alloy with 1) a total strain range type fatigue life relation and 2) the Smith-Wastson-Topper (SWT) parameter. Subsequently, two load-level multi-block fatigue tests were conducted on similar specimens of Haynes 188 at the same temperature. Fatigue lives of the multi-block tests were estimated with 1) the Linear Damage Rule (LDR) and 2) the nonlinear Damage Curve Approach (DCA) both with and without the consideration of mean stresses generated during the cumulative fatigue tests. Fatigue life predictions by the nonlinear DCA were much closer to the experimentally observed lives than those obtained by the LDR. In the presence of mean stresses, the SWT parameter estimated the fatigue lives more accurately under tensile conditions than under compressive conditions.
Predictors of elevational biodiversity gradients change from single taxa to the multi-taxa community level

PubMed Central

Peters, Marcell K.; Hemp, Andreas; Appelhans, Tim; Behler, Christina; Classen, Alice; Detsch, Florian; Ensslin, Andreas; Ferger, Stefan W.; Frederiksen, Sara B.; Gebert, Friederike; Haas, Michael; Helbig-Bonitz, Maria; Hemp, Claudia; Kindeketa, William J.; Mwangomo, Ephraim; Ngereza, Christine; Otte, Insa; Röder, Juliane; Rutten, Gemma; Schellenberger Costa, David; Tardanico, Joseph; Zancolli, Giulia; Deckert, Jürgen; Eardley, Connal D.; Peters, Ralph S.; Rödel, Mark-Oliver; Schleuning, Matthias; Ssymank, Axel; Kakengi, Victor; Zhang, Jie; Böhning-Gaese, Katrin; Brandl, Roland; Kalko, Elisabeth K.V.; Kleyer, Michael; Nauss, Thomas; Tschapka, Marco; Fischer, Markus; Steffan-Dewenter, Ingolf

2016-01-01

The factors determining gradients of biodiversity are a fundamental yet unresolved topic in ecology. While diversity gradients have been analysed for numerous single taxa, progress towards general explanatory models has been hampered by limitations in the phylogenetic coverage of past studies. By parallel sampling of 25 major plant and animal taxa along a 3.7 km elevational gradient on Mt. Kilimanjaro, we quantify cross-taxon consensus in diversity gradients and evaluate predictors of diversity from single taxa to a multi-taxa community level. While single taxa show complex distribution patterns and respond to different environmental factors, scaling up diversity to the community level leads to an unambiguous support for temperature as the main predictor of species richness in both plants and animals. Our findings illuminate the influence of taxonomic coverage for models of diversity gradients and point to the importance of temperature for diversification and species coexistence in plant and animal communities. PMID:28004657
Compressive Video Recovery Using Block Match Multi-Frame Motion Estimation Based on Single Pixel Cameras

PubMed Central

Bi, Sheng; Zeng, Xiao; Tang, Xin; Qin, Shujia; Lai, King Wai Chiu

2016-01-01

Compressive sensing (CS) theory has opened up new paths for the development of signal processing applications. Based on this theory, a novel single pixel camera architecture has been introduced to overcome the current limitations and challenges of traditional focal plane arrays. However, video quality based on this method is limited by existing acquisition and recovery methods, and the method also suffers from being time-consuming. In this paper, a multi-frame motion estimation algorithm is proposed in CS video to enhance the video quality. The proposed algorithm uses multiple frames to implement motion estimation. Experimental results show that using multi-frame motion estimation can improve the quality of recovered videos. To further reduce the motion estimation time, a block match algorithm is used to process motion estimation. Experiments demonstrate that using the block match algorithm can reduce motion estimation time by 30%. PMID:26950127
Scheduling optimization of design stream line for production research and development projects

NASA Astrophysics Data System (ADS)

Liu, Qinming; Geng, Xiuli; Dong, Ming; Lv, Wenyuan; Ye, Chunming

2017-05-01

In a development project, efficient design stream line scheduling is difficult and important owing to large design imprecision and the differences in the skills and skill levels of employees. The relative skill levels of employees are denoted as fuzzy numbers. Multiple execution modes are generated by scheduling different employees for design tasks. An optimization model of a design stream line scheduling problem is proposed with the constraints of multiple executive modes, multi-skilled employees and precedence. The model considers the parallel design of multiple projects, different skills of employees, flexible multi-skilled employees and resource constraints. The objective function is to minimize the duration and tardiness of the project. Moreover, a two-dimensional particle swarm algorithm is used to find the optimal solution. To illustrate the validity of the proposed method, a case is examined in this article, and the results support the feasibility and effectiveness of the proposed model and algorithm.
Computational electromagnetics: the physics of smooth versus oscillatory fields.

PubMed

Chew, W C

2004-03-15

This paper starts by discussing the difference in the physics between solutions to Laplace's equation (static) and Maxwell's equations for dynamic problems (Helmholtz equation). Their differing physical characters are illustrated by how the two fields convey information away from their source point. The paper elucidates the fact that their differing physical characters affect the use of Laplacian field and Helmholtz field in imaging. They also affect the design of fast computational algorithms for electromagnetic scattering problems. Specifically, a comparison is made between fast algorithms developed using wavelets, the simple fast multipole method, and the multi-level fast multipole algorithm for electrodynamics. The impact of the physical characters of the dynamic field on the parallelization of the multi-level fast multipole algorithm is also discussed. The relationship of diagonalization of translators to group theory is presented. Finally, future areas of research for computational electromagnetics are described.
Fast l₁-SPIRiT compressed sensing parallel imaging MRI: scalable parallel implementation and clinically feasible runtime.

PubMed

Murphy, Mark; Alley, Marcus; Demmel, James; Keutzer, Kurt; Vasanawala, Shreyas; Lustig, Michael

2012-06-01

We present l₁-SPIRiT, a simple algorithm for auto calibrating parallel imaging (acPI) and compressed sensing (CS) that permits an efficient implementation with clinically-feasible runtimes. We propose a CS objective function that minimizes cross-channel joint sparsity in the wavelet domain. Our reconstruction minimizes this objective via iterative soft-thresholding, and integrates naturally with iterative self-consistent parallel imaging (SPIRiT). Like many iterative magnetic resonance imaging reconstructions, l₁-SPIRiT's image quality comes at a high computational cost. Excessively long runtimes are a barrier to the clinical use of any reconstruction approach, and thus we discuss our approach to efficiently parallelizing l₁-SPIRiT and to achieving clinically-feasible runtimes. We present parallelizations of l₁-SPIRiT for both multi-GPU systems and multi-core CPUs, and discuss the software optimization and parallelization decisions made in our implementation. The performance of these alternatives depends on the processor architecture, the size of the image matrix, and the number of parallel imaging channels. Fundamentally, achieving fast runtime requires the correct trade-off between cache usage and parallelization overheads. We demonstrate image quality via a case from our clinical experimentation, using a custom 3DFT spoiled gradient echo (SPGR) sequence with up to 8× acceleration via Poisson-disc undersampling in the two phase-encoded directions.
Fast ℓ1-SPIRiT Compressed Sensing Parallel Imaging MRI: Scalable Parallel Implementation and Clinically Feasible Runtime

PubMed Central

Murphy, Mark; Alley, Marcus; Demmel, James; Keutzer, Kurt; Vasanawala, Shreyas; Lustig, Michael

2012-01-01

We present ℓ1-SPIRiT, a simple algorithm for auto calibrating parallel imaging (acPI) and compressed sensing (CS) that permits an efficient implementation with clinically-feasible runtimes. We propose a CS objective function that minimizes cross-channel joint sparsity in the Wavelet domain. Our reconstruction minimizes this objective via iterative soft-thresholding, and integrates naturally with iterative Self-Consistent Parallel Imaging (SPIRiT). Like many iterative MRI reconstructions, ℓ1-SPIRiT’s image quality comes at a high computational cost. Excessively long runtimes are a barrier to the clinical use of any reconstruction approach, and thus we discuss our approach to efficiently parallelizing ℓ1-SPIRiT and to achieving clinically-feasible runtimes. We present parallelizations of ℓ1-SPIRiT for both multi-GPU systems and multi-core CPUs, and discuss the software optimization and parallelization decisions made in our implementation. The performance of these alternatives depends on the processor architecture, the size of the image matrix, and the number of parallel imaging channels. Fundamentally, achieving fast runtime requires the correct trade-off between cache usage and parallelization overheads. We demonstrate image quality via a case from our clinical experimentation, using a custom 3DFT Spoiled Gradient Echo (SPGR) sequence with up to 8× acceleration via poisson-disc undersampling in the two phase-encoded directions. PMID:22345529
Fast Detection of Material Deformation through Structural Dissimilarity

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ushizima, Daniela; Perciano, Talita; Parkinson, Dilworth

2015-10-29

Designing materials that are resistant to extreme temperatures and brittleness relies on assessing structural dynamics of samples. Algorithms are critically important to characterize material deformation under stress conditions. Here, we report on our design of coarse-grain parallel algorithms for image quality assessment based on structural information and on crack detection of gigabyte-scale experimental datasets. We show how key steps can be decomposed into distinct processing flows, one based on structural similarity (SSIM) quality measure, and another on spectral content. These algorithms act upon image blocks that fit into memory, and can execute independently. We discuss the scientific relevance of themore » problem, key developments, and decomposition of complementary tasks into separate executions. We show how to apply SSIM to detect material degradation, and illustrate how this metric can be allied to spectral analysis for structure probing, while using tiled multi-resolution pyramids stored in HDF5 chunked multi-dimensional arrays. Results show that the proposed experimental data representation supports an average compression rate of 10X, and data compression scales linearly with the data size. We also illustrate how to correlate SSIM to crack formation, and how to use our numerical schemes to enable fast detection of deformation from 3D datasets evolving in time.« less
Deciding the liveness for a subclass of weighted Petri nets based on structurally circular wait

NASA Astrophysics Data System (ADS)

Liu, GuanJun; Chen, LiJing

2016-05-01

Weighted Petri nets as a kind of formal language are widely used to model and verify discrete event systems related to resource allocation like flexible manufacturing systems. System of Simple Sequential Processes with Multi-Resources (S3PMR, a subclass of weighted Petri nets and an important extension to the well-known System of Simple Sequential Processes with Resources, can model many discrete event systems in which (1) multiple processes may run in parallel and (2) each execution step of each process may use multiple units from multiple resource types. This paper gives a necessary and sufficient condition for the liveness of S3PMR. A new structural concept called Structurally Circular Wait (SCW) is proposed for S3PMR. Blocking Marking (BM) associated with an SCW is defined. It is proven that a marked S3PMR is live if and only if each SCW has no BM. We use an example of multi-processor system-on-chip to show that SCW and BM can precisely characterise the (partial) deadlocks for S3PMR. Simultaneously, two examples are used to show the advantages of SCW in preventing deadlocks of S3PMR. These results are significant for the further research on dealing with the deadlock problem.
Quantum interference between transverse spatial waveguide modes.

PubMed

Mohanty, Aseema; Zhang, Mian; Dutt, Avik; Ramelow, Sven; Nussenzveig, Paulo; Lipson, Michal

2017-01-20

Integrated quantum optics has the potential to markedly reduce the footprint and resource requirements of quantum information processing systems, but its practical implementation demands broader utilization of the available degrees of freedom within the optical field. To date, integrated photonic quantum systems have primarily relied on path encoding. However, in the classical regime, the transverse spatial modes of a multi-mode waveguide have been easily manipulated using the waveguide geometry to densely encode information. Here, we demonstrate quantum interference between the transverse spatial modes within a single multi-mode waveguide using quantum circuit-building blocks. This work shows that spatial modes can be controlled to an unprecedented level and have the potential to enable practical and robust quantum information processing.
Chocolate tablet aspects of cytherean Meshkenet Tessera

NASA Technical Reports Server (NTRS)

Raitala, J.

1993-01-01

Meshkenet Tessera structures were mapped from Magellan data and several resemblances to chocolate tablet boudinage were found. The complex fault sets display polyphase tectonic sequences of a few main deformation phases. Shear and tension have contributed to the areal deformation. Main faults cut the 1600-km long Meshkenet Tessera highland into bar-like blocks which have ridge and groove pattern oriented along or at high angles to the faults. The first approach to the surface block deformation is an assumption of initial parallel shear faulting followed by a chocolate tablet boudinage. Major faults which cut Meshkenet Tessera into rectangular blocks have been active repetitively while two progressive or superposed boudinage set formations have taken place at high angles during the relaxational or flattening type deformation of the area. Chocolate tablet boudinage is caused by a layer-parallel two-dimensional extension resulting in fracturing of the competent layer. Such structures, defined by two sets of boudin neck lines at right angles to each other, have been described by a number of authors. They develop in a flattening type of bulk deformation or during superposed deformation where the rock is elongated in two dimensions parallel to the surface. This is an attempt to describe and understand the formation and development of structures of Meshkenet Tessera which has complicated fault structures.

Cross-Site Reliability of Human Induced Pluripotent Stem-Cell Derived Cardiomyocyte Based Safety Assays using Microelectrode Arrays: Results from a Blinded CiPA Pilot Study.

PubMed

Millard, Daniel; Dang, Qianyu; Shi, Hong; Zhang, Xiaou; Strock, Chris; Kraushaar, Udo; Zeng, Haoyu; Levesque, Paul; Lu, Hua-Rong; Guillon, Jean-Michel; Wu, Joseph C; Li, Yingxin; Luerman, Greg; Anson, Blake; Guo, Liang; Clements, Mike; Abassi, Yama A; Ross, James; Pierson, Jennifer; Gintant, Gary

2018-04-27

Recent in vitro cardiac safety studies demonstrate the ability of human induced pluripotent stem cell-derived cardiomyocytes (hiPSC-CMs) to detect electrophysiologic effects of drugs. However, variability contributed by unique approaches, procedures, cell lines and reagents across laboratories makes comparisons of results difficult, leading to uncertainty about the role of hiPSC-CMs in defining proarrhythmic risk in drug discovery and regulatory submissions. A blinded pilot study was conducted to evaluate the electrophysiologic effects of eight well-characterized drugs on four cardiomyocyte lines using a standardized protocol across three microelectrode array (MEA) platforms (18 individual studies). Drugs were selected to define assay sensitivity of prominent repolarizing currents (E-4031 for IKr, JNJ303 for IKs) and depolarizing currents (nifedipine for ICaL, mexiletine for INa) as well as drugs affecting multi-channel block (flecainide, moxifloxacin, quinidine, and ranolazine). Inclusion criteria for final analysis was based on demonstrated sensitivity to IKr block (20% prolongation with E-4031) and L-type calcium current block (20% shortening with nifedipine). Despite differences in baseline characteristics across cardiomyocyte lines, multiple sites and instrument platforms, 10 of 18 studies demonstrated adequate sensitivity to IKr block with E-4031 and ICaL block with nifedipine for inclusion in the final analysis. Concentration-dependent effects on repolarization were observed with this qualified dataset consistent with known ionic mechanisms of single and multi-channel blocking drugs. hiPSC-CMs can detect repolarization effects elicited by single and multi-channel blocking drugs after defining pharmacologic sensitivity to IKr and ICaL block, supporting further validation efforts using hiPSC-CMs for cardiac safety studies.
Multi-level bandwidth efficient block modulation codes

NASA Technical Reports Server (NTRS)

Lin, Shu

1989-01-01

The multilevel technique is investigated for combining block coding and modulation. There are four parts. In the first part, a formulation is presented for signal sets on which modulation codes are to be constructed. Distance measures on a signal set are defined and their properties are developed. In the second part, a general formulation is presented for multilevel modulation codes in terms of component codes with appropriate Euclidean distances. The distance properties, Euclidean weight distribution and linear structure of multilevel modulation codes are investigated. In the third part, several specific methods for constructing multilevel block modulation codes with interdependency among component codes are proposed. Given a multilevel block modulation code C with no interdependency among the binary component codes, the proposed methods give a multilevel block modulation code C which has the same rate as C, a minimum squared Euclidean distance not less than that of code C, a trellis diagram with the same number of states as that of C and a smaller number of nearest neighbor codewords than that of C. In the last part, error performance of block modulation codes is analyzed for an AWGN channel based on soft-decision maximum likelihood decoding. Error probabilities of some specific codes are evaluated based on their Euclidean weight distributions and simulation results.
Anatomical variation in a patient with lateral femoral cutaneous nerve entrapment neuropathy.

PubMed

Kokubo, Rinko; Kim, Kyongsong; Morimoto, Daijiro; Isu, Toyohiko; Iwamoto, Naotaka; Kitamura, Takao; Morita, Akio

2018-05-02

This 53-year-old man had a 10-year history of paresthesia and pain in the right antero-lateral thigh exacerbated by prolonged standing and walking. His symptoms improved completely but transiently by lateral femoral cutaneous nerve (LFCN) block. The diagnosis was LFCN entrapment (LFCN-EN). As additional treatment with drugs and repeat LFCN block was ineffective, we performed surgical decompression under local anesthesia. A nerve stimulator located the LFCN 4.5 cm medial to the anterior superior iliac spine, it formed a sharp curve and was embedded in connective tissue. Proximal dissection showed it to run parallel to the femoral nerve at the level of the inguinal ligament. The inguinal ligament was partially released to complete dissection/release. Postoperatively, his symptoms improved and the numeric rating scale fell from 8 to 1. Copyright © 2018. Published by Elsevier Inc.
Learning and optimization with cascaded VLSI neural network building-block chips

NASA Technical Reports Server (NTRS)

Duong, T.; Eberhardt, S. P.; Tran, M.; Daud, T.; Thakoor, A. P.

1992-01-01

To demonstrate the versatility of the building-block approach, two neural network applications were implemented on cascaded analog VLSI chips. Weights were implemented using 7-b multiplying digital-to-analog converter (MDAC) synapse circuits, with 31 x 32 and 32 x 32 synapses per chip. A novel learning algorithm compatible with analog VLSI was applied to the two-input parity problem. The algorithm combines dynamically evolving architecture with limited gradient-descent backpropagation for efficient and versatile supervised learning. To implement the learning algorithm in hardware, synapse circuits were paralleled for additional quantization levels. The hardware-in-the-loop learning system allocated 2-5 hidden neurons for parity problems. Also, a 7 x 7 assignment problem was mapped onto a cascaded 64-neuron fully connected feedback network. In 100 randomly selected problems, the network found optimal or good solutions in most cases, with settling times in the range of 7-100 microseconds.
Multiframe video coding for improved performance over wireless channels.

PubMed

Budagavi, M; Gibson, J D

2001-01-01

We propose and evaluate a multi-frame extension to block motion compensation (BMC) coding of videoconferencing-type video signals for wireless channels. The multi-frame BMC (MF-BMC) coder makes use of the redundancy that exists across multiple frames in typical videoconferencing sequences to achieve additional compression over that obtained by using the single frame BMC (SF-BMC) approach, such as in the base-level H.263 codec. The MF-BMC approach also has an inherent ability of overcoming some transmission errors and is thus more robust when compared to the SF-BMC approach. We model the error propagation process in MF-BMC coding as a multiple Markov chain and use Markov chain analysis to infer that the use of multiple frames in motion compensation increases robustness. The Markov chain analysis is also used to devise a simple scheme which randomizes the selection of the frame (amongst the multiple previous frames) used in BMC to achieve additional robustness. The MF-BMC coders proposed are a multi-frame extension of the base level H.263 coder and are found to be more robust than the base level H.263 coder when subjected to simulated errors commonly encountered on wireless channels.
Fast parallel algorithm for slicing STL based on pipeline

NASA Astrophysics Data System (ADS)

Ma, Xulong; Lin, Feng; Yao, Bo

2016-05-01

In Additive Manufacturing field, the current researches of data processing mainly focus on a slicing process of large STL files or complicated CAD models. To improve the efficiency and reduce the slicing time, a parallel algorithm has great advantages. However, traditional algorithms can't make full use of multi-core CPU hardware resources. In the paper, a fast parallel algorithm is presented to speed up data processing. A pipeline mode is adopted to design the parallel algorithm. And the complexity of the pipeline algorithm is analyzed theoretically. To evaluate the performance of the new algorithm, effects of threads number and layers number are investigated by a serial of experiments. The experimental results show that the threads number and layers number are two remarkable factors to the speedup ratio. The tendency of speedup versus threads number reveals a positive relationship which greatly agrees with the Amdahl's law, and the tendency of speedup versus layers number also keeps a positive relationship agreeing with Gustafson's law. The new algorithm uses topological information to compute contours with a parallel method of speedup. Another parallel algorithm based on data parallel is used in experiments to show that pipeline parallel mode is more efficient. A case study at last shows a suspending performance of the new parallel algorithm. Compared with the serial slicing algorithm, the new pipeline parallel algorithm can make full use of the multi-core CPU hardware, accelerate the slicing process, and compared with the data parallel slicing algorithm, the new slicing algorithm in this paper adopts a pipeline parallel model, and a much higher speedup ratio and efficiency is achieved.
Generalizations of the Alternating Direction Method of Multipliers for Large-Scale and Distributed Optimization

DTIC Science & Technology

2014-05-01

exact one is solved later — as- signed as step 5 of Algorithm 2 — because at each iteration , the ADMM updates the variables in the Gauss - Seidel ...Jacobi ADMM (see Algo- rithm 5 below). Unlike the Gauss - Seidel ADMM, the Jacobi ADMM updates all the 70 blocks in parallel at every iteration : xk+1i...that extending ADMM straightforwardly from the classic Gauss - Seidel setting to the Jacobi setting, from two blocks to multiple blocks, will preserve
Genotoxicity of multi-walled carbon nanotubes at occupationally relevant doses

PubMed Central

2014-01-01

Carbon nanotubes are commercially-important products of nanotechnology; however, their low density and small size makes carbon nanotube respiratory exposures likely during their production or processing. We have previously shown mitotic spindle aberrations in cultured primary and immortalized human airway epithelial cells exposed to single-walled carbon nanotubes (SWCNT). In this study, we examined whether multi-walled carbon nanotubes (MWCNT) cause mitotic spindle damage in cultured cells at doses equivalent to 34 years of exposure at the NIOSH Recommended Exposure Limit (REL). MWCNT induced a dose responsive increase in disrupted centrosomes, abnormal mitotic spindles and aneuploid chromosome number 24 hours after exposure to 0.024, 0.24, 2.4 and 24 μg/cm2 MWCNT. Monopolar mitotic spindles comprised 95% of disrupted mitoses. Three-dimensional reconstructions of 0.1 μm optical sections showed carbon nanotubes integrated with microtubules, DNA and within the centrosome structure. Cell cycle analysis demonstrated a greater number of cells in S-phase and fewer cells in the G2 phase in MWCNT-treated compared to diluent control, indicating a G1/S block in the cell cycle. The monopolar phenotype of the disrupted mitotic spindles and the G1/S block in the cell cycle is in sharp contrast to the multi-polar spindle and G2 block in the cell cycle previously observed following exposure to SWCNT. One month following exposure to MWCNT there was a dramatic increase in both size and number of colonies compared to diluent control cultures, indicating a potential to pass the genetic damage to daughter cells. Our results demonstrate significant disruption of the mitotic spindle by MWCNT at occupationally relevant exposure levels. PMID:24479647
Operation of Direct Drive Systems: Experiments in Peak Power Tracking and Multi-Thruster Control

NASA Technical Reports Server (NTRS)

Snyder, John Steven; Brophy, John R.

2013-01-01

Direct-drive power and propulsion systems have the potential to significantly reduce the mass of high-power solar electric propulsion spacecraft, among other advantages. Recent experimental direct-drive work has significantly mitigated or retired the technical risks associated with single-thruster operation, so attention is now moving toward systems-level areas of interest. One of those areas is the use of a Hall thruster system as a peak power tracker to fully use the available power from a solar array. A simple and elegant control based on the incremental conductance method, enhanced by combining it with the unique properties of Hall thruster systems, is derived here and it is shown to track peak solar array power very well. Another area of interest is multi-thruster operation and control. Dualthruster operation was investigated in a parallel electrical configuration, with both thrusters operating from discharge power provided by a single solar array. Startup and shutdown sequences are discussed, and it is shown that multi-thruster operation and control is as simple as for a single thruster. Some system architectures require operation of multiple cathodes while they are electrically connected together. Four different methods to control the discharge current emitted by individual cathodes in this configuration are investigated, with cathode flow rate control appearing to be advantageous. Dual-parallel thruster operation with equal cathode current sharing at total powers up to 10 kW is presented.
Performance analysis of three-dimensional-triple-level cell and two-dimensional-multi-level cell NAND flash hybrid solid-state drives

NASA Astrophysics Data System (ADS)

Sakaki, Yukiya; Yamada, Tomoaki; Matsui, Chihiro; Yamaga, Yusuke; Takeuchi, Ken

2018-04-01

In order to improve performance of solid-state drives (SSDs), hybrid SSDs have been proposed. Hybrid SSDs consist of more than two types of NAND flash memories or NAND flash memories and storage-class memories (SCMs). However, the cost of hybrid SSDs adopting SCMs is more expensive than that of NAND flash only SSDs because of the high bit cost of SCMs. This paper proposes unique hybrid SSDs with two-dimensional (2D) horizontal multi-level cell (MLC)/three-dimensional (3D) vertical triple-level cell (TLC) NAND flash memories to achieve higher cost-performance. The 2D-MLC/3D-TLC hybrid SSD achieves up to 31% higher performance than the conventional 2D-MLC/2D-TLC hybrid SSD. The factors of different performance between the proposed hybrid SSD and the conventional hybrid SSD are analyzed by changing its block size, read/write/erase latencies, and write unit of 3D-TLC NAND flash memory, by means of a transaction-level modeling simulator.
A Tutorial on Parallel and Concurrent Programming in Haskell

NASA Astrophysics Data System (ADS)

Peyton Jones, Simon; Singh, Satnam

This practical tutorial introduces the features available in Haskell for writing parallel and concurrent programs. We first describe how to write semi-explicit parallel programs by using annotations to express opportunities for parallelism and to help control the granularity of parallelism for effective execution on modern operating systems and processors. We then describe the mechanisms provided by Haskell for writing explicitly parallel programs with a focus on the use of software transactional memory to help share information between threads. Finally, we show how nested data parallelism can be used to write deterministically parallel programs which allows programmers to use rich data types in data parallel programs which are automatically transformed into flat data parallel versions for efficient execution on multi-core processors.
Enhancing membrane protein subcellular localization prediction by parallel fusion of multi-view features.

PubMed

Yu, Dongjun; Wu, Xiaowei; Shen, Hongbin; Yang, Jian; Tang, Zhenmin; Qi, Yong; Yang, Jingyu

2012-12-01

Membrane proteins are encoded by ~ 30% in the genome and function importantly in the living organisms. Previous studies have revealed that membrane proteins' structures and functions show obvious cell organelle-specific properties. Hence, it is highly desired to predict membrane protein's subcellular location from the primary sequence considering the extreme difficulties of membrane protein wet-lab studies. Although many models have been developed for predicting protein subcellular locations, only a few are specific to membrane proteins. Existing prediction approaches were constructed based on statistical machine learning algorithms with serial combination of multi-view features, i.e., different feature vectors are simply serially combined to form a super feature vector. However, such simple combination of features will simultaneously increase the information redundancy that could, in turn, deteriorate the final prediction accuracy. That's why it was often found that prediction success rates in the serial super space were even lower than those in a single-view space. The purpose of this paper is investigation of a proper method for fusing multiple multi-view protein sequential features for subcellular location predictions. Instead of serial strategy, we propose a novel parallel framework for fusing multiple membrane protein multi-view attributes that will represent protein samples in complex spaces. We also proposed generalized principle component analysis (GPCA) for feature reduction purpose in the complex geometry. All the experimental results through different machine learning algorithms on benchmark membrane protein subcellular localization datasets demonstrate that the newly proposed parallel strategy outperforms the traditional serial approach. We also demonstrate the efficacy of the parallel strategy on a soluble protein subcellular localization dataset indicating the parallel technique is flexible to suite for other computational biology problems. The software and datasets are available at: http://www.csbio.sjtu.edu.cn/bioinf/mpsp.
High-throughput preparation of complex multi-scale patterns from block copolymer/homopolymer blend films.

PubMed

Park, Hyungmin; Kim, Jae-Up; Park, Soojin

2012-02-21

A simple, straightforward process for fabricating multi-scale micro- and nanostructured patterns from polystyrene-block-poly(2-vinylpyridine) (PS-b-P2VP)/poly(methyl methacrylate) (PMMA) homopolymer in a preferential solvent for PS and PMMA is demonstrated. When the PS-b-P2VP/PMMA blend films were spin-coated onto a silicon wafer, PS-b-P2VP micellar arrays consisting of a PS corona and a P2VP core were formed, while the PMMA macrodomains were isolated, due to the macrophase separation caused by the incompatibility between block copolymer micelles and PMMA homopolymer during the spin-coating process. With an increase of PMMA composition, the size of PMMA macrodomains increased. Moreover, the P2VP blocks have a strong interaction with a native oxide of the surface of the silicon wafer, so that the P2VP wetting layer was first formed during spin-coating, and PS nanoclusters were observed on the PMMA macrodomains beneath. Whereas when a silicon surface was modified with a PS brush layer, the PS nanoclusters underlying PMMA domains were not formed. The multi-scale patterns prepared from copolymer micelle/homopolymer blend films are used as templates for the fabrication of gold nanoparticle arrays by incorporating the gold precursor into the P2VP chains. The combination of nanostructures prepared from block copolymer micellar arrays and macrostructures induced by incompatibility between the copolymer and the homopolymer leads to the formation of complex, multi-scale surface patterns by a simple casting process. This journal is © The Royal Society of Chemistry 2012
When the lowest energy does not induce native structures: parallel minimization of multi-energy values by hybridizing searching intelligences.

PubMed

Lü, Qiang; Xia, Xiao-Yan; Chen, Rong; Miao, Da-Jun; Chen, Sha-Sha; Quan, Li-Jun; Li, Hai-Ou

2012-01-01

Protein structure prediction (PSP), which is usually modeled as a computational optimization problem, remains one of the biggest challenges in computational biology. PSP encounters two difficult obstacles: the inaccurate energy function problem and the searching problem. Even if the lowest energy has been luckily found by the searching procedure, the correct protein structures are not guaranteed to obtain. A general parallel metaheuristic approach is presented to tackle the above two problems. Multi-energy functions are employed to simultaneously guide the parallel searching threads. Searching trajectories are in fact controlled by the parameters of heuristic algorithms. The parallel approach allows the parameters to be perturbed during the searching threads are running in parallel, while each thread is searching the lowest energy value determined by an individual energy function. By hybridizing the intelligences of parallel ant colonies and Monte Carlo Metropolis search, this paper demonstrates an implementation of our parallel approach for PSP. 16 classical instances were tested to show that the parallel approach is competitive for solving PSP problem. This parallel approach combines various sources of both searching intelligences and energy functions, and thus predicts protein conformations with good quality jointly determined by all the parallel searching threads and energy functions. It provides a framework to combine different searching intelligence embedded in heuristic algorithms. It also constructs a container to hybridize different not-so-accurate objective functions which are usually derived from the domain expertise.
When the Lowest Energy Does Not Induce Native Structures: Parallel Minimization of Multi-Energy Values by Hybridizing Searching Intelligences

PubMed Central

Lü, Qiang; Xia, Xiao-Yan; Chen, Rong; Miao, Da-Jun; Chen, Sha-Sha; Quan, Li-Jun; Li, Hai-Ou

2012-01-01

Background Protein structure prediction (PSP), which is usually modeled as a computational optimization problem, remains one of the biggest challenges in computational biology. PSP encounters two difficult obstacles: the inaccurate energy function problem and the searching problem. Even if the lowest energy has been luckily found by the searching procedure, the correct protein structures are not guaranteed to obtain. Results A general parallel metaheuristic approach is presented to tackle the above two problems. Multi-energy functions are employed to simultaneously guide the parallel searching threads. Searching trajectories are in fact controlled by the parameters of heuristic algorithms. The parallel approach allows the parameters to be perturbed during the searching threads are running in parallel, while each thread is searching the lowest energy value determined by an individual energy function. By hybridizing the intelligences of parallel ant colonies and Monte Carlo Metropolis search, this paper demonstrates an implementation of our parallel approach for PSP. 16 classical instances were tested to show that the parallel approach is competitive for solving PSP problem. Conclusions This parallel approach combines various sources of both searching intelligences and energy functions, and thus predicts protein conformations with good quality jointly determined by all the parallel searching threads and energy functions. It provides a framework to combine different searching intelligence embedded in heuristic algorithms. It also constructs a container to hybridize different not-so-accurate objective functions which are usually derived from the domain expertise. PMID:23028708
Benchmarking NWP Kernels on Multi- and Many-core Processors

NASA Astrophysics Data System (ADS)

Michalakes, J.; Vachharajani, M.

2008-12-01

Increased computing power for weather, climate, and atmospheric science has provided direct benefits for defense, agriculture, the economy, the environment, and public welfare and convenience. Today, very large clusters with many thousands of processors are allowing scientists to move forward with simulations of unprecedented size. But time-critical applications such as real-time forecasting or climate prediction need strong scaling: faster nodes and processors, not more of them. Moreover, the need for good cost- performance has never been greater, both in terms of performance per watt and per dollar. For these reasons, the new generations of multi- and many-core processors being mass produced for commercial IT and "graphical computing" (video games) are being scrutinized for their ability to exploit the abundant fine- grain parallelism in atmospheric models. We present results of our work to date identifying key computational kernels within the dynamics and physics of a large community NWP model, the Weather Research and Forecast (WRF) model. We benchmark and optimize these kernels on several different multi- and many-core processors. The goals are to (1) characterize and model performance of the kernels in terms of computational intensity, data parallelism, memory bandwidth pressure, memory footprint, etc. (2) enumerate and classify effective strategies for coding and optimizing for these new processors, (3) assess difficulties and opportunities for tool or higher-level language support, and (4) establish a continuing set of kernel benchmarks that can be used to measure and compare effectiveness of current and future designs of multi- and many-core processors for weather and climate applications.
Multithreaded Stochastic PDES for Reactions and Diffusions in Neurons.

PubMed

Lin, Zhongwei; Tropper, Carl; Mcdougal, Robert A; Patoary, Mohammand Nazrul Ishlam; Lytton, William W; Yao, Yiping; Hines, Michael L

2017-07-01

Cells exhibit stochastic behavior when the number of molecules is small. Hence a stochastic reaction-diffusion simulator capable of working at scale can provide a more accurate view of molecular dynamics within the cell. This paper describes a parallel discrete event simulator, Neuron Time Warp-Multi Thread (NTW-MT), developed for the simulation of reaction diffusion models of neurons. To the best of our knowledge, this is the first parallel discrete event simulator oriented towards stochastic simulation of chemical reactions in a neuron. The simulator was developed as part of the NEURON project. NTW-MT is optimistic and thread-based, which attempts to capitalize on multi-core architectures used in high performance machines. It makes use of a multi-level queue for the pending event set and a single roll-back message in place of individual anti-messages to disperse contention and decrease the overhead of processing rollbacks. Global Virtual Time is computed asynchronously both within and among processes to get rid of the overhead for synchronizing threads. Memory usage is managed in order to avoid locking and unlocking when allocating and de-allocating memory and to maximize cache locality. We verified our simulator on a calcium buffer model. We examined its performance on a calcium wave model, comparing it to the performance of a process based optimistic simulator and a threaded simulator which uses a single priority queue for each thread. Our multi-threaded simulator is shown to achieve superior performance to these simulators. Finally, we demonstrated the scalability of our simulator on a larger CICR model and a more detailed CICR model.
Automatic classification and detection of clinically relevant images for diabetic retinopathy

NASA Astrophysics Data System (ADS)

Xu, Xinyu; Li, Baoxin

2008-03-01

We proposed a novel approach to automatic classification of Diabetic Retinopathy (DR) images and retrieval of clinically-relevant DR images from a database. Given a query image, our approach first classifies the image into one of the three categories: microaneurysm (MA), neovascularization (NV) and normal, and then it retrieves DR images that are clinically-relevant to the query image from an archival image database. In the classification stage, the query DR images are classified by the Multi-class Multiple-Instance Learning (McMIL) approach, where images are viewed as bags, each of which contains a number of instances corresponding to non-overlapping blocks, and each block is characterized by low-level features including color, texture, histogram of edge directions, and shape. McMIL first learns a collection of instance prototypes for each class that maximizes the Diverse Density function using Expectation- Maximization algorithm. A nonlinear mapping is then defined using the instance prototypes and maps every bag to a point in a new multi-class bag feature space. Finally a multi-class Support Vector Machine is trained in the multi-class bag feature space. In the retrieval stage, we retrieve images from the archival database who bear the same label with the query image, and who are the top K nearest neighbors of the query image in terms of similarity in the multi-class bag feature space. The classification approach achieves high classification accuracy, and the retrieval of clinically-relevant images not only facilitates utilization of the vast amount of hidden diagnostic knowledge in the database, but also improves the efficiency and accuracy of DR lesion diagnosis and assessment.
A detail-preserved and luminance-consistent multi-exposure image fusion algorithm

NASA Astrophysics Data System (ADS)

Wang, Guanquan; Zhou, Yue

2018-04-01

When irradiance across a scene varies greatly, we can hardly get an image of the scene without over- or underexposure area, because of the constraints of cameras. Multi-exposure image fusion (MEF) is an effective method to deal with this problem by fusing multi-exposure images of a static scene. A novel MEF method is described in this paper. In the proposed algorithm, coarser-scale luminance consistency is preserved by contribution adjustment using the luminance information between blocks; detail-preserved smoothing filter can stitch blocks smoothly without losing details. Experiment results show that the proposed method performs well in preserving luminance consistency and details.
FaCSI: A block parallel preconditioner for fluid-structure interaction in hemodynamics

NASA Astrophysics Data System (ADS)

Deparis, Simone; Forti, Davide; Grandperrin, Gwenol; Quarteroni, Alfio

2016-12-01

Modeling Fluid-Structure Interaction (FSI) in the vascular system is mandatory to reliably compute mechanical indicators in vessels undergoing large deformations. In order to cope with the computational complexity of the coupled 3D FSI problem after discretizations in space and time, a parallel solution is often mandatory. In this paper we propose a new block parallel preconditioner for the coupled linearized FSI system obtained after space and time discretization. We name it FaCSI to indicate that it exploits the Factorized form of the linearized FSI matrix, the use of static Condensation to formally eliminate the interface degrees of freedom of the fluid equations, and the use of a SIMPLE preconditioner for saddle-point problems. FaCSI is built upon a block Gauss-Seidel factorization of the FSI Jacobian matrix and it uses ad-hoc preconditioners for each physical component of the coupled problem, namely the fluid, the structure and the geometry. In the fluid subproblem, after operating static condensation of the interface fluid variables, we use a SIMPLE preconditioner on the reduced fluid matrix. Moreover, to efficiently deal with a large number of processes, FaCSI exploits efficient single field preconditioners, e.g., based on domain decomposition or the multigrid method. We measure the parallel performances of FaCSI on a benchmark cylindrical geometry and on a problem of physiological interest, namely the blood flow through a patient-specific femoropopliteal bypass. We analyze the dependence of the number of linear solver iterations on the cores count (scalability of the preconditioner) and on the mesh size (optimality).

Jali - Unstructured Mesh Infrastructure for Multi-Physics Applications

DOE Office of Scientific and Technical Information (OSTI.GOV)

Garimella, Rao V; Berndt, Markus; Coon, Ethan

2017-04-13

Jali is a parallel unstructured mesh infrastructure library designed for use by multi-physics simulations. It supports 2D and 3D arbitrary polyhedral meshes distributed over hundreds to thousands of nodes. Jali can read write Exodus II meshes along with fields and sets on the mesh and support for other formats is partially implemented or is (https://github.com/MeshToolkit/MSTK), an open source general purpose unstructured mesh infrastructure library from Los Alamos National Laboratory. While it has been made to work with other mesh frameworks such as MOAB and STKmesh in the past, support for maintaining the interface to these frameworks has been suspended formore » now. Jali supports distributed as well as on-node parallelism. Support of on-node parallelism is through direct use of the the mesh in multi-threaded constructs or through the use of "tiles" which are submeshes or sub-partitions of a partition destined for a compute node.« less
Closed-loop adaptation of neurofeedback based on mental effort facilitates reinforcement learning of brain self-regulation.

PubMed

Bauer, Robert; Fels, Meike; Royter, Vladislav; Raco, Valerio; Gharabaghi, Alireza

2016-09-01

Considering self-rated mental effort during neurofeedback may improve training of brain self-regulation. Twenty-one healthy, right-handed subjects performed kinesthetic motor imagery of opening their left hand, while threshold-based classification of beta-band desynchronization resulted in proprioceptive robotic feedback. The experiment consisted of two blocks in a cross-over design. The participants rated their perceived mental effort nine times per block. In the adaptive block, the threshold was adjusted on the basis of these ratings whereas adjustments were carried out at random in the other block. Electroencephalography was used to examine the cortical activation patterns during the training sessions. The perceived mental effort was correlated with the difficulty threshold of neurofeedback training. Adaptive threshold-setting reduced mental effort and increased the classification accuracy and positive predictive value. This was paralleled by an inter-hemispheric cortical activation pattern in low frequency bands connecting the right frontal and left parietal areas. Optimal balance of mental effort was achieved at thresholds significantly higher than maximum classification accuracy. Rating of mental effort is a feasible approach for effective threshold-adaptation during neurofeedback training. Closed-loop adaptation of the neurofeedback difficulty level facilitates reinforcement learning of brain self-regulation. Copyright © 2016 International Federation of Clinical Neurophysiology. Published by Elsevier Ireland Ltd. All rights reserved.
Liquid-Nitrogen Test for Blocked Tubes

NASA Technical Reports Server (NTRS)

Wagner, W. R.

1984-01-01

Nondestructive test identifies obstructed tube in array of parallel tubes. Trickle of liquid nitrogen allowed to flow through tube array until array accumulates substantial formation of frost from moisture in air. Flow stopped and warm air introduced into inlet manifold to heat tubes in array. Tubes still frosted after others defrosted identified as obstructed tubes. Applications include inspection of flow systems having parallel legs.
Nerve Fiber Activation During Peripheral Nerve Field Stimulation: Importance of Electrode Orientation and Estimation of Area of Paresthesia.

PubMed

Frahm, Ken Steffen; Hennings, Kristian; Vera-Portocarrero, Louis; Wacnik, Paul W; Mørch, Carsten Dahl

2016-04-01

Low back pain is one of the indications for using peripheral nerve field stimulation (PNFS). However, the effect of PNFS varies between patients; several stimulation parameters have not been investigated in depth, such as orientation of the nerve fiber in relation to the electrode. While placing the electrode parallel to the nerve fiber may give lower activation thresholds, anodal blocking may occur when the propagating action potential passes an anode. A finite element model was used to simulate the extracellular potential during PNFS. This was combined with an active cable model of Aβ and Aδ nerve fibers. It was investigated how the angle between the nerve fiber and electrode affected the nerve activation and whether anodal blocking could occur. Finally, the area of paresthesia was estimated and compared with any concomitant Aδ fiber activation. The lowest threshold was found when nerve and electrode were in parallel, and that anodal blocking did not appear to occur during PNFS. The activation of Aβ fibers was within therapeutic range (<10V) of PNFS; however, within this range, Aδ fiber activation also may occur. The combined area of activated Aβ fibers (paresthesia) was at least two times larger than Aδ fibers for similar stimulation intensities. No evidence of anodal blocking was observed in this PNFS model. The thresholds were lowest when the nerves and electrodes were parallel; thus, it may be relevant to investigate the overall position of the target nerve fibers prior to electrode placement. © 2015 International Neuromodulation Society.
Terrain types and local-scale stratigraphy of grooved terrain on ganymede

NASA Technical Reports Server (NTRS)

Murchie, Scott L.; Head, James W.; Helfenstein, Paul; Plescia, Jeffrey B.

1986-01-01

Grooved terrain is subdivided on the basis of pervasive morphology into: (1) groove lanes - elongate parallel groove bands, (2) grooved polygons - polygonal domains of parallel grooves, (3) reticulate terrain - polygonal domains of orthogonal grooves, and (4) complex grooved terrain - polygons with several complexly cross-cutting groove sets. Detailed geologic mapping of select areas, employing previously established conventions for determining relative age relations, reveals a general three-stage sequence of grooved terrain emplacement: first, dissection of the lithosphere by throughgoing grooves, and pervasive deformation of intervening blocks; second, extensive flooding and continued deformation of the intervening blocks; third, repeated superposition of groove lanes concentrated at sites of initial throughgoing grooves. This sequence is corroborated by crater-density measurements. Dominant orientations of groove sets are parallel to relict zones of weakness that probably were reactivated during grooved terrain formation. Groove lane morphology and development consistent with that predicted for passive rifts suggests a major role of global expansion in grooved terrain formation.
An efficient parallel-processing method for transposing large matrices in place.

PubMed

Portnoff, M R

1999-01-01

We have developed an efficient algorithm for transposing large matrices in place. The algorithm is efficient because data are accessed either sequentially in blocks or randomly within blocks small enough to fit in cache, and because the same indexing calculations are shared among identical procedures operating on independent subsets of the data. This inherent parallelism makes the method well suited for a multiprocessor computing environment. The algorithm is easy to implement because the same two procedures are applied to the data in various groupings to carry out the complete transpose operation. Using only a single processor, we have demonstrated nearly an order of magnitude increase in speed over the previously published algorithm by Gate and Twigg for transposing a large rectangular matrix in place. With multiple processors operating in parallel, the processing speed increases almost linearly with the number of processors. A simplified version of the algorithm for square matrices is presented as well as an extension for matrices large enough to require virtual memory.
A Dynamic Finite Element Method for Simulating the Physics of Faults Systems

NASA Astrophysics Data System (ADS)

Saez, E.; Mora, P.; Gross, L.; Weatherley, D.

2004-12-01

We introduce a dynamic Finite Element method using a novel high level scripting language to describe the physical equations, boundary conditions and time integration scheme. The library we use is the parallel Finley library: a finite element kernel library, designed for solving large-scale problems. It is incorporated as a differential equation solver into a more general library called escript, based on the scripting language Python. This library has been developed to facilitate the rapid development of 3D parallel codes, and is optimised for the Australian Computational Earth Systems Simulator Major National Research Facility (ACcESS MNRF) supercomputer, a 208 processor SGI Altix with a peak performance of 1.1 TFlops. Using the scripting approach we obtain a parallel FE code able to take advantage of the computational efficiency of the Altix 3700. We consider faults as material discontinuities (the displacement, velocity, and acceleration fields are discontinuous at the fault), with elastic behavior. The stress continuity at the fault is achieved naturally through the expression of the fault interactions in the weak formulation. The elasticity problem is solved explicitly in time, using the Saint Verlat scheme. Finally, we specify a suitable frictional constitutive relation and numerical scheme to simulate fault behaviour. Our model is based on previous work on modelling fault friction and multi-fault systems using lattice solid-like models. We adapt the 2D model for simulating the dynamics of parallel fault systems described to the Finite-Element method. The approach uses a frictional relation along faults that is slip and slip-rate dependent, and the numerical integration approach introduced by Mora and Place in the lattice solid model. In order to illustrate the new Finite Element model, single and multi-fault simulation examples are presented.
P-Hint-Hunt: a deep parallelized whole genome DNA methylation detection tool.

PubMed

Peng, Shaoliang; Yang, Shunyun; Gao, Ming; Liao, Xiangke; Liu, Jie; Yang, Canqun; Wu, Chengkun; Yu, Wenqiang

2017-03-14

The increasing studies have been conducted using whole genome DNA methylation detection as one of the most important part of epigenetics research to find the significant relationships among DNA methylation and several typical diseases, such as cancers and diabetes. In many of those studies, mapping the bisulfite treated sequence to the whole genome has been the main method to study DNA cytosine methylation. However, today's relative tools almost suffer from inaccuracies and time-consuming problems. In our study, we designed a new DNA methylation prediction tool ("Hint-Hunt") to solve the problem. By having an optimal complex alignment computation and Smith-Waterman matrix dynamic programming, Hint-Hunt could analyze and predict the DNA methylation status. But when Hint-Hunt tried to predict DNA methylation status with large-scale dataset, there are still slow speed and low temporal-spatial efficiency problems. In order to solve the problems of Smith-Waterman dynamic programming and low temporal-spatial efficiency, we further design a deep parallelized whole genome DNA methylation detection tool ("P-Hint-Hunt") on Tianhe-2 (TH-2) supercomputer. To the best of our knowledge, P-Hint-Hunt is the first parallel DNA methylation detection tool with a high speed-up to process large-scale dataset, and could run both on CPU and Intel Xeon Phi coprocessors. Moreover, we deploy and evaluate Hint-Hunt and P-Hint-Hunt on TH-2 supercomputer in different scales. The experimental results illuminate our tools eliminate the deviation caused by bisulfite treatment in mapping procedure and the multi-level parallel program yields a 48 times speed-up with 64 threads. P-Hint-Hunt gain a deep acceleration on CPU and Intel Xeon Phi heterogeneous platform, which gives full play of the advantages of multi-cores (CPU) and many-cores (Phi).
Topical perspective on massive threading and parallelism.

PubMed

Farber, Robert M

2011-09-01

Unquestionably computer architectures have undergone a recent and noteworthy paradigm shift that now delivers multi- and many-core systems with tens to many thousands of concurrent hardware processing elements per workstation or supercomputer node. GPGPU (General Purpose Graphics Processor Unit) technology in particular has attracted significant attention as new software development capabilities, namely CUDA (Compute Unified Device Architecture) and OpenCL™, have made it possible for students as well as small and large research organizations to achieve excellent speedup for many applications over more conventional computing architectures. The current scientific literature reflects this shift with numerous examples of GPGPU applications that have achieved one, two, and in some special cases, three-orders of magnitude increased computational performance through the use of massive threading to exploit parallelism. Multi-core architectures are also evolving quickly to exploit both massive-threading and massive-parallelism such as the 1.3 million threads Blue Waters supercomputer. The challenge confronting scientists in planning future experimental and theoretical research efforts--be they individual efforts with one computer or collaborative efforts proposing to use the largest supercomputers in the world is how to capitalize on these new massively threaded computational architectures--especially as not all computational problems will scale to massive parallelism. In particular, the costs associated with restructuring software (and potentially redesigning algorithms) to exploit the parallelism of these multi- and many-threaded machines must be considered along with application scalability and lifespan. This perspective is an overview of the current state of threading and parallelize with some insight into the future. Published by Elsevier Inc.
Large Scale Document Inversion using a Multi-threaded Computing System

PubMed Central

Jung, Sungbo; Chang, Dar-Jen; Park, Juw Won

2018-01-01

Current microprocessor architecture is moving towards multi-core/multi-threaded systems. This trend has led to a surge of interest in using multi-threaded computing devices, such as the Graphics Processing Unit (GPU), for general purpose computing. We can utilize the GPU in computation as a massive parallel coprocessor because the GPU consists of multiple cores. The GPU is also an affordable, attractive, and user-programmable commodity. Nowadays a lot of information has been flooded into the digital domain around the world. Huge volume of data, such as digital libraries, social networking services, e-commerce product data, and reviews, etc., is produced or collected every moment with dramatic growth in size. Although the inverted index is a useful data structure that can be used for full text searches or document retrieval, a large number of documents will require a tremendous amount of time to create the index. The performance of document inversion can be improved by multi-thread or multi-core GPU. Our approach is to implement a linear-time, hash-based, single program multiple data (SPMD), document inversion algorithm on the NVIDIA GPU/CUDA programming platform utilizing the huge computational power of the GPU, to develop high performance solutions for document indexing. Our proposed parallel document inversion system shows 2-3 times faster performance than a sequential system on two different test datasets from PubMed abstract and e-commerce product reviews. CCS Concepts •Information systems➝Information retrieval • Computing methodologies➝Massively parallel and high-performance simulations. PMID:29861701
Large Scale Document Inversion using a Multi-threaded Computing System.

PubMed

Jung, Sungbo; Chang, Dar-Jen; Park, Juw Won

2017-06-01

Current microprocessor architecture is moving towards multi-core/multi-threaded systems. This trend has led to a surge of interest in using multi-threaded computing devices, such as the Graphics Processing Unit (GPU), for general purpose computing. We can utilize the GPU in computation as a massive parallel coprocessor because the GPU consists of multiple cores. The GPU is also an affordable, attractive, and user-programmable commodity. Nowadays a lot of information has been flooded into the digital domain around the world. Huge volume of data, such as digital libraries, social networking services, e-commerce product data, and reviews, etc., is produced or collected every moment with dramatic growth in size. Although the inverted index is a useful data structure that can be used for full text searches or document retrieval, a large number of documents will require a tremendous amount of time to create the index. The performance of document inversion can be improved by multi-thread or multi-core GPU. Our approach is to implement a linear-time, hash-based, single program multiple data (SPMD), document inversion algorithm on the NVIDIA GPU/CUDA programming platform utilizing the huge computational power of the GPU, to develop high performance solutions for document indexing. Our proposed parallel document inversion system shows 2-3 times faster performance than a sequential system on two different test datasets from PubMed abstract and e-commerce product reviews. •Information systems➝Information retrieval • Computing methodologies➝Massively parallel and high-performance simulations.
Improving the performance of heterogeneous multi-core processors by modifying the cache coherence protocol

NASA Astrophysics Data System (ADS)

Fang, Juan; Hao, Xiaoting; Fan, Qingwen; Chang, Zeqing; Song, Shuying

2017-05-01

In the Heterogeneous multi-core architecture, CPU and GPU processor are integrated on the same chip, which poses a new challenge to the last-level cache management. In this architecture, the CPU application and the GPU application execute concurrently, accessing the last-level cache. CPU and GPU have different memory access characteristics, so that they have differences in the sensitivity of last-level cache (LLC) capacity. For many CPU applications, a reduced share of the LLC could lead to significant performance degradation. On the contrary, GPU applications can tolerate increase in memory access latency when there is sufficient thread-level parallelism. Taking into account the GPU program memory latency tolerance characteristics, this paper presents a method that let GPU applications can access to memory directly, leaving lots of LLC space for CPU applications, in improving the performance of CPU applications and does not affect the performance of GPU applications. When the CPU application is cache sensitive, and the GPU application is insensitive to the cache, the overall performance of the system is improved significantly.
SmaggIce 2.0: Additional Capabilities for Interactive Grid Generation of Iced Airfoils

NASA Technical Reports Server (NTRS)

Kreeger, Richard E.; Baez, Marivell; Braun, Donald C.; Schilling, Herbert W.; Vickerman, Mary B.

2008-01-01

The Surface Modeling and Grid Generation for Iced Airfoils (SmaggIce) software toolkit has been extended to allow interactive grid generation for multi-element iced airfoils. The essential phases of an icing effects study include geometry preparation, block creation and grid generation. SmaggIce Version 2.0 now includes these main capabilities for both single and multi-element airfoils, plus an improved flow solver interface and a variety of additional tools to enhance the efficiency and accuracy of icing effects studies. An overview of these features is given, especially the new multi-element blocking strategy using the multiple wakes method. Examples are given which illustrate the capabilities of SmaggIce for conducting an icing effects study for both single and multi-element airfoils.
Identification and Reconfigurable Control of Impaired Multi-Rotor Drones

NASA Technical Reports Server (NTRS)

Stepanyan, Vahram; Krishnakumar, Kalmanje; Bencomo, Alfredo

2016-01-01

The paper presents an algorithm for control and safe landing of impaired multi-rotor drones when one or more motors fail simultaneously or in any sequence. It includes three main components: an identification block, a reconfigurable control block, and a decisions making block. The identification block monitors each motor load characteristics and the current drawn, based on which the failures are detected. The control block generates the required total thrust and three axis torques for the altitude, horizontal position and/or orientation control of the drone based on the time scale separation and nonlinear dynamic inversion. The horizontal displacement is controlled by modulating the roll and pitch angles. The decision making algorithm maps the total thrust and three torques into the individual motor thrusts based on the information provided by the identification block. The drone continues the mission execution as long as the number of functioning motors provide controllability of it. Otherwise, the controller is switched to the safe mode, which gives up the yaw control, commands a safe landing spot and descent rate while maintaining the horizontal attitude.
A parallel computer implementation of fast low-rank QR approximation of the Biot-Savart law

DOE Office of Scientific and Technical Information (OSTI.GOV)

White, D A; Fasenfest, B J; Stowell, M L

2005-11-07

In this paper we present a low-rank QR method for evaluating the discrete Biot-Savart law on parallel computers. It is assumed that the known current density and the unknown magnetic field are both expressed in a finite element expansion, and we wish to compute the degrees-of-freedom (DOF) in the basis function expansion of the magnetic field. The matrix that maps the current DOF to the field DOF is full, but if the spatial domain is properly partitioned the matrix can be written as a block matrix, with blocks representing distant interactions being low rank and having a compressed QR representation.more » The matrix partitioning is determined by the number of processors, the rank of each block (i.e. the compression) is determined by the specific geometry and is computed dynamically. In this paper we provide the algorithmic details and present computational results for large-scale computations.« less
OpenMP parallelization of a gridded SWAT (SWATG)

NASA Astrophysics Data System (ADS)

Zhang, Ying; Hou, Jinliang; Cao, Yongpan; Gu, Juan; Huang, Chunlin

2017-12-01

Large-scale, long-term and high spatial resolution simulation is a common issue in environmental modeling. A Gridded Hydrologic Response Unit (HRU)-based Soil and Water Assessment Tool (SWATG) that integrates grid modeling scheme with different spatial representations also presents such problems. The time-consuming problem affects applications of very high resolution large-scale watershed modeling. The OpenMP (Open Multi-Processing) parallel application interface is integrated with SWATG (called SWATGP) to accelerate grid modeling based on the HRU level. Such parallel implementation takes better advantage of the computational power of a shared memory computer system. We conducted two experiments at multiple temporal and spatial scales of hydrological modeling using SWATG and SWATGP on a high-end server. At 500-m resolution, SWATGP was found to be up to nine times faster than SWATG in modeling over a roughly 2000 km2 watershed with 1 CPU and a 15 thread configuration. The study results demonstrate that parallel models save considerable time relative to traditional sequential simulation runs. Parallel computations of environmental models are beneficial for model applications, especially at large spatial and temporal scales and at high resolutions. The proposed SWATGP model is thus a promising tool for large-scale and high-resolution water resources research and management in addition to offering data fusion and model coupling ability.
Fast parallel approach for 2-D DHT-based real-valued discrete Gabor transform.

PubMed

Tao, Liang; Kwan, Hon Keung

2009-12-01

Two-dimensional fast Gabor transform algorithms are useful for real-time applications due to the high computational complexity of the traditional 2-D complex-valued discrete Gabor transform (CDGT). This paper presents two block time-recursive algorithms for 2-D DHT-based real-valued discrete Gabor transform (RDGT) and its inverse transform and develops a fast parallel approach for the implementation of the two algorithms. The computational complexity of the proposed parallel approach is analyzed and compared with that of the existing 2-D CDGT algorithms. The results indicate that the proposed parallel approach is attractive for real time image processing.
Parallel pathways for cross-modal memory retrieval in Drosophila.

PubMed

Zhang, Xiaonan; Ren, Qingzhong; Guo, Aike

2013-05-15

Memory-retrieval processing of cross-modal sensory preconditioning is vital for understanding the plasticity underlying the interactions between modalities. As part of the sensory preconditioning paradigm, it has been hypothesized that the conditioned response to an unreinforced cue depends on the memory of the reinforced cue via a sensory link between the two cues. To test this hypothesis, we studied cross-modal memory-retrieval processing in a genetically tractable model organism, Drosophila melanogaster. By expressing the dominant temperature-sensitive shibire(ts1) (shi(ts1)) transgene, which blocks synaptic vesicle recycling of specific neural subsets with the Gal4/UAS system at the restrictive temperature, we specifically blocked visual and olfactory memory retrieval, either alone or in combination; memory acquisition remained intact for these modalities. Blocking the memory retrieval of the reinforced olfactory cues did not impair the conditioned response to the unreinforced visual cues or vice versa, in contrast to the canonical memory-retrieval processing of sensory preconditioning. In addition, these conditioned responses can be abolished by blocking the memory retrieval of the two modalities simultaneously. In sum, our results indicated that a conditioned response to an unreinforced cue in cross-modal sensory preconditioning can be recalled through parallel pathways.
Controlling the Morphology of Side Chain Liquid Crystalline Block Copolymer Thin Films through Variations in Liquid Crystalline Content

PubMed Central

Verploegen, Eric; Zhang, Tejia; Jung, Yeon Sik; Ross, Caroline; Hammond, Paula T.

2009-01-01

In this paper we describe methods for manipulating the morphology of side-chain liquid crystalline block copolymers through variations in the liquid crystalline content. By systematically controlling the covalent attachment of side chain liquid crystals to a block copolymer (BCP) backbone, the morphology of both the liquid crystalline (LC) mesophase and the phase segregated BCP microstructures can be precisely manipulated. Increases in LC functionalization lead to stronger preferences for the anchoring of the LC mesophase relative to the substrate and the inter-material dividing surface (IMDS). By manipulating the strength of these interactions the arrangement and ordering of the ultrathin film block copolymer nanostructures can be controlled, yielding a range of morphologies that includes perpendicular and parallel cylinders, as well as both perpendicular and parallel lamellae. Additionally, we demonstrate the utilization of selective etching to create a nanoporous liquid crystalline polymer thin film. The unique control over the orientation and order of the self-assembled morphologies with respect to the substrate will allow for the custom design of thin films for specific nano-patterning applications without manipulation of the surface chemistry or the application of external fields. PMID:18763835
Controlling the morphology of side chain liquid crystalline block copolymer thin films through variations in liquid crystalline content.

PubMed

Verploegen, Eric; Zhang, Tejia; Jung, Yeon Sik; Ross, Caroline; Hammond, Paula T

2008-10-01

In this paper, we describe methods for manipulating the morphology of side-chain liquid crystalline block copolymers through variations in the liquid crystalline content. By systematically controlling the covalent attachment of side chain liquid crystals to a block copolymer (BCP) backbone, the morphology of both the liquid crystalline (LC) mesophase and the phase-segregated BCP microstructures can be precisely manipulated. Increases in LC functionalization lead to stronger preferences for the anchoring of the LC mesophase relative to the substrate and the intermaterial dividing surface. By manipulating the strength of these interactions, the arrangement and ordering of the ultrathin film block copolymer nanostructures can be controlled, yielding a range of morphologies that includes perpendicular and parallel cylinders, as well as both perpendicular and parallel lamellae. Additionally, we demonstrate the utilization of selective etching to create a nanoporous liquid crystalline polymer thin film. The unique control over the orientation and order of the self-assembled morphologies with respect to the substrate will allow for the custom design of thin films for specific nanopatterning applications without manipulation of the surface chemistry or the application of external fields.

Chondrocyte Deformations as a Function of Tibiofemoral Joint Loading Predicted by a Generalized High-Throughput Pipeline of Multi-Scale Simulations

PubMed Central

Sibole, Scott C.; Erdemir, Ahmet

2012-01-01

Cells of the musculoskeletal system are known to respond to mechanical loading and chondrocytes within the cartilage are not an exception. However, understanding how joint level loads relate to cell level deformations, e.g. in the cartilage, is not a straightforward task. In this study, a multi-scale analysis pipeline was implemented to post-process the results of a macro-scale finite element (FE) tibiofemoral joint model to provide joint mechanics based displacement boundary conditions to micro-scale cellular FE models of the cartilage, for the purpose of characterizing chondrocyte deformations in relation to tibiofemoral joint loading. It was possible to identify the load distribution within the knee among its tissue structures and ultimately within the cartilage among its extracellular matrix, pericellular environment and resident chondrocytes. Various cellular deformation metrics (aspect ratio change, volumetric strain, cellular effective strain and maximum shear strain) were calculated. To illustrate further utility of this multi-scale modeling pipeline, two micro-scale cartilage constructs were considered: an idealized single cell at the centroid of a 100×100×100 μm block commonly used in past research studies, and an anatomically based (11 cell model of the same volume) representation of the middle zone of tibiofemoral cartilage. In both cases, chondrocytes experienced amplified deformations compared to those at the macro-scale, predicted by simulating one body weight compressive loading on the tibiofemoral joint. In the 11 cell case, all cells experienced less deformation than the single cell case, and also exhibited a larger variance in deformation compared to other cells residing in the same block. The coupling method proved to be highly scalable due to micro-scale model independence that allowed for exploitation of distributed memory computing architecture. The method’s generalized nature also allows for substitution of any macro-scale and/or micro-scale model providing application for other multi-scale continuum mechanics problems. PMID:22649535
Embedded Implementation of VHR Satellite Image Segmentation

PubMed Central

Li, Chao; Balla-Arabé, Souleymane; Ginhac, Dominique; Yang, Fan

2016-01-01

Processing and analysis of Very High Resolution (VHR) satellite images provide a mass of crucial information, which can be used for urban planning, security issues or environmental monitoring. However, they are computationally expensive and, thus, time consuming, while some of the applications, such as natural disaster monitoring and prevention, require high efficiency performance. Fortunately, parallel computing techniques and embedded systems have made great progress in recent years, and a series of massively parallel image processing devices, such as digital signal processors or Field Programmable Gate Arrays (FPGAs), have been made available to engineers at a very convenient price and demonstrate significant advantages in terms of running-cost, embeddability, power consumption flexibility, etc. In this work, we designed a texture region segmentation method for very high resolution satellite images by using the level set algorithm and the multi-kernel theory in a high-abstraction C environment and realize its register-transfer level implementation with the help of a new proposed high-level synthesis-based design flow. The evaluation experiments demonstrate that the proposed design can produce high quality image segmentation with a significant running-cost advantage. PMID:27240370
Parallelized multi–graphics processing unit framework for high-speed Gabor-domain optical coherence microscopy

PubMed Central

Tankam, Patrice; Santhanam, Anand P.; Lee, Kye-Sung; Won, Jungeun; Canavesi, Cristina; Rolland, Jannick P.

2014-01-01

Abstract. Gabor-domain optical coherence microscopy (GD-OCM) is a volumetric high-resolution technique capable of acquiring three-dimensional (3-D) skin images with histological resolution. Real-time image processing is needed to enable GD-OCM imaging in a clinical setting. We present a parallelized and scalable multi-graphics processing unit (GPU) computing framework for real-time GD-OCM image processing. A parallelized control mechanism was developed to individually assign computation tasks to each of the GPUs. For each GPU, the optimal number of amplitude-scans (A-scans) to be processed in parallel was selected to maximize GPU memory usage and core throughput. We investigated five computing architectures for computational speed-up in processing 1000×1000 A-scans. The proposed parallelized multi-GPU computing framework enables processing at a computational speed faster than the GD-OCM image acquisition, thereby facilitating high-speed GD-OCM imaging in a clinical setting. Using two parallelized GPUs, the image processing of a 1×1×0.6 mm3 skin sample was performed in about 13 s, and the performance was benchmarked at 6.5 s with four GPUs. This work thus demonstrates that 3-D GD-OCM data may be displayed in real-time to the examiner using parallelized GPU processing. PMID:24695868
Equalizer: a scalable parallel rendering framework.

PubMed

Eilemann, Stefan; Makhinya, Maxim; Pajarola, Renato

2009-01-01

Continuing improvements in CPU and GPU performances as well as increasing multi-core processor and cluster-based parallelism demand for flexible and scalable parallel rendering solutions that can exploit multipipe hardware accelerated graphics. In fact, to achieve interactive visualization, scalable rendering systems are essential to cope with the rapid growth of data sets. However, parallel rendering systems are non-trivial to develop and often only application specific implementations have been proposed. The task of developing a scalable parallel rendering framework is even more difficult if it should be generic to support various types of data and visualization applications, and at the same time work efficiently on a cluster with distributed graphics cards. In this paper we introduce a novel system called Equalizer, a toolkit for scalable parallel rendering based on OpenGL which provides an application programming interface (API) to develop scalable graphics applications for a wide range of systems ranging from large distributed visualization clusters and multi-processor multipipe graphics systems to single-processor single-pipe desktop machines. We describe the system architecture, the basic API, discuss its advantages over previous approaches, present example configurations and usage scenarios as well as scalability results.
Preoperative versus postoperative ultrasound-guided rectus sheath block for improving pain, sleep quality and cytokine levels of patients with open midline incisions undergoing transabdominal gynaecological operation: study protocol for a randomised controlled trial.

PubMed

Jin, Feng; Li, Xiao-Qian; Tan, Wen-Fei; Ma, Hong; Lu, Huang-Wei

2015-12-10

Rectus sheath block (RSB) is used for postoperative pain relief in patients undergoing abdominal surgery with midline incision. Preoperative RSB has been shown to be effective, but it has not been compared with postoperative RSB. The aim of the present study is to evaluate postoperative pain, sleep quality and changes in the cytokine levels of patients undergoing gynaecological surgery with RSB performed preoperatively versus postoperatively. This study is a prospective, randomised, controlled (randomised, parallel group, concealed allocation), single-blinded trial. All patients undergoing transabdominal gynaecological surgery will be randomised 1:1 to the treatment intervention with general anaesthesia as an adjunct to preoperative or postoperative RSB. The objective of the trial is to evaluate postoperative pain, sleep quality and changes in the cytokine levels of patients undergoing gynaecological surgery with RSB performed preoperatively (n = 32) versus postoperatively (n = 32). All of the patients, irrespective of group allocation, will receive patient-controlled intravenous analgesia (PCIA) with oxycodone. The primary objective is to compare the interval between leaving the post-anaesthesia care unit and receiving the first PCIA bolus injection on the first postoperative night between patients who receive preoperative versus postoperative RSB. The secondary objectives will be to compare (1) cumulative oxycodone consumption at 24 hours after surgery; (2) postoperative sleep quality, as measured using a BIS-Vista monitor during the first night after surgery; and (3) cytokine levels (interleukin-1, interleukin-6, tumour necrosis factor-α and interferon-γ) during surgery and at 24 and 48 hours postoperatively. Clinical experience has suggested that RSB is a very effective postoperative analgesic technique, and we will answer the following questions with this trial. Do preoperative block and postoperative block have the same duration of analgesic effects? Can postoperative block extend the analgesic time? The results of this study could have actual clinical applications that could help to reduce postoperative pain and shorten hospital stays. Current Controlled Trials NCT02477098 15 June 2015.
A framework for grand scale parallelization of the combined finite discrete element method in 2d

NASA Astrophysics Data System (ADS)

Lei, Z.; Rougier, E.; Knight, E. E.; Munjiza, A.

2014-09-01

Within the context of rock mechanics, the Combined Finite-Discrete Element Method (FDEM) has been applied to many complex industrial problems such as block caving, deep mining techniques (tunneling, pillar strength, etc.), rock blasting, seismic wave propagation, packing problems, dam stability, rock slope stability, rock mass strength characterization problems, etc. The reality is that most of these were accomplished in a 2D and/or single processor realm. In this work a hardware independent FDEM parallelization framework has been developed using the Virtual Parallel Machine for FDEM, (V-FDEM). With V-FDEM, a parallel FDEM software can be adapted to different parallel architecture systems ranging from just a few to thousands of cores.
High-throughput preparation of complex multi-scale patterns from block copolymer/homopolymer blend films

NASA Astrophysics Data System (ADS)

Park, Hyungmin; Kim, Jae-Up; Park, Soojin

2012-02-01

A simple, straightforward process for fabricating multi-scale micro- and nanostructured patterns from polystyrene-block-poly(2-vinylpyridine) (PS-b-P2VP)/poly(methyl methacrylate) (PMMA) homopolymer in a preferential solvent for PS and PMMA is demonstrated. When the PS-b-P2VP/PMMA blend films were spin-coated onto a silicon wafer, PS-b-P2VP micellar arrays consisting of a PS corona and a P2VP core were formed, while the PMMA macrodomains were isolated, due to the macrophase separation caused by the incompatibility between block copolymer micelles and PMMA homopolymer during the spin-coating process. With an increase of PMMA composition, the size of PMMA macrodomains increased. Moreover, the P2VP blocks have a strong interaction with a native oxide of the surface of the silicon wafer, so that the P2VP wetting layer was first formed during spin-coating, and PS nanoclusters were observed on the PMMA macrodomains beneath. Whereas when a silicon surface was modified with a PS brush layer, the PS nanoclusters underlying PMMA domains were not formed. The multi-scale patterns prepared from copolymer micelle/homopolymer blend films are used as templates for the fabrication of gold nanoparticle arrays by incorporating the gold precursor into the P2VP chains. The combination of nanostructures prepared from block copolymer micellar arrays and macrostructures induced by incompatibility between the copolymer and the homopolymer leads to the formation of complex, multi-scale surface patterns by a simple casting process.A simple, straightforward process for fabricating multi-scale micro- and nanostructured patterns from polystyrene-block-poly(2-vinylpyridine) (PS-b-P2VP)/poly(methyl methacrylate) (PMMA) homopolymer in a preferential solvent for PS and PMMA is demonstrated. When the PS-b-P2VP/PMMA blend films were spin-coated onto a silicon wafer, PS-b-P2VP micellar arrays consisting of a PS corona and a P2VP core were formed, while the PMMA macrodomains were isolated, due to the macrophase separation caused by the incompatibility between block copolymer micelles and PMMA homopolymer during the spin-coating process. With an increase of PMMA composition, the size of PMMA macrodomains increased. Moreover, the P2VP blocks have a strong interaction with a native oxide of the surface of the silicon wafer, so that the P2VP wetting layer was first formed during spin-coating, and PS nanoclusters were observed on the PMMA macrodomains beneath. Whereas when a silicon surface was modified with a PS brush layer, the PS nanoclusters underlying PMMA domains were not formed. The multi-scale patterns prepared from copolymer micelle/homopolymer blend films are used as templates for the fabrication of gold nanoparticle arrays by incorporating the gold precursor into the P2VP chains. The combination of nanostructures prepared from block copolymer micellar arrays and macrostructures induced by incompatibility between the copolymer and the homopolymer leads to the formation of complex, multi-scale surface patterns by a simple casting process. Electronic supplementary information (ESI) available: AFM images of PS-b-P2VP/PMMA blend films and cross-sectional line scans. See DOI: 10.1039/c2nr11792d
A Blocked Linear Method for Optimizing Large Parameter Sets in Variational Monte Carlo

DOE PAGES

Zhao, Luning; Neuscamman, Eric

2017-05-17

We present a modification to variational Monte Carlo’s linear method optimization scheme that addresses a critical memory bottleneck while maintaining compatibility with both the traditional ground state variational principle and our recently-introduced variational principle for excited states. For wave function ansatzes with tens of thousands of variables, our modification reduces the required memory per parallel process from tens of gigabytes to hundreds of megabytes, making the methodology a much better fit for modern supercomputer architectures in which data communication and per-process memory consumption are primary concerns. We verify the efficacy of the new optimization scheme in small molecule tests involvingmore » both the Hilbert space Jastrow antisymmetric geminal power ansatz and real space multi-Slater Jastrow expansions. Satisfied with its performance, we have added the optimizer to the QMCPACK software package, with which we demonstrate on a hydrogen ring a prototype approach for making systematically convergent, non-perturbative predictions of Mott-insulators’ optical band gaps.« less
Hardware-software face detection system based on multi-block local binary patterns

NASA Astrophysics Data System (ADS)

Acasandrei, Laurentiu; Barriga, Angel

2015-03-01

Face detection is an important aspect for biometrics, video surveillance and human computer interaction. Due to the complexity of the detection algorithms any face detection system requires a huge amount of computational and memory resources. In this communication an accelerated implementation of MB LBP face detection algorithm targeting low frequency, low memory and low power embedded system is presented. The resulted implementation is time deterministic and uses a customizable AMBA IP hardware accelerator. The IP implements the kernel operations of the MB-LBP algorithm and can be used as universal accelerator for MB LBP based applications. The IP employs 8 parallel MB-LBP feature evaluators cores, uses a deterministic bandwidth, has a low area profile and the power consumption is ~95 mW on a Virtex5 XC5VLX50T. The resulted implementation acceleration gain is between 5 to 8 times, while the hardware MB-LBP feature evaluation gain is between 69 and 139 times.
User's guide to the Fault Inferring Nonlinear Detection System (FINDS) computer program

NASA Technical Reports Server (NTRS)

Caglayan, A. K.; Godiwala, P. M.; Satz, H. S.

1988-01-01

Described are the operation and internal structure of the computer program FINDS (Fault Inferring Nonlinear Detection System). The FINDS algorithm is designed to provide reliable estimates for aircraft position, velocity, attitude, and horizontal winds to be used for guidance and control laws in the presence of possible failures in the avionics sensors. The FINDS algorithm was developed with the use of a digital simulation of a commercial transport aircraft and tested with flight recorded data. The algorithm was then modified to meet the size constraints and real-time execution requirements on a flight computer. For the real-time operation, a multi-rate implementation of the FINDS algorithm has been partitioned to execute on a dual parallel processor configuration: one based on the translational dynamics and the other on the rotational kinematics. The report presents an overview of the FINDS algorithm, the implemented equations, the flow charts for the key subprograms, the input and output files, program variable indexing convention, subprogram descriptions, and the common block descriptions used in the program.
Symbiotic Navigation in Multi-Robot Systems with Remote Obstacle Knowledge Sharing

PubMed Central

Ravankar, Abhijeet; Ravankar, Ankit A.; Kobayashi, Yukinori; Emaru, Takanori

2017-01-01

Large scale operational areas often require multiple service robots for coverage and task parallelism. In such scenarios, each robot keeps its individual map of the environment and serves specific areas of the map at different times. We propose a knowledge sharing mechanism for multiple robots in which one robot can inform other robots about the changes in map, like path blockage, or new static obstacles, encountered at specific areas of the map. This symbiotic information sharing allows the robots to update remote areas of the map without having to explicitly navigate those areas, and plan efficient paths. A node representation of paths is presented for seamless sharing of blocked path information. The transience of obstacles is modeled to track obstacles which might have been removed. A lazy information update scheme is presented in which only relevant information affecting the current task is updated for efficiency. The advantages of the proposed method for path planning are discussed against traditional method with experimental results in both simulation and real environments. PMID:28678193
A Fault Oblivious Extreme-Scale Execution Environment

DOE Office of Scientific and Technical Information (OSTI.GOV)

McKie, Jim

The FOX project, funded under the ASCR X-stack I program, developed systems software and runtime libraries for a new approach to the data and work distribution for massively parallel, fault oblivious application execution. Our work was motivated by the premise that exascale computing systems will provide a thousand-fold increase in parallelism and a proportional increase in failure rate relative to today’s machines. To deliver the capability of exascale hardware, the systems software must provide the infrastructure to support existing applications while simultaneously enabling efficient execution of new programming models that naturally express dynamic, adaptive, irregular computation; coupled simulations; and massivemore » data analysis in a highly unreliable hardware environment with billions of threads of execution. Our OS research has prototyped new methods to provide efficient resource sharing, synchronization, and protection in a many-core compute node. We have experimented with alternative task/dataflow programming models and shown scalability in some cases to hundreds of thousands of cores. Much of our software is in active development through open source projects. Concepts from FOX are being pursued in next generation exascale operating systems. Our OS work focused on adaptive, application tailored OS services optimized for multi → many core processors. We developed a new operating system NIX that supports role-based allocation of cores to processes which was released to open source. We contributed to the IBM FusedOS project, which promoted the concept of latency-optimized and throughput-optimized cores. We built a task queue library based on distributed, fault tolerant key-value store and identified scaling issues. A second fault tolerant task parallel library was developed, based on the Linda tuple space model, that used low level interconnect primitives for optimized communication. We designed fault tolerance mechanisms for task parallel computations employing work stealing for load balancing that scaled to the largest existing supercomputers. Finally, we implemented the Elastic Building Blocks runtime, a library to manage object-oriented distributed software components. To support the research, we won two INCITE awards for time on Intrepid (BG/P) and Mira (BG/Q). Much of our work has had impact in the OS and runtime community through the ASCR Exascale OS/R workshop and report, leading to the research agenda of the Exascale OS/R program. Our project was, however, also affected by attrition of multiple PIs. While the PIs continued to participate and offer guidance as time permitted, losing these key individuals was unfortunate both for the project and for the DOE HPC community.« less
A numerical multi-scale model to predict macroscopic material anisotropy of multi-phase steels from crystal plasticity material definitions

NASA Astrophysics Data System (ADS)

Ravi, Sathish Kumar; Gawad, Jerzy; Seefeldt, Marc; Van Bael, Albert; Roose, Dirk

2017-10-01

A numerical multi-scale model is being developed to predict the anisotropic macroscopic material response of multi-phase steel. The embedded microstructure is given by a meso-scale Representative Volume Element (RVE), which holds the most relevant features like phase distribution, grain orientation, morphology etc., in sufficient detail to describe the multi-phase behavior of the material. A Finite Element (FE) mesh of the RVE is constructed using statistical information from individual phases such as grain size distribution and ODF. The material response of the RVE is obtained for selected loading/deformation modes through numerical FE simulations in Abaqus. For the elasto-plastic response of the individual grains, single crystal plasticity based plastic potential functions are proposed as Abaqus material definitions. The plastic potential functions are derived using the Facet method for individual phases in the microstructure at the level of single grains. The proposed method is a new modeling framework and the results presented in terms of macroscopic flow curves are based on the building blocks of the approach, while the model would eventually facilitate the construction of an anisotropic yield locus of the underlying multi-phase microstructure derived from a crystal plasticity based framework.
Exploiting Parallel R in the Cloud with SPRINT

PubMed Central

Piotrowski, M.; McGilvary, G.A.; Sloan, T. M.; Mewissen, M.; Lloyd, A.D.; Forster, T.; Mitchell, L.; Ghazal, P.; Hill, J.

2012-01-01

Background Advances in DNA Microarray devices and next-generation massively parallel DNA sequencing platforms have led to an exponential growth in data availability but the arising opportunities require adequate computing resources. High Performance Computing (HPC) in the Cloud offers an affordable way of meeting this need. Objectives Bioconductor, a popular tool for high-throughput genomic data analysis, is distributed as add-on modules for the R statistical programming language but R has no native capabilities for exploiting multi-processor architectures. SPRINT is an R package that enables easy access to HPC for genomics researchers. This paper investigates: setting up and running SPRINT-enabled genomic analyses on Amazon’s Elastic Compute Cloud (EC2), the advantages of submitting applications to EC2 from different parts of the world and, if resource underutilization can improve application performance. Methods The SPRINT parallel implementations of correlation, permutation testing, partitioning around medoids and the multi-purpose papply have been benchmarked on data sets of various size on Amazon EC2. Jobs have been submitted from both the UK and Thailand to investigate monetary differences. Results It is possible to obtain good, scalable performance but the level of improvement is dependent upon the nature of algorithm. Resource underutilization can further improve the time to result. End-user’s location impacts on costs due to factors such as local taxation. Conclusions: Although not designed to satisfy HPC requirements, Amazon EC2 and cloud computing in general provides an interesting alternative and provides new possibilities for smaller organisations with limited funds. PMID:23223611
Exploiting parallel R in the cloud with SPRINT.

PubMed

Piotrowski, M; McGilvary, G A; Sloan, T M; Mewissen, M; Lloyd, A D; Forster, T; Mitchell, L; Ghazal, P; Hill, J

2013-01-01

Advances in DNA Microarray devices and next-generation massively parallel DNA sequencing platforms have led to an exponential growth in data availability but the arising opportunities require adequate computing resources. High Performance Computing (HPC) in the Cloud offers an affordable way of meeting this need. Bioconductor, a popular tool for high-throughput genomic data analysis, is distributed as add-on modules for the R statistical programming language but R has no native capabilities for exploiting multi-processor architectures. SPRINT is an R package that enables easy access to HPC for genomics researchers. This paper investigates: setting up and running SPRINT-enabled genomic analyses on Amazon's Elastic Compute Cloud (EC2), the advantages of submitting applications to EC2 from different parts of the world and, if resource underutilization can improve application performance. The SPRINT parallel implementations of correlation, permutation testing, partitioning around medoids and the multi-purpose papply have been benchmarked on data sets of various size on Amazon EC2. Jobs have been submitted from both the UK and Thailand to investigate monetary differences. It is possible to obtain good, scalable performance but the level of improvement is dependent upon the nature of the algorithm. Resource underutilization can further improve the time to result. End-user's location impacts on costs due to factors such as local taxation. Although not designed to satisfy HPC requirements, Amazon EC2 and cloud computing in general provides an interesting alternative and provides new possibilities for smaller organisations with limited funds.
Highly efficient spatial data filtering in parallel using the opensource library CPPPO

NASA Astrophysics Data System (ADS)

Municchi, Federico; Goniva, Christoph; Radl, Stefan

2016-10-01

CPPPO is a compilation of parallel data processing routines developed with the aim to create a library for "scale bridging" (i.e. connecting different scales by mean of closure models) in a multi-scale approach. CPPPO features a number of parallel filtering algorithms designed for use with structured and unstructured Eulerian meshes, as well as Lagrangian data sets. In addition, data can be processed on the fly, allowing the collection of relevant statistics without saving individual snapshots of the simulation state. Our library is provided with an interface to the widely-used CFD solver OpenFOAM®, and can be easily connected to any other software package via interface modules. Also, we introduce a novel, extremely efficient approach to parallel data filtering, and show that our algorithms scale super-linearly on multi-core clusters. Furthermore, we provide a guideline for choosing the optimal Eulerian cell selection algorithm depending on the number of CPU cores used. Finally, we demonstrate the accuracy and the parallel scalability of CPPPO in a showcase focusing on heat and mass transfer from a dense bed of particles.
Large-scale quantum transport calculations for electronic devices with over ten thousand atoms

NASA Astrophysics Data System (ADS)

Lu, Wenchang; Lu, Yan; Xiao, Zhongcan; Hodak, Miro; Briggs, Emil; Bernholc, Jerry

The non-equilibrium Green's function method (NEGF) has been implemented in our massively parallel DFT software, the real space multigrid (RMG) code suite. Our implementation employs multi-level parallelization strategies and fully utilizes both multi-core CPUs and GPU accelerators. Since the cost of the calculations increases dramatically with the number of orbitals, an optimal basis set is crucial for including a large number of atoms in the ``active device'' part of the simulations. In our implementation, the localized orbitals are separately optimized for each principal layer of the device region, in order to obtain an accurate and optimal basis set. As a large example, we calculated the transmission characteristics of a Si nanowire p-n junction. The nanowire is along (110) direction in order to minimize the number dangling bonds that are saturated by H atoms. Its diameter is 3 nm. The length of 24 nm is necessary because of the long-range screening length in Si. Our calculations clearly show the I-V characteristics of a diode, i.e., the current increases exponentially with forward bias and is near zero with backward bias. Other examples will also be presented, including three-terminal transistors and large sensor structures.
Multi-resonant electromagnetic shunt in base isolation for vibration damping and energy harvesting

NASA Astrophysics Data System (ADS)

Pei, Yalu; Liu, Yilun; Zuo, Lei

2018-06-01

This paper investigates multi-resonant electromagnetic shunts applied to base isolation for dual-function vibration damping and energy harvesting. Two multi-mode shunt circuit configurations, namely parallel and series, are proposed and optimized based on the H2 criteria. The root-mean-square (RMS) value of the relative displacement between the base and the primary structure is minimized. Practically, this will improve the safety of base-isolated buildings subjected the broad bandwidth ground acceleration. Case studies of a base-isolated building are conducted in both the frequency and time domains to investigate the effectiveness of multi-resonant electromagnetic shunts under recorded earthquake signals. It shows that both multi-mode shunt circuits outperform traditional single mode shunt circuits by suppressing the first and the second vibration modes simultaneously. Moreover, for the same stiffness ratio, the parallel shunt circuit is more effective at harvesting energy and suppressing vibration, and can more robustly handle parameter mistuning than the series shunt circuit. Furthermore, this paper discusses experimental validation of the effectiveness of multi-resonant electromagnetic shunts for vibration damping and energy harvesting on a scaled-down base isolation system.
Recent Progress on the Parallel Implementation of Moving-Body Overset Grid Schemes

NASA Technical Reports Server (NTRS)

Wissink, Andrew; Allen, Edwin (Technical Monitor)

1998-01-01

Viscous calculations about geometrically complex bodies in which there is relative motion between component parts is one of the most computationally demanding problems facing CFD researchers today. This presentation documents results from the first two years of a CHSSI-funded effort within the U.S. Army AFDD to develop scalable dynamic overset grid methods for unsteady viscous calculations with moving-body problems. The first pan of the presentation will focus on results from OVERFLOW-D1, a parallelized moving-body overset grid scheme that employs traditional Chimera methodology. The two processes that dominate the cost of such problems are the flow solution on each component and the intergrid connectivity solution. Parallel implementations of the OVERFLOW flow solver and DCF3D connectivity software are coupled with a proposed two-part static-dynamic load balancing scheme and tested on the IBM SP and Cray T3E multi-processors. The second part of the presentation will cover some recent results from OVERFLOW-D2, a new flow solver that employs Cartesian grids with various levels of refinement, facilitating solution adaption. A study of the parallel performance of the scheme on large distributed- memory multiprocessor computer architectures will be reported.
Modelling Detailed-Chemistry Effects on Turbulent Diffusion Flames using a Parallel Solution-Adaptive Scheme

NASA Astrophysics Data System (ADS)

Jha, Pradeep Kumar

Capturing the effects of detailed-chemistry on turbulent combustion processes is a central challenge faced by the numerical combustion community. However, the inherent complexity and non-linear nature of both turbulence and chemistry require that combustion models rely heavily on engineering approximations to remain computationally tractable. This thesis proposes a computationally efficient algorithm for modelling detailed-chemistry effects in turbulent diffusion flames and numerically predicting the associated flame properties. The cornerstone of this combustion modelling tool is the use of parallel Adaptive Mesh Refinement (AMR) scheme with the recently proposed Flame Prolongation of Intrinsic low-dimensional manifold (FPI) tabulated-chemistry approach for modelling complex chemistry. The effect of turbulence on the mean chemistry is incorporated using a Presumed Conditional Moment (PCM) approach based on a beta-probability density function (PDF). The two-equation k-w turbulence model is used for modelling the effects of the unresolved turbulence on the mean flow field. The finite-rate of methane-air combustion is represented here by using the GRI-Mech 3.0 scheme. This detailed mechanism is used to build the FPI tables. A state of the art numerical scheme based on a parallel block-based solution-adaptive algorithm has been developed to solve the Favre-averaged Navier-Stokes (FANS) and other governing partial-differential equations using a second-order accurate, fully-coupled finite-volume formulation on body-fitted, multi-block, quadrilateral/hexahedral mesh for two-dimensional and three-dimensional flow geometries, respectively. A standard fourth-order Runge-Kutta time-marching scheme is used for time-accurate temporal discretizations. Numerical predictions of three different diffusion flames configurations are considered in the present work: a laminar counter-flow flame; a laminar co-flow diffusion flame; and a Sydney bluff-body turbulent reacting flow. Comparisons are made between the predicted results of the present FPI scheme and Steady Laminar Flamelet Model (SLFM) approach for diffusion flames. The effects of grid resolution on the predicted overall flame solutions are also assessed. Other non-reacting flows have also been considered to further validate other aspects of the numerical scheme. The present schemes predict results which are in good agreement with published experimental results and reduces the computational cost involved in modelling turbulent diffusion flames significantly, both in terms of storage and processing time.

Some links on this page may take you to non-federal websites. Their policies may differ from this site.