thread level parallelism: Topics by Science.gov

Sample records for thread level parallelism

A Review of Lightweight Thread Approaches for High Performance Computing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Castello, Adrian; Pena, Antonio J.; Seo, Sangmin

High-level, directive-based solutions are becoming the programming models (PMs) of the multi/many-core architectures. Several solutions relying on operating system (OS) threads perfectly work with a moderate number of cores. However, exascale systems will spawn hundreds of thousands of threads in order to exploit their massive parallel architectures and thus conventional OS threads are too heavy for that purpose. Several lightweight thread (LWT) libraries have recently appeared offering lighter mechanisms to tackle massive concurrency. In order to examine the suitability of LWTs in high-level runtimes, we develop a set of microbenchmarks consisting of commonlyfound patterns in current parallel codes. Moreover, wemore » study the semantics offered by some LWT libraries in order to expose the similarities between different LWT application programming interfaces. This study reveals that a reduced set of LWT functions can be sufficient to cover the common parallel code patterns and that those LWT libraries perform better than OS threads-based solutions in cases where task and nested parallelism are becoming more popular with new architectures.« less
Data preprocessing for determining outer/inner parallelization in the nested loop problem using OpenMP

NASA Astrophysics Data System (ADS)

Handhika, T.; Bustamam, A.; Ernastuti, Kerami, D.

2017-07-01

Multi-thread programming using OpenMP on the shared-memory architecture with hyperthreading technology allows the resource to be accessed by multiple processors simultaneously. Each processor can execute more than one thread for a certain period of time. However, its speedup depends on the ability of the processor to execute threads in limited quantities, especially the sequential algorithm which contains a nested loop. The number of the outer loop iterations is greater than the maximum number of threads that can be executed by a processor. The thread distribution technique that had been found previously only be applied by the high-level programmer. This paper generates a parallelization procedure for low-level programmer in dealing with 2-level nested loop problems with the maximum number of threads that can be executed by a processor is smaller than the number of the outer loop iterations. Data preprocessing which is related to the number of the outer loop and the inner loop iterations, the computational time required to execute each iteration and the maximum number of threads that can be executed by a processor are used as a strategy to determine which parallel region that will produce optimal speedup.
Vectorization for Molecular Dynamics on Intel Xeon Phi Corpocessors

NASA Astrophysics Data System (ADS)

Yi, Hongsuk

2014-03-01

Many modern processors are capable of exploiting data-level parallelism through the use of single instruction multiple data (SIMD) execution. The new Intel Xeon Phi coprocessor supports 512 bit vector registers for the high performance computing. In this paper, we have developed a hierarchical parallelization scheme for accelerated molecular dynamics simulations with the Terfoff potentials for covalent bond solid crystals on Intel Xeon Phi coprocessor systems. The scheme exploits multi-level parallelism computing. We combine thread-level parallelism using a tightly coupled thread-level and task-level parallelism with 512-bit vector register. The simulation results show that the parallel performance of SIMD implementations on Xeon Phi is apparently superior to their x86 CPU architecture.
Evict on write, a management strategy for a prefetch unit and/or first level cache in a multiprocessor system with speculative execution

DOEpatents

Gara, Alan; Ohmacht, Martin

2014-09-16

In a multiprocessor system with at least two levels of cache, a speculative thread may run on a core processor in parallel with other threads. When the thread seeks to do a write to main memory, this access is to be written through the first level cache to the second level cache. After the write though, the corresponding line is deleted from the first level cache and/or prefetch unit, so that any further accesses to the same location in main memory have to be retrieved from the second level cache. The second level cache keeps track of multiple versions of data, where more than one speculative thread is running in parallel, while the first level cache does not have any of the versions during speculation. A switch allows choosing between modes of operation of a speculation blind first level cache.
Constant time worker thread allocation via configuration caching

DOE Office of Scientific and Technical Information (OSTI.GOV)

Eichenberger, Alexandre E; O'Brien, John K. P.

Mechanisms are provided for allocating threads for execution of a parallel region of code. A request for allocation of worker threads to execute the parallel region of code is received from a master thread. Cached thread allocation information identifying prior thread allocations that have been performed for the master thread are accessed. Worker threads are allocated to the master thread based on the cached thread allocation information. The parallel region of code is executed using the allocated worker threads.
Performance Analysis of Multilevel Parallel Applications on Shared Memory Architectures

NASA Technical Reports Server (NTRS)

Biegel, Bryan A. (Technical Monitor); Jost, G.; Jin, H.; Labarta J.; Gimenez, J.; Caubet, J.

2003-01-01

Parallel programming paradigms include process level parallelism, thread level parallelization, and multilevel parallelism. This viewgraph presentation describes a detailed performance analysis of these paradigms for Shared Memory Architecture (SMA). This analysis uses the Paraver Performance Analysis System. The presentation includes diagrams of a flow of useful computations.
Power/Performance Trade-offs of Small Batched LU Based Solvers on GPUs

DOE Office of Scientific and Technical Information (OSTI.GOV)

Villa, Oreste; Fatica, Massimiliano; Gawande, Nitin A.

In this paper we propose and analyze a set of batched linear solvers for small matrices on Graphic Processing Units (GPUs), evaluating the various alternatives depending on the size of the systems to solve. We discuss three different solutions that operate with different level of parallelization and GPU features. The first, exploiting the CUBLAS library, manages matrices of size up to 32x32 and employs Warp level (one matrix, one Warp) parallelism and shared memory. The second works at Thread-block level parallelism (one matrix, one Thread-block), still exploiting shared memory but managing matrices up to 76x76. The third is Thread levelmore » parallel (one matrix, one thread) and can reach sizes up to 128x128, but it does not exploit shared memory and only relies on the high memory bandwidth of the GPU. The first and second solution only support partial pivoting, the third one easily supports partial and full pivoting, making it attractive to problems that require greater numerical stability. We analyze the trade-offs in terms of performance and power consumption as function of the size of the linear systems that are simultaneously solved. We execute the three implementations on a Tesla M2090 (Fermi) and on a Tesla K20 (Kepler).« less
Data Parallel Bin-Based Indexing for Answering Queries on Multi-Core Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gosink, Luke; Wu, Kesheng; Bethel, E. Wes

2009-06-02

The multi-core trend in CPUs and general purpose graphics processing units (GPUs) offers new opportunities for the database community. The increase of cores at exponential rates is likely to affect virtually every server and client in the coming decade, and presents database management systems with a huge, compelling disruption that will radically change how processing is done. This paper presents a new parallel indexing data structure for answering queries that takes full advantage of the increasing thread-level parallelism emerging in multi-core architectures. In our approach, our Data Parallel Bin-based Index Strategy (DP-BIS) first bins the base data, and then partitionsmore » and stores the values in each bin as a separate, bin-based data cluster. In answering a query, the procedures for examining the bin numbers and the bin-based data clusters offer the maximum possible level of concurrency; each record is evaluated by a single thread and all threads are processed simultaneously in parallel. We implement and demonstrate the effectiveness of DP-BIS on two multi-core architectures: a multi-core CPU and a GPU. The concurrency afforded by DP-BIS allows us to fully utilize the thread-level parallelism provided by each architecture--for example, our GPU-based DP-BIS implementation simultaneously evaluates over 12,000 records with an equivalent number of concurrently executing threads. In comparing DP-BIS's performance across these architectures, we show that the GPU-based DP-BIS implementation requires significantly less computation time to answer a query than the CPU-based implementation. We also demonstrate in our analysis that DP-BIS provides better overall performance than the commonly utilized CPU and GPU-based projection index. Finally, due to data encoding, we show that DP-BIS accesses significantly smaller amounts of data than index strategies that operate solely on a column's base data; this smaller data footprint is critical for parallel processors that possess limited memory resources (e.g., GPUs).« less
Multi-threading: A new dimension to massively parallel scientific computation

NASA Astrophysics Data System (ADS)

Nielsen, Ida M. B.; Janssen, Curtis L.

2000-06-01

Multi-threading is becoming widely available for Unix-like operating systems, and the application of multi-threading opens new ways for performing parallel computations with greater efficiency. We here briefly discuss the principles of multi-threading and illustrate the application of multi-threading for a massively parallel direct four-index transformation of electron repulsion integrals. Finally, other potential applications of multi-threading in scientific computing are outlined.
SMT-Aware Instantaneous Footprint Optimization

DOE Office of Scientific and Technical Information (OSTI.GOV)

Roy, Probir; Liu, Xu; Song, Shuaiwen

Modern architectures employ simultaneous multithreading (SMT) to increase thread-level parallelism. SMT threads share many functional units and the whole memory hierarchy of a physical core. Without a careful code design, SMT threads can easily contend with each other for these shared resources, causing severe performance degradation. Minimizing SMT thread contention for HPC applications running on dedicated platforms is very challenging, because they usually spawn threads within Single Program Multiple Data (SPMD) models. To address this important issue, we introduce a simple scheme for SMT-aware code optimization, which aims to reduce the memory contention across SMT threads.
Does Simultaneous Liposuction Adversely Affect the Outcome of Thread Lifts? A Preliminary Result.

PubMed

Lee, Yong Woo; Park, Tae Hwan

2018-04-11

Along with advances in thread lift techniques and materials, ancillary procedures such as fat grafting, liposuction, or filler injections have been performed simultaneously. Some surgeons think that these ancillary procedures might affect the aesthetic outcomes of thread lifting possibly due to inadvertent injury to threads or loosening of soft tissue via passing the cannula in the surgical plane of the thread lifts. The purpose of the current study is to determine the effect of such ancillary procedures on the outcome of thread lifts in the human and cadaveric setting. We used human abdominal tissue after abdominoplasty and cadaveric faces. In the abdominal tissue, liposuction parallel to the parallel axis was performed in one area for 5 min. We counted 30 passes when liposuction was performed in one direction. This was repeated as we changed the direction of passages. The plane of thread lifts (dermal vs subcutaneous) and angle between liposuction and thread lifts (parallel vs perpendicular) were differentiated in this abdominal tissue study group. Then, we performed parallel or perpendicular thread lifts using a small slit incision. Using a tensiometer, the maximum holding strength was measured when pulling the thread out of the skin as much as possible. We also used faces of cadavers to prove whether the finding in human abdominal tissue is really valid with corresponding techniques. Our pilot study using abdominal tissue showed that liposuction after thread lifts adversely affects it regardless of the vector of thread lifts. In the cadaveric study, however, liposuction prior to thread lifting does not significantly affect the holding strength of thread lifts. Liposuction or fat grafting in the appropriate layer would not be a hurdle to safely performing simultaneous thread lifts if the target lift tissue is intra-SMAS or just above the SMAS layer. This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .
Thread concept for automatic task parallelization in image analysis

NASA Astrophysics Data System (ADS)

Lueckenhaus, Maximilian; Eckstein, Wolfgang

1998-09-01

Parallel processing of image analysis tasks is an essential method to speed up image processing and helps to exploit the full capacity of distributed systems. However, writing parallel code is a difficult and time-consuming process and often leads to an architecture-dependent program that has to be re-implemented when changing the hardware. Therefore it is highly desirable to do the parallelization automatically. For this we have developed a special kind of thread concept for image analysis tasks. Threads derivated from one subtask may share objects and run in the same context but may process different threads of execution and work on different data in parallel. In this paper we describe the basics of our thread concept and show how it can be used as basis of an automatic task parallelization to speed up image processing. We further illustrate the design and implementation of an agent-based system that uses image analysis threads for generating and processing parallel programs by taking into account the available hardware. The tests made with our system prototype show that the thread concept combined with the agent paradigm is suitable to speed up image processing by an automatic parallelization of image analysis tasks.
Parallel Lattice Basis Reduction Using a Multi-threaded Schnorr-Euchner LLL Algorithm

NASA Astrophysics Data System (ADS)

Backes, Werner; Wetzel, Susanne

In this paper, we introduce a new parallel variant of the LLL lattice basis reduction algorithm. Our new, multi-threaded algorithm is the first to provide an efficient, parallel implementation of the Schorr-Euchner algorithm for today’s multi-processor, multi-core computer architectures. Experiments with sparse and dense lattice bases show a speed-up factor of about 1.8 for the 2-thread and about factor 3.2 for the 4-thread version of our new parallel lattice basis reduction algorithm in comparison to the traditional non-parallel algorithm.
Efficient parallelization for AMR MHD multiphysics calculations; implementation in AstroBEAR

NASA Astrophysics Data System (ADS)

Carroll-Nellenback, Jonathan J.; Shroyer, Brandon; Frank, Adam; Ding, Chen

2013-03-01

Current adaptive mesh refinement (AMR) simulations require algorithms that are highly parallelized and manage memory efficiently. As compute engines grow larger, AMR simulations will require algorithms that achieve new levels of efficient parallelization and memory management. We have attempted to employ new techniques to achieve both of these goals. Patch or grid based AMR often employs ghost cells to decouple the hyperbolic advances of each grid on a given refinement level. This decoupling allows each grid to be advanced independently. In AstroBEAR we utilize this independence by threading the grid advances on each level with preference going to the finer level grids. This allows for global load balancing instead of level by level load balancing and allows for greater parallelization across both physical space and AMR level. Threading of level advances can also improve performance by interleaving communication with computation, especially in deep simulations with many levels of refinement. While we see improvements of up to 30% on deep simulations run on a few cores, the speedup is typically more modest (5-20%) for larger scale simulations. To improve memory management we have employed a distributed tree algorithm that requires processors to only store and communicate local sections of the AMR tree structure with neighboring processors. Using this distributed approach we are able to get reasonable scaling efficiency (>80%) out to 12288 cores and up to 8 levels of AMR - independent of the use of threading.
IOPA: I/O-aware parallelism adaption for parallel programs

PubMed Central

Liu, Tao; Liu, Yi; Qian, Chen; Qian, Depei

2017-01-01

With the development of multi-/many-core processors, applications need to be written as parallel programs to improve execution efficiency. For data-intensive applications that use multiple threads to read/write files simultaneously, an I/O sub-system can easily become a bottleneck when too many of these types of threads exist; on the contrary, too few threads will cause insufficient resource utilization and hurt performance. Therefore, programmers must pay much attention to parallelism control to find the appropriate number of I/O threads for an application. This paper proposes a parallelism control mechanism named IOPA that can adjust the parallelism of applications to adapt to the I/O capability of a system and balance computing resources and I/O bandwidth. The programming interface of IOPA is also provided to programmers to simplify parallel programming. IOPA is evaluated using multiple applications with both solid state and hard disk drives. The results show that the parallel applications using IOPA can achieve higher efficiency than those with a fixed number of threads. PMID:28278236
IOPA: I/O-aware parallelism adaption for parallel programs.

PubMed

Liu, Tao; Liu, Yi; Qian, Chen; Qian, Depei

2017-01-01

With the development of multi-/many-core processors, applications need to be written as parallel programs to improve execution efficiency. For data-intensive applications that use multiple threads to read/write files simultaneously, an I/O sub-system can easily become a bottleneck when too many of these types of threads exist; on the contrary, too few threads will cause insufficient resource utilization and hurt performance. Therefore, programmers must pay much attention to parallelism control to find the appropriate number of I/O threads for an application. This paper proposes a parallelism control mechanism named IOPA that can adjust the parallelism of applications to adapt to the I/O capability of a system and balance computing resources and I/O bandwidth. The programming interface of IOPA is also provided to programmers to simplify parallel programming. IOPA is evaluated using multiple applications with both solid state and hard disk drives. The results show that the parallel applications using IOPA can achieve higher efficiency than those with a fixed number of threads.
On the utility of threads for data parallel programming

NASA Technical Reports Server (NTRS)

Fahringer, Thomas; Haines, Matthew; Mehrotra, Piyush

1995-01-01

Threads provide a useful programming model for asynchronous behavior because of their ability to encapsulate units of work that can then be scheduled for execution at runtime, based on the dynamic state of a system. Recently, the threaded model has been applied to the domain of data parallel scientific codes, and initial reports indicate that the threaded model can produce performance gains over non-threaded approaches, primarily through the use of overlapping useful computation with communication latency. However, overlapping computation with communication is possible without the benefit of threads if the communication system supports asynchronous primitives, and this comparison has not been made in previous papers. This paper provides a critical look at the utility of lightweight threads as applied to data parallel scientific programming.
Ropes: Support for collective opertions among distributed threads

NASA Technical Reports Server (NTRS)

Haines, Matthew; Mehrotra, Piyush; Cronk, David

1995-01-01

Lightweight threads are becoming increasingly useful in supporting parallelism and asynchronous control structures in applications and language implementations. Recently, systems have been designed and implemented to support interprocessor communication between lightweight threads so that threads can be exploited in a distributed memory system. Their use, in this setting, has been largely restricted to supporting latency hiding techniques and functional parallelism within a single application. However, to execute data parallel codes independent of other threads in the system, collective operations and relative indexing among threads are required. This paper describes the design of ropes: a scoping mechanism for collective operations and relative indexing among threads. We present the design of ropes in the context of the Chant system, and provide performance results evaluating our initial design decisions.
Automatic Thread-Level Parallelization in the Chombo AMR Library

DOE Office of Scientific and Technical Information (OSTI.GOV)

Christen, Matthias; Keen, Noel; Ligocki, Terry

2011-05-26

The increasing on-chip parallelism has some substantial implications for HPC applications. Currently, hybrid programming models (typically MPI+OpenMP) are employed for mapping software to the hardware in order to leverage the hardware?s architectural features. In this paper, we present an approach that automatically introduces thread level parallelism into Chombo, a parallel adaptive mesh refinement framework for finite difference type PDE solvers. In Chombo, core algorithms are specified in the ChomboFortran, a macro language extension to F77 that is part of the Chombo framework. This domain-specific language forms an already used target language for an automatic migration of the large number ofmore » existing algorithms into a hybrid MPI+OpenMP implementation. It also provides access to the auto-tuning methodology that enables tuning certain aspects of an algorithm to hardware characteristics. Performance measurements are presented for a few of the most relevant kernels with respect to a specific application benchmark using this technique as well as benchmark results for the entire application. The kernel benchmarks show that, using auto-tuning, up to a factor of 11 in performance was gained with 4 threads with respect to the serial reference implementation.« less
Expressing Parallelism with ROOT

NASA Astrophysics Data System (ADS)

Piparo, D.; Tejedor, E.; Guiraud, E.; Ganis, G.; Mato, P.; Moneta, L.; Valls Pla, X.; Canal, P.

2017-10-01

The need for processing the ever-increasing amount of data generated by the LHC experiments in a more efficient way has motivated ROOT to further develop its support for parallelism. Such support is being tackled both for shared-memory and distributed-memory environments. The incarnations of the aforementioned parallelism are multi-threading, multi-processing and cluster-wide executions. In the area of multi-threading, we discuss the new implicit parallelism and related interfaces, as well as the new building blocks to safely operate with ROOT objects in a multi-threaded environment. Regarding multi-processing, we review the new MultiProc framework, comparing it with similar tools (e.g. multiprocessing module in Python). Finally, as an alternative to PROOF for cluster-wide executions, we introduce the efforts on integrating ROOT with state-of-the-art distributed data processing technologies like Spark, both in terms of programming model and runtime design (with EOS as one of the main components). For all the levels of parallelism, we discuss, based on real-life examples and measurements, how our proposals can increase the productivity of scientists.

Expressing Parallelism with ROOT

DOE Office of Scientific and Technical Information (OSTI.GOV)

Piparo, D.; Tejedor, E.; Guiraud, E.

The need for processing the ever-increasing amount of data generated by the LHC experiments in a more efficient way has motivated ROOT to further develop its support for parallelism. Such support is being tackled both for shared-memory and distributed-memory environments. The incarnations of the aforementioned parallelism are multi-threading, multi-processing and cluster-wide executions. In the area of multi-threading, we discuss the new implicit parallelism and related interfaces, as well as the new building blocks to safely operate with ROOT objects in a multi-threaded environment. Regarding multi-processing, we review the new MultiProc framework, comparing it with similar tools (e.g. multiprocessing module inmore » Python). Finally, as an alternative to PROOF for cluster-wide executions, we introduce the efforts on integrating ROOT with state-of-the-art distributed data processing technologies like Spark, both in terms of programming model and runtime design (with EOS as one of the main components). For all the levels of parallelism, we discuss, based on real-life examples and measurements, how our proposals can increase the productivity of scientists.« less
Using OpenMP vs. Threading Building Blocks for Medical Imaging on Multi-cores

NASA Astrophysics Data System (ADS)

Kegel, Philipp; Schellmann, Maraike; Gorlatch, Sergei

We compare two parallel programming approaches for multi-core systems: the well-known OpenMP and the recently introduced Threading Building Blocks (TBB) library by Intel®. The comparison is made using the parallelization of a real-world numerical algorithm for medical imaging. We develop several parallel implementations, and compare them w.r.t. programming effort, programming style and abstraction, and runtime performance. We show that TBB requires a considerable program re-design, whereas with OpenMP simple compiler directives are sufficient. While TBB appears to be less appropriate for parallelizing existing implementations, it fosters a good programming style and higher abstraction level for newly developed parallel programs. Our experimental measurements on a dual quad-core system demonstrate that OpenMP slightly outperforms TBB in our implementation.
Parallel approach for bioinspired algorithms

NASA Astrophysics Data System (ADS)

Zaporozhets, Dmitry; Zaruba, Daria; Kulieva, Nina

2018-05-01

In the paper, a probabilistic parallel approach based on the population heuristic, such as a genetic algorithm, is suggested. The authors proposed using a multithreading approach at the micro level at which new alternative solutions are generated. On each iteration, several threads that independently used the same population to generate new solutions can be started. After the work of all threads, a selection operator combines obtained results in the new population. To confirm the effectiveness of the suggested approach, the authors have developed software on the basis of which experimental computations can be carried out. The authors have considered a classic optimization problem – finding a Hamiltonian cycle in a graph. Experiments show that due to the parallel approach at the micro level, increment of running speed can be obtained on graphs with 250 and more vertices.
Using a source-to-source transformation to introduce multi-threading into the AliRoot framework for a parallel event reconstruction

NASA Astrophysics Data System (ADS)

Lohn, Stefan B.; Dong, Xin; Carminati, Federico

2012-12-01

Chip-Multiprocessors are going to support massive parallelism by many additional physical and logical cores. Improving performance can no longer be obtained by increasing clock-frequency because the technical limits are almost reached. Instead, parallel execution must be used to gain performance. Resources like main memory, the cache hierarchy, bandwidth of the memory bus or links between cores and sockets are not going to be improved as fast. Hence, parallelism can only result into performance gains if the memory usage is optimized and the communication between threads is minimized. Besides concurrent programming has become a domain for experts. Implementing multi-threading is error prone and labor-intensive. A full reimplementation of the whole AliRoot source-code is unaffordable. This paper describes the effort to evaluate the adaption of AliRoot to the needs of multi-threading and to provide the capability of parallel processing by using a semi-automatic source-to-source transformation to address the problems as described before and to provide a straight-forward way of parallelization with almost no interference between threads. This makes the approach simple and reduces the required manual changes in the code. In a first step, unconditional thread-safety will be introduced to bring the original sequential and thread unaware source-code into the position of utilizing multi-threading. Afterwards further investigations have to be performed to point out candidates of classes that are useful to share amongst threads. Then in a second step, the transformation has to change the code to share these classes and finally to verify if there are anymore invalid interferences between threads.
A C++ Thread Package for Concurrent and Parallel Programming

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jie Chen; William Watson

1999-11-01

Recently thread libraries have become a common entity on various operating systems such as Unix, Windows NT and VxWorks. Those thread libraries offer significant performance enhancement by allowing applications to use multiple threads running either concurrently or in parallel on multiprocessors. However, the incompatibilities between native libraries introduces challenges for those who wish to develop portable applications.
Characterizing and Mitigating Work Time Inflation in Task Parallel Programs

DOE PAGES

Olivier, Stephen L.; de Supinski, Bronis R.; Schulz, Martin; ...

2013-01-01

Task parallelism raises the level of abstraction in shared memory parallel programming to simplify the development of complex applications. However, task parallel applications can exhibit poor performance due to thread idleness, scheduling overheads, and work time inflation – additional time spent by threads in a multithreaded computation beyond the time required to perform the same work in a sequential computation. We identify the contributions of each factor to lost efficiency in various task parallel OpenMP applications and diagnose the causes of work time inflation in those applications. Increased data access latency can cause significant work time inflation in NUMA systems.more » Our locality framework for task parallel OpenMP programs mitigates this cause of work time inflation. Our extensions to the Qthreads library demonstrate that locality-aware scheduling can improve performance up to 3X compared to the Intel OpenMP task scheduler.« less
Exploiting Thread Parallelism for Ocean Modeling on Cray XC Supercomputers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sarje, Abhinav; Jacobsen, Douglas W.; Williams, Samuel W.

The incorporation of increasing core counts in modern processors used to build state-of-the-art supercomputers is driving application development towards exploitation of thread parallelism, in addition to distributed memory parallelism, with the goal of delivering efficient high-performance codes. In this work we describe the exploitation of threading and our experiences with it with respect to a real-world ocean modeling application code, MPAS-Ocean. We present detailed performance analysis and comparisons of various approaches and configurations for threading on the Cray XC series supercomputers.
Roofline model toolkit: A practical tool for architectural and program analysis

DOE Office of Scientific and Technical Information (OSTI.GOV)

Lo, Yu Jung; Williams, Samuel; Van Straalen, Brian

We present preliminary results of the Roofline Toolkit for multicore, many core, and accelerated architectures. This paper focuses on the processor architecture characterization engine, a collection of portable instrumented micro benchmarks implemented with Message Passing Interface (MPI), and OpenMP used to express thread-level parallelism. These benchmarks are specialized to quantify the behavior of different architectural features. Compared to previous work on performance characterization, these microbenchmarks focus on capturing the performance of each level of the memory hierarchy, along with thread-level parallelism, instruction-level parallelism and explicit SIMD parallelism, measured in the context of the compilers and run-time environments. We also measuremore » sustained PCIe throughput with four GPU memory managed mechanisms. By combining results from the architecture characterization with the Roofline model based solely on architectural specifications, this work offers insights for performance prediction of current and future architectures and their software systems. To that end, we instrument three applications and plot their resultant performance on the corresponding Roofline model when run on a Blue Gene/Q architecture.« less
Employing Nested OpenMP for the Parallelization of Multi-Zone Computational Fluid Dynamics Applications

NASA Technical Reports Server (NTRS)

Ayguade, Eduard; Gonzalez, Marc; Martorell, Xavier; Jost, Gabriele

2004-01-01

In this paper we describe the parallelization of the multi-zone code versions of the NAS Parallel Benchmarks employing multi-level OpenMP parallelism. For our study we use the NanosCompiler, which supports nesting of OpenMP directives and provides clauses to control the grouping of threads, load balancing, and synchronization. We report the benchmark results, compare the timings with those of different hybrid parallelization paradigms and discuss OpenMP implementation issues which effect the performance of multi-level parallel applications.
Real-time SHVC software decoding with multi-threaded parallel processing

NASA Astrophysics Data System (ADS)

Gudumasu, Srinivas; He, Yuwen; Ye, Yan; He, Yong; Ryu, Eun-Seok; Dong, Jie; Xiu, Xiaoyu

2014-09-01

This paper proposes a parallel decoding framework for scalable HEVC (SHVC). Various optimization technologies are implemented on the basis of SHVC reference software SHM-2.0 to achieve real-time decoding speed for the two layer spatial scalability configuration. SHVC decoder complexity is analyzed with profiling information. The decoding process at each layer and the up-sampling process are designed in parallel and scheduled by a high level application task manager. Within each layer, multi-threaded decoding is applied to accelerate the layer decoding speed. Entropy decoding, reconstruction, and in-loop processing are pipeline designed with multiple threads based on groups of coding tree units (CTU). A group of CTUs is treated as a processing unit in each pipeline stage to achieve a better trade-off between parallelism and synchronization. Motion compensation, inverse quantization, and inverse transform modules are further optimized with SSE4 SIMD instructions. Simulations on a desktop with an Intel i7 processor 2600 running at 3.4 GHz show that the parallel SHVC software decoder is able to decode 1080p spatial 2x at up to 60 fps (frames per second) and 1080p spatial 1.5x at up to 50 fps for those bitstreams generated with SHVC common test conditions in the JCT-VC standardization group. The decoding performance at various bitrates with different optimization technologies and different numbers of threads are compared in terms of decoding speed and resource usage, including processor and memory.
Advanced Numerical Techniques of Performance Evaluation. Volume 1

DTIC Science & Technology

1990-06-01

system scheduling3thread. The scheduling thread then runs any other ready thread that can be found. A thread can only sleep or switch out on itself...Polychronopoulos and D.J. Kuck. Guided Self- Scheduling : A Practical Scheduling Scheme for Parallel Supercomputers. IEEE Transactions on Computers C...Kuck 1987] C.D. Polychronopoulos and D.J. Kuck. Guided Self- Scheduling : A Practical Scheduling Scheme for Parallel Supercomputers. IEEE Trans. on Comp
Parallel Implementation of 3-D Iterative Reconstruction With Intra-Thread Update for the jPET-D4

NASA Astrophysics Data System (ADS)

Lam, Chih Fung; Yamaya, Taiga; Obi, Takashi; Yoshida, Eiji; Inadama, Naoko; Shibuya, Kengo; Nishikido, Fumihiko; Murayama, Hideo

2009-02-01

One way to speed-up iterative image reconstruction is by parallel computing with a computer cluster. However, as the number of computing threads increases, parallel efficiency decreases due to network transfer delay. In this paper, we proposed a method to reduce data transfer between computing threads by introducing an intra-thread update. The update factor is collected from each slave thread and a global image is updated as usual in the first K sub-iteration. In the rest of the sub-iterations, the global image is only updated at an interval which is controlled by a parameter L. In between that interval, the intra-thread update is carried out whereby an image update is performed in each slave thread locally. We investigated combinations of K and L parameters based on parallel implementation of RAMLA for the jPET-D4 scanner. Our evaluation used four workstations with a total of 16 slave threads. Each slave thread calculated a different set of LORs which are divided according to ring difference numbers. We assessed image quality of the proposed method with a hotspot simulation phantom. The figure of merit was the full-width-half-maximum of hotspots and the background normalized standard deviation. At an optimum K and L setting, we did not find significant change in the output images. We also applied the proposed method to a Hoffman phantom experiment and found the difference due to intra-thread update was negligible. With the intra-thread update, computation time could be reduced by about 23%.
Modeling Cooperative Threads to Project GPU Performance for Adaptive Parallelism

DOE Office of Scientific and Technical Information (OSTI.GOV)

Meng, Jiayuan; Uram, Thomas; Morozov, Vitali A.

Most accelerators, such as graphics processing units (GPUs) and vector processors, are particularly suitable for accelerating massively parallel workloads. On the other hand, conventional workloads are developed for multi-core parallelism, which often scale to only a few dozen OpenMP threads. When hardware threads significantly outnumber the degree of parallelism in the outer loop, programmers are challenged with efficient hardware utilization. A common solution is to further exploit the parallelism hidden deep in the code structure. Such parallelism is less structured: parallel and sequential loops may be imperfectly nested within each other, neigh boring inner loops may exhibit different concurrency patternsmore » (e.g. Reduction vs. Forall), yet have to be parallelized in the same parallel section. Many input-dependent transformations have to be explored. A programmer often employs a larger group of hardware threads to cooperatively walk through a smaller outer loop partition and adaptively exploit any encountered parallelism. This process is time-consuming and error-prone, yet the risk of gaining little or no performance remains high for such workloads. To reduce risk and guide implementation, we propose a technique to model workloads with limited parallelism that can automatically explore and evaluate transformations involving cooperative threads. Eventually, our framework projects the best achievable performance and the most promising transformations without implementing GPU code or using physical hardware. We envision our technique to be integrated into future compilers or optimization frameworks for autotuning.« less
Efficient Thread Labeling for Monitoring Programs with Nested Parallelism

NASA Astrophysics Data System (ADS)

Ha, Ok-Kyoon; Kim, Sun-Sook; Jun, Yong-Kee

It is difficult and cumbersome to detect data races occurred in an execution of parallel programs. Any on-the-fly race detection techniques using Lamport's happened-before relation needs a thread labeling scheme for generating unique identifiers which maintain logical concurrency information for the parallel threads. NR labeling is an efficient thread labeling scheme for the fork-join program model with nested parallelism, because its efficiency depends only on the nesting depth for every fork and join operation. This paper presents an improved NR labeling, called e-NR labeling, in which every thread generates its label by inheriting the pointer to its ancestor list from the parent threads or by updating the pointer in a constant amount of time and space. This labeling is more efficient than the NR labeling, because its efficiency does not depend on the nesting depth for every fork and join operation. Some experiments were performed with OpenMP programs having nesting depths of three or four and maximum parallelisms varying from 10,000 to 1,000,000. The results show that e-NR is 5 times faster than NR labeling and 4.3 times faster than OS labeling in the average time for creating and maintaining the thread labels. In average space required for labeling, it is 3.5 times smaller than NR labeling and 3 times smaller than OS labeling.
Multicore Challenges and Benefits for High Performance Scientific Computing

DOE PAGES

Nielsen, Ida M. B.; Janssen, Curtis L.

2008-01-01

Until recently, performance gains in processors were achieved largely by improvements in clock speeds and instruction level parallelism. Thus, applications could obtain performance increases with relatively minor changes by upgrading to the latest generation of computing hardware. Currently, however, processor performance improvements are realized by using multicore technology and hardware support for multiple threads within each core, and taking full advantage of this technology to improve the performance of applications requires exposure of extreme levels of software parallelism. We will here discuss the architecture of parallel computers constructed from many multicore chips as well as techniques for managing the complexitymore » of programming such computers, including the hybrid message-passing/multi-threading programming model. We will illustrate these ideas with a hybrid distributed memory matrix multiply and a quantum chemistry algorithm for energy computation using Møller–Plesset perturbation theory.« less
Modern multicore and manycore architectures: Modelling, optimisation and benchmarking a multiblock CFD code

NASA Astrophysics Data System (ADS)

Hadade, Ioan; di Mare, Luca

2016-08-01

Modern multicore and manycore processors exhibit multiple levels of parallelism through a wide range of architectural features such as SIMD for data parallel execution or threads for core parallelism. The exploitation of multi-level parallelism is therefore crucial for achieving superior performance on current and future processors. This paper presents the performance tuning of a multiblock CFD solver on Intel SandyBridge and Haswell multicore CPUs and the Intel Xeon Phi Knights Corner coprocessor. Code optimisations have been applied on two computational kernels exhibiting different computational patterns: the update of flow variables and the evaluation of the Roe numerical fluxes. We discuss at great length the code transformations required for achieving efficient SIMD computations for both kernels across the selected devices including SIMD shuffles and transpositions for flux stencil computations and global memory transformations. Core parallelism is expressed through threading based on a number of domain decomposition techniques together with optimisations pertaining to alleviating NUMA effects found in multi-socket compute nodes. Results are correlated with the Roofline performance model in order to assert their efficiency for each distinct architecture. We report significant speedups for single thread execution across both kernels: 2-5X on the multicore CPUs and 14-23X on the Xeon Phi coprocessor. Computations at full node and chip concurrency deliver a factor of three speedup on the multicore processors and up to 24X on the Xeon Phi manycore coprocessor.
SKIRT: Hybrid parallelization of radiative transfer simulations

NASA Astrophysics Data System (ADS)

Verstocken, S.; Van De Putte, D.; Camps, P.; Baes, M.

2017-07-01

We describe the design, implementation and performance of the new hybrid parallelization scheme in our Monte Carlo radiative transfer code SKIRT, which has been used extensively for modelling the continuum radiation of dusty astrophysical systems including late-type galaxies and dusty tori. The hybrid scheme combines distributed memory parallelization, using the standard Message Passing Interface (MPI) to communicate between processes, and shared memory parallelization, providing multiple execution threads within each process to avoid duplication of data structures. The synchronization between multiple threads is accomplished through atomic operations without high-level locking (also called lock-free programming). This improves the scaling behaviour of the code and substantially simplifies the implementation of the hybrid scheme. The result is an extremely flexible solution that adjusts to the number of available nodes, processors and memory, and consequently performs well on a wide variety of computing architectures.
Topical perspective on massive threading and parallelism.

PubMed

Farber, Robert M

2011-09-01

Unquestionably computer architectures have undergone a recent and noteworthy paradigm shift that now delivers multi- and many-core systems with tens to many thousands of concurrent hardware processing elements per workstation or supercomputer node. GPGPU (General Purpose Graphics Processor Unit) technology in particular has attracted significant attention as new software development capabilities, namely CUDA (Compute Unified Device Architecture) and OpenCL™, have made it possible for students as well as small and large research organizations to achieve excellent speedup for many applications over more conventional computing architectures. The current scientific literature reflects this shift with numerous examples of GPGPU applications that have achieved one, two, and in some special cases, three-orders of magnitude increased computational performance through the use of massive threading to exploit parallelism. Multi-core architectures are also evolving quickly to exploit both massive-threading and massive-parallelism such as the 1.3 million threads Blue Waters supercomputer. The challenge confronting scientists in planning future experimental and theoretical research efforts--be they individual efforts with one computer or collaborative efforts proposing to use the largest supercomputers in the world is how to capitalize on these new massively threaded computational architectures--especially as not all computational problems will scale to massive parallelism. In particular, the costs associated with restructuring software (and potentially redesigning algorithms) to exploit the parallelism of these multi- and many-threaded machines must be considered along with application scalability and lifespan. This perspective is an overview of the current state of threading and parallelize with some insight into the future. Published by Elsevier Inc.
Multi-threaded parallel simulation of non-local non-linear problems in ultrashort laser pulse propagation in the presence of plasma

NASA Astrophysics Data System (ADS)

Baregheh, Mandana; Mezentsev, Vladimir; Schmitz, Holger

2011-06-01

We describe a parallel multi-threaded approach for high performance modelling of wide class of phenomena in ultrafast nonlinear optics. Specific implementation has been performed using the highly parallel capabilities of a programmable graphics processor.
Nebo: An efficient, parallel, and portable domain-specific language for numerically solving partial differential equations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Earl, Christopher; Might, Matthew; Bagusetty, Abhishek

This study presents Nebo, a declarative domain-specific language embedded in C++ for discretizing partial differential equations for transport phenomena on multiple architectures. Application programmers use Nebo to write code that appears sequential but can be run in parallel, without editing the code. Currently Nebo supports single-thread execution, multi-thread execution, and many-core (GPU-based) execution. With single-thread execution, Nebo performs on par with code written by domain experts. With multi-thread execution, Nebo can linearly scale (with roughly 90% efficiency) up to 12 cores, compared to its single-thread execution. Moreover, Nebo’s many-core execution can be over 140x faster than its single-thread execution.

Nebo: An efficient, parallel, and portable domain-specific language for numerically solving partial differential equations

DOE PAGES

Earl, Christopher; Might, Matthew; Bagusetty, Abhishek; ...

2016-01-26

This study presents Nebo, a declarative domain-specific language embedded in C++ for discretizing partial differential equations for transport phenomena on multiple architectures. Application programmers use Nebo to write code that appears sequential but can be run in parallel, without editing the code. Currently Nebo supports single-thread execution, multi-thread execution, and many-core (GPU-based) execution. With single-thread execution, Nebo performs on par with code written by domain experts. With multi-thread execution, Nebo can linearly scale (with roughly 90% efficiency) up to 12 cores, compared to its single-thread execution. Moreover, Nebo’s many-core execution can be over 140x faster than its single-thread execution.
Final report on EURAMET.L-S21: `Supplementary comparison of parallel thread gauges'

NASA Astrophysics Data System (ADS)

Mudronja, Vedran; Šimunovic, Vedran; Acko, Bojan; Matus, Michael; Bánréti, Edit; István, Dicso; Thalmann, Rudolf; Lassila, Antti; Lillepea, Lauri; Bartolo Picotto, Gian; Bellotti, Roberto; Pometto, Marco; Ganioglu, Okhan; Meral, Ilker; Salgado, José Antonio; Georges, Vailleau

2015-01-01

The results of the comparison of parallel thread gauges between ten European countries are presented. Three thread plugs and three thread rings were calibrated in one loop. Croatian National Laboratory for Length (HMI/FSB-LPMD) acted as the coordinator and pilot laboratory of the comparison. Thread angle, thread pitch, simple pitch diameter and pitch diameter were measured. Pitch diameters were calibrated within 1a, 2a, 1b and 2b calibration categories in accordance with the EURAMET cg-10 calibration guide. A good agreement between the measurement results and differences due to different calibration categories are analysed in this paper. This comparison was a first EURAMET comparison of parallel thread gauges based on the EURAMET ctg-10 calibration guide, and has made a step towards the harmonization of future comparisons with the registration of CMC values for thread gauges. Main text. To reach the main text of this paper, click on Final Report. Note that this text is that which appears in Appendix B of the BIPM key comparison database kcdb.bipm.org/. The final report has been peer-reviewed and approved for publication by the CCL, according to the provisions of the CIPM Mutual Recognition Arrangement (CIPM MRA).
Solving Large Problems Quickly: Progress in 2001-2003

NASA Technical Reports Server (NTRS)

Mowry, Todd C.; Colohan, Christopher B.; Brown, Angela Demke; Steffan, J. Gregory; Zhai, Antonia

2004-01-01

This document describes the progress we have made and the lessons we have learned in 2001 through 2003 under the NASA grant entitled "Solving Important Problems Faster". The long-term goal of this research is to accelerate large, irregular scientific applications which have enormous data sets and which are difficult to parallelize. To accomplish this goal, we are exploring two complementary techniques: (i) using compiler-inserted prefetching to automatically hide the I/O latency of accessing these large data sets from disk; and (ii) using thread-level data speculation to enable the optimistic parallelization of applications despite uncertainty as to whether data dependences exist between the resulting threads which would normally make them unsafe to execute in parallel. Overall, we made significant progress in 2001 through 2003, and the project has gone well.
a Spatiotemporal Aggregation Query Method Using Multi-Thread Parallel Technique Based on Regional Division

NASA Astrophysics Data System (ADS)

Liao, S.; Chen, L.; Li, J.; Xiong, W.; Wu, Q.

2015-07-01

Existing spatiotemporal database supports spatiotemporal aggregation query over massive moving objects datasets. Due to the large amounts of data and single-thread processing method, the query speed cannot meet the application requirements. On the other hand, the query efficiency is more sensitive to spatial variation then temporal variation. In this paper, we proposed a spatiotemporal aggregation query method using multi-thread parallel technique based on regional divison and implemented it on the server. Concretely, we divided the spatiotemporal domain into several spatiotemporal cubes, computed spatiotemporal aggregation on all cubes using the technique of multi-thread parallel processing, and then integrated the query results. By testing and analyzing on the real datasets, this method has improved the query speed significantly.
Study of Thread Level Parallelism in a Video Encoding Application for Chip Multiprocessor Design

NASA Astrophysics Data System (ADS)

Debes, Eric; Kaine, Greg

2002-11-01

In media applications there is a high level of available thread level parallelism (TLP). In this paper we study the intra TLP in a video encoder. We show that a well-distributed highly optimized encoder running on a symmetric multiprocessor (SMP) system can run 3.2 faster on a 4-way SMP machine than on a single processor. The multithreaded encoder running on an SMP system is then used to understand the requirements of a chip multiprocessor (CMP) architecture, which is one possible architectural direction to better exploit TLP. In the framework of this study, we use a software approach to evaluate the dataflow between processors for the video encoder running on an SMP system. An estimation of the dataflow is done with L2 cache miss event counters using Intel® VTuneTM performance analyzer. The experimental measurements are compared to theoretical results.
Cpu/gpu Computing for AN Implicit Multi-Block Compressible Navier-Stokes Solver on Heterogeneous Platform

NASA Astrophysics Data System (ADS)

Deng, Liang; Bai, Hanli; Wang, Fang; Xu, Qingxin

2016-06-01

CPU/GPU computing allows scientists to tremendously accelerate their numerical codes. In this paper, we port and optimize a double precision alternating direction implicit (ADI) solver for three-dimensional compressible Navier-Stokes equations from our in-house Computational Fluid Dynamics (CFD) software on heterogeneous platform. First, we implement a full GPU version of the ADI solver to remove a lot of redundant data transfers between CPU and GPU, and then design two fine-grain schemes, namely “one-thread-one-point” and “one-thread-one-line”, to maximize the performance. Second, we present a dual-level parallelization scheme using the CPU/GPU collaborative model to exploit the computational resources of both multi-core CPUs and many-core GPUs within the heterogeneous platform. Finally, considering the fact that memory on a single node becomes inadequate when the simulation size grows, we present a tri-level hybrid programming pattern MPI-OpenMP-CUDA that merges fine-grain parallelism using OpenMP and CUDA threads with coarse-grain parallelism using MPI for inter-node communication. We also propose a strategy to overlap the computation with communication using the advanced features of CUDA and MPI programming. We obtain speedups of 6.0 for the ADI solver on one Tesla M2050 GPU in contrast to two Xeon X5670 CPUs. Scalability tests show that our implementation can offer significant performance improvement on heterogeneous platform.
Processing data communications events by awakening threads in parallel active messaging interface of a parallel computer

DOE Office of Scientific and Technical Information (OSTI.GOV)

Archer, Charles J.; Blocksome, Michael A.; Ratterman, Joseph D.

Processing data communications events in a parallel active messaging interface (`PAMI`) of a parallel computer that includes compute nodes that execute a parallel application, with the PAMI including data communications endpoints, and the endpoints are coupled for data communications through the PAMI and through other data communications resources, including determining by an advance function that there are no actionable data communications events pending for its context, placing by the advance function its thread of execution into a wait state, waiting for a subsequent data communications event for the context; responsive to occurrence of a subsequent data communications event for themore » context, awakening by the thread from the wait state; and processing by the advance function the subsequent data communications event now pending for the context.« less
When the lowest energy does not induce native structures: parallel minimization of multi-energy values by hybridizing searching intelligences.

PubMed

Lü, Qiang; Xia, Xiao-Yan; Chen, Rong; Miao, Da-Jun; Chen, Sha-Sha; Quan, Li-Jun; Li, Hai-Ou

2012-01-01

Protein structure prediction (PSP), which is usually modeled as a computational optimization problem, remains one of the biggest challenges in computational biology. PSP encounters two difficult obstacles: the inaccurate energy function problem and the searching problem. Even if the lowest energy has been luckily found by the searching procedure, the correct protein structures are not guaranteed to obtain. A general parallel metaheuristic approach is presented to tackle the above two problems. Multi-energy functions are employed to simultaneously guide the parallel searching threads. Searching trajectories are in fact controlled by the parameters of heuristic algorithms. The parallel approach allows the parameters to be perturbed during the searching threads are running in parallel, while each thread is searching the lowest energy value determined by an individual energy function. By hybridizing the intelligences of parallel ant colonies and Monte Carlo Metropolis search, this paper demonstrates an implementation of our parallel approach for PSP. 16 classical instances were tested to show that the parallel approach is competitive for solving PSP problem. This parallel approach combines various sources of both searching intelligences and energy functions, and thus predicts protein conformations with good quality jointly determined by all the parallel searching threads and energy functions. It provides a framework to combine different searching intelligence embedded in heuristic algorithms. It also constructs a container to hybridize different not-so-accurate objective functions which are usually derived from the domain expertise.
When the Lowest Energy Does Not Induce Native Structures: Parallel Minimization of Multi-Energy Values by Hybridizing Searching Intelligences

PubMed Central

Lü, Qiang; Xia, Xiao-Yan; Chen, Rong; Miao, Da-Jun; Chen, Sha-Sha; Quan, Li-Jun; Li, Hai-Ou

2012-01-01

Background Protein structure prediction (PSP), which is usually modeled as a computational optimization problem, remains one of the biggest challenges in computational biology. PSP encounters two difficult obstacles: the inaccurate energy function problem and the searching problem. Even if the lowest energy has been luckily found by the searching procedure, the correct protein structures are not guaranteed to obtain. Results A general parallel metaheuristic approach is presented to tackle the above two problems. Multi-energy functions are employed to simultaneously guide the parallel searching threads. Searching trajectories are in fact controlled by the parameters of heuristic algorithms. The parallel approach allows the parameters to be perturbed during the searching threads are running in parallel, while each thread is searching the lowest energy value determined by an individual energy function. By hybridizing the intelligences of parallel ant colonies and Monte Carlo Metropolis search, this paper demonstrates an implementation of our parallel approach for PSP. 16 classical instances were tested to show that the parallel approach is competitive for solving PSP problem. Conclusions This parallel approach combines various sources of both searching intelligences and energy functions, and thus predicts protein conformations with good quality jointly determined by all the parallel searching threads and energy functions. It provides a framework to combine different searching intelligence embedded in heuristic algorithms. It also constructs a container to hybridize different not-so-accurate objective functions which are usually derived from the domain expertise. PMID:23028708
(Re)engineering Earth System Models to Expose Greater Concurrency for Ultrascale Computing: Practice, Experience, and Musings

NASA Astrophysics Data System (ADS)

Mills, R. T.

2014-12-01

As the high performance computing (HPC) community pushes towards the exascale horizon, the importance and prevalence of fine-grained parallelism in new computer architectures is increasing. This is perhaps most apparent in the proliferation of so-called "accelerators" such as the Intel Xeon Phi or NVIDIA GPGPUs, but the trend also holds for CPUs, where serial performance has grown slowly and effective use of hardware threads and vector units are becoming increasingly important to realizing high performance. This has significant implications for weather, climate, and Earth system modeling codes, many of which display impressive scalability across MPI ranks but take relatively little advantage of threading and vector processing. In addition to increasing parallelism, next generation codes will also need to address increasingly deep hierarchies for data movement: NUMA/cache levels, on node vs. off node, local vs. wide neighborhoods on the interconnect, and even in the I/O system. We will discuss some approaches (grounded in experiences with the Intel Xeon Phi architecture) for restructuring Earth science codes to maximize concurrency across multiple levels (vectors, threads, MPI ranks), and also discuss some novel approaches for minimizing expensive data movement/communication.
Processing communications events in parallel active messaging interface by awakening thread from wait state

DOEpatents

Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E

2013-10-22

Processing data communications events in a parallel active messaging interface (`PAMI`) of a parallel computer that includes compute nodes that execute a parallel application, with the PAMI including data communications endpoints, and the endpoints are coupled for data communications through the PAMI and through other data communications resources, including determining by an advance function that there are no actionable data communications events pending for its context, placing by the advance function its thread of execution into a wait state, waiting for a subsequent data communications event for the context; responsive to occurrence of a subsequent data communications event for the context, awakening by the thread from the wait state; and processing by the advance function the subsequent data communications event now pending for the context.
Efficient Parallelization of a Dynamic Unstructured Application on the Tera MTA

NASA Technical Reports Server (NTRS)

Oliker, Leonid; Biswas, Rupak

1999-01-01

The success of parallel computing in solving real-life computationally-intensive problems relies on their efficient mapping and execution on large-scale multiprocessor architectures. Many important applications are both unstructured and dynamic in nature, making their efficient parallel implementation a daunting task. This paper presents the parallelization of a dynamic unstructured mesh adaptation algorithm using three popular programming paradigms on three leading supercomputers. We examine an MPI message-passing implementation on the Cray T3E and the SGI Origin2OOO, a shared-memory implementation using cache coherent nonuniform memory access (CC-NUMA) of the Origin2OOO, and a multi-threaded version on the newly-released Tera Multi-threaded Architecture (MTA). We compare several critical factors of this parallel code development, including runtime, scalability, programmability, and memory overhead. Our overall results demonstrate that multi-threaded systems offer tremendous potential for quickly and efficiently solving some of the most challenging real-life problems on parallel computers.
Multithreaded Stochastic PDES for Reactions and Diffusions in Neurons.

PubMed

Lin, Zhongwei; Tropper, Carl; Mcdougal, Robert A; Patoary, Mohammand Nazrul Ishlam; Lytton, William W; Yao, Yiping; Hines, Michael L

2017-07-01

Cells exhibit stochastic behavior when the number of molecules is small. Hence a stochastic reaction-diffusion simulator capable of working at scale can provide a more accurate view of molecular dynamics within the cell. This paper describes a parallel discrete event simulator, Neuron Time Warp-Multi Thread (NTW-MT), developed for the simulation of reaction diffusion models of neurons. To the best of our knowledge, this is the first parallel discrete event simulator oriented towards stochastic simulation of chemical reactions in a neuron. The simulator was developed as part of the NEURON project. NTW-MT is optimistic and thread-based, which attempts to capitalize on multi-core architectures used in high performance machines. It makes use of a multi-level queue for the pending event set and a single roll-back message in place of individual anti-messages to disperse contention and decrease the overhead of processing rollbacks. Global Virtual Time is computed asynchronously both within and among processes to get rid of the overhead for synchronizing threads. Memory usage is managed in order to avoid locking and unlocking when allocating and de-allocating memory and to maximize cache locality. We verified our simulator on a calcium buffer model. We examined its performance on a calcium wave model, comparing it to the performance of a process based optimistic simulator and a threaded simulator which uses a single priority queue for each thread. Our multi-threaded simulator is shown to achieve superior performance to these simulators. Finally, we demonstrated the scalability of our simulator on a larger CICR model and a more detailed CICR model.
Integrating end-to-end threads of control into object-oriented analysis and design

NASA Technical Reports Server (NTRS)

Mccandlish, Janet E.; Macdonald, James R.; Graves, Sara J.

1993-01-01

Current object-oriented analysis and design methodologies fall short in their use of mechanisms for identifying threads of control for the system being developed. The scenarios which typically describe a system are more global than looking at the individual objects and representing their behavior. Unlike conventional methodologies that use data flow and process-dependency diagrams, object-oriented methodologies do not provide a model for representing these global threads end-to-end. Tracing through threads of control is key to ensuring that a system is complete and timing constraints are addressed. The existence of multiple threads of control in a system necessitates a partitioning of the system into processes. This paper describes the application and representation of end-to-end threads of control to the object-oriented analysis and design process using object-oriented constructs. The issue of representation is viewed as a grouping problem, that is, how to group classes/objects at a higher level of abstraction so that the system may be viewed as a whole with both classes/objects and their associated dynamic behavior. Existing object-oriented development methodology techniques are extended by adding design-level constructs termed logical composite classes and process composite classes. Logical composite classes are design-level classes which group classes/objects both logically and by thread of control information. Process composite classes further refine the logical composite class groupings by using process partitioning criteria to produce optimum concurrent execution results. The goal of these design-level constructs is to ultimately provide the basis for a mechanism that can support the creation of process composite classes in an automated way. Using an automated mechanism makes it easier to partition a system into concurrently executing elements that can be run in parallel on multiple processors.
Performance evaluation of canny edge detection on a tiled multicore architecture

NASA Astrophysics Data System (ADS)

Brethorst, Andrew Z.; Desai, Nehal; Enright, Douglas P.; Scrofano, Ronald

2011-01-01

In the last few years, a variety of multicore architectures have been used to parallelize image processing applications. In this paper, we focus on assessing the parallel speed-ups of different Canny edge detection parallelization strategies on the Tile64, a tiled multicore architecture developed by the Tilera Corporation. Included in these strategies are different ways Canny edge detection can be parallelized, as well as differences in data management. The two parallelization strategies examined were loop-level parallelism and domain decomposition. Loop-level parallelism is achieved through the use of OpenMP,1 and it is capable of parallelization across the range of values over which a loop iterates. Domain decomposition is the process of breaking down an image into subimages, where each subimage is processed independently, in parallel. The results of the two strategies show that for the same number of threads, programmer implemented, domain decomposition exhibits higher speed-ups than the compiler managed, loop-level parallelism implemented with OpenMP.
GPU COMPUTING FOR PARTICLE TRACKING

DOE Office of Scientific and Technical Information (OSTI.GOV)

Nishimura, Hiroshi; Song, Kai; Muriki, Krishna

2011-03-25

This is a feasibility study of using a modern Graphics Processing Unit (GPU) to parallelize the accelerator particle tracking code. To demonstrate the massive parallelization features provided by GPU computing, a simplified TracyGPU program is developed for dynamic aperture calculation. Performances, issues, and challenges from introducing GPU are also discussed. General purpose Computation on Graphics Processing Units (GPGPU) bring massive parallel computing capabilities to numerical calculation. However, the unique architecture of GPU requires a comprehensive understanding of the hardware and programming model to be able to well optimize existing applications. In the field of accelerator physics, the dynamic aperture calculationmore » of a storage ring, which is often the most time consuming part of the accelerator modeling and simulation, can benefit from GPU due to its embarrassingly parallel feature, which fits well with the GPU programming model. In this paper, we use the Tesla C2050 GPU which consists of 14 multi-processois (MP) with 32 cores on each MP, therefore a total of 448 cores, to host thousands ot threads dynamically. Thread is a logical execution unit of the program on GPU. In the GPU programming model, threads are grouped into a collection of blocks Within each block, multiple threads share the same code, and up to 48 KB of shared memory. Multiple thread blocks form a grid, which is executed as a GPU kernel. A simplified code that is a subset of Tracy++ [2] is developed to demonstrate the possibility of using GPU to speed up the dynamic aperture calculation by having each thread track a particle.« less
Multi-thread parallel algorithm for reconstructing 3D large-scale porous structures

NASA Astrophysics Data System (ADS)

Ju, Yang; Huang, Yaohui; Zheng, Jiangtao; Qian, Xu; Xie, Heping; Zhao, Xi

2017-04-01

Geomaterials inherently contain many discontinuous, multi-scale, geometrically irregular pores, forming a complex porous structure that governs their mechanical and transport properties. The development of an efficient reconstruction method for representing porous structures can significantly contribute toward providing a better understanding of the governing effects of porous structures on the properties of porous materials. In order to improve the efficiency of reconstructing large-scale porous structures, a multi-thread parallel scheme was incorporated into the simulated annealing reconstruction method. In the method, four correlation functions, which include the two-point probability function, the linear-path functions for the pore phase and the solid phase, and the fractal system function for the solid phase, were employed for better reproduction of the complex well-connected porous structures. In addition, a random sphere packing method and a self-developed pre-conditioning method were incorporated to cast the initial reconstructed model and select independent interchanging pairs for parallel multi-thread calculation, respectively. The accuracy of the proposed algorithm was evaluated by examining the similarity between the reconstructed structure and a prototype in terms of their geometrical, topological, and mechanical properties. Comparisons of the reconstruction efficiency of porous models with various scales indicated that the parallel multi-thread scheme significantly shortened the execution time for reconstruction of a large-scale well-connected porous model compared to a sequential single-thread procedure.
Scalable and massively parallel Monte Carlo photon transport simulations for heterogeneous computing platforms

NASA Astrophysics Data System (ADS)

Yu, Leiming; Nina-Paravecino, Fanny; Kaeli, David; Fang, Qianqian

2018-01-01

We present a highly scalable Monte Carlo (MC) three-dimensional photon transport simulation platform designed for heterogeneous computing systems. Through the development of a massively parallel MC algorithm using the Open Computing Language framework, this research extends our existing graphics processing unit (GPU)-accelerated MC technique to a highly scalable vendor-independent heterogeneous computing environment, achieving significantly improved performance and software portability. A number of parallel computing techniques are investigated to achieve portable performance over a wide range of computing hardware. Furthermore, multiple thread-level and device-level load-balancing strategies are developed to obtain efficient simulations using multiple central processing units and GPUs.
Anti-parallel EUV Flows Observed along Active Region Filament Threads with Hi-C

NASA Astrophysics Data System (ADS)

Alexander, Caroline E.; Walsh, Robert W.; Régnier, Stéphane; Cirtain, Jonathan; Winebarger, Amy R.; Golub, Leon; Kobayashi, Ken; Platt, Simon; Mitchell, Nick; Korreck, Kelly; DePontieu, Bart; DeForest, Craig; Weber, Mark; Title, Alan; Kuzin, Sergey

2013-09-01

Plasma flows within prominences/filaments have been observed for many years and hold valuable clues concerning the mass and energy balance within these structures. Previous observations of these flows primarily come from Hα and cool extreme-ultraviolet (EUV) lines (e.g., 304 Å) where estimates of the size of the prominence threads has been limited by the resolution of the available instrumentation. Evidence of "counter-steaming" flows has previously been inferred from these cool plasma observations, but now, for the first time, these flows have been directly imaged along fundamental filament threads within the million degree corona (at 193 Å). In this work, we present observations of an AR filament observed with the High-resolution Coronal Imager (Hi-C) that exhibits anti-parallel flows along adjacent filament threads. Complementary data from the Solar Dynamics Observatory (SDO)/Atmospheric Imaging Assembly (AIA) and Helioseismic and Magnetic Imager are presented. The ultra-high spatial and temporal resolution of Hi-C allow the anti-parallel flow velocities to be measured (70-80 km s-1) and gives an indication of the resolvable thickness of the individual strands (0.''8 ± 0.''1). The temperature of the plasma flows was estimated to be log T (K) = 5.45 ± 0.10 using Emission Measure loci analysis. We find that SDO/AIA cannot clearly observe these anti-parallel flows or measure their velocity or thread width due to its larger pixel size. We suggest that anti-parallel/counter-streaming flows are likely commonplace within all filaments and are currently not observed in EUV due to current instrument spatial resolution.
Organizing Compression of Hyperspectral Imagery to Allow Efficient Parallel Decompression

NASA Technical Reports Server (NTRS)

Klimesh, Matthew A.; Kiely, Aaron B.

2014-01-01

family of schemes has been devised for organizing the output of an algorithm for predictive data compression of hyperspectral imagery so as to allow efficient parallelization in both the compressor and decompressor. In these schemes, the compressor performs a number of iterations, during each of which a portion of the data is compressed via parallel threads operating on independent portions of the data. The general idea is that for each iteration it is predetermined how much compressed data will be produced from each thread.

Thread-Level Parallelization and Optimization of NWChem for the Intel MIC Architecture

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shan, Hongzhang; Williams, Samuel; Jong, Wibe de

In the multicore era it was possible to exploit the increase in on-chip parallelism by simply running multiple MPI processes per chip. Unfortunately, manycore processors' greatly increased thread- and data-level parallelism coupled with a reduced memory capacity demand an altogether different approach. In this paper we explore augmenting two NWChem modules, triples correction of the CCSD(T) and Fock matrix construction, with OpenMP in order that they might run efficiently on future manycore architectures. As the next NERSC machine will be a self-hosted Intel MIC (Xeon Phi) based supercomputer, we leverage an existing MIC testbed at NERSC to evaluate our experiments.more » In order to proxy the fact that future MIC machines will not have a host processor, we run all of our experiments in tt native mode. We found that while straightforward application of OpenMP to the deep loop nests associated with the tensor contractions of CCSD(T) was sufficient in attaining high performance, significant effort was required to safely and efficiently thread the TEXAS integral package when constructing the Fock matrix. Ultimately, our new MPI OpenMP hybrid implementations attain up to 65x better performance for the triples part of the CCSD(T) due in large part to the fact that the limited on-card memory limits the existing MPI implementation to a single process per card. Additionally, we obtain up to 1.6x better performance on Fock matrix constructions when compared with the best MPI implementations running multiple processes per card.« less
Thread-level parallelization and optimization of NWChem for the Intel MIC architecture

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shan, Hongzhang; Williams, Samuel; de Jong, Wibe

In the multicore era it was possible to exploit the increase in on-chip parallelism by simply running multiple MPI processes per chip. Unfortunately, manycore processors' greatly increased thread- and data-level parallelism coupled with a reduced memory capacity demand an altogether different approach. In this paper we explore augmenting two NWChem modules, triples correction of the CCSD(T) and Fock matrix construction, with OpenMP in order that they might run efficiently on future manycore architectures. As the next NERSC machine will be a self-hosted Intel MIC (Xeon Phi) based supercomputer, we leverage an existing MIC testbed at NERSC to evaluate our experiments.more » In order to proxy the fact that future MIC machines will not have a host processor, we run all of our experiments in native mode. We found that while straightforward application of OpenMP to the deep loop nests associated with the tensor contractions of CCSD(T) was sufficient in attaining high performance, significant e ort was required to safely and efeciently thread the TEXAS integral package when constructing the Fock matrix. Ultimately, our new MPI+OpenMP hybrid implementations attain up to 65× better performance for the triples part of the CCSD(T) due in large part to the fact that the limited on-card memory limits the existing MPI implementation to a single process per card. Additionally, we obtain up to 1.6× better performance on Fock matrix constructions when compared with the best MPI implementations running multiple processes per card.« less
ANTI-PARALLEL EUV FLOWS OBSERVED ALONG ACTIVE REGION FILAMENT THREADS WITH HI-C

DOE Office of Scientific and Technical Information (OSTI.GOV)

Alexander, Caroline E.; Walsh, Robert W.; Régnier, Stéphane

Plasma flows within prominences/filaments have been observed for many years and hold valuable clues concerning the mass and energy balance within these structures. Previous observations of these flows primarily come from Hα and cool extreme-ultraviolet (EUV) lines (e.g., 304 Å) where estimates of the size of the prominence threads has been limited by the resolution of the available instrumentation. Evidence of 'counter-steaming' flows has previously been inferred from these cool plasma observations, but now, for the first time, these flows have been directly imaged along fundamental filament threads within the million degree corona (at 193 Å). In this work, wemore » present observations of an AR filament observed with the High-resolution Coronal Imager (Hi-C) that exhibits anti-parallel flows along adjacent filament threads. Complementary data from the Solar Dynamics Observatory (SDO)/Atmospheric Imaging Assembly (AIA) and Helioseismic and Magnetic Imager are presented. The ultra-high spatial and temporal resolution of Hi-C allow the anti-parallel flow velocities to be measured (70-80 km s{sup –1}) and gives an indication of the resolvable thickness of the individual strands (0.''8 ± 0.''1). The temperature of the plasma flows was estimated to be log T (K) = 5.45 ± 0.10 using Emission Measure loci analysis. We find that SDO/AIA cannot clearly observe these anti-parallel flows or measure their velocity or thread width due to its larger pixel size. We suggest that anti-parallel/counter-streaming flows are likely commonplace within all filaments and are currently not observed in EUV due to current instrument spatial resolution.« less
Final report for the Tera Computer TTI CRADA

DOE Office of Scientific and Technical Information (OSTI.GOV)

Davidson, G.S.; Pavlakos, C.; Silva, C.

1997-01-01

Tera Computer and Sandia National Laboratories have completed a CRADA, which examined the Tera Multi-Threaded Architecture (MTA) for use with large codes of importance to industry and DOE. The MTA is an innovative architecture that uses parallelism to mask latency between memories and processors. The physical implementation is a parallel computer with high cross-section bandwidth and GaAs processors designed by Tera, which support many small computation threads and fast, lightweight context switches between them. When any thread blocks while waiting for memory accesses to complete, another thread immediately begins execution so that high CPU utilization is maintained. The Tera MTAmore » parallel computer has a single, global address space, which is appealing when porting existing applications to a parallel computer. This ease of porting is further enabled by compiler technology that helps break computations into parallel threads. DOE and Sandia National Laboratories were interested in working with Tera to further develop this computing concept. While Tera Computer would continue the hardware development and compiler research, Sandia National Laboratories would work with Tera to ensure that their compilers worked well with important Sandia codes, most particularly CTH, a shock physics code used for weapon safety computations. In addition to that important code, Sandia National Laboratories would complete research on a robotic path planning code, SANDROS, which is important in manufacturing applications, and would evaluate the MTA performance on this code. Finally, Sandia would work directly with Tera to develop 3D visualization codes, which would be appropriate for use with the MTA. Each of these tasks has been completed to the extent possible, given that Tera has just completed the MTA hardware. All of the CRADA work had to be done on simulators.« less
Performance of the Heavy Flavor Tracker (HFT) detector in star experiment at RHIC

NASA Astrophysics Data System (ADS)

Alruwaili, Manal

With the growing technology, the number of the processors is becoming massive. Current supercomputer processing will be available on desktops in the next decade. For mass scale application software development on massive parallel computing available on desktops, existing popular languages with large libraries have to be augmented with new constructs and paradigms that exploit massive parallel computing and distributed memory models while retaining the user-friendliness. Currently, available object oriented languages for massive parallel computing such as Chapel, X10 and UPC++ exploit distributed computing, data parallel computing and thread-parallelism at the process level in the PGAS (Partitioned Global Address Space) memory model. However, they do not incorporate: 1) any extension at for object distribution to exploit PGAS model; 2) the programs lack the flexibility of migrating or cloning an object between places to exploit load balancing; and 3) lack the programming paradigms that will result from the integration of data and thread-level parallelism and object distribution. In the proposed thesis, I compare different languages in PGAS model; propose new constructs that extend C++ with object distribution and object migration; and integrate PGAS based process constructs with these extensions on distributed objects. Object cloning and object migration. Also a new paradigm MIDD (Multiple Invocation Distributed Data) is presented when different copies of the same class can be invoked, and work on different elements of a distributed data concurrently using remote method invocations. I present new constructs, their grammar and their behavior. The new constructs have been explained using simple programs utilizing these constructs.
Parallel fast multipole boundary element method applied to computational homogenization

NASA Astrophysics Data System (ADS)

Ptaszny, Jacek

2018-01-01

In the present work, a fast multipole boundary element method (FMBEM) and a parallel computer code for 3D elasticity problem is developed and applied to the computational homogenization of a solid containing spherical voids. The system of equation is solved by using the GMRES iterative solver. The boundary of the body is dicretized by using the quadrilateral serendipity elements with an adaptive numerical integration. Operations related to a single GMRES iteration, performed by traversing the corresponding tree structure upwards and downwards, are parallelized by using the OpenMP standard. The assignment of tasks to threads is based on the assumption that the tree nodes at which the moment transformations are initialized can be partitioned into disjoint sets of equal or approximately equal size and assigned to the threads. The achieved speedup as a function of number of threads is examined.
Block-Parallel Data Analysis with DIY2

DOE Office of Scientific and Technical Information (OSTI.GOV)

Morozov, Dmitriy; Peterka, Tom

DIY2 is a programming model and runtime for block-parallel analytics on distributed-memory machines. Its main abstraction is block-structured data parallelism: data are decomposed into blocks; blocks are assigned to processing elements (processes or threads); computation is described as iterations over these blocks, and communication between blocks is defined by reusable patterns. By expressing computation in this general form, the DIY2 runtime is free to optimize the movement of blocks between slow and fast memories (disk and flash vs. DRAM) and to concurrently execute blocks residing in memory with multiple threads. This enables the same program to execute in-core, out-of-core, serial,more » parallel, single-threaded, multithreaded, or combinations thereof. This paper describes the implementation of the main features of the DIY2 programming model and optimizations to improve performance. DIY2 is evaluated on benchmark test cases to establish baseline performance for several common patterns and on larger complete analysis codes running on large-scale HPC machines.« less
Mobile Thread Task Manager

NASA Technical Reports Server (NTRS)

Clement, Bradley J.; Estlin, Tara A.; Bornstein, Benjamin J.

2013-01-01

The Mobile Thread Task Manager (MTTM) is being applied to parallelizing existing flight software to understand the benefits and to develop new techniques and architectural concepts for adapting software to multicore architectures. It allocates and load-balances tasks for a group of threads that migrate across processors to improve cache performance. In order to balance-load across threads, the MTTM augments a basic map-reduce strategy to draw jobs from a global queue. In a multicore processor, memory may be "homed" to the cache of a specific processor and must be accessed from that processor. The MTTB architecture wraps access to data with thread management to move threads to the home processor for that data so that the computation follows the data in an attempt to avoid L2 cache misses. Cache homing is also handled by a memory manager that translates identifiers to processor IDs where the data will be homed (according to rules defined by the user). The user can also specify the number of threads and processors separately, which is important for tuning performance for different patterns of computation and memory access. MTTM efficiently processes tasks in parallel on a multiprocessor computer. It also provides an interface to make it easier to adapt existing software to a multiprocessor environment.
CMS event processing multi-core efficiency status

NASA Astrophysics Data System (ADS)

Jones, C. D.; CMS Collaboration

2017-10-01

In 2015, CMS was the first LHC experiment to begin using a multi-threaded framework for doing event processing. This new framework utilizes Intel’s Thread Building Block library to manage concurrency via a task based processing model. During the 2015 LHC run period, CMS only ran reconstruction jobs using multiple threads because only those jobs were sufficiently thread efficient. Recent work now allows simulation and digitization to be thread efficient. In addition, during 2015 the multi-threaded framework could run events in parallel but could only use one thread per event. Work done in 2016 now allows multiple threads to be used while processing one event. In this presentation we will show how these recent changes have improved CMS’s overall threading and memory efficiency and we will discuss work to be done to further increase those efficiencies.
Kokkos: Enabling manycore performance portability through polymorphic memory access patterns

DOE PAGES

Carter Edwards, H.; Trott, Christian R.; Sunderland, Daniel

2014-07-22

The manycore revolution can be characterized by increasing thread counts, decreasing memory per thread, and diversity of continually evolving manycore architectures. High performance computing (HPC) applications and libraries must exploit increasingly finer levels of parallelism within their codes to sustain scalability on these devices. We found that a major obstacle to performance portability is the diverse and conflicting set of constraints on memory access patterns across devices. Contemporary portable programming models address manycore parallelism (e.g., OpenMP, OpenACC, OpenCL) but fail to address memory access patterns. The Kokkos C++ library enables applications and domain libraries to achieve performance portability on diversemore » manycore architectures by unifying abstractions for both fine-grain data parallelism and memory access patterns. In this paper we describe Kokkos’ abstractions, summarize its application programmer interface (API), present performance results for unit-test kernels and mini-applications, and outline an incremental strategy for migrating legacy C++ codes to Kokkos. Furthermore, the Kokkos library is under active research and development to incorporate capabilities from new generations of manycore architectures, and to address a growing list of applications and domain libraries.« less
Developing eThread pipeline using SAGA-pilot abstraction for large-scale structural bioinformatics.

PubMed

Ragothaman, Anjani; Boddu, Sairam Chowdary; Kim, Nayong; Feinstein, Wei; Brylinski, Michal; Jha, Shantenu; Kim, Joohyun

2014-01-01

While most of computational annotation approaches are sequence-based, threading methods are becoming increasingly attractive because of predicted structural information that could uncover the underlying function. However, threading tools are generally compute-intensive and the number of protein sequences from even small genomes such as prokaryotes is large typically containing many thousands, prohibiting their application as a genome-wide structural systems biology tool. To leverage its utility, we have developed a pipeline for eThread--a meta-threading protein structure modeling tool, that can use computational resources efficiently and effectively. We employ a pilot-based approach that supports seamless data and task-level parallelism and manages large variation in workload and computational requirements. Our scalable pipeline is deployed on Amazon EC2 and can efficiently select resources based upon task requirements. We present runtime analysis to characterize computational complexity of eThread and EC2 infrastructure. Based on results, we suggest a pathway to an optimized solution with respect to metrics such as time-to-solution or cost-to-solution. Our eThread pipeline can scale to support a large number of sequences and is expected to be a viable solution for genome-scale structural bioinformatics and structure-based annotation, particularly, amenable for small genomes such as prokaryotes. The developed pipeline is easily extensible to other types of distributed cyberinfrastructure.
Fast parallel algorithm for slicing STL based on pipeline

NASA Astrophysics Data System (ADS)

Ma, Xulong; Lin, Feng; Yao, Bo

2016-05-01

In Additive Manufacturing field, the current researches of data processing mainly focus on a slicing process of large STL files or complicated CAD models. To improve the efficiency and reduce the slicing time, a parallel algorithm has great advantages. However, traditional algorithms can't make full use of multi-core CPU hardware resources. In the paper, a fast parallel algorithm is presented to speed up data processing. A pipeline mode is adopted to design the parallel algorithm. And the complexity of the pipeline algorithm is analyzed theoretically. To evaluate the performance of the new algorithm, effects of threads number and layers number are investigated by a serial of experiments. The experimental results show that the threads number and layers number are two remarkable factors to the speedup ratio. The tendency of speedup versus threads number reveals a positive relationship which greatly agrees with the Amdahl's law, and the tendency of speedup versus layers number also keeps a positive relationship agreeing with Gustafson's law. The new algorithm uses topological information to compute contours with a parallel method of speedup. Another parallel algorithm based on data parallel is used in experiments to show that pipeline parallel mode is more efficient. A case study at last shows a suspending performance of the new parallel algorithm. Compared with the serial slicing algorithm, the new pipeline parallel algorithm can make full use of the multi-core CPU hardware, accelerate the slicing process, and compared with the data parallel slicing algorithm, the new slicing algorithm in this paper adopts a pipeline parallel model, and a much higher speedup ratio and efficiency is achieved.
RGCA: A Reliable GPU Cluster Architecture for Large-Scale Internet of Things Computing Based on Effective Performance-Energy Optimization

PubMed Central

Chen, Qingkui; Zhao, Deyu; Wang, Jingjuan

2017-01-01

This paper aims to develop a low-cost, high-performance and high-reliability computing system to process large-scale data using common data mining algorithms in the Internet of Things (IoT) computing environment. Considering the characteristics of IoT data processing, similar to mainstream high performance computing, we use a GPU (Graphics Processing Unit) cluster to achieve better IoT services. Firstly, we present an energy consumption calculation method (ECCM) based on WSNs. Then, using the CUDA (Compute Unified Device Architecture) Programming model, we propose a Two-level Parallel Optimization Model (TLPOM) which exploits reasonable resource planning and common compiler optimization techniques to obtain the best blocks and threads configuration considering the resource constraints of each node. The key to this part is dynamic coupling Thread-Level Parallelism (TLP) and Instruction-Level Parallelism (ILP) to improve the performance of the algorithms without additional energy consumption. Finally, combining the ECCM and the TLPOM, we use the Reliable GPU Cluster Architecture (RGCA) to obtain a high-reliability computing system considering the nodes’ diversity, algorithm characteristics, etc. The results show that the performance of the algorithms significantly increased by 34.1%, 33.96% and 24.07% for Fermi, Kepler and Maxwell on average with TLPOM and the RGCA ensures that our IoT computing system provides low-cost and high-reliability services. PMID:28777325
RGCA: A Reliable GPU Cluster Architecture for Large-Scale Internet of Things Computing Based on Effective Performance-Energy Optimization.

PubMed

Fang, Yuling; Chen, Qingkui; Xiong, Neal N; Zhao, Deyu; Wang, Jingjuan

2017-08-04

This paper aims to develop a low-cost, high-performance and high-reliability computing system to process large-scale data using common data mining algorithms in the Internet of Things (IoT) computing environment. Considering the characteristics of IoT data processing, similar to mainstream high performance computing, we use a GPU (Graphics Processing Unit) cluster to achieve better IoT services. Firstly, we present an energy consumption calculation method (ECCM) based on WSNs. Then, using the CUDA (Compute Unified Device Architecture) Programming model, we propose a Two-level Parallel Optimization Model (TLPOM) which exploits reasonable resource planning and common compiler optimization techniques to obtain the best blocks and threads configuration considering the resource constraints of each node. The key to this part is dynamic coupling Thread-Level Parallelism (TLP) and Instruction-Level Parallelism (ILP) to improve the performance of the algorithms without additional energy consumption. Finally, combining the ECCM and the TLPOM, we use the Reliable GPU Cluster Architecture (RGCA) to obtain a high-reliability computing system considering the nodes' diversity, algorithm characteristics, etc. The results show that the performance of the algorithms significantly increased by 34.1%, 33.96% and 24.07% for Fermi, Kepler and Maxwell on average with TLPOM and the RGCA ensures that our IoT computing system provides low-cost and high-reliability services.
Split off-specular reflection and surface scattering from woven materials

NASA Astrophysics Data System (ADS)

Pont, Sylvia C.; Koenderink, Jan J.

2003-03-01

We measured radiance distributions for black lining cloth and copper gauze using the convenient technique of wrapping the materials around a circular cylinder, irradiating it with a parallel light source and collecting the scattered radiance by a digital camera. One family of parallel threads (weave or weft) was parallel to the cylinder generator. The most salient features for such glossy plane weaves are a splitting up of the reflection peak due to the wavy variations in local slopes of the threads around the cylinders and a surface scattering lobe due to the threads that run along the cylinder. These scattering characteristics are quite different from the (off-)specular peaks and lobes that were found before for random rough specular surfaces. The split off-specular reflection is due to the regular structures in our samples of man-made materials. We derived simple approximations for these reflectance characteristics using geometrical optics.
Enabling the High Level Synthesis of Data Analytics Accelerators

DOE Office of Scientific and Technical Information (OSTI.GOV)

Minutoli, Marco; Castellana, Vito G.; Tumeo, Antonino

Conventional High Level Synthesis (HLS) tools mainly tar- get compute intensive kernels typical of digital signal pro- cessing applications. We are developing techniques and ar- chitectural templates to enable HLS of data analytics appli- cations. These applications are memory intensive, present fine-grained, unpredictable data accesses, and irregular, dy- namic task parallelism. We discuss an architectural tem- plate based around a distributed controller to efficiently ex- ploit thread level parallelism. We present a memory in- terface that supports parallel memory subsystems and en- ables implementing atomic memory operations. We intro- duce a dynamic task scheduling approach to efficiently ex- ecute heavilymore » unbalanced workload. The templates are val- idated by synthesizing queries from the Lehigh University Benchmark (LUBM), a well know SPARQL benchmark.« less
Constructing Neuronal Network Models in Massively Parallel Environments.

PubMed

Ippen, Tammo; Eppler, Jochen M; Plesser, Hans E; Diesmann, Markus

2017-01-01

Recent advances in the development of data structures to represent spiking neuron network models enable us to exploit the complete memory of petascale computers for a single brain-scale network simulation. In this work, we investigate how well we can exploit the computing power of such supercomputers for the creation of neuronal networks. Using an established benchmark, we divide the runtime of simulation code into the phase of network construction and the phase during which the dynamical state is advanced in time. We find that on multi-core compute nodes network creation scales well with process-parallel code but exhibits a prohibitively large memory consumption. Thread-parallel network creation, in contrast, exhibits speedup only up to a small number of threads but has little overhead in terms of memory. We further observe that the algorithms creating instances of model neurons and their connections scale well for networks of ten thousand neurons, but do not show the same speedup for networks of millions of neurons. Our work uncovers that the lack of scaling of thread-parallel network creation is due to inadequate memory allocation strategies and demonstrates that thread-optimized memory allocators recover excellent scaling. An analysis of the loop order used for network construction reveals that more complex tests on the locality of operations significantly improve scaling and reduce runtime by allowing construction algorithms to step through large networks more efficiently than in existing code. The combination of these techniques increases performance by an order of magnitude and harnesses the increasingly parallel compute power of the compute nodes in high-performance clusters and supercomputers.
Constructing Neuronal Network Models in Massively Parallel Environments

PubMed Central

Ippen, Tammo; Eppler, Jochen M.; Plesser, Hans E.; Diesmann, Markus

2017-01-01

Recent advances in the development of data structures to represent spiking neuron network models enable us to exploit the complete memory of petascale computers for a single brain-scale network simulation. In this work, we investigate how well we can exploit the computing power of such supercomputers for the creation of neuronal networks. Using an established benchmark, we divide the runtime of simulation code into the phase of network construction and the phase during which the dynamical state is advanced in time. We find that on multi-core compute nodes network creation scales well with process-parallel code but exhibits a prohibitively large memory consumption. Thread-parallel network creation, in contrast, exhibits speedup only up to a small number of threads but has little overhead in terms of memory. We further observe that the algorithms creating instances of model neurons and their connections scale well for networks of ten thousand neurons, but do not show the same speedup for networks of millions of neurons. Our work uncovers that the lack of scaling of thread-parallel network creation is due to inadequate memory allocation strategies and demonstrates that thread-optimized memory allocators recover excellent scaling. An analysis of the loop order used for network construction reveals that more complex tests on the locality of operations significantly improve scaling and reduce runtime by allowing construction algorithms to step through large networks more efficiently than in existing code. The combination of these techniques increases performance by an order of magnitude and harnesses the increasingly parallel compute power of the compute nodes in high-performance clusters and supercomputers. PMID:28559808
A Pervasive Parallel Processing Framework for Data Visualization and Analysis at Extreme Scale

DOE Office of Scientific and Technical Information (OSTI.GOV)

Moreland, Kenneth; Geveci, Berk

2014-11-01

The evolution of the computing world from teraflop to petaflop has been relatively effortless, with several of the existing programming models scaling effectively to the petascale. The migration to exascale, however, poses considerable challenges. All industry trends infer that the exascale machine will be built using processors containing hundreds to thousands of cores per chip. It can be inferred that efficient concurrency on exascale machines requires a massive amount of concurrent threads, each performing many operations on a localized piece of data. Currently, visualization libraries and applications are based off what is known as the visualization pipeline. In the pipelinemore » model, algorithms are encapsulated as filters with inputs and outputs. These filters are connected by setting the output of one component to the input of another. Parallelism in the visualization pipeline is achieved by replicating the pipeline for each processing thread. This works well for today’s distributed memory parallel computers but cannot be sustained when operating on processors with thousands of cores. Our project investigates a new visualization framework designed to exhibit the pervasive parallelism necessary for extreme scale machines. Our framework achieves this by defining algorithms in terms of worklets, which are localized stateless operations. Worklets are atomic operations that execute when invoked unlike filters, which execute when a pipeline request occurs. The worklet design allows execution on a massive amount of lightweight threads with minimal overhead. Only with such fine-grained parallelism can we hope to fill the billions of threads we expect will be necessary for efficient computation on an exascale machine.« less
Parallelization and checkpointing of GPU applications through program transformation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Solano-Quinde, Lizandro Damian

2012-01-01

GPUs have emerged as a powerful tool for accelerating general-purpose applications. The availability of programming languages that makes writing general-purpose applications for running on GPUs tractable have consolidated GPUs as an alternative for accelerating general purpose applications. Among the areas that have benefited from GPU acceleration are: signal and image processing, computational fluid dynamics, quantum chemistry, and, in general, the High Performance Computing (HPC) Industry. In order to continue to exploit higher levels of parallelism with GPUs, multi-GPU systems are gaining popularity. In this context, single-GPU applications are parallelized for running in multi-GPU systems. Furthermore, multi-GPU systems help to solvemore » the GPU memory limitation for applications with large application memory footprint. Parallelizing single-GPU applications has been approached by libraries that distribute the workload at runtime, however, they impose execution overhead and are not portable. On the other hand, on traditional CPU systems, parallelization has been approached through application transformation at pre-compile time, which enhances the application to distribute the workload at application level and does not have the issues of library-based approaches. Hence, a parallelization scheme for GPU systems based on application transformation is needed. Like any computing engine of today, reliability is also a concern in GPUs. GPUs are vulnerable to transient and permanent failures. Current checkpoint/restart techniques are not suitable for systems with GPUs. Checkpointing for GPU systems present new and interesting challenges, primarily due to the natural differences imposed by the hardware design, the memory subsystem architecture, the massive number of threads, and the limited amount of synchronization among threads. Therefore, a checkpoint/restart technique suitable for GPU systems is needed. The goal of this work is to exploit higher levels of parallelism and to develop support for application-level fault tolerance in applications using multiple GPUs. Our techniques reduce the burden of enhancing single-GPU applications to support these features. To achieve our goal, this work designs and implements a framework for enhancing a single-GPU OpenCL application through application transformation.« less

OpenGeoSys-GEMS: Hybrid parallelization of a reactive transport code with MPI and threads

NASA Astrophysics Data System (ADS)

Kosakowski, G.; Kulik, D. A.; Shao, H.

2012-04-01

OpenGeoSys-GEMS is a generic purpose reactive transport code based on the operator splitting approach. The code couples the Finite-Element groundwater flow and multi-species transport modules of the OpenGeoSys (OGS) project (http://www.ufz.de/index.php?en=18345) with the GEM-Selektor research package to model thermodynamic equilibrium of aquatic (geo)chemical systems utilizing the Gibbs Energy Minimization approach (http://gems.web.psi.ch/). The combination of OGS and the GEM-Selektor kernel (GEMS3K) is highly flexible due to the object-oriented modular code structures and the well defined (memory based) data exchange modules. Like other reactive transport codes, the practical applicability of OGS-GEMS is often hampered by the long calculation time and large memory requirements. • For realistic geochemical systems which might include dozens of mineral phases and several (non-ideal) solid solutions the time needed to solve the chemical system with GEMS3K may increase exceptionally. • The codes are coupled in a sequential non-iterative loop. In order to keep the accuracy, the time step size is restricted. In combination with a fine spatial discretization the time step size may become very small which increases calculation times drastically even for small 1D problems. • The current version of OGS is not optimized for memory use and the MPI version of OGS does not distribute data between nodes. Even for moderately small 2D problems the number of MPI processes that fit into memory of up-to-date workstations or HPC hardware is limited. One strategy to overcome the above mentioned restrictions of OGS-GEMS is to parallelize the coupled code. For OGS a parallelized version already exists. It is based on a domain decomposition method implemented with MPI and provides a parallel solver for fluid and mass transport processes. In the coupled code, after solving fluid flow and solute transport, geochemical calculations are done in form of a central loop over all finite element nodes with calls to GEMS3K and consecutive calculations of changed material parameters. In a first step the existing MPI implementation was utilized to parallelize this loop. Calculations were split between the MPI processes and afterwards data was synchronized by using MPI communication routines. Furthermore, multi-threaded calculation of the loop was implemented with help of the boost thread library (http://www.boost.org). This implementation provides a flexible environment to distribute calculations between several threads. For each MPI process at least one and up to several dozens of worker threads are spawned. These threads do not replicate the complete OGS-GEM data structure and use only a limited amount of memory. Calculation of the central geochemical loop is shared between all threads. Synchronization between the threads is done by barrier commands. The overall number of local threads times MPI processes should match the number of available computing nodes. The combination of multi-threading and MPI provides an effective and flexible environment to speed up OGS-GEMS calculations while limiting the required memory use. Test calculations on different hardware show that for certain types of applications tremendous speedups are possible.
Developing eThread Pipeline Using SAGA-Pilot Abstraction for Large-Scale Structural Bioinformatics

PubMed Central

Ragothaman, Anjani; Feinstein, Wei; Jha, Shantenu; Kim, Joohyun

2014-01-01

While most of computational annotation approaches are sequence-based, threading methods are becoming increasingly attractive because of predicted structural information that could uncover the underlying function. However, threading tools are generally compute-intensive and the number of protein sequences from even small genomes such as prokaryotes is large typically containing many thousands, prohibiting their application as a genome-wide structural systems biology tool. To leverage its utility, we have developed a pipeline for eThread—a meta-threading protein structure modeling tool, that can use computational resources efficiently and effectively. We employ a pilot-based approach that supports seamless data and task-level parallelism and manages large variation in workload and computational requirements. Our scalable pipeline is deployed on Amazon EC2 and can efficiently select resources based upon task requirements. We present runtime analysis to characterize computational complexity of eThread and EC2 infrastructure. Based on results, we suggest a pathway to an optimized solution with respect to metrics such as time-to-solution or cost-to-solution. Our eThread pipeline can scale to support a large number of sequences and is expected to be a viable solution for genome-scale structural bioinformatics and structure-based annotation, particularly, amenable for small genomes such as prokaryotes. The developed pipeline is easily extensible to other types of distributed cyberinfrastructure. PMID:24995285
Accelerating finite-rate chemical kinetics with coprocessors: Comparing vectorization methods on GPUs, MICs, and CPUs

NASA Astrophysics Data System (ADS)

Stone, Christopher P.; Alferman, Andrew T.; Niemeyer, Kyle E.

2018-05-01

Accurate and efficient methods for solving stiff ordinary differential equations (ODEs) are a critical component of turbulent combustion simulations with finite-rate chemistry. The ODEs governing the chemical kinetics at each mesh point are decoupled by operator-splitting allowing each to be solved concurrently. An efficient ODE solver must then take into account the available thread and instruction-level parallelism of the underlying hardware, especially on many-core coprocessors, as well as the numerical efficiency. A stiff Rosenbrock and a nonstiff Runge-Kutta ODE solver are both implemented using the single instruction, multiple thread (SIMT) and single instruction, multiple data (SIMD) paradigms within OpenCL. Both methods solve multiple ODEs concurrently within the same instruction stream. The performance of these parallel implementations was measured on three chemical kinetic models of increasing size across several multicore and many-core platforms. Two separate benchmarks were conducted to clearly determine any performance advantage offered by either method. The first benchmark measured the run-time of evaluating the right-hand-side source terms in parallel and the second benchmark integrated a series of constant-pressure, homogeneous reactors using the Rosenbrock and Runge-Kutta solvers. The right-hand-side evaluations with SIMD parallelism on the host multicore Xeon CPU and many-core Xeon Phi co-processor performed approximately three times faster than the baseline multithreaded C++ code. The SIMT parallel model on the host and Phi was 13%-35% slower than the baseline while the SIMT model on the NVIDIA Kepler GPU provided approximately the same performance as the SIMD model on the Phi. The runtimes for both ODE solvers decreased significantly with the SIMD implementations on the host CPU (2.5-2.7 ×) and Xeon Phi coprocessor (4.7-4.9 ×) compared to the baseline parallel code. The SIMT implementations on the GPU ran 1.5-1.6 times faster than the baseline multithreaded CPU code; however, this was significantly slower than the SIMD versions on the host CPU or the Xeon Phi. The performance difference between the three platforms was attributed to thread divergence caused by the adaptive step-sizes within the ODE integrators. Analysis showed that the wider vector width of the GPU incurs a higher level of divergence than the narrower Sandy Bridge or Xeon Phi. The significant performance improvement provided by the SIMD parallel strategy motivates further research into more ODE solver methods that are both SIMD-friendly and computationally efficient.
Memory Benchmarks for SMP-Based High Performance Parallel Computers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Yoo, A B; de Supinski, B; Mueller, F

2001-11-20

As the speed gap between CPU and main memory continues to grow, memory accesses increasingly dominates the performance of many applications. The problem is particularly acute for symmetric multiprocessor (SMP) systems, where the shared memory may be accessed concurrently by a group of threads running on separate CPUs. Unfortunately, several key issues governing memory system performance in current systems are not well understood. Complex interactions between the levels of the memory hierarchy, buses or switches, DRAM back-ends, system software, and application access patterns can make it difficult to pinpoint bottlenecks and determine appropriate optimizations, and the situation is even moremore » complex for SMP systems. To partially address this problem, we formulated a set of multi-threaded microbenchmarks for characterizing and measuring the performance of the underlying memory system in SMP-based high-performance computers. We report our use of these microbenchmarks on two important SMP-based machines. This paper has four primary contributions. First, we introduce a microbenchmark suite to systematically assess and compare the performance of different levels in SMP memory hierarchies. Second, we present a new tool based on hardware performance monitors to determine a wide array of memory system characteristics, such as cache sizes, quickly and easily; by using this tool, memory performance studies can be targeted to the full spectrum of performance regimes with many fewer data points than is otherwise required. Third, we present experimental results indicating that the performance of applications with large memory footprints remains largely constrained by memory. Fourth, we demonstrate that thread-level parallelism further degrades memory performance, even for the latest SMPs with hardware prefetching and switch-based memory interconnects.« less
A software bus for thread objects

NASA Technical Reports Server (NTRS)

Callahan, John R.; Li, Dehuai

1995-01-01

The authors have implemented a software bus for lightweight threads in an object-oriented programming environment that allows for rapid reconfiguration and reuse of thread objects in discrete-event simulation experiments. While previous research in object-oriented, parallel programming environments has focused on direct communication between threads, our lightweight software bus, called the MiniBus, provides a means to isolate threads from their contexts of execution by restricting communications between threads to message-passing via their local ports only. The software bus maintains a topology of connections between these ports. It routes, queues, and delivers messages according to this topology. This approach allows for rapid reconfiguration and reuse of thread objects in other systems without making changes to the specifications or source code. A layered approach that provides the needed transparency to developers is presented. Examples of using the MiniBus are given, and the value of bus architectures in building and conducting simulations of discrete-event systems is discussed.
A path-level exact parallelization strategy for sequential simulation

NASA Astrophysics Data System (ADS)

Peredo, Oscar F.; Baeza, Daniel; Ortiz, Julián M.; Herrero, José R.

2018-01-01

Sequential Simulation is a well known method in geostatistical modelling. Following the Bayesian approach for simulation of conditionally dependent random events, Sequential Indicator Simulation (SIS) method draws simulated values for K categories (categorical case) or classes defined by K different thresholds (continuous case). Similarly, Sequential Gaussian Simulation (SGS) method draws simulated values from a multivariate Gaussian field. In this work, a path-level approach to parallelize SIS and SGS methods is presented. A first stage of re-arrangement of the simulation path is performed, followed by a second stage of parallel simulation for non-conflicting nodes. A key advantage of the proposed parallelization method is to generate identical realizations as with the original non-parallelized methods. Case studies are presented using two sequential simulation codes from GSLIB: SISIM and SGSIM. Execution time and speedup results are shown for large-scale domains, with many categories and maximum kriging neighbours in each case, achieving high speedup results in the best scenarios using 16 threads of execution in a single machine.
Multi-petascale highly efficient parallel supercomputer

DOE Office of Scientific and Technical Information (OSTI.GOV)

Asaad, Sameh; Bellofatto, Ralph E.; Blocksome, Michael A.

A Multi-Petascale Highly Efficient Parallel Supercomputer of 100 petaflop-scale includes node architectures based upon System-On-a-Chip technology, where each processing node comprises a single Application Specific Integrated Circuit (ASIC). The ASIC nodes are interconnected by a five dimensional torus network that optimally maximize the throughput of packet communications between nodes and minimize latency. The network implements collective network and a global asynchronous network that provides global barrier and notification functions. Integrated in the node design include a list-based prefetcher. The memory system implements transaction memory, thread level speculation, and multiversioning cache that improves soft error rate at the same time andmore » supports DMA functionality allowing for parallel processing message-passing.« less
GPU-based parallel algorithm for blind image restoration using midfrequency-based methods

NASA Astrophysics Data System (ADS)

Xie, Lang; Luo, Yi-han; Bao, Qi-liang

2013-08-01

GPU-based general-purpose computing is a new branch of modern parallel computing, so the study of parallel algorithms specially designed for GPU hardware architecture is of great significance. In order to solve the problem of high computational complexity and poor real-time performance in blind image restoration, the midfrequency-based algorithm for blind image restoration was analyzed and improved in this paper. Furthermore, a midfrequency-based filtering method is also used to restore the image hardly with any recursion or iteration. Combining the algorithm with data intensiveness, data parallel computing and GPU execution model of single instruction and multiple threads, a new parallel midfrequency-based algorithm for blind image restoration is proposed in this paper, which is suitable for stream computing of GPU. In this algorithm, the GPU is utilized to accelerate the estimation of class-G point spread functions and midfrequency-based filtering. Aiming at better management of the GPU threads, the threads in a grid are scheduled according to the decomposition of the filtering data in frequency domain after the optimization of data access and the communication between the host and the device. The kernel parallelism structure is determined by the decomposition of the filtering data to ensure the transmission rate to get around the memory bandwidth limitation. The results show that, with the new algorithm, the operational speed is significantly increased and the real-time performance of image restoration is effectively improved, especially for high-resolution images.
Thread mapping using system-level model for shared memory multicores

NASA Astrophysics Data System (ADS)

Mitra, Reshmi

Exploring thread-to-core mapping options for a parallel application on a multicore architecture is computationally very expensive. For the same algorithm, the mapping strategy (MS) with the best response time may change with data size and thread counts. The primary challenge is to design a fast, accurate and automatic framework for exploring these MSs for large data-intensive applications. This is to ensure that the users can explore the design space within reasonable machine hours, without thorough understanding on how the code interacts with the platform. Response time is related to the cycles per instructions retired (CPI), taking into account both active and sleep states of the pipeline. This work establishes a hybrid approach, based on Markov Chain Model (MCM) and Model Tree (MT) for system-level steady state CPI prediction. It is designed for shared memory multicore processors with coarse-grained multithreading. The thread status is represented by the MCM states. The program characteristics are modeled as the transition probabilities, representing the system moving between active and suspended thread states. The MT model extrapolates these probabilities for the actual application size (AS) from the smaller AS performance. This aspect of the framework, along with, the use of mathematical expressions for the actual AS performance information, results in a tremendous reduction in the CPI prediction time. The framework is validated using an electromagnetics application. The average performance prediction error for steady state CPI results with 12 different MSs is less than 1%. The total run time of model is of the order of minutes, whereas the actual application execution time is in terms of days.
ASC-ATDM Performance Portability Requirements for 2015-2019

DOE Office of Scientific and Technical Information (OSTI.GOV)

Edwards, Harold C.; Trott, Christian Robert

This report outlines the research, development, and support requirements for the Advanced Simulation and Computing (ASC ) Advanced Technology, Development, and Mitigation (ATDM) Performance Portability (a.k.a., Kokkos) project for 2015 - 2019 . The research and development (R&D) goal for Kokkos (v2) has been to create and demonstrate a thread - parallel programming model a nd standard C++ library - based implementation that enables performance portability across diverse manycore architectures such as multicore CPU, Intel Xeon Phi, and NVIDIA Kepler GPU. This R&D goal has been achieved for algorithms that use data parallel pat terns including parallel - for, parallelmore » - reduce, and parallel - scan. Current R&D is focusing on hierarchical parallel patterns such as a directed acyclic graph (DAG) of asynchronous tasks where each task contain s nested data parallel algorithms. This five y ear plan includes R&D required to f ully and performance portably exploit thread parallelism across current and anticipated next generation platforms (NGP). The Kokkos library is being evaluated by many projects exploring algorithm s and code design for NGP. Some production libraries and applications such as Trilinos and LAMMPS have already committed to Kokkos as their foundation for manycore parallelism an d performance portability. These five year requirements includes support required for current and antic ipated ASC projects to be effective and productive in their use of Kokkos on NGP. The greatest risk to the success of Kokkos and ASC projects relying upon Kokkos is a lack of staffing resources to support Kokkos to the degree needed by these ASC projects. This support includes up - to - date tutorials, documentation, multi - platform (hardware and software stack) testing, minor feature enhancements, thread - scalable algorithm consulting, and managing collaborative R&D.« less
Simulation of LHC events on a millions threads

NASA Astrophysics Data System (ADS)

Childers, J. T.; Uram, T. D.; LeCompte, T. J.; Papka, M. E.; Benjamin, D. P.

2015-12-01

Demand for Grid resources is expected to double during LHC Run II as compared to Run I; the capacity of the Grid, however, will not double. The HEP community must consider how to bridge this computing gap by targeting larger compute resources and using the available compute resources as efficiently as possible. Argonne's Mira, the fifth fastest supercomputer in the world, can run roughly five times the number of parallel processes that the ATLAS experiment typically uses on the Grid. We ported Alpgen, a serial x86 code, to run as a parallel application under MPI on the Blue Gene/Q architecture. By analysis of the Alpgen code, we reduced the memory footprint to allow running 64 threads per node, utilizing the four hardware threads available per core on the PowerPC A2 processor. Event generation and unweighting, typically run as independent serial phases, are coupled together in a single job in this scenario, reducing intermediate writes to the filesystem. By these optimizations, we have successfully run LHC proton-proton physics event generation at the scale of a million threads, filling two-thirds of Mira.
Massively Parallel Dantzig-Wolfe Decomposition Applied to Traffic Flow Scheduling

NASA Technical Reports Server (NTRS)

Rios, Joseph Lucio; Ross, Kevin

2009-01-01

Optimal scheduling of air traffic over the entire National Airspace System is a computationally difficult task. To speed computation, Dantzig-Wolfe decomposition is applied to a known linear integer programming approach for assigning delays to flights. The optimization model is proven to have the block-angular structure necessary for Dantzig-Wolfe decomposition. The subproblems for this decomposition are solved in parallel via independent computation threads. Experimental evidence suggests that as the number of subproblems/threads increases (and their respective sizes decrease), the solution quality, convergence, and runtime improve. A demonstration of this is provided by using one flight per subproblem, which is the finest possible decomposition. This results in thousands of subproblems and associated computation threads. This massively parallel approach is compared to one with few threads and to standard (non-decomposed) approaches in terms of solution quality and runtime. Since this method generally provides a non-integral (relaxed) solution to the original optimization problem, two heuristics are developed to generate an integral solution. Dantzig-Wolfe followed by these heuristics can provide a near-optimal (sometimes optimal) solution to the original problem hundreds of times faster than standard (non-decomposed) approaches. In addition, when massive decomposition is employed, the solution is shown to be more likely integral, which obviates the need for an integerization step. These results indicate that nationwide, real-time, high fidelity, optimal traffic flow scheduling is achievable for (at least) 3 hour planning horizons.
Parallelization strategies for continuum-generalized method of moments on the multi-thread systems

NASA Astrophysics Data System (ADS)

Bustamam, A.; Handhika, T.; Ernastuti, Kerami, D.

2017-07-01

Continuum-Generalized Method of Moments (C-GMM) covers the Generalized Method of Moments (GMM) shortfall which is not as efficient as Maximum Likelihood estimator by using the continuum set of moment conditions in a GMM framework. However, this computation would take a very long time since optimizing regularization parameter. Unfortunately, these calculations are processed sequentially whereas in fact all modern computers are now supported by hierarchical memory systems and hyperthreading technology, which allowing for parallel computing. This paper aims to speed up the calculation process of C-GMM by designing a parallel algorithm for C-GMM on the multi-thread systems. First, parallel regions are detected for the original C-GMM algorithm. There are two parallel regions in the original C-GMM algorithm, that are contributed significantly to the reduction of computational time: the outer-loop and the inner-loop. Furthermore, this parallel algorithm will be implemented with standard shared-memory application programming interface, i.e. Open Multi-Processing (OpenMP). The experiment shows that the outer-loop parallelization is the best strategy for any number of observations.
Scaling Irregular Applications through Data Aggregation and Software Multithreading

DOE Office of Scientific and Technical Information (OSTI.GOV)

Morari, Alessandro; Tumeo, Antonino; Chavarría-Miranda, Daniel

Bioinformatics, data analytics, semantic databases, knowledge discovery are emerging high performance application areas that exploit dynamic, linked data structures such as graphs, unbalanced trees or unstructured grids. These data structures usually are very large, requiring significantly more memory than available on single shared memory systems. Additionally, these data structures are difficult to partition on distributed memory systems. They also present poor spatial and temporal locality, thus generating unpredictable memory and network accesses. The Partitioned Global Address Space (PGAS) programming model seems suitable for these applications, because it allows using a shared memory abstraction across distributed-memory clusters. However, current PGAS languagesmore » and libraries are built to target regular remote data accesses and block transfers. Furthermore, they usually rely on the Single Program Multiple Data (SPMD) parallel control model, which is not well suited to the fine grained, dynamic and unbalanced parallelism of irregular applications. In this paper we present {\\bf GMT} (Global Memory and Threading library), a custom runtime library that enables efficient execution of irregular applications on commodity clusters. GMT integrates a PGAS data substrate with simple fork/join parallelism and provides automatic load balancing on a per node basis. It implements multi-level aggregation and lightweight multithreading to maximize memory and network bandwidth with fine-grained data accesses and tolerate long data access latencies. A key innovation in the GMT runtime is its thread specialization (workers, helpers and communication threads) that realize the overall functionality. We compare our approach with other PGAS models, such as UPC running using GASNet, and hand-optimized MPI code on a set of typical large-scale irregular applications, demonstrating speedups of an order of magnitude.« less
Generic accelerated sequence alignment in SeqAn using vectorization and multi-threading.

PubMed

Rahn, René; Budach, Stefan; Costanza, Pascal; Ehrhardt, Marcel; Hancox, Jonny; Reinert, Knut

2018-05-03

Pairwise sequence alignment is undoubtedly a central tool in many bioinformatics analyses. In this paper, we present a generically accelerated module for pairwise sequence alignments applicable for a broad range of applications. In our module, we unified the standard dynamic programming kernel used for pairwise sequence alignments and extended it with a generalized inter-sequence vectorization layout, such that many alignments can be computed simultaneously by exploiting SIMD (Single Instruction Multiple Data) instructions of modern processors. We then extended the module by adding two layers of thread-level parallelization, where we a) distribute many independent alignments on multiple threads and b) inherently parallelize a single alignment computation using a work stealing approach producing a dynamic wavefront progressing along the minor diagonal. We evaluated our alignment vectorization and parallelization on different processors, including the newest Intel® Xeon® (Skylake) and Intel® Xeon Phi™ (KNL) processors, and use cases. The instruction set AVX512-BW (Byte and Word), available on Skylake processors, can genuinely improve the performance of vectorized alignments. We could run single alignments 1600 times faster on the Xeon Phi™ and 1400 times faster on the Xeon® than executing them with our previous sequential alignment module. The module is programmed in C++ using the SeqAn (Reinert et al., 2017) library and distributed with version 2.4. under the BSD license. We support SSE4, AVX2, AVX512 instructions and included UME::SIMD, a SIMD-instruction wrapper library, to extend our module for further instruction sets. We thoroughly test all alignment components with all major C++ compilers on various platforms. rene.rahn@fu-berlin.de.
Dual-thread parallel control strategy for ophthalmic adaptive optics.

PubMed

Yu, Yongxin; Zhang, Yuhua

To improve ophthalmic adaptive optics speed and compensate for ocular wavefront aberration of high temporal frequency, the adaptive optics wavefront correction has been implemented with a control scheme including 2 parallel threads; one is dedicated to wavefront detection and the other conducts wavefront reconstruction and compensation. With a custom Shack-Hartmann wavefront sensor that measures the ocular wave aberration with 193 subapertures across the pupil, adaptive optics has achieved a closed loop updating frequency up to 110 Hz, and demonstrated robust compensation for ocular wave aberration up to 50 Hz in an adaptive optics scanning laser ophthalmoscope.
Dual-thread parallel control strategy for ophthalmic adaptive optics

PubMed Central

Yu, Yongxin; Zhang, Yuhua

2015-01-01

To improve ophthalmic adaptive optics speed and compensate for ocular wavefront aberration of high temporal frequency, the adaptive optics wavefront correction has been implemented with a control scheme including 2 parallel threads; one is dedicated to wavefront detection and the other conducts wavefront reconstruction and compensation. With a custom Shack-Hartmann wavefront sensor that measures the ocular wave aberration with 193 subapertures across the pupil, adaptive optics has achieved a closed loop updating frequency up to 110 Hz, and demonstrated robust compensation for ocular wave aberration up to 50 Hz in an adaptive optics scanning laser ophthalmoscope. PMID:25866498
Parallel Implicit Runge-Kutta Methods Applied to Coupled Orbit/Attitude Propagation

NASA Astrophysics Data System (ADS)

Hatten, Noble; Russell, Ryan P.

2017-12-01

A variable-step Gauss-Legendre implicit Runge-Kutta (GLIRK) propagator is applied to coupled orbit/attitude propagation. Concepts previously shown to improve efficiency in 3DOF propagation are modified and extended to the 6DOF problem, including the use of variable-fidelity dynamics models. The impact of computing the stage dynamics of a single step in parallel is examined using up to 23 threads and 22 associated GLIRK stages; one thread is reserved for an extra dynamics function evaluation used in the estimation of the local truncation error. Efficiency is found to peak for typical examples when using approximately 8 to 12 stages for both serial and parallel implementations. Accuracy and efficiency compare favorably to explicit Runge-Kutta and linear-multistep solvers for representative scenarios. However, linear-multistep methods are found to be more efficient for some applications, particularly in a serial computing environment, or when parallelism can be applied across multiple trajectories.
Cribellate thread production in spiders: Complex processing of nano-fibres into a functional capture thread.

PubMed

Joel, Anna-Christin; Kappel, Peter; Adamova, Hana; Baumgartner, Werner; Scholz, Ingo

2015-11-01

Spider silk production has been studied intensively in the last years. However, capture threads of cribellate spiders employ an until now often unnoticed alternative of thread production. This thread in general is highly interesting, as it not only involves a controlled arrangement of three types of threads with one being nano-scale fibres (cribellate fibres), but also a special comb-like structure on the metatarsus of the fourth leg (calamistrum) for its production. We found the cribellate fibres organized as a mat, enclosing two parallel larger fibres (axial fibres) and forming the typical puffy structure of cribellate threads. Mat and axial fibres are punctiform connected to each other between two puffs, presumably by the action of the median spinnerets. However, this connection alone does not lead to the typical puffy shape of a cribellate thread. Removing the calamistrum, we found a functional capture thread still being produced, but the puffy shape of the thread was lost. Therefore, the calamistrum is not necessary for the extraction or combination of fibres, but for further processing of the nano-scale cribellate fibres. Using data from Uloborus plumipes we were able to develop a model of the cribellate thread production, probably universally valid for cribellate spiders. Copyright © 2015 Elsevier Ltd. All rights reserved.
Threaded average temperature thermocouple

NASA Technical Reports Server (NTRS)

Ward, Stanley W. (Inventor)

1990-01-01

A threaded average temperature thermocouple 11 is provided to measure the average temperature of a test situs of a test material 30. A ceramic insulator rod 15 with two parallel holes 17 and 18 through the length thereof is securely fitted in a cylinder 16, which is bored along the longitudinal axis of symmetry of threaded bolt 12. Threaded bolt 12 is composed of material having thermal properties similar to those of test material 30. Leads of a thermocouple wire 20 leading from a remotely situated temperature sensing device 35 are each fed through one of the holes 17 or 18, secured at head end 13 of ceramic insulator rod 15, and exit at tip end 14. Each lead of thermocouple wire 20 is bent into and secured in an opposite radial groove 25 in tip end 14 of threaded bolt 12. Resulting threaded average temperature thermocouple 11 is ready to be inserted into cylindrical receptacle 32. The tip end 14 of the threaded average temperature thermocouple 11 is in intimate contact with receptacle 32. A jam nut 36 secures the threaded average temperature thermocouple 11 to test material 30.

Automatic Multilevel Parallelization Using OpenMP

NASA Technical Reports Server (NTRS)

Jin, Hao-Qiang; Jost, Gabriele; Yan, Jerry; Ayguade, Eduard; Gonzalez, Marc; Martorell, Xavier; Biegel, Bryan (Technical Monitor)

2002-01-01

In this paper we describe the extension of the CAPO parallelization support tool to support multilevel parallelism based on OpenMP directives. CAPO generates OpenMP directives with extensions supported by the NanosCompiler to allow for directive nesting and definition of thread groups. We report first results for several benchmark codes and one full application that have been parallelized using our system.
Longitudinal parallel compression suture to control postopartum hemorrhage due to placenta previa and accrete.

PubMed

Li, Guang-Tai; Li, Xiao-Fan; Wu, Baoping; Li, Guangrui

2016-04-01

To assess the efficacy and safety of longitudinal parallel compression suture to control heavy postpartum hemorrhage (PPH) in patients with placenta previa/accreta. Fifteen women received a longitudinal parallel compression suture to stop life-threatening PPH due to placenta previa with or without accreta during cesarean section. The suture apposed the anterior and posterior walls of the lower uterine segment together using an absorbable thread A 70-mm round needle with a Number-1 absorbable thread was used. The point of needle entry was 1 cm above the upper margin of the cervix and 1 cm from the right lateral border of the lower segment of the anterior wall. The suture was threaded through the uterine cavity to the serosa of the posterior wall. Then, it was directed upward and threaded from the posterior to the anterior wall at ∼1-2 cm above the upper boundary of the lower uterine segment and 3-cm medial to the right margin of the uterus. Both ends of the suture were tied on the anterior aspect of uterus. The left side was sutured in the same way. The success rate of the procedure was 86.7% (13/15). Two of 15 cases were concurrently administered gauze packing and achieved satisfactory hemostasis. All patients resumed a normal menstrual flow, and no postoperative anatomical or physiological abnormalities related to the suture were observed. Three women achieved further pregnancies after the procedure. Longitudinal parallel compression suture is a safe, easy, effective, practical, and conservative surgical technique to stop intractable PPH from the lower uterine segment, particularly in women who have a cesarean scar and placenta previa/accreta. Copyright © 2016. Published by Elsevier B.V.
Automatic Multilevel Parallelization Using OpenMP

NASA Technical Reports Server (NTRS)

Jin, Hao-Qiang; Jost, Gabriele; Yan, Jerry; Ayguade, Eduard; Gonzalez, Marc; Martorell, Xavier; Biegel, Bryan (Technical Monitor)

2002-01-01

In this paper we describe the extension of the CAPO (CAPtools (Computer Aided Parallelization Toolkit) OpenMP) parallelization support tool to support multilevel parallelism based on OpenMP directives. CAPO generates OpenMP directives with extensions supported by the NanosCompiler to allow for directive nesting and definition of thread groups. We report some results for several benchmark codes and one full application that have been parallelized using our system.
Multithreaded Model for Dynamic Load Balancing Parallel Adaptive PDE Computations

NASA Technical Reports Server (NTRS)

Chrisochoides, Nikos

1995-01-01

We present a multithreaded model for the dynamic load-balancing of numerical, adaptive computations required for the solution of Partial Differential Equations (PDE's) on multiprocessors. Multithreading is used as a means of exploring concurrency in the processor level in order to tolerate synchronization costs inherent to traditional (non-threaded) parallel adaptive PDE solvers. Our preliminary analysis for parallel, adaptive PDE solvers indicates that multithreading can be used an a mechanism to mask overheads required for the dynamic balancing of processor workloads with computations required for the actual numerical solution of the PDE's. Also, multithreading can simplify the implementation of dynamic load-balancing algorithms, a task that is very difficult for traditional data parallel adaptive PDE computations. Unfortunately, multithreading does not always simplify program complexity, often makes code re-usability not an easy task, and increases software complexity.
Large Scale Document Inversion using a Multi-threaded Computing System

PubMed Central

Jung, Sungbo; Chang, Dar-Jen; Park, Juw Won

2018-01-01

Current microprocessor architecture is moving towards multi-core/multi-threaded systems. This trend has led to a surge of interest in using multi-threaded computing devices, such as the Graphics Processing Unit (GPU), for general purpose computing. We can utilize the GPU in computation as a massive parallel coprocessor because the GPU consists of multiple cores. The GPU is also an affordable, attractive, and user-programmable commodity. Nowadays a lot of information has been flooded into the digital domain around the world. Huge volume of data, such as digital libraries, social networking services, e-commerce product data, and reviews, etc., is produced or collected every moment with dramatic growth in size. Although the inverted index is a useful data structure that can be used for full text searches or document retrieval, a large number of documents will require a tremendous amount of time to create the index. The performance of document inversion can be improved by multi-thread or multi-core GPU. Our approach is to implement a linear-time, hash-based, single program multiple data (SPMD), document inversion algorithm on the NVIDIA GPU/CUDA programming platform utilizing the huge computational power of the GPU, to develop high performance solutions for document indexing. Our proposed parallel document inversion system shows 2-3 times faster performance than a sequential system on two different test datasets from PubMed abstract and e-commerce product reviews. CCS Concepts •Information systems➝Information retrieval • Computing methodologies➝Massively parallel and high-performance simulations. PMID:29861701
Large Scale Document Inversion using a Multi-threaded Computing System.

PubMed

Jung, Sungbo; Chang, Dar-Jen; Park, Juw Won

2017-06-01

Current microprocessor architecture is moving towards multi-core/multi-threaded systems. This trend has led to a surge of interest in using multi-threaded computing devices, such as the Graphics Processing Unit (GPU), for general purpose computing. We can utilize the GPU in computation as a massive parallel coprocessor because the GPU consists of multiple cores. The GPU is also an affordable, attractive, and user-programmable commodity. Nowadays a lot of information has been flooded into the digital domain around the world. Huge volume of data, such as digital libraries, social networking services, e-commerce product data, and reviews, etc., is produced or collected every moment with dramatic growth in size. Although the inverted index is a useful data structure that can be used for full text searches or document retrieval, a large number of documents will require a tremendous amount of time to create the index. The performance of document inversion can be improved by multi-thread or multi-core GPU. Our approach is to implement a linear-time, hash-based, single program multiple data (SPMD), document inversion algorithm on the NVIDIA GPU/CUDA programming platform utilizing the huge computational power of the GPU, to develop high performance solutions for document indexing. Our proposed parallel document inversion system shows 2-3 times faster performance than a sequential system on two different test datasets from PubMed abstract and e-commerce product reviews. •Information systems➝Information retrieval • Computing methodologies➝Massively parallel and high-performance simulations.
Thread scheduling for GPU-based OPC simulation on multi-thread

NASA Astrophysics Data System (ADS)

Lee, Heejun; Kim, Sangwook; Hong, Jisuk; Lee, Sooryong; Han, Hwansoo

2018-03-01

As semiconductor product development based on shrinkage continues, the accuracy and difficulty required for the model based optical proximity correction (MBOPC) is increasing. OPC simulation time, which is the most timeconsuming part of MBOPC, is rapidly increasing due to high pattern density in a layout and complex OPC model. To reduce OPC simulation time, we attempt to apply graphic processing unit (GPU) to MBOPC because OPC process is good to be programmed in parallel. We address some issues that may typically happen during GPU-based OPC simulation in multi thread system, such as "out of memory" and "GPU idle time". To overcome these problems, we propose a thread scheduling method, which manages OPC jobs in multiple threads in such a way that simulations jobs from multiple threads are alternatively executed on GPU while correction jobs are executed at the same time in each CPU cores. It was observed that the amount of GPU peak memory usage decreases by up to 35%, and MBOPC runtime also decreases by 4%. In cases where out of memory issues occur in a multi-threaded environment, the thread scheduler was used to improve MBOPC runtime up to 23%.
GPU accelerated dynamic functional connectivity analysis for functional MRI data.

PubMed

Akgün, Devrim; Sakoğlu, Ünal; Esquivel, Johnny; Adinoff, Bryon; Mete, Mutlu

2015-07-01

Recent advances in multi-core processors and graphics card based computational technologies have paved the way for an improved and dynamic utilization of parallel computing techniques. Numerous applications have been implemented for the acceleration of computationally-intensive problems in various computational science fields including bioinformatics, in which big data problems are prevalent. In neuroimaging, dynamic functional connectivity (DFC) analysis is a computationally demanding method used to investigate dynamic functional interactions among different brain regions or networks identified with functional magnetic resonance imaging (fMRI) data. In this study, we implemented and analyzed a parallel DFC algorithm based on thread-based and block-based approaches. The thread-based approach was designed to parallelize DFC computations and was implemented in both Open Multi-Processing (OpenMP) and Compute Unified Device Architecture (CUDA) programming platforms. Another approach developed in this study to better utilize CUDA architecture is the block-based approach, where parallelization involves smaller parts of fMRI time-courses obtained by sliding-windows. Experimental results showed that the proposed parallel design solutions enabled by the GPUs significantly reduce the computation time for DFC analysis. Multicore implementation using OpenMP on 8-core processor provides up to 7.7× speed-up. GPU implementation using CUDA yielded substantial accelerations ranging from 18.5× to 157× speed-up once thread-based and block-based approaches were combined in the analysis. Proposed parallel programming solutions showed that multi-core processor and CUDA-supported GPU implementations accelerated the DFC analyses significantly. Developed algorithms make the DFC analyses more practical for multi-subject studies with more dynamic analyses. Copyright © 2015 Elsevier Ltd. All rights reserved.
Bioinformatics algorithm based on a parallel implementation of a machine learning approach using transducers

NASA Astrophysics Data System (ADS)

Roche-Lima, Abiel; Thulasiram, Ruppa K.

2012-02-01

Finite automata, in which each transition is augmented with an output label in addition to the familiar input label, are considered finite-state transducers. Transducers have been used to analyze some fundamental issues in bioinformatics. Weighted finite-state transducers have been proposed to pairwise alignments of DNA and protein sequences; as well as to develop kernels for computational biology. Machine learning algorithms for conditional transducers have been implemented and used for DNA sequence analysis. Transducer learning algorithms are based on conditional probability computation. It is calculated by using techniques, such as pair-database creation, normalization (with Maximum-Likelihood normalization) and parameters optimization (with Expectation-Maximization - EM). These techniques are intrinsically costly for computation, even worse when are applied to bioinformatics, because the databases sizes are large. In this work, we describe a parallel implementation of an algorithm to learn conditional transducers using these techniques. The algorithm is oriented to bioinformatics applications, such as alignments, phylogenetic trees, and other genome evolution studies. Indeed, several experiences were developed using the parallel and sequential algorithm on Westgrid (specifically, on the Breeze cluster). As results, we obtain that our parallel algorithm is scalable, because execution times are reduced considerably when the data size parameter is increased. Another experience is developed by changing precision parameter. In this case, we obtain smaller execution times using the parallel algorithm. Finally, number of threads used to execute the parallel algorithm on the Breezy cluster is changed. In this last experience, we obtain as result that speedup is considerably increased when more threads are used; however there is a convergence for number of threads equal to or greater than 16.
Parallel algorithms for large-scale biological sequence alignment on Xeon-Phi based clusters.

PubMed

Lan, Haidong; Chan, Yuandong; Xu, Kai; Schmidt, Bertil; Peng, Shaoliang; Liu, Weiguo

2016-07-19

Computing alignments between two or more sequences are common operations frequently performed in computational molecular biology. The continuing growth of biological sequence databases establishes the need for their efficient parallel implementation on modern accelerators. This paper presents new approaches to high performance biological sequence database scanning with the Smith-Waterman algorithm and the first stage of progressive multiple sequence alignment based on the ClustalW heuristic on a Xeon Phi-based compute cluster. Our approach uses a three-level parallelization scheme to take full advantage of the compute power available on this type of architecture; i.e. cluster-level data parallelism, thread-level coarse-grained parallelism, and vector-level fine-grained parallelism. Furthermore, we re-organize the sequence datasets and use Xeon Phi shuffle operations to improve I/O efficiency. Evaluations show that our method achieves a peak overall performance up to 220 GCUPS for scanning real protein sequence databanks on a single node consisting of two Intel E5-2620 CPUs and two Intel Xeon Phi 7110P cards. It also exhibits good scalability in terms of sequence length and size, and number of compute nodes for both database scanning and multiple sequence alignment. Furthermore, the achieved performance is highly competitive in comparison to optimized Xeon Phi and GPU implementations. Our implementation is available at https://github.com/turbo0628/LSDBS-mpi .
Parallelization of interpolation, solar radiation and water flow simulation modules in GRASS GIS using OpenMP

NASA Astrophysics Data System (ADS)

Hofierka, Jaroslav; Lacko, Michal; Zubal, Stanislav

2017-10-01

In this paper, we describe the parallelization of three complex and computationally intensive modules of GRASS GIS using the OpenMP application programming interface for multi-core computers. These include the v.surf.rst module for spatial interpolation, the r.sun module for solar radiation modeling and the r.sim.water module for water flow simulation. We briefly describe the functionality of the modules and parallelization approaches used in the modules. Our approach includes the analysis of the module's functionality, identification of source code segments suitable for parallelization and proper application of OpenMP parallelization code to create efficient threads processing the subtasks. We document the efficiency of the solutions using the airborne laser scanning data representing land surface in the test area and derived high-resolution digital terrain model grids. We discuss the performance speed-up and parallelization efficiency depending on the number of processor threads. The study showed a substantial increase in computation speeds on a standard multi-core computer while maintaining the accuracy of results in comparison to the output from original modules. The presented parallelization approach showed the simplicity and efficiency of the parallelization of open-source GRASS GIS modules using OpenMP, leading to an increased performance of this geospatial software on standard multi-core computers.
A Hybrid Shared-Memory Parallel Max-Tree Algorithm for Extreme Dynamic-Range Images.

PubMed

Moschini, Ugo; Meijster, Arnold; Wilkinson, Michael H F

2018-03-01

Max-trees, or component trees, are graph structures that represent the connected components of an image in a hierarchical way. Nowadays, many application fields rely on images with high-dynamic range or floating point values. Efficient sequential algorithms exist to build trees and compute attributes for images of any bit depth. However, we show that the current parallel algorithms perform poorly already with integers at bit depths higher than 16 bits per pixel. We propose a parallel method combining the two worlds of flooding and merging max-tree algorithms. First, a pilot max-tree of a quantized version of the image is built in parallel using a flooding method. Later, this structure is used in a parallel leaf-to-root approach to compute efficiently the final max-tree and to drive the merging of the sub-trees computed by the threads. We present an analysis of the performance both on simulated and actual 2D images and 3D volumes. Execution times are about better than the fastest sequential algorithm and speed-up goes up to on 64 threads.
Real time display Fourier-domain OCT using multi-thread parallel computing with data vectorization

NASA Astrophysics Data System (ADS)

Eom, Tae Joong; Kim, Hoon Seop; Kim, Chul Min; Lee, Yeung Lak; Choi, Eun-Seo

2011-03-01

We demonstrate a real-time display of processed OCT images using multi-thread parallel computing with a quad-core CPU of a personal computer. The data of each A-line are treated as one vector to maximize the data translation rate between the cores of the CPU and RAM stored image data. A display rate of 29.9 frames/sec for processed OCT data (4096 FFT-size x 500 A-scans) is achieved in our system using a wavelength swept source with 52-kHz swept frequency. The data processing times of the OCT image and a Doppler OCT image with a 4-time average are 23.8 msec and 91.4 msec.
Playback system designed for X-Band SAR

NASA Astrophysics Data System (ADS)

Yuquan, Liu; Changyong, Dou

2014-03-01

SAR(Synthetic Aperture Radar) has extensive application because it is daylight and weather independent. In particular, X-Band SAR strip map, designed by Institute of Remote Sensing and Digital Earth, Chinese Academy of Sciences, provides high ground resolution images, at the same time it has a large spatial coverage and a short acquisition time, so it is promising in multi-applications. When sudden disaster comes, the emergency situation acquires radar signal data and image as soon as possible, in order to take action to reduce loss and save lives in the first time. This paper summarizes a type of X-Band SAR playback processing system designed for disaster response and scientific needs. It describes SAR data workflow includes the payload data transmission and reception process. Playback processing system completes signal analysis on the original data, providing SAR level 0 products and quick image. Gigabit network promises radar signal transmission efficiency from recorder to calculation unit. Multi-thread parallel computing and ping pong operation can ensure computation speed. Through gigabit network, multi-thread parallel computing and ping pong operation, high speed data transmission and processing meet the SAR radar data playback real time requirement.
Potential Application of a Graphical Processing Unit to Parallel Computations in the NUBEAM Code

NASA Astrophysics Data System (ADS)

Payne, J.; McCune, D.; Prater, R.

2010-11-01

NUBEAM is a comprehensive computational Monte Carlo based model for neutral beam injection (NBI) in tokamaks. NUBEAM computes NBI-relevant profiles in tokamak plasmas by tracking the deposition and the slowing of fast ions. At the core of NUBEAM are vector calculations used to track fast ions. These calculations have recently been parallelized to run on MPI clusters. However, cost and interlink bandwidth limit the ability to fully parallelize NUBEAM on an MPI cluster. Recent implementation of double precision capabilities for Graphical Processing Units (GPUs) presents a cost effective and high performance alternative or complement to MPI computation. Commercially available graphics cards can achieve up to 672 GFLOPS double precision and can handle hundreds of thousands of threads. The ability to execute at least one thread per particle simultaneously could significantly reduce the execution time and the statistical noise of NUBEAM. Progress on implementation on a GPU will be presented.
GPU Based N-Gram String Matching Algorithm with Score Table Approach for String Searching in Many Documents

NASA Astrophysics Data System (ADS)

Srinivasa, K. G.; Shree Devi, B. N.

2017-10-01

String searching in documents has become a tedious task with the evolution of Big Data. Generation of large data sets demand for a high performance search algorithm in areas such as text mining, information retrieval and many others. The popularity of GPU's for general purpose computing has been increasing for various applications. Therefore it is of great interest to exploit the thread feature of a GPU to provide a high performance search algorithm. This paper proposes an optimized new approach to N-gram model for string search in a number of lengthy documents and its GPU implementation. The algorithm exploits GPGPUs for searching strings in many documents employing character level N-gram matching with parallel Score Table approach and search using CUDA API. The new approach of Score table used for frequency storage of N-grams in a document, makes the search independent of the document's length and allows faster access to the frequency values, thus decreasing the search complexity. The extensive thread feature in a GPU has been exploited to enable parallel pre-processing of trigrams in a document for Score Table creation and parallel search in huge number of documents, thus speeding up the whole search process even for a large pattern size. Experiments were carried out for many documents of varied length and search strings from the standard Lorem Ipsum text on NVIDIA's GeForce GT 540M GPU with 96 cores. Results prove that the parallel approach for Score Table creation and searching gives a good speed up than the same approach executed serially.
Three dimensional simulations of viscous folding in diverging microchannels

NASA Astrophysics Data System (ADS)

Xu, Bingrui; Chergui, Jalel; Shin, Seungwon; Juric, Damir

2016-11-01

Three dimensional simulations on the viscous folding in diverging microchannels reported by Cubaud and Mason are performed using the parallel code BLUE for multi-phase flows. The more viscous liquid L1 is injected into the channel from the center inlet, and the less viscous liquid L2 from two side inlets. Liquid L1 takes the form of a thin filament due to hydrodynamic focusing in the long channel that leads to the diverging region. The thread then becomes unstable to a folding instability, due to the longitudinal compressive stress applied to it by the diverging flow of liquid L2. We performed a parameter study in which the flow rate ratio, the viscosity ratio, the Reynolds number, and the shape of the channel were varied relative to a reference model. In our simulations, the cross section of the thread produced by focusing is elliptical rather than circular. The initial folding axis can be either parallel or perpendicular to the narrow dimension of the chamber. In the former case, the folding slowly transforms via twisting to perpendicular folding, or it may remain parallel. The direction of folding onset is determined by the velocity profile and the elliptical shape of the thread cross section in the channel that feeds the diverging part of the cell.
On Designing Lightweight Threads for Substrate Software

NASA Technical Reports Server (NTRS)

Haines, Matthew

1997-01-01

Existing user-level thread packages employ a 'black box' design approach, where the implementation of the threads is hidden from the user. While this approach is often sufficient for application-level programmers, it hides critical design decisions that system-level programmers must be able to change in order to provide efficient service for high-level systems. By applying the principles of Open Implementation Analysis and Design, we construct a new user-level threads package that supports common thread abstractions and a well-defined meta-interface for altering the behavior of these abstractions. As a result, system-level programmers will have the advantages of using high-level thread abstractions without having to sacrifice performance, flexibility or portability.
CUDA Optimization Strategies for Compute- and Memory-Bound Neuroimaging Algorithms

PubMed Central

Lee, Daren; Dinov, Ivo; Dong, Bin; Gutman, Boris; Yanovsky, Igor; Toga, Arthur W.

2011-01-01

As neuroimaging algorithms and technology continue to grow faster than CPU performance in complexity and image resolution, data-parallel computing methods will be increasingly important. The high performance, data-parallel architecture of modern graphical processing units (GPUs) can reduce computational times by orders of magnitude. However, its massively threaded architecture introduces challenges when GPU resources are exceeded. This paper presents optimization strategies for compute- and memory-bound algorithms for the CUDA architecture. For compute-bound algorithms, the registers are reduced through variable reuse via shared memory and the data throughput is increased through heavier thread workloads and maximizing the thread configuration for a single thread block per multiprocessor. For memory-bound algorithms, fitting the data into the fast but limited GPU resources is achieved through reorganizing the data into self-contained structures and employing a multi-pass approach. Memory latencies are reduced by selecting memory resources whose cache performance are optimized for the algorithm's access patterns. We demonstrate the strategies on two computationally expensive algorithms and achieve optimized GPU implementations that perform up to 6× faster than unoptimized ones. Compared to CPU implementations, we achieve peak GPU speedups of 129× for the 3D unbiased nonlinear image registration technique and 93× for the non-local means surface denoising algorithm. PMID:21159404
CUDA optimization strategies for compute- and memory-bound neuroimaging algorithms.

PubMed

Lee, Daren; Dinov, Ivo; Dong, Bin; Gutman, Boris; Yanovsky, Igor; Toga, Arthur W

2012-06-01

As neuroimaging algorithms and technology continue to grow faster than CPU performance in complexity and image resolution, data-parallel computing methods will be increasingly important. The high performance, data-parallel architecture of modern graphical processing units (GPUs) can reduce computational times by orders of magnitude. However, its massively threaded architecture introduces challenges when GPU resources are exceeded. This paper presents optimization strategies for compute- and memory-bound algorithms for the CUDA architecture. For compute-bound algorithms, the registers are reduced through variable reuse via shared memory and the data throughput is increased through heavier thread workloads and maximizing the thread configuration for a single thread block per multiprocessor. For memory-bound algorithms, fitting the data into the fast but limited GPU resources is achieved through reorganizing the data into self-contained structures and employing a multi-pass approach. Memory latencies are reduced by selecting memory resources whose cache performance are optimized for the algorithm's access patterns. We demonstrate the strategies on two computationally expensive algorithms and achieve optimized GPU implementations that perform up to 6× faster than unoptimized ones. Compared to CPU implementations, we achieve peak GPU speedups of 129× for the 3D unbiased nonlinear image registration technique and 93× for the non-local means surface denoising algorithm. Copyright © 2010 Elsevier Ireland Ltd. All rights reserved.

Gregarious Data Re-structuring in a Many Core Architecture

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shrestha, Sunil; Manzano Franco, Joseph B.; Marquez, Andres

this paper, we have developed a new methodology that takes in consideration the access patterns from a single parallel actor (e.g. a thread), as well as, the access patterns of “grouped” parallel actors that share a resource (e.g. a distributed Level 3 cache). We start with a hierarchical tile code for our target machine and apply a series of transformations at the tile level to improve data residence in a given memory hierarchy level. The contribution of this paper includes (a) collaborative data restructuring for group reuse and (b) low overhead transformation technique to improve access pattern and bring closelymore » connected data elements together. Preliminary results in a many core architecture, Tilera TileGX, shows promising improvements over optimized OpenMP code (up to 31% increase in GFLOPS) and over our own previous work on fine grained runtimes (up to 16%) for selected kernels« less
Adaptive multi-GPU Exchange Monte Carlo for the 3D Random Field Ising Model

NASA Astrophysics Data System (ADS)

Navarro, Cristóbal A.; Huang, Wei; Deng, Youjin

2016-08-01

This work presents an adaptive multi-GPU Exchange Monte Carlo approach for the simulation of the 3D Random Field Ising Model (RFIM). The design is based on a two-level parallelization. The first level, spin-level parallelism, maps the parallel computation as optimal 3D thread-blocks that simulate blocks of spins in shared memory with minimal halo surface, assuming a constant block volume. The second level, replica-level parallelism, uses multi-GPU computation to handle the simulation of an ensemble of replicas. CUDA's concurrent kernel execution feature is used in order to fill the occupancy of each GPU with many replicas, providing a performance boost that is more notorious at the smallest values of L. In addition to the two-level parallel design, the work proposes an adaptive multi-GPU approach that dynamically builds a proper temperature set free of exchange bottlenecks. The strategy is based on mid-point insertions at the temperature gaps where the exchange rate is most compromised. The extra work generated by the insertions is balanced across the GPUs independently of where the mid-point insertions were performed. Performance results show that spin-level performance is approximately two orders of magnitude faster than a single-core CPU version and one order of magnitude faster than a parallel multi-core CPU version running on 16-cores. Multi-GPU performance is highly convenient under a weak scaling setting, reaching up to 99 % efficiency as long as the number of GPUs and L increase together. The combination of the adaptive approach with the parallel multi-GPU design has extended our possibilities of simulation to sizes of L = 32 , 64 for a workstation with two GPUs. Sizes beyond L = 64 can eventually be studied using larger multi-GPU systems.
SU-E-T-531: Performance Evaluation of Multithreaded Geant4 for Proton Therapy Dose Calculations in a High Performance Computing Facility

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shin, J; Coss, D; McMurry, J

Purpose: To evaluate the efficiency of multithreaded Geant4 (Geant4-MT, version 10.0) for proton Monte Carlo dose calculations using a high performance computing facility. Methods: Geant4-MT was used to calculate 3D dose distributions in 1×1×1 mm3 voxels in a water phantom and patient's head with a 150 MeV proton beam covering approximately 5×5 cm2 in the water phantom. Three timestamps were measured on the fly to separately analyze the required time for initialization (which cannot be parallelized), processing time of individual threads, and completion time. Scalability of averaged processing time per thread was calculated as a function of thread number (1,more » 100, 150, and 200) for both 1M and 50 M histories. The total memory usage was recorded. Results: Simulations with 50 M histories were fastest with 100 threads, taking approximately 1.3 hours and 6 hours for the water phantom and the CT data, respectively with better than 1.0 % statistical uncertainty. The calculations show 1/N scalability in the event loops for both cases. The gains from parallel calculations started to decrease with 150 threads. The memory usage increases linearly with number of threads. No critical failures were observed during the simulations. Conclusion: Multithreading in Geant4-MT decreased simulation time in proton dose distribution calculations by a factor of 64 and 54 at a near optimal 100 threads for water phantom and patient's data respectively. Further simulations will be done to determine the efficiency at the optimal thread number. Considering the trend of computer architecture development, utilizing Geant4-MT for radiotherapy simulations is an excellent cost-effective alternative for a distributed batch queuing system. However, because the scalability depends highly on simulation details, i.e., the ratio of the processing time of one event versus waiting time to access for the shared event queue, a performance evaluation as described is recommended.« less
The Process of Parallelizing the Conjunction Prediction Algorithm of ESA's SSA Conjunction Prediction Service Using GPGPU

NASA Astrophysics Data System (ADS)

Fehr, M.; Navarro, V.; Martin, L.; Fletcher, E.

2013-08-01

Space Situational Awareness[8] (SSA) is defined as the comprehensive knowledge, understanding and maintained awareness of the population of space objects, the space environment and existing threats and risks. As ESA's SSA Conjunction Prediction Service (CPS) requires the repetitive application of a processing algorithm against a data set of man-made space objects, it is crucial to exploit the highly parallelizable nature of this problem. Currently the CPS system makes use of OpenMP[7] for parallelization purposes using CPU threads, but only a GPU with its hundreds of cores can fully benefit from such high levels of parallelism. This paper presents the adaptation of several core algorithms[5] of the CPS for general-purpose computing on graphics processing units (GPGPU) using NVIDIAs Compute Unified Device Architecture (CUDA).
NDL-v2.0: A new version of the numerical differentiation library for parallel architectures

NASA Astrophysics Data System (ADS)

Hadjidoukas, P. E.; Angelikopoulos, P.; Voglis, C.; Papageorgiou, D. G.; Lagaris, I. E.

2014-07-01

We present a new version of the numerical differentiation library (NDL) used for the numerical estimation of first and second order partial derivatives of a function by finite differencing. In this version we have restructured the serial implementation of the code so as to achieve optimal task-based parallelization. The pure shared-memory parallelization of the library has been based on the lightweight OpenMP tasking model allowing for the full extraction of the available parallelism and efficient scheduling of multiple concurrent library calls. On multicore clusters, parallelism is exploited by means of TORC, an MPI-based multi-threaded tasking library. The new MPI implementation of NDL provides optimal performance in terms of function calls and, furthermore, supports asynchronous execution of multiple library calls within legacy MPI programs. In addition, a Python interface has been implemented for all cases, exporting the functionality of our library to sequential Python codes. Catalog identifier: AEDG_v2_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEDG_v2_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 63036 No. of bytes in distributed program, including test data, etc.: 801872 Distribution format: tar.gz Programming language: ANSI Fortran-77, ANSI C, Python. Computer: Distributed systems (clusters), shared memory systems. Operating system: Linux, Unix. Has the code been vectorized or parallelized?: Yes. RAM: The library uses O(N) internal storage, N being the dimension of the problem. It can use up to O(N2) internal storage for Hessian calculations, if a task throttling factor has not been set by the user. Classification: 4.9, 4.14, 6.5. Catalog identifier of previous version: AEDG_v1_0 Journal reference of previous version: Comput. Phys. Comm. 180(2009)1404 Does the new version supersede the previous version?: Yes Nature of problem: The numerical estimation of derivatives at several accuracy levels is a common requirement in many computational tasks, such as optimization, solution of nonlinear systems, and sensitivity analysis. For a large number of scientific and engineering applications, the underlying functions correspond to simulation codes for which analytical estimation of derivatives is difficult or almost impossible. A parallel implementation that exploits systems with multiple CPUs is very important for large scale and computationally expensive problems. Solution method: Finite differencing is used with a carefully chosen step that minimizes the sum of the truncation and round-off errors. The parallel versions employ both OpenMP and MPI libraries. Reasons for new version: The updated version was motivated by our endeavors to extend a parallel Bayesian uncertainty quantification framework [1], by incorporating higher order derivative information as in most state-of-the-art stochastic simulation methods such as Stochastic Newton MCMC [2] and Riemannian Manifold Hamiltonian MC [3]. The function evaluations are simulations with significant time-to-solution, which also varies with the input parameters such as in [1, 4]. The runtime of the N-body-type of problem changes considerably with the introduction of a longer cut-off between the bodies. In the first version of the library, the OpenMP-parallel subroutines spawn a new team of threads and distribute the function evaluations with a PARALLEL DO directive. This limits the functionality of the library as multiple concurrent calls require nested parallelism support from the OpenMP environment. Therefore, either their function evaluations will be serialized or processor oversubscription is likely to occur due to the increased number of OpenMP threads. In addition, the Hessian calculations include two explicit parallel regions that compute first the diagonal and then the off-diagonal elements of the array. Due to the barrier between the two regions, the parallelism of the calculations is not fully exploited. These issues have been addressed in the new version by first restructuring the serial code and then running the function evaluations in parallel using OpenMP tasks. Although the MPI-parallel implementation of the first version is capable of fully exploiting the task parallelism of the PNDL routines, it does not utilize the caching mechanism of the serial code and, therefore, performs some redundant function evaluations in the Hessian and Jacobian calculations. This can lead to: (a) higher execution times if the number of available processors is lower than the total number of tasks, and (b) significant energy consumption due to wasted processor cycles. Overcoming these drawbacks, which become critical as the time of a single function evaluation increases, was the primary goal of this new version. Due to the code restructure, the MPI-parallel implementation (and the OpenMP-parallel in accordance) avoids redundant calls, providing optimal performance in terms of the number of function evaluations. Another limitation of the library was that the library subroutines were collective and synchronous calls. In the new version, each MPI process can issue any number of subroutines for asynchronous execution. We introduce two library calls that provide global and local task synchronizations, similarly to the BARRIER and TASKWAIT directives of OpenMP. The new MPI-implementation is based on TORC, a new tasking library for multicore clusters [5-7]. TORC improves the portability of the software, as it relies exclusively on the POSIX-Threads and MPI programming interfaces. It allows MPI processes to utilize multiple worker threads, offering a hybrid programming and execution environment similar to MPI+OpenMP, in a completely transparent way. Finally, to further improve the usability of our software, a Python interface has been implemented on top of both the OpenMP and MPI versions of the library. This allows sequential Python codes to exploit shared and distributed memory systems. Summary of revisions: The revised code improves the performance of both parallel (OpenMP and MPI) implementations. The functionality and the user-interface of the MPI-parallel version have been extended to support the asynchronous execution of multiple PNDL calls, issued by one or multiple MPI processes. A new underlying tasking library increases portability and allows MPI processes to have multiple worker threads. For both implementations, an interface to the Python programming language has been added. Restrictions: The library uses only double precision arithmetic. The MPI implementation assumes the homogeneity of the execution environment provided by the operating system. Specifically, the processes of a single MPI application must have identical address space and a user function resides at the same virtual address. In addition, address space layout randomization should not be used for the application. Unusual features: The software takes into account bound constraints, in the sense that only feasible points are used to evaluate the derivatives, and given the level of the desired accuracy, the proper formula is automatically employed. Running time: Running time depends on the function's complexity. The test run took 23 ms for the serial distribution, 25 ms for the OpenMP with 2 threads, 53 ms and 1.01 s for the MPI parallel distribution using 2 threads and 2 processes respectively and yield-time for idle workers equal to 10 ms. References: [1] P. Angelikopoulos, C. Paradimitriou, P. Koumoutsakos, Bayesian uncertainty quantification and propagation in molecular dynamics simulations: a high performance computing framework, J. Chem. Phys 137 (14). [2] H.P. Flath, L.C. Wilcox, V. Akcelik, J. Hill, B. van Bloemen Waanders, O. Ghattas, Fast algorithms for Bayesian uncertainty quantification in large-scale linear inverse problems based on low-rank partial Hessian approximations, SIAM J. Sci. Comput. 33 (1) (2011) 407-432. [3] M. Girolami, B. Calderhead, Riemann manifold Langevin and Hamiltonian Monte Carlo methods, J. R. Stat. Soc. Ser. B (Stat. Methodol.) 73 (2) (2011) 123-214. [4] P. Angelikopoulos, C. Paradimitriou, P. Koumoutsakos, Data driven, predictive molecular dynamics for nanoscale flow simulations under uncertainty, J. Phys. Chem. B 117 (47) (2013) 14808-14816. [5] P.E. Hadjidoukas, E. Lappas, V.V. Dimakopoulos, A runtime library for platform-independent task parallelism, in: PDP, IEEE, 2012, pp. 229-236. [6] C. Voglis, P.E. Hadjidoukas, D.G. Papageorgiou, I. Lagaris, A parallel hybrid optimization algorithm for fitting interatomic potentials, Appl. Soft Comput. 13 (12) (2013) 4481-4492. [7] P.E. Hadjidoukas, C. Voglis, V.V. Dimakopoulos, I. Lagaris, D.G. Papageorgiou, Supporting adaptive and irregular parallelism for non-linear numerical optimization, Appl. Math. Comput. 231 (2014) 544-559.
Community Detection on the GPU

DOE Office of Scientific and Technical Information (OSTI.GOV)

Naim, Md; Manne, Fredrik; Halappanavar, Mahantesh

We present and evaluate a new GPU algorithm based on the Louvain method for community detection. Our algorithm is the first for this problem that parallelizes the access to individual edges. In this way we can fine tune the load balance when processing networks with nodes of highly varying degrees. This is achieved by scaling the number of threads assigned to each node according to its degree. Extensive experiments show that we obtain speedups up to a factor of 270 compared to the sequential algorithm. The algorithm consistently outperforms other recent shared memory implementations and is only one order ofmore » magnitude slower than the current fastest parallel Louvain method running on a Blue Gene/Q supercomputer using more than 500K threads.« less
Support of Multidimensional Parallelism in the OpenMP Programming Model

NASA Technical Reports Server (NTRS)

Jin, Hao-Qiang; Jost, Gabriele

2003-01-01

OpenMP is the current standard for shared-memory programming. While providing ease of parallel programming, the OpenMP programming model also has limitations which often effect the scalability of applications. Examples for these limitations are work distribution and point-to-point synchronization among threads. We propose extensions to the OpenMP programming model which allow the user to easily distribute the work in multiple dimensions and synchronize the workflow among the threads. The proposed extensions include four new constructs and the associated runtime library. They do not require changes to the source code and can be implemented based on the existing OpenMP standard. We illustrate the concept in a prototype translator and test with benchmark codes and a cloud modeling code.
Accelerating the Gillespie Exact Stochastic Simulation Algorithm using hybrid parallel execution on graphics processing units.

PubMed

Komarov, Ivan; D'Souza, Roshan M

2012-01-01

The Gillespie Stochastic Simulation Algorithm (GSSA) and its variants are cornerstone techniques to simulate reaction kinetics in situations where the concentration of the reactant is too low to allow deterministic techniques such as differential equations. The inherent limitations of the GSSA include the time required for executing a single run and the need for multiple runs for parameter sweep exercises due to the stochastic nature of the simulation. Even very efficient variants of GSSA are prohibitively expensive to compute and perform parameter sweeps. Here we present a novel variant of the exact GSSA that is amenable to acceleration by using graphics processing units (GPUs). We parallelize the execution of a single realization across threads in a warp (fine-grained parallelism). A warp is a collection of threads that are executed synchronously on a single multi-processor. Warps executing in parallel on different multi-processors (coarse-grained parallelism) simultaneously generate multiple trajectories. Novel data-structures and algorithms reduce memory traffic, which is the bottleneck in computing the GSSA. Our benchmarks show an 8×-120× performance gain over various state-of-the-art serial algorithms when simulating different types of models.
XMOS XC-2 Development Board for Mechanical Control and Data Collection

NASA Technical Reports Server (NTRS)

Jarnot, Robert F.; Bowden, William J.

2011-01-01

The scanning microwave limb sounder (SMLS) will use technological improvements in low-noise mixers to provide precise data on the Earth s atmospheric composition with high spatial resolution. This project focuses on the design and implementation of a realtime control system needed for airborne engineering tests of the SMLS. The system must coordinate the actuation of optical components using four motors with encoder readback, while collecting synchronized telemetric data from a GPS receiver and 3-axis gyrometric system. A graphical user interface for testing the control system was also designed using Python. Although the system could have been implemented with an FPGA(fieldprogrammable gate array)-based setup, a processor development kit manufactured by XMOS was chosen. The XMOS architecture allows parallel execution of multiple tasks on separate threads, making it ideal for this application. It is easily programmed using XC (a subset of C). The necessary communication interfaces were implemented in software, including Ethernet, with significant cost and time reduction compared to an FPGA-based approach. A simple approach to control the chopper, calibration mirror, and gimbal for the airborne SMLS was needed. The XMOS board allows for multiple threads and real-time data acquisition. The XC-2 development kit is an attractive choice for synchronized, real-time, event-driven applications. The XMOS is based on the transputer microprocessor architecture developed for parallel computing, which is being revamped in this new platform. The XMOS device has multiple cores capable of running parallel applications on separate threads. The threads communicate with each other via user-defined channels capable of transmitting data within the device. XMOS provides a C-based development environment using XC, which eliminates the need for custom tool kits associated with FPGA programming. The XC-2 has four cores and necessary hardware for Ethernet I/O.
Scaling Up Coordinate Descent Algorithms for Large ℓ1 Regularization Problems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Scherrer, Chad; Halappanavar, Mahantesh; Tewari, Ambuj

2012-07-03

We present a generic framework for parallel coordinate descent (CD) algorithms that has as special cases the original sequential algorithms of Cyclic CD and Stochastic CD, as well as the recent parallel Shotgun algorithm of Bradley et al. We introduce two novel parallel algorithms that are also special cases---Thread-Greedy CD and Coloring-Based CD---and give performance measurements for an OpenMP implementation of these.
Locality Aware Concurrent Start for Stencil Applications

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shrestha, Sunil; Gao, Guang R.; Manzano Franco, Joseph B.

Stencil computations are at the heart of many physical simulations used in scientific codes. Thus, there exists a plethora of optimization efforts for this family of computations. Among these techniques, tiling techniques that allow concurrent start have proven to be very efficient in providing better performance for these critical kernels. Nevertheless, with many core designs being the norm, these optimization techniques might not be able to fully exploit locality (both spatial and temporal) on multiple levels of the memory hierarchy without compromising parallelism. It is no longer true that the machine can be seen as a homogeneous collection of nodesmore » with caches, main memory and an interconnect network. New architectural designs exhibit complex grouping of nodes, cores, threads, caches and memory connected by an ever evolving network-on-chip design. These new designs may benefit greatly from carefully crafted schedules and groupings that encourage parallel actors (i.e. threads, cores or nodes) to be aware of the computational history of other actors in close proximity. In this paper, we provide an efficient tiling technique that allows hierarchical concurrent start for memory hierarchy aware tile groups. Each execution schedule and tile shape exploit the available parallelism, load balance and locality present in the given applications. We demonstrate our technique on the Intel Xeon Phi architecture with selected and representative stencil kernels. We show improvement ranging from 5.58% to 31.17% over existing state-of-the-art techniques.« less
Optimizing Excited-State Electronic-Structure Codes for Intel Knights Landing: A Case Study on the BerkeleyGW Software

DOE Office of Scientific and Technical Information (OSTI.GOV)

Deslippe, Jack; da Jornada, Felipe H.; Vigil-Fowler, Derek

2016-10-06

We profile and optimize calculations performed with the BerkeleyGW code on the Xeon-Phi architecture. BerkeleyGW depends both on hand-tuned critical kernels as well as on BLAS and FFT libraries. We describe the optimization process and performance improvements achieved. We discuss a layered parallelization strategy to take advantage of vector, thread and node-level parallelism. We discuss locality changes (including the consequence of the lack of L3 cache) and effective use of the on-package high-bandwidth memory. We show preliminary results on Knights-Landing including a roofline study of code performance before and after a number of optimizations. We find that the GW methodmore » is particularly well-suited for many-core architectures due to the ability to exploit a large amount of parallelism over plane-wave components, band-pairs, and frequencies.« less
Adapter plate assembly for adjustable mounting of objects

DOEpatents

Blackburn, R.S.

1986-05-02

An adapter plate and two locking discs are together affixed to an optic table with machine screws or bolts threaded into a fixed array of internally threaded holes provided in the table surface. The adapter plate preferably has two, and preferably parallel, elongated locating slots each freely receiving a portion of one of the locking discs for secure affixation of the adapter plate to the optic table. A plurality of threaded apertures provided in the adapter plate are available to attach optical mounts or other devices onto the adapter plate in an orientation not limited by the disposition of the array of threaded holes in the table surface. An axially aligned but radially offset hole through each locking disc receives a screw that tightens onto the table, such that prior to tightening of the screw the locking disc may rotate and translate within each locating slot of the adapter plate for maximum flexibility of the orientation thereof.
Adapter plate assembly for adjustable mounting of objects

DOEpatents

Blackburn, Robert S.

1987-01-01

An adapter plate and two locking discs are together affixed to an optic table with machine screws or bolts threaded into a fixed array of internally threaded holes provided in the table surface. The adapter plate preferably has two, and preferably parallel, elongated locating slots each freely receiving a portion of one of the locking discs for secure affixation of the adapter plate to the optic table. A plurality of threaded apertures provided in the adapter plate are available to attach optical mounts or other devices onto the adapter plate in an orientation not limited by the disposition of the array of threaded holes in the table surface. An axially aligned but radially offset hole through each locking disc receives a screw that tightens onto the table, such that prior to tightening of the screw the locking disc may rotate and translate within each locating slot of the adapter plate for maximum flexibility of the orientation thereof.
A Massively Parallel Computational Method of Reading Index Files for SOAPsnv.

PubMed

Zhu, Xiaoqian; Peng, Shaoliang; Liu, Shaojie; Cui, Yingbo; Gu, Xiang; Gao, Ming; Fang, Lin; Fang, Xiaodong

2015-12-01

SOAPsnv is the software used for identifying the single nucleotide variation in cancer genes. However, its performance is yet to match the massive amount of data to be processed. Experiments reveal that the main performance bottleneck of SOAPsnv software is the pileup algorithm. The original pileup algorithm's I/O process is time-consuming and inefficient to read input files. Moreover, the scalability of the pileup algorithm is also poor. Therefore, we designed a new algorithm, named BamPileup, aiming to improve the performance of sequential read, and the new pileup algorithm implemented a parallel read mode based on index. Using this method, each thread can directly read the data start from a specific position. The results of experiments on the Tianhe-2 supercomputer show that, when reading data in a multi-threaded parallel I/O way, the processing time of algorithm is reduced to 3.9 s and the application program can achieve a speedup up to 100×. Moreover, the scalability of the new algorithm is also satisfying.
MrBayes tgMC3++: A High Performance and Resource-Efficient GPU-Oriented Phylogenetic Analysis Method.

PubMed

Ling, Cheng; Hamada, Tsuyoshi; Gao, Jingyang; Zhao, Guoguang; Sun, Donghong; Shi, Weifeng

2016-01-01

MrBayes is a widespread phylogenetic inference tool harnessing empirical evolutionary models and Bayesian statistics. However, the computational cost on the likelihood estimation is very expensive, resulting in undesirably long execution time. Although a number of multi-threaded optimizations have been proposed to speed up MrBayes, there are bottlenecks that severely limit the GPU thread-level parallelism of likelihood estimations. This study proposes a high performance and resource-efficient method for GPU-oriented parallelization of likelihood estimations. Instead of having to rely on empirical programming, the proposed novel decomposition storage model implements high performance data transfers implicitly. In terms of performance improvement, a speedup factor of up to 178 can be achieved on the analysis of simulated datasets by four Tesla K40 cards. In comparison to the other publicly available GPU-oriented MrBayes, the tgMC 3 ++ method (proposed herein) outperforms the tgMC 3 (v1.0), nMC 3 (v2.1.1) and oMC 3 (v1.00) methods by speedup factors of up to 1.6, 1.9 and 2.9, respectively. Moreover, tgMC 3 ++ supports more evolutionary models and gamma categories, which previous GPU-oriented methods fail to take into analysis.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Shipman, Galen M.

These are the slides for a presentation on programming models in HPC, at the Los Alamos National Laboratory's Parallel Computing Summer School. The following topics are covered: Flynn's Taxonomy of computer architectures; single instruction single data; single instruction multiple data; multiple instruction multiple data; address space organization; definition of Trinity (Intel Xeon-Phi is a MIMD architecture); single program multiple data; multiple program multiple data; ExMatEx workflow overview; definition of a programming model, programming languages, runtime systems; programming model and environments; MPI (Message Passing Interface); OpenMP; Kokkos (Performance Portable Thread-Parallel Programming Model); Kokkos abstractions, patterns, policies, and spaces; RAJA, a systematicmore » approach to node-level portability and tuning; overview of the Legion Programming Model; mapping tasks and data to hardware resources; interoperability: supporting task-level models; Legion S3D execution and performance details; workflow, integration of external resources into the programming model.« less
CCC7-119 Reactive Molecular Dynamics Simulations of Hot Spot Growth in Shocked Energetic Materials

DOE Office of Scientific and Technical Information (OSTI.GOV)

Thompson, Aidan P.

2015-03-01

The purpose of this work is to understand how defects control initiation in energetic materials used in stockpile components; Sequoia gives us the core-count to run very large-scale simulations of up to 10 million atoms and; Using an OpenMP threaded implementation of the ReaxFF package in LAMMPS, we have been able to get good parallel efficiency running on 16k nodes of Sequoia, with 1 hardware thread per core.
A Tutorial on Parallel and Concurrent Programming in Haskell

NASA Astrophysics Data System (ADS)

Peyton Jones, Simon; Singh, Satnam

This practical tutorial introduces the features available in Haskell for writing parallel and concurrent programs. We first describe how to write semi-explicit parallel programs by using annotations to express opportunities for parallelism and to help control the granularity of parallelism for effective execution on modern operating systems and processors. We then describe the mechanisms provided by Haskell for writing explicitly parallel programs with a focus on the use of software transactional memory to help share information between threads. Finally, we show how nested data parallelism can be used to write deterministically parallel programs which allows programmers to use rich data types in data parallel programs which are automatically transformed into flat data parallel versions for efficient execution on multi-core processors.
Performance and Application of Parallel OVERFLOW Codes on Distributed and Shared Memory Platforms

NASA Technical Reports Server (NTRS)

Djomehri, M. Jahed; Rizk, Yehia M.

1999-01-01

The presentation discusses recent studies on the performance of the two parallel versions of the aerodynamics CFD code, OVERFLOW_MPI and _MLP. Developed at NASA Ames, the serial version, OVERFLOW, is a multidimensional Navier-Stokes flow solver based on overset (Chimera) grid technology. The code has recently been parallelized in two ways. One is based on the explicit message-passing interface (MPI) across processors and uses the _MPI communication package. This approach is primarily suited for distributed memory systems and workstation clusters. The second, termed the multi-level parallel (MLP) method, is simple and uses shared memory for all communications. The _MLP code is suitable on distributed-shared memory systems. For both methods, the message passing takes place across the processors or processes at the advancement of each time step. This procedure is, in effect, the Chimera boundary conditions update, which is done in an explicit "Jacobi" style. In contrast, the update in the serial code is done in more of the "Gauss-Sidel" fashion. The programming efforts for the _MPI code is more complicated than for the _MLP code; the former requires modification of the outer and some inner shells of the serial code, whereas the latter focuses only on the outer shell of the code. The _MPI version offers a great deal of flexibility in distributing grid zones across a specified number of processors in order to achieve load balancing. The approach is capable of partitioning zones across multiple processors or sending each zone and/or cluster of several zones into a single processor. The message passing across the processors consists of Chimera boundary and/or an overlap of "halo" boundary points for each partitioned zone. The MLP version is a new coarse-grain parallel concept at the zonal and intra-zonal levels. A grouping strategy is used to distribute zones into several groups forming sub-processes which will run in parallel. The total volume of grid points in each group are approximately balanced. A proper number of threads are initially allocated to each group, and in subsequent iterations during the run-time, the number of threads are adjusted to achieve load balancing across the processes. Each process exploits the multitasking directives already established in Overflow.

Software Defined Radio with Parallelized Software Architecture

NASA Technical Reports Server (NTRS)

Heckler, Greg

2013-01-01

This software implements software-defined radio procession over multicore, multi-CPU systems in a way that maximizes the use of CPU resources in the system. The software treats each processing step in either a communications or navigation modulator or demodulator system as an independent, threaded block. Each threaded block is defined with a programmable number of input or output buffers; these buffers are implemented using POSIX pipes. In addition, each threaded block is assigned a unique thread upon block installation. A modulator or demodulator system is built by assembly of the threaded blocks into a flow graph, which assembles the processing blocks to accomplish the desired signal processing. This software architecture allows the software to scale effortlessly between single CPU/single-core computers or multi-CPU/multi-core computers without recompilation. NASA spaceflight and ground communications systems currently rely exclusively on ASICs or FPGAs. This software allows low- and medium-bandwidth (100 bps to approx.50 Mbps) software defined radios to be designed and implemented solely in C/C++ software, while lowering development costs and facilitating reuse and extensibility.
Software Defined Radio with Parallelized Software Architecture

NASA Technical Reports Server (NTRS)

Heckler, Greg

2013-01-01

This software implements software-defined radio procession over multi-core, multi-CPU systems in a way that maximizes the use of CPU resources in the system. The software treats each processing step in either a communications or navigation modulator or demodulator system as an independent, threaded block. Each threaded block is defined with a programmable number of input or output buffers; these buffers are implemented using POSIX pipes. In addition, each threaded block is assigned a unique thread upon block installation. A modulator or demodulator system is built by assembly of the threaded blocks into a flow graph, which assembles the processing blocks to accomplish the desired signal processing. This software architecture allows the software to scale effortlessly between single CPU/single-core computers or multi-CPU/multi-core computers without recompilation. NASA spaceflight and ground communications systems currently rely exclusively on ASICs or FPGAs. This software allows low- and medium-bandwidth (100 bps to .50 Mbps) software defined radios to be designed and implemented solely in C/C++ software, while lowering development costs and facilitating reuse and extensibility.
Using all of your CPU's in HIPE

NASA Astrophysics Data System (ADS)

Jacobson, J. D.; Fadda, D.

2012-09-01

Modern computer architectures increasingly feature multi-core CPU's. For example, the MacbookPro features the Intel quad-core i7 processors. Through the use of hyper-threading, where each core can execute two threads simultaneously, the quad-core i7 can support eight simultaneous processing threads. All this on your laptop! This CPU power can now be put into service by scientists to perform data reduction tasks, but only if the software has been designed to take advantage of the multiple processor architectures. Up to now, software written for Herschel data reduction (HIPE), written in Jython and JAVA, is single-threaded and can only utilize a single processor. Users of HIPE do not get any advantage from the additional processors. Why not put all of the CPU resources to work reducing your data? We present a multi-threaded software application that corrects long-term transients in the signal from the PACS unchopped spectroscopy line scan mode. In this poster, we present a multi-threaded software framework to achieve performance improvements from parallel execution. We will show how a task to correct transients in the PACS Spectroscopy Pipeline for the un-chopped line scan mode, has been threaded. This computation-intensive task uses either a one-parameter or a three parameter exponential function, to characterize the transient. The task uses a JAVA implementation of Minpack, translated from the C (Moshier) and IDL (Markwardt) by the authors, to optimize the correction parameters. We also explain how to determine if a task can benefit from threading (Amdahl's Law), and if it is safe to thread. The design and implementation, using the JAVA concurrency package completions service is described. Pitfalls, timing bugs, thread safety, resource control, testing and performance improvements are described and plotted.
Cable-type supercapacitors of three-dimensional cotton thread based multi-grade nanostructures for wearable energy storage.

PubMed

Liu, Nishuang; Ma, Wenzhen; Tao, Jiayou; Zhang, Xianghui; Su, Jun; Li, Luying; Yang, Congxing; Gao, Yihua; Golberg, Dmitri; Bando, Yoshio

2013-09-20

A novel cable-type flexible supercapacitor with excellent performance is fabricated using 3D polypyrrole(PPy)-MnO2 -CNT-cotton thread multi-grade nanostructure-based electrodes. The multiple supercapacitors with a high areal capacitance 1.49 F cm(-2) at a scan rate of 1 mV s(-1) connected in series and in parallel can successfully drive a LED segment display. Such an excellent performance is attributed to the cumulative effect of conducting single-walled carbon nanotubes on cotton thread, active mesoporous flower-like MnO2 nanoplates, and PPy conductive wrapping layer improving the conductivity, and acting as pseudocapacitance material simultaneously. Copyright © 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Mapping virtual addresses to different physical addresses for value disambiguation for thread memory access requests

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gala, Alan; Ohmacht, Martin

A multiprocessor system includes nodes. Each node includes a data path that includes a core, a TLB, and a first level cache implementing disambiguation. The system also includes at least one second level cache and a main memory. For thread memory access requests, the core uses an address associated with an instruction format of the core. The first level cache uses an address format related to the size of the main memory plus an offset corresponding to hardware thread meta data. The second level cache uses a physical main memory address plus software thread meta data to store the memorymore » access request. The second level cache accesses the main memory using the physical address with neither the offset nor the thread meta data after resolving speculation. In short, this system includes mapping of a virtual address to a different physical addresses for value disambiguation for different threads.« less
Nonlinear Wave Simulation on the Xeon Phi Knights Landing Processor

NASA Astrophysics Data System (ADS)

Hristov, Ivan; Goranov, Goran; Hristova, Radoslava

2018-02-01

We consider an interesting from computational point of view standing wave simulation by solving coupled 2D perturbed Sine-Gordon equations. We make an OpenMP realization which explores both thread and SIMD levels of parallelism. We test the OpenMP program on two different energy equivalent Intel architectures: 2× Xeon E5-2695 v2 processors, (code-named "Ivy Bridge-EP") in the Hybrilit cluster, and Xeon Phi 7250 processor (code-named "Knights Landing" (KNL). The results show 2 times better performance on KNL processor.
OpenMP performance for benchmark 2D shallow water equations using LBM

NASA Astrophysics Data System (ADS)

Sabri, Khairul; Rabbani, Hasbi; Gunawan, Putu Harry

2018-03-01

Shallow water equations or commonly referred as Saint-Venant equations are used to model fluid phenomena. These equations can be solved numerically using several methods, like Lattice Boltzmann method (LBM), SIMPLE-like Method, Finite Difference Method, Godunov-type Method, and Finite Volume Method. In this paper, the shallow water equation will be approximated using LBM or known as LABSWE and will be simulated in performance of parallel programming using OpenMP. To evaluate the performance between 2 and 4 threads parallel algorithm, ten various number of grids Lx and Ly are elaborated. The results show that using OpenMP platform, the computational time for solving LABSWE can be decreased. For instance using grid sizes 1000 × 500, the speedup of 2 and 4 threads is observed 93.54 s and 333.243 s respectively.
High-Performance, Multi-Node File Copies and Checksums for Clustered File Systems

NASA Technical Reports Server (NTRS)

Kolano, Paul Z.; Ciotti, Robert B.

2012-01-01

Modern parallel file systems achieve high performance using a variety of techniques, such as striping files across multiple disks to increase aggregate I/O bandwidth and spreading disks across multiple servers to increase aggregate interconnect bandwidth. To achieve peak performance from such systems, it is typically necessary to utilize multiple concurrent readers/writers from multiple systems to overcome various singlesystem limitations, such as number of processors and network bandwidth. The standard cp and md5sum tools of GNU coreutils found on every modern Unix/Linux system, however, utilize a single execution thread on a single CPU core of a single system, and hence cannot take full advantage of the increased performance of clustered file systems. Mcp and msum are drop-in replacements for the standard cp and md5sum programs that utilize multiple types of parallelism and other optimizations to achieve maximum copy and checksum performance on clustered file systems. Multi-threading is used to ensure that nodes are kept as busy as possible. Read/write parallelism allows individual operations of a single copy to be overlapped using asynchronous I/O. Multinode cooperation allows different nodes to take part in the same copy/checksum. Split-file processing allows multiple threads to operate concurrently on the same file. Finally, hash trees allow inherently serial checksums to be performed in parallel. Mcp and msum provide significant performance improvements over standard cp and md5sum using multiple types of parallelism and other optimizations. The total speed-ups from all improvements are significant. Mcp improves cp performance over 27x, msum improves md5sum performance almost 19x, and the combination of mcp and msum improves verified copies via cp and md5sum by almost 22x. These improvements come in the form of drop-in replacements for cp and md5sum, so are easily used and are available for download as open source software at http://mutil.sourceforge.net.
PRIMA-X Final Report

DOE Office of Scientific and Technical Information (OSTI.GOV)

Lorenz, Daniel; Wolf, Felix

2016-02-17

The PRIMA-X (Performance Retargeting of Instrumentation, Measurement, and Analysis Technologies for Exascale Computing) project is the successor of the DOE PRIMA (Performance Refactoring of Instrumentation, Measurement, and Analysis Technologies for Petascale Computing) project, which addressed the challenge of creating a core measurement infrastructure that would serve as a common platform for both integrating leading parallel performance systems (notably TAU and Scalasca) and developing next-generation scalable performance tools. The PRIMA-X project shifts the focus away from refactorization of robust performance tools towards a re-targeting of the parallel performance measurement and analysis architecture for extreme scales. The massive concurrency, asynchronous execution dynamics,more » hardware heterogeneity, and multi-objective prerequisites (performance, power, resilience) that identify exascale systems introduce fundamental constraints on the ability to carry forward existing performance methodologies. In particular, there must be a deemphasis of per-thread observation techniques to significantly reduce the otherwise unsustainable flood of redundant performance data. Instead, it will be necessary to assimilate multi-level resource observations into macroscopic performance views, from which resilient performance metrics can be attributed to the computational features of the application. This requires a scalable framework for node-level and system-wide monitoring and runtime analyses of dynamic performance information. Also, the interest in optimizing parallelism parameters with respect to performance and energy drives the integration of tool capabilities in the exascale environment further. Initially, PRIMA-X was a collaborative project between the University of Oregon (lead institution) and the German Research School for Simulation Sciences (GRS). Because Prof. Wolf, the PI at GRS, accepted a position as full professor at Technische Universität Darmstadt (TU Darmstadt) starting February 1st, 2015, the project ended at GRS on January 31st, 2015. This report reflects the work accomplished at GRS until then. The work of GRS is expected to be continued at TU Darmstadt. The first main accomplishment of GRS is the design of different thread-level aggregation techniques. We created a prototype capable of aggregating the thread-level information in performance profiles using these techniques. The next step will be the integration of the most promising techniques into the Score-P measurement system and their evaluation. The second main accomplishment is a substantial increase of Score-P’s scalability, achieved by improving the design of the system-tree representation in Score-P’s profile format. We developed a new representation and a distributed algorithm to create the scalable system tree representation. Finally, we developed a lightweight approach to MPI wait-state profiling. Former algorithms either needed piggy-backing, which can cause significant runtime overhead, or tracing, which comes with its own set of scaling challenges. Our approach works with local data only and, thus, is scalable and has very little overhead.« less
Microstructural homogeneity of support silk spun by Eriophora fuliginea (C.L. Koch) determined by scanning X-ray microdiffraction

NASA Astrophysics Data System (ADS)

Riekel, C.; Craig, C. L.; Burghammer, M.; Müller, M.

2001-01-01

Scanning X-ray microdiffraction (SXD) permits the 'imaging' in-situ of crystalline phases, crystallinity and texture in whole biopolymer samples on the micrometre scale. SXD complements transmission electron microscopy (TEM) techniques, which reach sub-nanometre lateral resolution but require thin sections and a vacuum environment. This is demonstrated using a support thread from a web spun by the orb-weaving spider Eriophora fuliginea (C.L. Koch). Scanning electron microscopy (SEM) shows a central thread composed of two fibres to which thinner fibres are loosely attached. SXD of a piece of support thread approximately 60 µm long shows in addition the presence of nanometre-sized crystallites with the β-poly(L-alanine) structure in all fibres. The crystallinity of the thin fibres appears to be higher than that of the central thread, which probably reflects a higher polyalanine content of the fibroins. The molecular axis of the polymer chains in the central thread is orientated parallel to the macroscopic fibre axis, but in the thin fibres the molecular axis is tilted by about 71° to the macroscopic fibre axis. A helical model is tentatively proposed to describe this morphology. The central thread has a homogeneous distribution of crystallinity along the macroscopic fibre axis.
THE THERMAL INSTABILITY OF SOLAR PROMINENCE THREADS

DOE Office of Scientific and Technical Information (OSTI.GOV)

Soler, R.; Goossens, M.; Ballester, J. L., E-mail: roberto.soler@wis.kuleuven.be

The fine structure of solar prominences and filaments appears as thin and long threads in high-resolution images. In H{alpha} observations of filaments, some threads can be observed for only 5-20 minutes before they seem to fade and eventually disappear, suggesting that these threads may have very short lifetimes. The presence of an instability might be the cause of this quick disappearance. Here, we study the thermal instability of prominence threads as an explanation of their sudden disappearance from H{alpha} observations. We model a prominence thread as a magnetic tube with prominence conditions embedded in a coronal environment. We assume amore » variation of the physical properties in the transverse direction so that the temperature and density continuously change from internal to external values in an inhomogeneous transitional layer representing the particular prominence-corona transition region (PCTR) of the thread. We use the nonadiabatic and resistive magnetohydrodynamic equations, which include terms due to thermal conduction parallel and perpendicular to the magnetic field, radiative losses, heating, and magnetic diffusion. We combine both analytical and numerical methods to study linear perturbations from the equilibrium state, focusing on unstable thermal solutions. We find that thermal modes are unstable in the PCTR for temperatures higher than 80,000 K, approximately. These modes are related to temperature disturbances that can lead to changes in the equilibrium due to rapid plasma heating or cooling. For typical prominence parameters, the instability timescale is of the order of a few minutes and is independent of the form of the temperature profile within the PCTR of the thread. This result indicates that thermal instability may play an important role for the short lifetimes of threads in the observations.« less
Dynamics of threading dislocations in porous heteroepitaxial GaN films

NASA Astrophysics Data System (ADS)

Gutkin, M. Yu.; Rzhavtsev, E. A.

2017-12-01

Behavior of threading dislocations in porous heteroepitaxial gallium nitride (GaN) films has been studied using computer simulation by the two-dimensional discrete dislocation dynamics approach. A computational scheme, where pores are modeled as cross sections of cylindrical cavities, elastically interacting with unidirectional parallel edge dislocations, which imitate threading dislocations, is used. Time dependences of coordinates and velocities of each dislocation from dislocation ensembles under investigation are obtained. Visualization of current structure of dislocation ensemble is performed in the form of a location map of dislocations at any time. It has been shown that the density of appearing dislocation structures significantly depends on the ratio of area of a pore cross section to area of the simulation region. In particular, increasing the portion of pores surface on the layer surface up to 2% should lead to about a 1.5-times decrease of the final density of threading dislocations, and increase of this portion up to 15% should lead to approximately a 4.5-times decrease of it.
Data communications in a parallel active messaging interface of a parallel computer

DOEpatents

Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E

2014-02-11

Data communications in a parallel active messaging interface ('PAMI') or a parallel computer, the parallel computer including a plurality of compute nodes that execute a parallel application, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution of a compute node, including specification of a client, a context, and a task, the compute nodes and the endpoints coupled for data communications instruction, the instruction characterized by instruction type, the instruction specifying a transmission of transfer data from the origin endpoint to a target endpoint and transmitting, in accordance witht the instruction type, the transfer data from the origin endpoin to the target endpoint.
Reader set encoding for directory of shared cache memory in multiprocessor system

DOEpatents

Ahn, Dnaiel; Ceze, Luis H.; Gara, Alan; Ohmacht, Martin; Xiaotong, Zhuang

2014-06-10

In a parallel processing system with speculative execution, conflict checking occurs in a directory lookup of a cache memory that is shared by all processors. In each case, the same physical memory address will map to the same set of that cache, no matter which processor originated that access. The directory includes a dynamic reader set encoding, indicating what speculative threads have read a particular line. This reader set encoding is used in conflict checking. A bitset encoding is used to specify particular threads that have read the line.
Parallel Mutual Information Based Construction of Genome-Scale Networks on the Intel® Xeon Phi™ Coprocessor.

PubMed

Misra, Sanchit; Pamnany, Kiran; Aluru, Srinivas

2015-01-01

Construction of whole-genome networks from large-scale gene expression data is an important problem in systems biology. While several techniques have been developed, most cannot handle network reconstruction at the whole-genome scale, and the few that can, require large clusters. In this paper, we present a solution on the Intel Xeon Phi coprocessor, taking advantage of its multi-level parallelism including many x86-based cores, multiple threads per core, and vector processing units. We also present a solution on the Intel® Xeon® processor. Our solution is based on TINGe, a fast parallel network reconstruction technique that uses mutual information and permutation testing for assessing statistical significance. We demonstrate the first ever inference of a plant whole genome regulatory network on a single chip by constructing a 15,575 gene network of the plant Arabidopsis thaliana from 3,137 microarray experiments in only 22 minutes. In addition, our optimization for parallelizing mutual information computation on the Intel Xeon Phi coprocessor holds out lessons that are applicable to other domains.
Climate Modeling with a Million CPUs

NASA Astrophysics Data System (ADS)

Tobis, M.; Jackson, C. S.

2010-12-01

Michael Tobis, Ph.D. Research Scientist Associate University of Texas Institute for Geophysics Charles S. Jackson Research Scientist University of Texas Institute for Geophysics Meteorological, oceanographic, and climatological applications have been at the forefront of scientific computing since its inception. The trend toward ever larger and more capable computing installations is unabated. However, much of the increase in capacity is accompanied by an increase in parallelism and a concomitant increase in complexity. An increase of at least four additional orders of magnitude in the computational power of scientific platforms is anticipated. It is unclear how individual climate simulations can continue to make effective use of the largest platforms. Conversion of existing community codes to higher resolution, or to more complex phenomenology, or both, presents daunting design and validation challenges. Our alternative approach is to use the expected resources to run very large ensembles of simulations of modest size, rather than to await the emergence of very large simulations. We are already doing this in exploring the parameter space of existing models using the Multiple Very Fast Simulated Annealing algorithm, which was developed for seismic imaging. Our experiments have the dual intentions of tuning the model and identifying ranges of parameter uncertainty. Our approach is less strongly constrained by the dimensionality of the parameter space than are competing methods. Nevertheless, scaling up remains costly. Much could be achieved by increasing the dimensionality of the search and adding complexity to the search algorithms. Such ensemble approaches scale naturally to very large platforms. Extensions of the approach are anticipated. For example, structurally different models can be tuned to comparable effectiveness. This can provide an objective test for which there is no realistic precedent with smaller computations. We find ourselves inventing new code to manage our ensembles. Component computations involve tens to hundreds of CPUs and tens to hundreds of hours. The results of these moderately large parallel jobs influence the scheduling of subsequent jobs, and complex algorithms may be easily contemplated for this. The operating system concept of a "thread" re-emerges at a very coarse level, where each thread manages atomic computations of thousands of CPU-hours. That is, rather than multiple threads operating on a processor, at this level, multiple processors operate within a single thread. In collaboration with the Texas Advanced Computing Center, we are developing a software library at the system level, which should facilitate the development of computations involving complex strategies which invoke large numbers of moderately large multi-processor jobs. While this may have applications in other sciences, our key intent is to better characterize the coupled behavior of a very large set of climate model configurations.
Implementation of the NAS Parallel Benchmarks in Java

NASA Technical Reports Server (NTRS)

Frumkin, Michael A.; Schultz, Matthew; Jin, Haoqiang; Yan, Jerry; Biegel, Bryan (Technical Monitor)

2002-01-01

Several features make Java an attractive choice for High Performance Computing (HPC). In order to gauge the applicability of Java to Computational Fluid Dynamics (CFD), we have implemented the NAS (NASA Advanced Supercomputing) Parallel Benchmarks in Java. The performance and scalability of the benchmarks point out the areas where improvement in Java compiler technology and in Java thread implementation would position Java closer to Fortran in the competition for CFD applications.
Performance and Scalability of the NAS Parallel Benchmarks in Java

NASA Technical Reports Server (NTRS)

Frumkin, Michael A.; Schultz, Matthew; Jin, Haoqiang; Yan, Jerry; Biegel, Bryan A. (Technical Monitor)

2002-01-01

Several features make Java an attractive choice for scientific applications. In order to gauge the applicability of Java to Computational Fluid Dynamics (CFD), we have implemented the NAS (NASA Advanced Supercomputing) Parallel Benchmarks in Java. The performance and scalability of the benchmarks point out the areas where improvement in Java compiler technology and in Java thread implementation would position Java closer to Fortran in the competition for scientific applications.
Robust Parallel Motion Estimation and Mapping with Stereo Cameras in Underground Infrastructure

NASA Astrophysics Data System (ADS)

Liu, Chun; Li, Zhengning; Zhou, Yuan

2016-06-01

Presently, we developed a novel robust motion estimation method for localization and mapping in underground infrastructure using a pre-calibrated rigid stereo camera rig. Localization and mapping in underground infrastructure is important to safety. Yet it's also nontrivial since most underground infrastructures have poor lighting condition and featureless structure. Overcoming these difficulties, we discovered that parallel system is more efficient than the EKF-based SLAM approach since parallel system divides motion estimation and 3D mapping tasks into separate threads, eliminating data-association problem which is quite an issue in SLAM. Moreover, the motion estimation thread takes the advantage of state-of-art robust visual odometry algorithm which is highly functional under low illumination and provides accurate pose information. We designed and built an unmanned vehicle and used the vehicle to collect a dataset in an underground garage. The parallel system was evaluated by the actual dataset. Motion estimation results indicated a relative position error of 0.3%, and 3D mapping results showed a mean position error of 13cm. Off-line process reduced position error to 2cm. Performance evaluation by actual dataset showed that our system is capable of robust motion estimation and accurate 3D mapping in poor illumination and featureless underground environment.
A Family of ACO Routing Protocols for Mobile Ad Hoc Networks.

PubMed

Rupérez Cañas, Delfín; Sandoval Orozco, Ana Lucila; García Villalba, Luis Javier; Kim, Tai-Hoon

2017-05-22

In this work, an ACO routing protocol for mobile ad hoc networks based on AntHocNet is specified. As its predecessor, this new protocol, called AntOR, is hybrid in the sense that it contains elements from both reactive and proactive routing. Specifically, it combines a reactive route setup process with a proactive route maintenance and improvement process. Key aspects of the AntOR protocol are the disjoint-link and disjoint-node routes, separation between the regular pheromone and the virtual pheromone in the diffusion process and the exploration of routes, taking into consideration the number of hops in the best routes. In this work, a family of ACO routing protocols based on AntOR is also specified. These protocols are based on protocol successive refinements. In this work, we also present a parallelized version of AntOR that we call PAntOR. Using programming multiprocessor architectures based on the shared memory protocol, PAntOR allows running tasks in parallel using threads. This parallelization is applicable in the route setup phase, route local repair process and link failure notification. In addition, a variant of PAntOR that consists of having more than one interface, which we call PAntOR-MI (PAntOR-Multiple Interface), is specified. This approach parallelizes the sending of broadcast messages by interface through threads.

Development and study of a parallel algorithm of iteratively forming latent functionally-determined structures for classification and analysis of meteorological data

NASA Astrophysics Data System (ADS)

Sorokin, V. A.; Volkov, Yu V.; Sherstneva, A. I.; Botygin, I. A.

2016-11-01

This paper overviews a method of generating climate regions based on an analytic signal theory. When applied to atmospheric surface layer temperature data sets, the method allows forming climatic structures with the corresponding changes in the temperature to make conclusions on the uniformity of climate in an area and to trace the climate changes in time by analyzing the type group shifts. The algorithm is based on the fact that the frequency spectrum of the thermal oscillation process is narrow-banded and has only one mode for most weather stations. This allows using the analytic signal theory, causality conditions and introducing an oscillation phase. The annual component of the phase, being a linear function, was removed by the least squares method. The remaining phase fluctuations allow consistent studying of their coordinated behavior and timing, using the Pearson correlation coefficient for dependence evaluation. This study includes program experiments to evaluate the calculation efficiency in the phase grouping task. The paper also overviews some single-threaded and multi-threaded computing models. It is shown that the phase grouping algorithm for meteorological data can be parallelized and that a multi-threaded implementation leads to a 25-30% increase in the performance.
Threading dynamics of a polymer through parallel pores: Potential applications to DNA size separation

NASA Astrophysics Data System (ADS)

Åkerman, Björn

1997-04-01

DNA orientation measurements by linear dichroism (LD) spectroscopy and single molecule imaging by fluorescence microscopy are used to investigate the effect of DNA size (71-740 kilo base pairs) and field strength E (1-5.9 V/cm) on the conformation dynamics during the field-driven threading of DNA molecules through a set of parallel pores in agarose gels, with average pore radii between 380 Å and 1400 Å. Locally relaxed but globally oriented DNA molecules are subjected to a perpendicular field, and the observed LD time profile is compared with a recent theory for the threading [D. Long and J.-L. Viovy, Phys. Rev. E 53, 803 (1996)] which assumes the same initial state. As predicted the DNA is driven by the ends into a U-form, leading to an overshoot in the LD. The overshoot-time scales as E-(1.2-1.4) as predicted, but grows more slowly with DNA size than the predicted linear dependence. For long molecules loops form initially in the threading process but are finally consumed by the ends, and the process of transfer of DNA segments, from the loops to the arms of the U, leads to a shoulder in the LD as predicted. The critical size below which loops do not form (as indicated by the LD shoulder being absent) is between 71 and 105 kbp (0.5% agarose, 5.9 V/cm), and considerably larger than predicted because in the initial state the DNA molecules are housed in gel cavities with effective pore sizes about four times larger than the average pore size. From the data, the separation of DNA by exploiting the threading dynamics in pulsed fields [D. Long et al., CR Acad. Sci. Paris, Ser. IIb 321, 239 (1995)] is shown to be feasible in principle in an agarose-based system.
Accelerating Computation of DCM for ERP in MATLAB by External Function Calls to the GPU.

PubMed

Wang, Wei-Jen; Hsieh, I-Fan; Chen, Chun-Chuan

2013-01-01

This study aims to improve the performance of Dynamic Causal Modelling for Event Related Potentials (DCM for ERP) in MATLAB by using external function calls to a graphics processing unit (GPU). DCM for ERP is an advanced method for studying neuronal effective connectivity. DCM utilizes an iterative procedure, the expectation maximization (EM) algorithm, to find the optimal parameters given a set of observations and the underlying probability model. As the EM algorithm is computationally demanding and the analysis faces possible combinatorial explosion of models to be tested, we propose a parallel computing scheme using the GPU to achieve a fast estimation of DCM for ERP. The computation of DCM for ERP is dynamically partitioned and distributed to threads for parallel processing, according to the DCM model complexity and the hardware constraints. The performance efficiency of this hardware-dependent thread arrangement strategy was evaluated using the synthetic data. The experimental data were used to validate the accuracy of the proposed computing scheme and quantify the time saving in practice. The simulation results show that the proposed scheme can accelerate the computation by a factor of 155 for the parallel part. For experimental data, the speedup factor is about 7 per model on average, depending on the model complexity and the data. This GPU-based implementation of DCM for ERP gives qualitatively the same results as the original MATLAB implementation does at the group level analysis. In conclusion, we believe that the proposed GPU-based implementation is very useful for users as a fast screen tool to select the most likely model and may provide implementation guidance for possible future clinical applications such as online diagnosis.
Accelerating Computation of DCM for ERP in MATLAB by External Function Calls to the GPU

PubMed Central

Wang, Wei-Jen; Hsieh, I-Fan; Chen, Chun-Chuan

2013-01-01

This study aims to improve the performance of Dynamic Causal Modelling for Event Related Potentials (DCM for ERP) in MATLAB by using external function calls to a graphics processing unit (GPU). DCM for ERP is an advanced method for studying neuronal effective connectivity. DCM utilizes an iterative procedure, the expectation maximization (EM) algorithm, to find the optimal parameters given a set of observations and the underlying probability model. As the EM algorithm is computationally demanding and the analysis faces possible combinatorial explosion of models to be tested, we propose a parallel computing scheme using the GPU to achieve a fast estimation of DCM for ERP. The computation of DCM for ERP is dynamically partitioned and distributed to threads for parallel processing, according to the DCM model complexity and the hardware constraints. The performance efficiency of this hardware-dependent thread arrangement strategy was evaluated using the synthetic data. The experimental data were used to validate the accuracy of the proposed computing scheme and quantify the time saving in practice. The simulation results show that the proposed scheme can accelerate the computation by a factor of 155 for the parallel part. For experimental data, the speedup factor is about 7 per model on average, depending on the model complexity and the data. This GPU-based implementation of DCM for ERP gives qualitatively the same results as the original MATLAB implementation does at the group level analysis. In conclusion, we believe that the proposed GPU-based implementation is very useful for users as a fast screen tool to select the most likely model and may provide implementation guidance for possible future clinical applications such as online diagnosis. PMID:23840507
Argobots: A Lightweight Low-Level Threading and Tasking Framework

DOE Office of Scientific and Technical Information (OSTI.GOV)

Seo, Sangmin; Amer, Abdelhalim; Balaji, Pavan

In the past few decades, a number of user-level threading and tasking models have been proposed in the literature to address the shortcomings of OS-level threads, primarily with respect to cost and flexibility. Current state-of-the-art user-level threading and tasking models, however, are either too specific to applications or architectures or are not as powerful or flexible. In this paper, we present Argobots, a lightweight, low-level threading and tasking framework that is designed as a portable and performant substrate for high-level programming models or runtime systems. Argobots offers a carefully designed execution model that balances generality of functionality with providing amore » rich set of controls to allow specialization by the user or high-level programming model. We describe the design, implementation, and optimization of Argobots and present integrations with three example high-level models: OpenMP, MPI, and co-located I/O service. Evaluations show that (1) Argobots outperforms existing generic threading runtimes; (2) our OpenMP runtime offers more efficient interoperability capabilities than production OpenMP runtimes do; (3) when MPI interoperates with Argobots instead of Pthreads, it enjoys reduced synchronization costs and better latency hiding capabilities; and (4) I/O service with Argobots reduces interference with co-located applications, achieving performance competitive with that of the Pthreads version.« less
Toward Enhancing OpenMP's Work-Sharing Directives

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chapman, B M; Huang, L; Jin, H

2006-05-17

OpenMP provides a portable programming interface for shared memory parallel computers (SMPs). Although this interface has proven successful for small SMPs, it requires greater flexibility in light of the steadily growing size of individual SMPs and the recent advent of multithreaded chips. In this paper, we describe two application development experiences that exposed these expressivity problems in the current OpenMP specification. We then propose mechanisms to overcome these limitations, including thread subteams and thread topologies. Thus, we identify language features that improve OpenMP application performance on emerging and large-scale platforms while preserving ease of programming.
Practical Formal Verification of MPI and Thread Programs

NASA Astrophysics Data System (ADS)

Gopalakrishnan, Ganesh; Kirby, Robert M.

Large-scale simulation codes in science and engineering are written using the Message Passing Interface (MPI). Shared memory threads are widely used directly, or to implement higher level programming abstractions. Traditional debugging methods for MPI or thread programs are incapable of providing useful formal guarantees about coverage. They get bogged down in the sheer number of interleavings (schedules), often missing shallow bugs. In this tutorial we will introduce two practical formal verification tools: ISP (for MPI C programs) and Inspect (for Pthread C programs). Unlike other formal verification tools, ISP and Inspect run directly on user source codes (much like a debugger). They pursue only the relevant set of process interleavings, using our own customized Dynamic Partial Order Reduction algorithms. For a given test harness, DPOR allows these tools to guarantee the absence of deadlocks, instrumented MPI object leaks and communication races (using ISP), and shared memory races (using Inspect). ISP and Inspect have been used to verify large pieces of code: in excess of 10,000 lines of MPI/C for ISP in under 5 seconds, and about 5,000 lines of Pthread/C code in a few hours (and much faster with the use of a cluster or by exploiting special cases such as symmetry) for Inspect. We will also demonstrate the Microsoft Visual Studio and Eclipse Parallel Tools Platform integrations of ISP (these will be available on the LiveCD).
Initial Kernel Timing Using a Simple PIM Performance Model

NASA Technical Reports Server (NTRS)

Katz, Daniel S.; Block, Gary L.; Springer, Paul L.; Sterling, Thomas; Brockman, Jay B.; Callahan, David

2005-01-01

This presentation will describe some initial results of paper-and-pencil studies of 4 or 5 application kernels applied to a processor-in-memory (PIM) system roughly similar to the Cascade Lightweight Processor (LWP). The application kernels are: * Linked list traversal * Sun of leaf nodes on a tree * Bitonic sort * Vector sum * Gaussian elimination The intent of this work is to guide and validate work on the Cascade project in the areas of compilers, simulators, and languages. We will first discuss the generic PIM structure. Then, we will explain the concepts needed to program a parallel PIM system (locality, threads, parcels). Next, we will present a simple PIM performance model that will be used in the remainder of the presentation. For each kernel, we will then present a set of codes, including codes for a single PIM node, and codes for multiple PIM nodes that move data to threads and move threads to data. These codes are written at a fairly low level, between assembly and C, but much closer to C than to assembly. For each code, we will present some hand-drafted timing forecasts, based on the simple PIM performance model. Finally, we will conclude by discussing what we have learned from this work, including what programming styles seem to work best, from the point-of-view of both expressiveness and performance.
GPU-Accelerated Stony-Brook University 5-class Microphysics Scheme in WRF

NASA Astrophysics Data System (ADS)

Mielikainen, J.; Huang, B.; Huang, A.

2011-12-01

The Weather Research and Forecasting (WRF) model is a next-generation mesoscale numerical weather prediction system. Microphysics plays an important role in weather and climate prediction. Several bulk water microphysics schemes are available within the WRF, with different numbers of simulated hydrometeor classes and methods for estimating their size fall speeds, distributions and densities. Stony-Brook University scheme (SBU-YLIN) is a 5-class scheme with riming intensity predicted to account for mixed-phase processes. In the past few years, co-processing on Graphics Processing Units (GPUs) has been a disruptive technology in High Performance Computing (HPC). GPUs use the ever increasing transistor count for adding more processor cores. Therefore, GPUs are well suited for massively data parallel processing with high floating point arithmetic intensity. Thus, it is imperative to update legacy scientific applications to take advantage of this unprecedented increase in computing power. CUDA is an extension to the C programming language offering programming GPU's directly. It is designed so that its constructs allow for natural expression of data-level parallelism. A CUDA program is organized into two parts: a serial program running on the CPU and a CUDA kernel running on the GPU. The CUDA code consists of three computational phases: transmission of data into the global memory of the GPU, execution of the CUDA kernel, and transmission of results from the GPU into the memory of CPU. CUDA takes a bottom-up point of view of parallelism is which thread is an atomic unit of parallelism. Individual threads are part of groups called warps, within which every thread executes exactly the same sequence of instructions. To test SBU-YLIN, we used a CONtinental United States (CONUS) benchmark data set for 12 km resolution domain for October 24, 2001. A WRF domain is a geographic region of interest discretized into a 2-dimensional grid parallel to the ground. Each grid point has multiple levels, which correspond to various vertical heights in the atmosphere. The size of the CONUS 12 km domain is 433 x 308 horizontal grid points with 35 vertical levels. First, the entire SBU-YLIN Fortran code was rewritten in C in preparation of GPU accelerated version. After that, C code was verified against Fortran code for identical outputs. Default compiler options from WRF were used for gfortran and gcc compilers. The processing time for the original Fortran code is 12274 ms and 12893 ms for C version. The processing times for GPU implementation of SBU-YLIN microphysics scheme with I/O are 57.7 ms and 37.2 ms for 1 and 2 GPUs, respectively. The corresponding speedups are 213x and 330x compared to a Fortran implementation. Without I/O the speedup is 896x on 1 GPU. Obviously, ignoring I/O time speedup scales linearly with GPUs. Thus, 2 GPUs have a speedup of 1788x without I/O. Microphysics computation is just a small part of the whole WRF model. After having completely implemented WRF on GPU, the inputs for SBU-YLIN do not have to be transferred from CPU. Instead they are results of previous WRF modules. Therefore, the role of I/O is greatly diminished once all of WRF have been converted to run on GPUs. In the near future, we expect to have a WRF running completely on GPUs for a superior performance.
Implementation of NAS Parallel Benchmarks in Java

NASA Technical Reports Server (NTRS)

Frumkin, Michael; Schultz, Matthew; Jin, Hao-Qiang; Yan, Jerry

2000-01-01

A number of features make Java an attractive but a debatable choice for High Performance Computing (HPC). In order to gauge the applicability of Java to the Computational Fluid Dynamics (CFD) we have implemented NAS Parallel Benchmarks in Java. The performance and scalability of the benchmarks point out the areas where improvement in Java compiler technology and in Java thread implementation would move Java closer to Fortran in the competition for CFD applications.
Hierarchical Parallelization of Gene Differential Association Analysis

PubMed Central

2011-01-01

Background Microarray gene differential expression analysis is a widely used technique that deals with high dimensional data and is computationally intensive for permutation-based procedures. Microarray gene differential association analysis is even more computationally demanding and must take advantage of multicore computing technology, which is the driving force behind increasing compute power in recent years. In this paper, we present a two-layer hierarchical parallel implementation of gene differential association analysis. It takes advantage of both fine- and coarse-grain (with granularity defined by the frequency of communication) parallelism in order to effectively leverage the non-uniform nature of parallel processing available in the cutting-edge systems of today. Results Our results show that this hierarchical strategy matches data sharing behavior to the properties of the underlying hardware, thereby reducing the memory and bandwidth needs of the application. The resulting improved efficiency reduces computation time and allows the gene differential association analysis code to scale its execution with the number of processors. The code and biological data used in this study are downloadable from http://www.urmc.rochester.edu/biostat/people/faculty/hu.cfm. Conclusions The performance sweet spot occurs when using a number of threads per MPI process that allows the working sets of the corresponding MPI processes running on the multicore to fit within the machine cache. Hence, we suggest that practitioners follow this principle in selecting the appropriate number of MPI processes and threads within each MPI process for their cluster configurations. We believe that the principles of this hierarchical approach to parallelization can be utilized in the parallelization of other computationally demanding kernels. PMID:21936916
Hierarchical parallelization of gene differential association analysis.

PubMed

Needham, Mark; Hu, Rui; Dwarkadas, Sandhya; Qiu, Xing

2011-09-21

Microarray gene differential expression analysis is a widely used technique that deals with high dimensional data and is computationally intensive for permutation-based procedures. Microarray gene differential association analysis is even more computationally demanding and must take advantage of multicore computing technology, which is the driving force behind increasing compute power in recent years. In this paper, we present a two-layer hierarchical parallel implementation of gene differential association analysis. It takes advantage of both fine- and coarse-grain (with granularity defined by the frequency of communication) parallelism in order to effectively leverage the non-uniform nature of parallel processing available in the cutting-edge systems of today. Our results show that this hierarchical strategy matches data sharing behavior to the properties of the underlying hardware, thereby reducing the memory and bandwidth needs of the application. The resulting improved efficiency reduces computation time and allows the gene differential association analysis code to scale its execution with the number of processors. The code and biological data used in this study are downloadable from http://www.urmc.rochester.edu/biostat/people/faculty/hu.cfm. The performance sweet spot occurs when using a number of threads per MPI process that allows the working sets of the corresponding MPI processes running on the multicore to fit within the machine cache. Hence, we suggest that practitioners follow this principle in selecting the appropriate number of MPI processes and threads within each MPI process for their cluster configurations. We believe that the principles of this hierarchical approach to parallelization can be utilized in the parallelization of other computationally demanding kernels.
Manufactured Textile Fibers

NASA Astrophysics Data System (ADS)

Gupta, Bhupender S.

The first conversion of naturally occurring fibers into threads strong enough to be looped into snares, knit to form nets, or woven into fabrics is lost in prehistory. Unlike stone weapons, such threads, cords, and fabrics—being organic in nature—have in most part disappeared, although in some dry caves traces remain. There is ample evidence to indicate that spindles used to assist in the twisting of fibers together had been developed long before the dawn of recorded history. In that spinning process, fibers such as wool were drawn out of a loose mass, perhaps held in a distaff, and made parallel by human fingers. (A maidservant so spins in Giotto's The Annunciation to Anne, ca. A.D. 1306, Arena Chapel, Padua, Italy.1) A rod (spindle), hooked to the lengthening thread, was rotated so that the fibers while so held were twisted together to form additional thread. The finished length then was wound by hand around the spindle, which, in becoming the core on which the finished product was accumulated, served the dual role of twisting and storing, and, in so doing, established a principle still in use today.
Halvade-RNA: Parallel variant calling from transcriptomic data using MapReduce.

PubMed

Decap, Dries; Reumers, Joke; Herzeel, Charlotte; Costanza, Pascal; Fostier, Jan

2017-01-01

Given the current cost-effectiveness of next-generation sequencing, the amount of DNA-seq and RNA-seq data generated is ever increasing. One of the primary objectives of NGS experiments is calling genetic variants. While highly accurate, most variant calling pipelines are not optimized to run efficiently on large data sets. However, as variant calling in genomic data has become common practice, several methods have been proposed to reduce runtime for DNA-seq analysis through the use of parallel computing. Determining the effectively expressed variants from transcriptomics (RNA-seq) data has only recently become possible, and as such does not yet benefit from efficiently parallelized workflows. We introduce Halvade-RNA, a parallel, multi-node RNA-seq variant calling pipeline based on the GATK Best Practices recommendations. Halvade-RNA makes use of the MapReduce programming model to create and manage parallel data streams on which multiple instances of existing tools such as STAR and GATK operate concurrently. Whereas the single-threaded processing of a typical RNA-seq sample requires ∼28h, Halvade-RNA reduces this runtime to ∼2h using a small cluster with two 20-core machines. Even on a single, multi-core workstation, Halvade-RNA can significantly reduce runtime compared to using multi-threading, thus providing for a more cost-effective processing of RNA-seq data. Halvade-RNA is written in Java and uses the Hadoop MapReduce 2.0 API. It supports a wide range of distributions of Hadoop, including Cloudera and Amazon EMR.
Computing Maximum Cardinality Matchings in Parallel on Bipartite Graphs via Tree-Grafting

DOE Office of Scientific and Technical Information (OSTI.GOV)

Azad, Ariful; Buluc, Aydn; Pothen, Alex

It is difficult to obtain high performance when computing matchings on parallel processors because matching algorithms explicitly or implicitly search for paths in the graph, and when these paths become long, there is little concurrency. In spite of this limitation, we present a new algorithm and its shared-memory parallelization that achieves good performance and scalability in computing maximum cardinality matchings in bipartite graphs. This algorithm searches for augmenting paths via specialized breadth-first searches (BFS) from multiple source vertices, hence creating more parallelism than single source algorithms. Algorithms that employ multiple-source searches cannot discard a search tree once no augmenting pathmore » is discovered from the tree, unlike algorithms that rely on single-source searches. We describe a novel tree-grafting method that eliminates most of the redundant edge traversals resulting from this property of multiple-source searches. We also employ the recent direction-optimizing BFS algorithm as a subroutine to discover augmenting paths faster. Our algorithm compares favorably with the current best algorithms in terms of the number of edges traversed, the average augmenting path length, and the number of iterations. Here, we provide a proof of correctness for our algorithm. Our NUMA-aware implementation is scalable to 80 threads of an Intel multiprocessor and to 240 threads on an Intel Knights Corner coprocessor. On average, our parallel algorithm runs an order of magnitude faster than the fastest algorithms available. The performance improvement is more significant on graphs with small matching number.« less
Computing Maximum Cardinality Matchings in Parallel on Bipartite Graphs via Tree-Grafting

DOE PAGES

Azad, Ariful; Buluc, Aydn; Pothen, Alex

2016-03-24

It is difficult to obtain high performance when computing matchings on parallel processors because matching algorithms explicitly or implicitly search for paths in the graph, and when these paths become long, there is little concurrency. In spite of this limitation, we present a new algorithm and its shared-memory parallelization that achieves good performance and scalability in computing maximum cardinality matchings in bipartite graphs. This algorithm searches for augmenting paths via specialized breadth-first searches (BFS) from multiple source vertices, hence creating more parallelism than single source algorithms. Algorithms that employ multiple-source searches cannot discard a search tree once no augmenting pathmore » is discovered from the tree, unlike algorithms that rely on single-source searches. We describe a novel tree-grafting method that eliminates most of the redundant edge traversals resulting from this property of multiple-source searches. We also employ the recent direction-optimizing BFS algorithm as a subroutine to discover augmenting paths faster. Our algorithm compares favorably with the current best algorithms in terms of the number of edges traversed, the average augmenting path length, and the number of iterations. Here, we provide a proof of correctness for our algorithm. Our NUMA-aware implementation is scalable to 80 threads of an Intel multiprocessor and to 240 threads on an Intel Knights Corner coprocessor. On average, our parallel algorithm runs an order of magnitude faster than the fastest algorithms available. The performance improvement is more significant on graphs with small matching number.« less
Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kim, Kyungjoo; Rajamanickam, Sivasankaran; Stelle, George Widgery

We introduce a task-parallel algorithm for sparse incomplete Cholesky factorization that utilizes a 2D sparse partitioned-block layout of a matrix. Our factorization algorithm follows the idea of algorithms-by-blocks by using the block layout. The algorithm-byblocks approach induces a task graph for the factorization. These tasks are inter-related to each other through their data dependences in the factorization algorithm. To process the tasks on various manycore architectures in a portable manner, we also present a portable tasking API that incorporates different tasking backends and device-specific features using an open-source framework for manycore platforms i.e., Kokkos. A performance evaluation is presented onmore » both Intel Sandybridge and Xeon Phi platforms for matrices from the University of Florida sparse matrix collection to illustrate merits of the proposed task-based factorization. Experimental results demonstrate that our task-parallel implementation delivers about 26.6x speedup (geometric mean) over single-threaded incomplete Choleskyby- blocks and 19.2x speedup over serial Cholesky performance which does not carry tasking overhead using 56 threads on the Intel Xeon Phi processor for sparse matrices arising from various application problems.« less
Data Acquisition System for Multi-Frequency Radar Flight Operations Preparation

NASA Technical Reports Server (NTRS)

Leachman, Jonathan

2010-01-01

A three-channel data acquisition system was developed for the NASA Multi-Frequency Radar (MFR) system. The system is based on a commercial-off-the-shelf (COTS) industrial PC (personal computer) and two dual-channel 14-bit digital receiver cards. The decimated complex envelope representations of the three radar signals are passed to the host PC via the PCI bus, and then processed in parallel by multiple cores of the PC CPU (central processing unit). The innovation is this parallelization of the radar data processing using multiple cores of a standard COTS multi-core CPU. The data processing portion of the data acquisition software was built using autonomous program modules or threads, which can run simultaneously on different cores. A master program module calculates the optimal number of processing threads, launches them, and continually supplies each with data. The benefit of this new parallel software architecture is that COTS PCs can be used to implement increasingly complex processing algorithms on an increasing number of radar range gates and data rates. As new PCs become available with higher numbers of CPU cores, the software will automatically utilize the additional computational capacity.
Concurrent Breakpoints

DTIC Science & Technology

2011-12-18

Proceedings of the SIGMET- RICS Symposium on Parallel and Distributed Tools, pages 48–59, 1998. [8] A. Dinning and E. Schonberg . Detecting access...multi- threaded programs. ACM Trans. Comput. Syst., 15(4):391– 411, 1997. [38] E. Schonberg . On-the-fly detection of access anomalies. In Proceedings
Argobots: A Lightweight Low-Level Threading and Tasking Framework

DOE Office of Scientific and Technical Information (OSTI.GOV)

Seo, Sangmin; Amer, Abdelhalim; Balaji, Pavan

In the past few decades, a number of user-level threading and tasking models have been proposed in the literature to address the shortcomings of OS-level threads, primarily with respect to cost and flexibility. Current state-of-the-art user-level threading and tasking models, however, either are too specific to applications or architectures or are not as powerful or flexible. In this paper, we present Argobots, a lightweight, low-level threading and tasking framework that is designed as a portable and performant substrate for high-level programming models or runtime systems. Argobots offers a carefully designed execution model that balances generality of functionality with providing amore » rich set of controls to allow specialization by end users or high-level programming models. We describe the design, implementation, and performance characterization of Argobots and present integrations with three high-level models: OpenMP, MPI, and colocated I/O services. Evaluations show that (1) Argobots, while providing richer capabilities, is competitive with existing simpler generic threading runtimes; (2) our OpenMP runtime offers more efficient interoperability capabilities than production OpenMP runtimes do; (3) when MPI interoperates with Argobots instead of Pthreads, it enjoys reduced synchronization costs and better latency-hiding capabilities; and (4) I/O services with Argobots reduce interference with colocated applications while achieving performance competitive with that of a Pthreads approach.« less

Argobots: A Lightweight Low-Level Threading and Tasking Framework

DOE PAGES

Seo, Sangmin; Amer, Abdelhalim; Balaji, Pavan; ...

2017-10-24

In the past few decades, a number of user-level threading and tasking models have been proposed in the literature to address the shortcomings of OS-level threads, primarily with respect to cost and flexibility. Current state-of-the-art user-level threading and tasking models, however, are either too specific to applications or architectures or are not as powerful or flexible. In this article, we present Argobots, a lightweight, low-level threading and tasking framework that is designed as a portable and performant substrate for high-level programming models or runtime systems. Argobots offers a carefully designed execution model that balances generality of functionality with providing amore » rich set of controls to allow specialization by the user or high-level programming model. Here, we describe the design, implementation, and optimization of Argobots and present integrations with three example high-level models: OpenMP, MPI, and co-located I/O service. Evaluations show that (1) Argobots outperforms existing generic threading runtimes; (2) our OpenMP runtime offers more efficient interoperability capabilities than production OpenMP runtimes do; (3) when MPI interoperates with Argobots instead of Pthreads, it enjoys reduced synchronization costs and better latency hiding capabilities; and (4) I/O service with Argobots reduces interference with co-located applications, achieving performance competitive with that of the Pthreads version.« less
Argobots: A Lightweight Low-Level Threading and Tasking Framework

DOE Office of Scientific and Technical Information (OSTI.GOV)

Seo, Sangmin; Amer, Abdelhalim; Balaji, Pavan

In the past few decades, a number of user-level threading and tasking models have been proposed in the literature to address the shortcomings of OS-level threads, primarily with respect to cost and flexibility. Current state-of-the-art user-level threading and tasking models, however, are either too specific to applications or architectures or are not as powerful or flexible. In this article, we present Argobots, a lightweight, low-level threading and tasking framework that is designed as a portable and performant substrate for high-level programming models or runtime systems. Argobots offers a carefully designed execution model that balances generality of functionality with providing amore » rich set of controls to allow specialization by the user or high-level programming model. Here, we describe the design, implementation, and optimization of Argobots and present integrations with three example high-level models: OpenMP, MPI, and co-located I/O service. Evaluations show that (1) Argobots outperforms existing generic threading runtimes; (2) our OpenMP runtime offers more efficient interoperability capabilities than production OpenMP runtimes do; (3) when MPI interoperates with Argobots instead of Pthreads, it enjoys reduced synchronization costs and better latency hiding capabilities; and (4) I/O service with Argobots reduces interference with co-located applications, achieving performance competitive with that of the Pthreads version.« less
Implementing Shared Memory Parallelism in MCBEND

NASA Astrophysics Data System (ADS)

Bird, Adam; Long, David; Dobson, Geoff

2017-09-01

MCBEND is a general purpose radiation transport Monte Carlo code from AMEC Foster Wheelers's ANSWERS® Software Service. MCBEND is well established in the UK shielding community for radiation shielding and dosimetry assessments. The existing MCBEND parallel capability effectively involves running the same calculation on many processors. This works very well except when the memory requirements of a model restrict the number of instances of a calculation that will fit on a machine. To more effectively utilise parallel hardware OpenMP has been used to implement shared memory parallelism in MCBEND. This paper describes the reasoning behind the choice of OpenMP, notes some of the challenges of multi-threading an established code such as MCBEND and assesses the performance of the parallel method implemented in MCBEND.
Cache Locality Optimization for Recursive Programs

DOE Office of Scientific and Technical Information (OSTI.GOV)

Lifflander, Jonathan; Krishnamoorthy, Sriram

We present an approach to optimize the cache locality for recursive programs by dynamically splicing--recursively interleaving--the execution of distinct function invocations. By utilizing data effect annotations, we identify concurrency and data reuse opportunities across function invocations and interleave them to reduce reuse distance. We present algorithms that efficiently track effects in recursive programs, detect interference and dependencies, and interleave execution of function invocations using user-level (non-kernel) lightweight threads. To enable multi-core execution, a program is parallelized using a nested fork/join programming model. Our cache optimization strategy is designed to work in the context of a random work stealing scheduler. Wemore » present an implementation using the MIT Cilk framework that demonstrates significant improvements in sequential and parallel performance, competitive with a state-of-the-art compile-time optimizer for loop programs and a domain- specific optimizer for stencil programs.« less
A Family of ACO Routing Protocols for Mobile Ad Hoc Networks

PubMed Central

Rupérez Cañas, Delfín; Sandoval Orozco, Ana Lucila; García Villalba, Luis Javier; Kim, Tai-hoon

2017-01-01

In this work, an ACO routing protocol for mobile ad hoc networks based on AntHocNet is specified. As its predecessor, this new protocol, called AntOR, is hybrid in the sense that it contains elements from both reactive and proactive routing. Specifically, it combines a reactive route setup process with a proactive route maintenance and improvement process. Key aspects of the AntOR protocol are the disjoint-link and disjoint-node routes, separation between the regular pheromone and the virtual pheromone in the diffusion process and the exploration of routes, taking into consideration the number of hops in the best routes. In this work, a family of ACO routing protocols based on AntOR is also specified. These protocols are based on protocol successive refinements. In this work, we also present a parallelized version of AntOR that we call PAntOR. Using programming multiprocessor architectures based on the shared memory protocol, PAntOR allows running tasks in parallel using threads. This parallelization is applicable in the route setup phase, route local repair process and link failure notification. In addition, a variant of PAntOR that consists of having more than one interface, which we call PAntOR-MI (PAntOR-Multiple Interface), is specified. This approach parallelizes the sending of broadcast messages by interface through threads. PMID:28531159
Comparative Performance Analysis of Coarse Solvers for Algebraic Multigrid on Multicore and Manycore Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Druinsky, Alex; Ghysels, Pieter; Li, Xiaoye S.

In this paper, we study the performance of a two-level algebraic-multigrid algorithm, with a focus on the impact of the coarse-grid solver on performance. We consider two algorithms for solving the coarse-space systems: the preconditioned conjugate gradient method and a new robust HSS-embedded low-rank sparse-factorization algorithm. Our test data comes from the SPE Comparative Solution Project for oil-reservoir simulations. We contrast the performance of our code on one 12-core socket of a Cray XC30 machine with performance on a 60-core Intel Xeon Phi coprocessor. To obtain top performance, we optimized the code to take full advantage of fine-grained parallelism andmore » made it thread-friendly for high thread count. We also developed a bounds-and-bottlenecks performance model of the solver which we used to guide us through the optimization effort, and also carried out performance tuning in the solver’s large parameter space. Finally, as a result, significant speedups were obtained on both machines.« less
Challenges in scaling NLO generators to leadership computers

NASA Astrophysics Data System (ADS)

Benjamin, D.; Childers, JT; Hoeche, S.; LeCompte, T.; Uram, T.

2017-10-01

Exascale computing resources are roughly a decade away and will be capable of 100 times more computing than current supercomputers. In the last year, Energy Frontier experiments crossed a milestone of 100 million core-hours used at the Argonne Leadership Computing Facility, Oak Ridge Leadership Computing Facility, and NERSC. The Fortran-based leading-order parton generator called Alpgen was successfully scaled to millions of threads to achieve this level of usage on Mira. Sherpa and MadGraph are next-to-leading order generators used heavily by LHC experiments for simulation. Integration times for high-multiplicity or rare processes can take a week or more on standard Grid machines, even using all 16-cores. We will describe our ongoing work to scale the Sherpa generator to thousands of threads on leadership-class machines and reduce run-times to less than a day. This work allows the experiments to leverage large-scale parallel supercomputers for event generation today, freeing tens of millions of grid hours for other work, and paving the way for future applications (simulation, reconstruction) on these and future supercomputers.
Shared Memory Parallelism for 3D Cartesian Discrete Ordinates Solver

NASA Astrophysics Data System (ADS)

Moustafa, Salli; Dutka-Malen, Ivan; Plagne, Laurent; Ponçot, Angélique; Ramet, Pierre

2014-06-01

This paper describes the design and the performance of DOMINO, a 3D Cartesian SN solver that implements two nested levels of parallelism (multicore+SIMD) on shared memory computation nodes. DOMINO is written in C++, a multi-paradigm programming language that enables the use of powerful and generic parallel programming tools such as Intel TBB and Eigen. These two libraries allow us to combine multi-thread parallelism with vector operations in an efficient and yet portable way. As a result, DOMINO can exploit the full power of modern multi-core processors and is able to tackle very large simulations, that usually require large HPC clusters, using a single computing node. For example, DOMINO solves a 3D full core PWR eigenvalue problem involving 26 energy groups, 288 angular directions (S16), 46 × 106 spatial cells and 1 × 1012 DoFs within 11 hours on a single 32-core SMP node. This represents a sustained performance of 235 GFlops and 40:74% of the SMP node peak performance for the DOMINO sweep implementation. The very high Flops/Watt ratio of DOMINO makes it a very interesting building block for a future many-nodes nuclear simulation tool.
OpenMP parallelization of a gridded SWAT (SWATG)

NASA Astrophysics Data System (ADS)

Zhang, Ying; Hou, Jinliang; Cao, Yongpan; Gu, Juan; Huang, Chunlin

2017-12-01

Large-scale, long-term and high spatial resolution simulation is a common issue in environmental modeling. A Gridded Hydrologic Response Unit (HRU)-based Soil and Water Assessment Tool (SWATG) that integrates grid modeling scheme with different spatial representations also presents such problems. The time-consuming problem affects applications of very high resolution large-scale watershed modeling. The OpenMP (Open Multi-Processing) parallel application interface is integrated with SWATG (called SWATGP) to accelerate grid modeling based on the HRU level. Such parallel implementation takes better advantage of the computational power of a shared memory computer system. We conducted two experiments at multiple temporal and spatial scales of hydrological modeling using SWATG and SWATGP on a high-end server. At 500-m resolution, SWATGP was found to be up to nine times faster than SWATG in modeling over a roughly 2000 km2 watershed with 1 CPU and a 15 thread configuration. The study results demonstrate that parallel models save considerable time relative to traditional sequential simulation runs. Parallel computations of environmental models are beneficial for model applications, especially at large spatial and temporal scales and at high resolutions. The proposed SWATGP model is thus a promising tool for large-scale and high-resolution water resources research and management in addition to offering data fusion and model coupling ability.
Evaluation of the peri-implant bone around parallel-walled dental implants with a condensing thread macrodesign and a self-tapping apex: a 10-year retrospective histological analysis.

PubMed

Degidi, Marco; Perrotti, Vittoria; Shibli, Jamil A; Mortellaro, Carmen; Piattelli, Adriano; Iezzi, Giovanna

2014-05-01

The long-term high percentages of survival and success of dental implants reported in the literature are related mainly to new, innovative implant and thread designs, and new implant surfaces that allow to obtain very good primary and secondary stability in most anatomical and clinical situations, even in low quality and quantity of bone, promoting a more rapid osseointegration. The aim of this retrospective study was a histological and histomorphometrical evaluation of the bone response around implants with a parallel-wall configuration, condensing thread macrodesign, and self-tapping apex, retrieved from man for different causes. A total of 10 implants were reported in the present study, and these implants had been retrieved after a loading period comprised between a few weeks to about 8 years. Mineralized newly formed bone was found at the interface of all the implants, in direct contact with the implant surface, with no gaps or connective fibrous tissue. This bone adapted very well to the microirregularities of the implant surface. Areas of bone remodeling were present in some regions of the interface, with many reversal lines. High bone-implant contact percentages were found. In conclusion, both the macrostructure and the microstructure of this specific type of implant could be very helpful in the long-term high survival and success implant percentages.
A hybrid algorithm for parallel molecular dynamics simulations

NASA Astrophysics Data System (ADS)

Mangiardi, Chris M.; Meyer, R.

2017-10-01

This article describes algorithms for the hybrid parallelization and SIMD vectorization of molecular dynamics simulations with short-range forces. The parallelization method combines domain decomposition with a thread-based parallelization approach. The goal of the work is to enable efficient simulations of very large (tens of millions of atoms) and inhomogeneous systems on many-core processors with hundreds or thousands of cores and SIMD units with large vector sizes. In order to test the efficiency of the method, simulations of a variety of configurations with up to 74 million atoms have been performed. Results are shown that were obtained on multi-core systems with Sandy Bridge and Haswell processors as well as systems with Xeon Phi many-core processors.
[Design and study of parallel computing environment of Monte Carlo simulation for particle therapy planning using a public cloud-computing infrastructure].

PubMed

Yokohama, Noriya

2013-07-01

This report was aimed at structuring the design of architectures and studying performance measurement of a parallel computing environment using a Monte Carlo simulation for particle therapy using a high performance computing (HPC) instance within a public cloud-computing infrastructure. Performance measurements showed an approximately 28 times faster speed than seen with single-thread architecture, combined with improved stability. A study of methods of optimizing the system operations also indicated lower cost.
Data communications in a parallel active messaging interface of a parallel computer

DOEpatents

Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E

2013-10-29

Data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the parallel computer including a plurality of compute nodes that execute a parallel application, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes and the endpoints coupled for data communications through the PAMI and through data communications resources, including receiving in an origin endpoint of the PAMI a data communications instruction, the instruction characterized by an instruction type, the instruction specifying a transmission of transfer data from the origin endpoint to a target endpoint and transmitting, in accordance with the instruction type, the transfer data from the origin endpoint to the target endpoint.
Comparison of in-vivo failure of single-thread and dual-thread temporary anchorage devices over 18 months: A split-mouth randomized controlled trial.

PubMed

Durrani, Owais Khalid; Shaheed, Sohrab; Khan, Arsalan; Bashir, Ulfat

2017-10-01

The purpose of this study was to compare the in-vivo failure rates of single-thread and dual-thread temporary anchorage device (TAD) designs over 18 months. Thirty patients with skeletal Class II Division 1 malocclusion requiring anchorage from TADs for retraction of maxillary incisors into the extracted premolar space were recruited in this parallel group, split-mouth, randomized controlled trial. A block randomization sequence was generated with Random Allocation Software (version 2.0; Isfahan, Iran) with the allocations concealed in sequentially numbered, opaque, sealed envelopes. A total of 60 TADs (diameter, 2 mm; length, 10 mm) were placed in the maxillary arches of these patients with random allocation of the 2 types to the left and the right sides in a 1:1 ratio. All TADs were placed between the roots of the second premolar and the first molar and were immediately loaded. Patients were followed for a minimum of 12 months and a maximum of 18 months for the failure of the TADs. Data were analyzed blindly on an intention-to-treat basis. Four TADs (13.3%) failed in the single-thread group, and 6 TADs (20%) failed in the dual-thread group. The McNemar test showed an insignificant difference (P = 0.72) between the 2 groups. An odds ratio of 1.6 (95% confidence interval, 0.39-6.97) showed no significant associations among the variables. Most TADs failed in the first month after insertion (50%). The failure rate of dual-thread TADs compared with single-thread TADs is statistically insignificant when placed in the maxilla for retraction of the anterior segment. Registration: The trial was not registered before commencement. The protocol was not published before the trial. Copyright © 2016 American Association of Orthodontists. Published by Elsevier Inc. All rights reserved.
Development of a Distributed Parallel Computing Framework to Facilitate Regional/Global Gridded Crop Modeling with Various Scenarios

NASA Astrophysics Data System (ADS)

Jang, W.; Engda, T. A.; Neff, J. C.; Herrick, J.

2017-12-01

Many crop models are increasingly used to evaluate crop yields at regional and global scales. However, implementation of these models across large areas using fine-scale grids is limited by computational time requirements. In order to facilitate global gridded crop modeling with various scenarios (i.e., different crop, management schedule, fertilizer, and irrigation) using the Environmental Policy Integrated Climate (EPIC) model, we developed a distributed parallel computing framework in Python. Our local desktop with 14 cores (28 threads) was used to test the distributed parallel computing framework in Iringa, Tanzania which has 406,839 grid cells. High-resolution soil data, SoilGrids (250 x 250 m), and climate data, AgMERRA (0.25 x 0.25 deg) were also used as input data for the gridded EPIC model. The framework includes a master file for parallel computing, input database, input data formatters, EPIC model execution, and output analyzers. Through the master file for parallel computing, the user-defined number of threads of CPU divides the EPIC simulation into jobs. Then, Using EPIC input data formatters, the raw database is formatted for EPIC input data and the formatted data moves into EPIC simulation jobs. Then, 28 EPIC jobs run simultaneously and only interesting results files are parsed and moved into output analyzers. We applied various scenarios with seven different slopes and twenty-four fertilizer ranges. Parallelized input generators create different scenarios as a list for distributed parallel computing. After all simulations are completed, parallelized output analyzers are used to analyze all outputs according to the different scenarios. This saves significant computing time and resources, making it possible to conduct gridded modeling at regional to global scales with high-resolution data. For example, serial processing for the Iringa test case would require 113 hours, while using the framework developed in this study requires only approximately 6 hours, a nearly 95% reduction in computing time.
Data communications in a parallel active messaging interface of a parallel computer

DOEpatents

Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E

2013-11-12

Data communications in a parallel active messaging interface (`PAMI`) of a parallel computer composed of compute nodes that execute a parallel application, each compute node including application processors that execute the parallel application and at least one management processor dedicated to gathering information regarding data communications. The PAMI is composed of data communications endpoints, each endpoint composed of a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes and the endpoints coupled for data communications through the PAMI and through data communications resources. Embodiments function by gathering call site statistics describing data communications resulting from execution of data communications instructions and identifying in dependence upon the call cite statistics a data communications algorithm for use in executing a data communications instruction at a call site in the parallel application.
Endpoint-based parallel data processing in a parallel active messaging interface of a parallel computer

DOEpatents

Archer, Charles J; Blocksome, Michael E; Ratterman, Joseph D; Smith, Brian E

2014-02-11

Endpoint-based parallel data processing in a parallel active messaging interface ('PAMI') of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes coupled for data communications through the PAMI, including establishing a data communications geometry, the geometry specifying, for tasks representing processes of execution of the parallel application, a set of endpoints that are used in collective operations of the PAMI including a plurality of endpoints for one of the tasks; receiving in endpoints of the geometry an instruction for a collective operation; and executing the instruction for a collective opeartion through the endpoints in dependence upon the geometry, including dividing data communications operations among the plurality of endpoints for one of the tasks.
Endpoint-based parallel data processing in a parallel active messaging interface of a parallel computer

DOEpatents

Archer, Charles J.; Blocksome, Michael A.; Ratterman, Joseph D.; Smith, Brian E.

2014-08-12

Endpoint-based parallel data processing in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes coupled for data communications through the PAMI, including establishing a data communications geometry, the geometry specifying, for tasks representing processes of execution of the parallel application, a set of endpoints that are used in collective operations of the PAMI including a plurality of endpoints for one of the tasks; receiving in endpoints of the geometry an instruction for a collective operation; and executing the instruction for a collective operation through the endpoints in dependence upon the geometry, including dividing data communications operations among the plurality of endpoints for one of the tasks.
Meandering instability of a viscous thread

NASA Astrophysics Data System (ADS)

Morris, Stephen W.; Dawes, Jonathan H. P.; Ribe, Neil M.; Lister, John R.

2008-06-01

A viscous thread falling from a nozzle onto a surface exhibits the famous rope-coiling effect, in which the thread buckles to form loops. If the surface is replaced by a belt moving with speed U , the rotational symmetry of the buckling instability is broken and a wealth of interesting states are observed [see S. Chiu-Webster and J. R. Lister, J. Fluid Mech. 569, 89 (2006)]. We experimentally studied this “fluid-mechanical sewing machine” in a more precise apparatus. As U is reduced, the steady catenary thread bifurcates into a meandering state in which the thread displacements are only transverse to the motion of the belt. We measured the amplitude and frequency ω of the meandering close to the bifurcation. For smaller U , single-frequency meandering bifurcates to a two-frequency “figure-8” state, which contains a significant 2ω component and parallel as well as transverse displacements. This eventually reverts to single-frequency coiling at still smaller U . More complex, highly hysteretic states with additional frequencies are observed for larger nozzle heights. We propose to understand this zoology in terms of the generic amplitude equations appropriate for resonant interactions between two oscillatory modes with frequencies ω and 2ω . The form of the amplitude equations captures both the axisymmetry of the U=0 coiling state and the symmetry-breaking effects induced by the moving belt.
Thread amplitudes and frequencies in a fluid mechanical `sewing machine'

NASA Astrophysics Data System (ADS)

Morris, Stephen W.; Dawes, J. H. P.; Lister, John; Dalziel, Stuart

2006-11-01

A viscous thread falling on a surface exhibits the famous rope- coiling effect, in which the thread buckles to form loops. If the surface is replaced by a belt moving at speed U, the rotational symmetry of the buckling instability is broken and a wealth of interesting states are observed (1). We experimentally studied this fluid mechanical `sewing machine' in a new, more precise apparatus. As U is reduced, the stretched thread bifurcates into a meandering state in which the thread displacements are only transverse to the motion of the belt. We measured the amplitudes A and frequency φ of the meandering close to the bifurcation. For small U, single- frequency meandering bifurcates to a two-frequency `figure 8' state, which contains a significant 2φ component and parallel as well as transverse displacements. This eventually reverts to single-frequency coiling at smaller U. More complex, highly hysteretic states with additional harmonics are observed for larger nozzle heights. We propose to understand this zoology in terms of the generic amplitude equations appropriate for resonant interactions between three oscillatory modes with frequencies φ, 2φ and 3φ. The form of the amplitude equations captures both the axisymmetry of the U=0 coiling state and the symmetry-breaking effects induced by the moving belt.(1) Chiu-Webster and Lister, J. Fluid Mech., in press.

Meandering instability of a viscous thread.

PubMed

Morris, Stephen W; Dawes, Jonathan H P; Ribe, Neil M; Lister, John R

2008-06-01

A viscous thread falling from a nozzle onto a surface exhibits the famous rope-coiling effect, in which the thread buckles to form loops. If the surface is replaced by a belt moving with speed U , the rotational symmetry of the buckling instability is broken and a wealth of interesting states are observed [see S. Chiu-Webster and J. R. Lister, J. Fluid Mech. 569, 89 (2006)]. We experimentally studied this "fluid-mechanical sewing machine" in a more precise apparatus. As U is reduced, the steady catenary thread bifurcates into a meandering state in which the thread displacements are only transverse to the motion of the belt. We measured the amplitude and frequency omega of the meandering close to the bifurcation. For smaller U , single-frequency meandering bifurcates to a two-frequency "figure-8" state, which contains a significant 2omega component and parallel as well as transverse displacements. This eventually reverts to single-frequency coiling at still smaller U . More complex, highly hysteretic states with additional frequencies are observed for larger nozzle heights. We propose to understand this zoology in terms of the generic amplitude equations appropriate for resonant interactions between two oscillatory modes with frequencies omega and 2omega . The form of the amplitude equations captures both the axisymmetry of the U=0 coiling state and the symmetry-breaking effects induced by the moving belt.
Implementing and analyzing the multi-threaded LP-inference

NASA Astrophysics Data System (ADS)

Bolotova, S. Yu; Trofimenko, E. V.; Leschinskaya, M. V.

2018-03-01

The logical production equations provide new possibilities for the backward inference optimization in intelligent production-type systems. The strategy of a relevant backward inference is aimed at minimization of a number of queries to external information source (either to a database or an interactive user). The idea of the method is based on the computing of initial preimages set and searching for the true preimage. The execution of each stage can be organized independently and in parallel and the actual work at a given stage can also be distributed between parallel computers. This paper is devoted to the parallel algorithms of the relevant inference based on the advanced scheme of the parallel computations “pipeline” which allows to increase the degree of parallelism. The author also provides some details of the LP-structures implementation.
A Parallel Saturation Algorithm on Shared Memory Architectures

NASA Technical Reports Server (NTRS)

Ezekiel, Jonathan; Siminiceanu

2007-01-01

Symbolic state-space generators are notoriously hard to parallelize. However, the Saturation algorithm implemented in the SMART verification tool differs from other sequential symbolic state-space generators in that it exploits the locality of ring events in asynchronous system models. This paper explores whether event locality can be utilized to efficiently parallelize Saturation on shared-memory architectures. Conceptually, we propose to parallelize the ring of events within a decision diagram node, which is technically realized via a thread pool. We discuss the challenges involved in our parallel design and conduct experimental studies on its prototypical implementation. On a dual-processor dual core PC, our studies show speed-ups for several example models, e.g., of up to 50% for a Kanban model, when compared to running our algorithm only on a single core.
Trace: a high-throughput tomographic reconstruction engine for large-scale datasets

DOE PAGES

Bicer, Tekin; Gursoy, Doga; Andrade, Vincent De; ...

2017-01-28

Here, synchrotron light source and detector technologies enable scientists to perform advanced experiments. These scientific instruments and experiments produce data at such scale and complexity that large-scale computation is required to unleash their full power. One of the widely used data acquisition technique at light sources is Computed Tomography, which can generate tens of GB/s depending on x-ray range. A large-scale tomographic dataset, such as mouse brain, may require hours of computation time with a medium size workstation. In this paper, we present Trace, a data-intensive computing middleware we developed for implementation and parallelization of iterative tomographic reconstruction algorithms. Tracemore » provides fine-grained reconstruction of tomography datasets using both (thread level) shared memory and (process level) distributed memory parallelization. Trace utilizes a special data structure called replicated reconstruction object to maximize application performance. We also present the optimizations we have done on the replicated reconstruction objects and evaluate them using a shale and a mouse brain sinogram. Our experimental evaluations show that the applied optimizations and parallelization techniques can provide 158x speedup (using 32 compute nodes) over single core configuration, which decreases the reconstruction time of a sinogram (with 4501 projections and 22400 detector resolution) from 12.5 hours to less than 5 minutes per iteration.« less
Trace: a high-throughput tomographic reconstruction engine for large-scale datasets

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bicer, Tekin; Gursoy, Doga; Andrade, Vincent De

Here, synchrotron light source and detector technologies enable scientists to perform advanced experiments. These scientific instruments and experiments produce data at such scale and complexity that large-scale computation is required to unleash their full power. One of the widely used data acquisition technique at light sources is Computed Tomography, which can generate tens of GB/s depending on x-ray range. A large-scale tomographic dataset, such as mouse brain, may require hours of computation time with a medium size workstation. In this paper, we present Trace, a data-intensive computing middleware we developed for implementation and parallelization of iterative tomographic reconstruction algorithms. Tracemore » provides fine-grained reconstruction of tomography datasets using both (thread level) shared memory and (process level) distributed memory parallelization. Trace utilizes a special data structure called replicated reconstruction object to maximize application performance. We also present the optimizations we have done on the replicated reconstruction objects and evaluate them using a shale and a mouse brain sinogram. Our experimental evaluations show that the applied optimizations and parallelization techniques can provide 158x speedup (using 32 compute nodes) over single core configuration, which decreases the reconstruction time of a sinogram (with 4501 projections and 22400 detector resolution) from 12.5 hours to less than 5 minutes per iteration.« less
Synchrotron X-ray topographic study on nature of threading mixed dislocations in 4H–SiC crystals grown by PVT method

DOE Office of Scientific and Technical Information (OSTI.GOV)

Guo, Jianqiu; Yang, Yu; Wu, Fangzhen

Synchrotron X-ray Topography is a powerful technique to study defects structures particularly dislocation configurations in single crystals. Complementing this technique with geometrical and contrast analysis can enhance the efficiency of quantitatively characterizing defects. In this study, the use of Synchrotron White Beam X-ray Topography (SWBXT) to determine the line directions of threading dislocations in 4H–SiC axial slices (sample cut parallel to the growth axis from the boule) is demonstrated. This technique is based on the fact that the projected line directions of dislocations on different reflections are different. Another technique also discussed is the determination of the absolute Burgers vectorsmore » of threading mixed dislocations (TMDs) using Synchrotron Monochromatic Beam X-ray Topography (SMBXT). This technique utilizes the fact that the contrast from TMDs varies on SMBXT images as their Burgers vectors change. By comparing observed contrast with the contrast from threading dislocations provided by Ray Tracing Simulations, the Burgers vectors can be determined. Thereafter the distribution of TMDs with different Burgers vectors across the wafer is mapped and investigated.« less
FastGCN: A GPU Accelerated Tool for Fast Gene Co-Expression Networks

PubMed Central

Liang, Meimei; Zhang, Futao; Jin, Gulei; Zhu, Jun

2015-01-01

Gene co-expression networks comprise one type of valuable biological networks. Many methods and tools have been published to construct gene co-expression networks; however, most of these tools and methods are inconvenient and time consuming for large datasets. We have developed a user-friendly, accelerated and optimized tool for constructing gene co-expression networks that can fully harness the parallel nature of GPU (Graphic Processing Unit) architectures. Genetic entropies were exploited to filter out genes with no or small expression changes in the raw data preprocessing step. Pearson correlation coefficients were then calculated. After that, we normalized these coefficients and employed the False Discovery Rate to control the multiple tests. At last, modules identification was conducted to construct the co-expression networks. All of these calculations were implemented on a GPU. We also compressed the coefficient matrix to save space. We compared the performance of the GPU implementation with those of multi-core CPU implementations with 16 CPU threads, single-thread C/C++ implementation and single-thread R implementation. Our results show that GPU implementation largely outperforms single-thread C/C++ implementation and single-thread R implementation, and GPU implementation outperforms multi-core CPU implementation when the number of genes increases. With the test dataset containing 16,000 genes and 590 individuals, we can achieve greater than 63 times the speed using a GPU implementation compared with a single-thread R implementation when 50 percent of genes were filtered out and about 80 times the speed when no genes were filtered out. PMID:25602758
FastGCN: a GPU accelerated tool for fast gene co-expression networks.

PubMed

Liang, Meimei; Zhang, Futao; Jin, Gulei; Zhu, Jun

2015-01-01

Gene co-expression networks comprise one type of valuable biological networks. Many methods and tools have been published to construct gene co-expression networks; however, most of these tools and methods are inconvenient and time consuming for large datasets. We have developed a user-friendly, accelerated and optimized tool for constructing gene co-expression networks that can fully harness the parallel nature of GPU (Graphic Processing Unit) architectures. Genetic entropies were exploited to filter out genes with no or small expression changes in the raw data preprocessing step. Pearson correlation coefficients were then calculated. After that, we normalized these coefficients and employed the False Discovery Rate to control the multiple tests. At last, modules identification was conducted to construct the co-expression networks. All of these calculations were implemented on a GPU. We also compressed the coefficient matrix to save space. We compared the performance of the GPU implementation with those of multi-core CPU implementations with 16 CPU threads, single-thread C/C++ implementation and single-thread R implementation. Our results show that GPU implementation largely outperforms single-thread C/C++ implementation and single-thread R implementation, and GPU implementation outperforms multi-core CPU implementation when the number of genes increases. With the test dataset containing 16,000 genes and 590 individuals, we can achieve greater than 63 times the speed using a GPU implementation compared with a single-thread R implementation when 50 percent of genes were filtered out and about 80 times the speed when no genes were filtered out.
Scalable Algorithms for Parallel Discrete Event Simulation Systems in Multicore Environments

DTIC Science & Technology

2013-05-01

consolidated at the sender side. At the receiver side, the messages are deconsolidated and delivered to the appropriate thread. This approach bears some...Jiang, S. Kini, W. Yu, D. Buntinas, P. Wyckoff, and D. Panda . Performance comparison of mpi implementations over infiniband, myrinet and quadrics
Experiments and Analyses of Data Transfers Over Wide-Area Dedicated Connections

DOE Office of Scientific and Technical Information (OSTI.GOV)

Rao, Nageswara S.; Liu, Qiang; Sen, Satyabrata

Dedicated wide-area network connections are increasingly employed in high-performance computing and big data scenarios. One might expect the performance and dynamics of data transfers over such connections to be easy to analyze due to the lack of competing traffic. However, non-linear transport dynamics and end-system complexities (e.g., multi-core hosts and distributed filesystems) can in fact make analysis surprisingly challenging. We present extensive measurements of memory-to-memory and disk-to-disk file transfers over 10 Gbps physical and emulated connections with 0–366 ms round trip times (RTTs). For memory-to-memory transfers, profiles of both TCP and UDT throughput as a function of RTT show concavemore » and convex regions; large buffer sizes and more parallel flows lead to wider concave regions, which are highly desirable. TCP and UDT both also display complex throughput dynamics, as indicated by their Poincare maps and Lyapunov exponents. For disk-to-disk transfers, we determine that high throughput can be achieved via a combination of parallel I/O threads, parallel network threads, and direct I/O mode. Our measurements also show that Lustre filesystems can be mounted over long-haul connections using LNet routers, although challenges remain in jointly optimizing file I/O and transport method parameters to achieve peak throughput.« less
cuBLASTP: Fine-Grained Parallelization of Protein Sequence Search on CPU+GPU.

PubMed

Zhang, Jing; Wang, Hao; Feng, Wu-Chun

2017-01-01

BLAST, short for Basic Local Alignment Search Tool, is a ubiquitous tool used in the life sciences for pairwise sequence search. However, with the advent of next-generation sequencing (NGS), whether at the outset or downstream from NGS, the exponential growth of sequence databases is outstripping our ability to analyze the data. While recent studies have utilized the graphics processing unit (GPU) to speedup the BLAST algorithm for searching protein sequences (i.e., BLASTP), these studies use coarse-grained parallelism, where one sequence alignment is mapped to only one thread. Such an approach does not efficiently utilize the capabilities of a GPU, particularly due to the irregularity of BLASTP in both execution paths and memory-access patterns. To address the above shortcomings, we present a fine-grained approach to parallelize BLASTP, where each individual phase of sequence search is mapped to many threads on a GPU. This approach, which we refer to as cuBLASTP, reorders data-access patterns and reduces divergent branches of the most time-consuming phases (i.e., hit detection and ungapped extension). In addition, cuBLASTP optimizes the remaining phases (i.e., gapped extension and alignment with trace back) on a multicore CPU and overlaps their execution with the phases running on the GPU.
Parallel-wire grid assembly with method and apparatus for construction thereof

DOEpatents

Lewandowski, E.F.; Vrabec, J.

1981-10-26

Disclosed is a parallel wire grid and an apparatus and method for making the same. The grid consists of a generally coplanar array of parallel spaced-apart wires secured between metallic frame members by an electrically conductive epoxy. The method consists of continuously winding a wire about a novel winding apparatus comprising a plurality of spaced-apart generally parallel spindles. Each spindle is threaded with a number of predeterminedly spaced-apart grooves which receive and accurately position the wire at predetermined positions along the spindle. Overlying frame members coated with electrically conductive epoxy are then placed on either side of the wire array and are drawn together. After the epoxy hardens, portions of the wire array lying outside the frame members are trimmed away.
Parallel-wire grid assembly with method and apparatus for construction thereof

DOEpatents

Lewandowski, Edward F.; Vrabec, John

1984-01-01

Disclosed is a parallel wire grid and an apparatus and method for making the same. The grid consists of a generally coplanar array of parallel spaced-apart wires secured between metallic frame members by an electrically conductive epoxy. The method consists of continuously winding a wire about a novel winding apparatus comprising a plurality of spaced-apart generally parallel spindles. Each spindle is threaded with a number of predeterminedly spaced-apart grooves which receive and accurately position the wire at predetermined positions along the spindle. Overlying frame members coated with electrically conductive epoxy are then placed on either side of the wire array and are drawn together. After the epoxy hardens, portions of the wire array lying outside the frame members are trimmed away.
Energy-aware Thread and Data Management in Heterogeneous Multi-core, Multi-memory Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Su, Chun-Yi

By 2004, microprocessor design focused on multicore scaling—increasing the number of cores per die in each generation—as the primary strategy for improving performance. These multicore processors typically equip multiple memory subsystems to improve data throughput. In addition, these systems employ heterogeneous processors such as GPUs and heterogeneous memories like non-volatile memory to improve performance, capacity, and energy efficiency. With the increasing volume of hardware resources and system complexity caused by heterogeneity, future systems will require intelligent ways to manage hardware resources. Early research to improve performance and energy efficiency on heterogeneous, multi-core, multi-memory systems focused on tuning a single primitivemore » or at best a few primitives in the systems. The key limitation of past efforts is their lack of a holistic approach to resource management that balances the tradeoff between performance and energy consumption. In addition, the shift from simple, homogeneous systems to these heterogeneous, multicore, multi-memory systems requires in-depth understanding of efficient resource management for scalable execution, including new models that capture the interchange between performance and energy, smarter resource management strategies, and novel low-level performance/energy tuning primitives and runtime systems. Tuning an application to control available resources efficiently has become a daunting challenge; managing resources in automation is still a dark art since the tradeoffs among programming, energy, and performance remain insufficiently understood. In this dissertation, I have developed theories, models, and resource management techniques to enable energy-efficient execution of parallel applications through thread and data management in these heterogeneous multi-core, multi-memory systems. I study the effect of dynamic concurrent throttling on the performance and energy of multi-core, non-uniform memory access (NUMA) systems. I use critical path analysis to quantify memory contention in the NUMA memory system and determine thread mappings. In addition, I implement a runtime system that combines concurrent throttling and a novel thread mapping algorithm to manage thread resources and improve energy efficient execution in multi-core, NUMA systems.« less
Opportunities and choice in a new vector era

NASA Astrophysics Data System (ADS)

Nowak, A.

2014-06-01

This work discusses the significant changes in computing landscape related to the progression of Moore's Law, and the implications on scientific computing. Particular attention is devoted to the High Energy Physics domain (HEP), which has always made good use of threading, but levels of parallelism closer to the hardware were often left underutilized. Findings of the CERN openlab Platform Competence Center are reported in the context of expanding "performance dimensions", and especially the resurgence of vectors. These suggest that data oriented designs are feasible in HEP and have considerable potential for performance improvements on multiple levels, but will rarely trump algorithmic enhancements. Finally, an analysis of upcoming hardware and software technologies identifies heterogeneity as a major challenge for software, which will require more emphasis on scalable, efficient design.
Fixture for holding testing transducer

DOEpatents

Wagner, T.A.; Engel, H.P.

A fixture for mounting an ultrasonic transducer against the end of a threaded bolt or stud to test the same for flaws. A base means threadedly secured to the side of the bolt has a rotating ring thereon. A post rising up from the ring (parallel to the axis of the workpiece) pivotally mounts a variable length cross arm, on the inner end of which is mounted the transducer. A spring means acts between the cross arm and the base to apply the testing transducer against the workpiece at a constant pressure. The device maintains constant for successive tests the radial and circumferential positions of the testing transducer and its contact pressure against the end of the workpiece.
Fixture for holding testing transducer

DOEpatents

Wagner, Thomas A.; Engel, Herbert P.

1984-01-01

A fixture for mounting an ultrasonic transducer against the end of a threaded bolt or stud to test the same for flaws. A base means threadedly secured to the side of the bolt has a rotating ring thereon. A post rising up from the ring (parallel to the axis of the workpiece) pivotally mounts a variable length cross arm, on the inner end of which is mounted the transducer. A spring means acts between the cross arm and the base to apply the testing transducer against the workpiece at a constant pressure. The device maintains constant for successive tests the radial and circumferential positions of the testing transducer and its contact pressure against the end of the workpiece.
Reduction of threading dislocation density in SiGe epilayer on Si (0 0 1) by lateral growth liquid-phase epitaxy

NASA Astrophysics Data System (ADS)

O'Reilly, Andrew J.; Quitoriano, Nathaniel J.

2018-02-01

Si0.973Ge0.027 epilayers were grown on a Si (0 0 1) substrate by a lateral liquid-phase epitaxy (LLPE) technique. The lateral growth mechanism favoured the glide of misfit dislocations and inhibited the nucleation of new dislocations by maintaining the thickness less than the critical thicknesses for dislocation nucleation and greater than the critical thickness for glide. This promoted the formation of an array of long misfit dislocations parallel to the [1 1 0] growth direction and reduced the threading dislocation density to 103 cm-2, two orders of magnitude lower than the seed area with an isotropic misfit dislocation network.
PHoToNs–A parallel heterogeneous and threads oriented code for cosmological N-body simulation

NASA Astrophysics Data System (ADS)

Wang, Qiao; Cao, Zong-Yan; Gao, Liang; Chi, Xue-Bin; Meng, Chen; Wang, Jie; Wang, Long

2018-06-01

We introduce a new code for cosmological simulations, PHoToNs, which incorporates features for performing massive cosmological simulations on heterogeneous high performance computer (HPC) systems and threads oriented programming. PHoToNs adopts a hybrid scheme to compute gravitational force, with the conventional Particle-Mesh (PM) algorithm to compute the long-range force, the Tree algorithm to compute the short range force and the direct summation Particle-Particle (PP) algorithm to compute gravity from very close particles. A self-similar space filling a Peano-Hilbert curve is used to decompose the computing domain. Threads programming is advantageously used to more flexibly manage the domain communication, PM calculation and synchronization, as well as Dual Tree Traversal on the CPU+MIC platform. PHoToNs scales well and efficiency of the PP kernel achieves 68.6% of peak performance on MIC and 74.4% on CPU platforms. We also test the accuracy of the code against the much used Gadget-2 in the community and found excellent agreement.
Endpoint-based parallel data processing with non-blocking collective instructions in a parallel active messaging interface of a parallel computer

DOE Office of Scientific and Technical Information (OSTI.GOV)

Archer, Charles J; Blocksome, Michael A; Cernohous, Bob R

Endpoint-based parallel data processing with non-blocking collective instructions in a PAMI of a parallel computer is disclosed. The PAMI is composed of data communications endpoints, each including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task. The compute nodes are coupled for data communications through the PAMI. The parallel application establishes a data communications geometry specifying a set of endpoints that are used in collective operations of the PAMI by associating with the geometry a list of collective algorithms valid for use with themore » endpoints of the geometry; registering in each endpoint in the geometry a dispatch callback function for a collective operation; and executing without blocking, through a single one of the endpoints in the geometry, an instruction for the collective operation.« less

Parallel, Distributed Scripting with Python

DOE Office of Scientific and Technical Information (OSTI.GOV)

Miller, P J

2002-05-24

Parallel computers used to be, for the most part, one-of-a-kind systems which were extremely difficult to program portably. With SMP architectures, the advent of the POSIX thread API and OpenMP gave developers ways to portably exploit on-the-box shared memory parallelism. Since these architectures didn't scale cost-effectively, distributed memory clusters were developed. The associated MPI message passing libraries gave these systems a portable paradigm too. Having programmers effectively use this paradigm is a somewhat different question. Distributed data has to be explicitly transported via the messaging system in order for it to be useful. In high level languages, the MPI librarymore » gives access to data distribution routines in C, C++, and FORTRAN. But we need more than that. Many reasonable and common tasks are best done in (or as extensions to) scripting languages. Consider sysadm tools such as password crackers, file purgers, etc ... These are simple to write in a scripting language such as Python (an open source, portable, and freely available interpreter). But these tasks beg to be done in parallel. Consider the a password checker that checks an encrypted password against a 25,000 word dictionary. This can take around 10 seconds in Python (6 seconds in C). It is trivial to parallelize if you can distribute the information and co-ordinate the work.« less
Parallel Computer System for 3D Visualization Stereo on GPU

NASA Astrophysics Data System (ADS)

Al-Oraiqat, Anas M.; Zori, Sergii A.

2018-03-01

This paper proposes the organization of a parallel computer system based on Graphic Processors Unit (GPU) for 3D stereo image synthesis. The development is based on the modified ray tracing method developed by the authors for fast search of tracing rays intersections with scene objects. The system allows significant increase in the productivity for the 3D stereo synthesis of photorealistic quality. The generalized procedure of 3D stereo image synthesis on the Graphics Processing Unit/Graphics Processing Clusters (GPU/GPC) is proposed. The efficiency of the proposed solutions by GPU implementation is compared with single-threaded and multithreaded implementations on the CPU. The achieved average acceleration in multi-thread implementation on the test GPU and CPU is about 7.5 and 1.6 times, respectively. Studying the influence of choosing the size and configuration of the computational Compute Unified Device Archi-tecture (CUDA) network on the computational speed shows the importance of their correct selection. The obtained experimental estimations can be significantly improved by new GPUs with a large number of processing cores and multiprocessors, as well as optimized configuration of the computing CUDA network.
Parallel sort with a ranged, partitioned key-value store in a high perfomance computing environment

DOEpatents

Bent, John M.; Faibish, Sorin; Grider, Gary; Torres, Aaron; Poole, Stephen W.

2016-01-26

Improved sorting techniques are provided that perform a parallel sort using a ranged, partitioned key-value store in a high performance computing (HPC) environment. A plurality of input data files comprising unsorted key-value data in a partitioned key-value store are sorted. The partitioned key-value store comprises a range server for each of a plurality of ranges. Each input data file has an associated reader thread. Each reader thread reads the unsorted key-value data in the corresponding input data file and performs a local sort of the unsorted key-value data to generate sorted key-value data. A plurality of sorted, ranged subsets of each of the sorted key-value data are generated based on the plurality of ranges. Each sorted, ranged subset corresponds to a given one of the ranges and is provided to one of the range servers corresponding to the range of the sorted, ranged subset. Each range server sorts the received sorted, ranged subsets and provides a sorted range. A plurality of the sorted ranges are concatenated to obtain a globally sorted result.
The Wang Landau parallel algorithm for the simple grids. Optimizing OpenMPI parallel implementation

NASA Astrophysics Data System (ADS)

Kussainov, A. S.

2017-12-01

The Wang Landau Monte Carlo algorithm to calculate density of states for the different simple spin lattices was implemented. The energy space was split between the individual threads and balanced according to the expected runtime for the individual processes. Custom spin clustering mechanism, necessary for overcoming of the critical slowdown in the certain energy subspaces, was devised. Stable reconstruction of the density of states was of primary importance. Some data post-processing techniques were involved to produce the expected smooth density of states.
Performance Analysis of Multilevel Parallel Applications on Shared Memory Architectures

NASA Technical Reports Server (NTRS)

Jost, Gabriele; Jin, Haoqiang; Labarta, Jesus; Gimenez, Judit; Caubet, Jordi; Biegel, Bryan A. (Technical Monitor)

2002-01-01

In this paper we describe how to apply powerful performance analysis techniques to understand the behavior of multilevel parallel applications. We use the Paraver/OMPItrace performance analysis system for our study. This system consists of two major components: The OMPItrace dynamic instrumentation mechanism, which allows the tracing of processes and threads and the Paraver graphical user interface for inspection and analyses of the generated traces. We describe how to use the system to conduct a detailed comparative study of a benchmark code implemented in five different programming paradigms applicable for shared memory
From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation

DOE PAGES

Blazewicz, Marek; Hinder, Ian; Koppelman, David M.; ...

2013-01-01

Starting from a high-level problem description in terms of partial differential equations using abstract tensor notation, the Chemora framework discretizes, optimizes, and generates complete high performance codes for a wide range of compute architectures. Chemora extends the capabilities of Cactus, facilitating the usage of large-scale CPU/GPU systems in an efficient manner for complex applications, without low-level code tuning. Chemora achieves parallelism through MPI and multi-threading, combining OpenMP and CUDA. Optimizations include high-level code transformations, efficient loop traversal strategies, dynamically selected data and instruction cache usage strategies, and JIT compilation of GPU code tailored to the problem characteristics. The discretization ismore » based on higher-order finite differences on multi-block domains. Chemora's capabilities are demonstrated by simulations of black hole collisions. This problem provides an acid test of the framework, as the Einstein equations contain hundreds of variables and thousands of terms.« less
Parallelization of the Flow Field Dependent Variation Scheme for Solving the Triple Shock/Boundary Layer Interaction Problem

NASA Technical Reports Server (NTRS)

Schunk, Richard Gregory; Chung, T. J.

2001-01-01

A parallelized version of the Flowfield Dependent Variation (FDV) Method is developed to analyze a problem of current research interest, the flowfield resulting from a triple shock/boundary layer interaction. Such flowfields are often encountered in the inlets of high speed air-breathing vehicles including the NASA Hyper-X research vehicle. In order to resolve the complex shock structure and to provide adequate resolution for boundary layer computations of the convective heat transfer from surfaces inside the inlet, models containing over 500,000 nodes are needed. Efficient parallelization of the computation is essential to achieving results in a timely manner. Results from a parallelization scheme, based upon multi-threading, as implemented on multiple processor supercomputers and workstations is presented.
On some methods for improving time of reachability sets computation for the dynamic system control problem

NASA Astrophysics Data System (ADS)

Zimovets, Artem; Matviychuk, Alexander; Ushakov, Vladimir

2016-12-01

The paper presents two different approaches to reduce the time of computer calculation of reachability sets. First of these two approaches use different data structures for storing the reachability sets in the computer memory for calculation in single-threaded mode. Second approach is based on using parallel algorithms with reference to the data structures from the first approach. Within the framework of this paper parallel algorithm of approximate reachability set calculation on computer with SMP-architecture is proposed. The results of numerical modelling are presented in the form of tables which demonstrate high efficiency of parallel computing technology and also show how computing time depends on the used data structure.
Electromagnetic Physics Models for Parallel Computing Architectures

NASA Astrophysics Data System (ADS)

Amadio, G.; Ananya, A.; Apostolakis, J.; Aurora, A.; Bandieramonte, M.; Bhattacharyya, A.; Bianchini, C.; Brun, R.; Canal, P.; Carminati, F.; Duhem, L.; Elvira, D.; Gheata, A.; Gheata, M.; Goulas, I.; Iope, R.; Jun, S. Y.; Lima, G.; Mohanty, A.; Nikitina, T.; Novak, M.; Pokorski, W.; Ribon, A.; Seghal, R.; Shadura, O.; Vallecorsa, S.; Wenzel, S.; Zhang, Y.

2016-10-01

The recent emergence of hardware architectures characterized by many-core or accelerated processors has opened new opportunities for concurrent programming models taking advantage of both SIMD and SIMT architectures. GeantV, a next generation detector simulation, has been designed to exploit both the vector capability of mainstream CPUs and multi-threading capabilities of coprocessors including NVidia GPUs and Intel Xeon Phi. The characteristics of these architectures are very different in terms of the vectorization depth and type of parallelization needed to achieve optimal performance. In this paper we describe implementation of electromagnetic physics models developed for parallel computing architectures as a part of the GeantV project. Results of preliminary performance evaluation and physics validation are presented as well.
Fano effect dominance over Coulomb blockade in transport properties of parallel coupled quantum dot system

DOE Office of Scientific and Technical Information (OSTI.GOV)

Brogi, Bharat Bhushan, E-mail: brogi-221179@yahoo.in; Ahluwalia, P. K.; Chand, Shyam

2015-06-24

Theoretical study of the Coulomb blockade effect on transport properties (Transmission Probability and I-V characteristics) for varied configuration of coupled quantum dot system has been studied by using Non Equilibrium Green Function(NEGF) formalism and Equation of Motion(EOM) method in the presence of magnetic flux. The self consistent approach and intra-dot Coulomb interaction is being taken into account. As the key parameters of the coupled quantum dot system such as dot-lead coupling, inter-dot tunneling and magnetic flux threading through the system can be tuned, the effect of asymmetry parameter and magnetic flux on this tuning is being explored in Coulomb blockademore » regime. The presence of the Coulomb blockade due to on-dot Coulomb interaction decreases the width of transmission peak at energy level ε + U and by adjusting the magnetic flux the swapping effect in the Fano peaks in asymmetric and symmetric parallel configuration sustains despite strong Coulomb blockade effect.« less
Pythran: enabling static optimization of scientific Python programs

NASA Astrophysics Data System (ADS)

Guelton, Serge; Brunet, Pierrick; Amini, Mehdi; Merlini, Adrien; Corbillon, Xavier; Raynaud, Alan

2015-01-01

Pythran is an open source static compiler that turns modules written in a subset of Python language into native ones. Assuming that scientific modules do not rely much on the dynamic features of the language, it trades them for powerful, possibly inter-procedural, optimizations. These optimizations include detection of pure functions, temporary allocation removal, constant folding, Numpy ufunc fusion and parallelization, explicit thread-level parallelism through OpenMP annotations, false variable polymorphism pruning, and automatic vector instruction generation such as AVX or SSE. In addition to these compilation steps, Pythran provides a C++ runtime library that leverages the C++ STL to provide generic containers, and the Numeric Template Toolbox for Numpy support. It takes advantage of modern C++11 features such as variadic templates, type inference, move semantics and perfect forwarding, as well as classical idioms such as expression templates. Unlike the Cython approach, Pythran input code remains compatible with the Python interpreter. Output code is generally as efficient as the annotated Cython equivalent, if not more, but without the backward compatibility loss.
RCrawler: An R package for parallel web crawling and scraping

NASA Astrophysics Data System (ADS)

Khalil, Salim; Fakir, Mohamed

RCrawler is a contributed R package for domain-based web crawling and content scraping. As the first implementation of a parallel web crawler in the R environment, RCrawler can crawl, parse, store pages, extract contents, and produce data that can be directly employed for web content mining applications. However, it is also flexible, and could be adapted to other applications. The main features of RCrawler are multi-threaded crawling, content extraction, and duplicate content detection. In addition, it includes functionalities such as URL and content-type filtering, depth level controlling, and a robot.txt parser. Our crawler has a highly optimized system, and can download a large number of pages per second while being robust against certain crashes and spider traps. In this paper, we describe the design and functionality of RCrawler, and report on our experience of implementing it in an R environment, including different optimizations that handle the limitations of R. Finally, we discuss our experimental results.
Distinct spinning patterns gain differentiated loading tolerance of silk thread anchorages in spiders with different ecology.

PubMed

Wolff, Jonas O; van der Meijden, Arie; Herberstein, Marie E

2017-07-26

Building behaviour in animals extends biological functions beyond bodies. Many studies have emphasized the role of behavioural programmes, physiology and extrinsic factors for the structure and function of buildings. Structure attachments associated with animal constructions offer yet unrealized research opportunities. Spiders build a variety of one- to three-dimensional structures from silk fibres. The evolution of economic web shapes as a key for ecological success in spiders has been related to the emergence of high performance silks and thread coating glues. However, the role of thread anchorages has been widely neglected in those models. Here, we show that orb-web (Araneidae) and hunting spiders (Sparassidae) use different silk application patterns that determine the structure and robustness of the joint in silk thread anchorages. Silk anchorages of orb-web spiders show a greater robustness against different loading situations, whereas the silk anchorages of hunting spiders have their highest pull-off resistance when loaded parallel to the substrate along the direction of dragline spinning. This suggests that the behavioural 'printing' of silk into attachment discs along with spinneret morphology was a prerequisite for the evolution of extended silk use in a three-dimensional space. This highlights the ecological role of attachments in the evolution of animal architectures. © 2017 The Author(s).
Illustrating Thermodynamic Concepts Using a Hero's Engine

NASA Astrophysics Data System (ADS)

Muiño, Pedro L.; Hodgson, James R.

2000-05-01

A modified Hero's engine is used to illustrate concepts of thermodynamics and engineering design suitable for introductory chemistry courses and more advanced physical chemistry courses. The engine is a boiler made of Pyrex with two off-center nozzles. Upon boiling, the vapor exits the nozzles, creating two opposite, off-center forces that result in a circular motion by the engine around the vertical axis. The engine is suspended from a horizontal bar by means of two parallel threads. The rotation of the engine results in the twisting of the threads, with two important effects: the engine is raised vertically, and potential energy is stored in the coiling of the threads. When the engine is raised, it is removed from the heating source. This stops the boiling. The stored potential energy is then released into kinetic energy; that is, the threads uncoil, and the engine rotates in the opposite direction. This lowers the engine into the flame, so the water resumes boiling and the engine can be raised again. This cycle continues until all the liquid water is vaporized. This demonstration is suitable to illustrate concepts like gas expansion, gas cooling through expansion (Joule-Thompson experiment), conversion of heat to work, interconversion between kinetic energy and potential energy, and feedback mechanisms.
Data-dependence Profiling to Enable Safe Thread Level Speculation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bhattacharyya, Arnamoy; Amaral, José Nelson; Finkel, Hal

Data-dependence profling is a technique that enables a com- piler to judiciously decide when the execution of a loop | which the compiler could not prove to be dependence free | should be speculated through the use of Thread Level Spec- ulation (TLS). The data collected by a data-dependence pro- fler can be used to predict if may dependencies reported by a compiler static analysis are likely to materialize at runtime. A cost analysis can then be used to decide that some loops with a lower probability of dependence should be specula- tively parallelized. This paper addresses the question asmore » to whether a loops' dependence behaviour changes when the in- put to the program changes | a study of 57 different bench- marks indicates that it usually does not change. Then the paper describes SpecEval, an automatic speculative paral- lelization framework that uses single-input data-dependence profles to find speculation candidates in the SPEC2006 and PolyBench/C benchmarks. This paper also presents a per- formance evaluation of TLS implementation in IBM's Blue- Gene/Q supercomputer and shows that the performance of TLS is affected by several factors, including the number of speculated loops, the execution-time coverage of speculated loops, the miss-speculation overhead, the L1 cache miss rate and the effect on dynamic instruction path length.« less
Supporting Graduate Student Writers with VoiceThread

ERIC Educational Resources Information Center

Gonzalez, Michelle; Moore, Noreen S.

2018-01-01

This qualitative case study examined the influence of the use of VoiceThread technology on the feedback process for thesis writing in two online asynchronous graduate courses. The influence on instructor feedback process and graduate student writers' perceptions of the use of VoiceThread were the foci of the study. Master's-level students (n = 18)…
During Threaded Discussions Are Non-Native English Speakers Always at a Disadvantage?

ERIC Educational Resources Information Center

Shafer Willner, Lynn

2014-01-01

When participating in threaded discussions, under what conditions might non¬native speakers of English (NNSE) be at a comparative disadvantage to their classmates who are native speakers of English (NSE)? This study compares the threaded discussion perspectives of closely-matched NNSE and NSE adult students having different levels of threaded…
An overview of the Opus language and runtime system

NASA Technical Reports Server (NTRS)

Mehrotra, Piyush; Haines, Matthew

1994-01-01

We have recently introduced a new language, called Opus, which provides a set of Fortran language extensions that allow for integrated support of task and data parallelism. lt also provides shared data abstractions (SDA's) as a method for communication and synchronization among these tasks. In this paper, we first provide a brief description of the language features and then focus on both the language-dependent and language-independent parts of the runtime system that support the language. The language-independent portion of the runtime system supports lightweight threads across multiple address spaces, and is built upon existing lightweight thread and communication systems. The language-dependent portion of the runtime system supports conditional invocation of SDA methods and distributed SDA argument handling.
Iran's Implicit Philosophy of Education

ERIC Educational Resources Information Center

Bagheri Noaparast, Khosrow

2018-01-01

This paper aims to extract Iran's philosophy of education from two sources of the constitution and the course of practice in educational institutions. Regarding the first source, it is argued that parallel to the two main threads of the constitution, Iran's main elements of philosophy of education are expected to be derived from; (1) Islam and (2)…
Accelerating next generation sequencing data analysis with system level optimizations.

PubMed

Kathiresan, Nagarajan; Temanni, Ramzi; Almabrazi, Hakeem; Syed, Najeeb; Jithesh, Puthen V; Al-Ali, Rashid

2017-08-22

Next generation sequencing (NGS) data analysis is highly compute intensive. In-memory computing, vectorization, bulk data transfer, CPU frequency scaling are some of the hardware features in the modern computing architectures. To get the best execution time and utilize these hardware features, it is necessary to tune the system level parameters before running the application. We studied the GATK-HaplotypeCaller which is part of common NGS workflows, that consume more than 43% of the total execution time. Multiple GATK 3.x versions were benchmarked and the execution time of HaplotypeCaller was optimized by various system level parameters which included: (i) tuning the parallel garbage collection and kernel shared memory to simulate in-memory computing, (ii) architecture-specific tuning in the PairHMM library for vectorization, (iii) including Java 1.8 features through GATK source code compilation and building a runtime environment for parallel sorting and bulk data transfer (iv) the default 'on-demand' mode of CPU frequency is over-clocked by using 'performance-mode' to accelerate the Java multi-threads. As a result, the HaplotypeCaller execution time was reduced by 82.66% in GATK 3.3 and 42.61% in GATK 3.7. Overall, the execution time of NGS pipeline was reduced to 70.60% and 34.14% for GATK 3.3 and GATK 3.7 respectively.

Parallel algorithm of VLBI software correlator under multiprocessor environment

NASA Astrophysics Data System (ADS)

Zheng, Weimin; Zhang, Dong

2007-11-01

The correlator is the key signal processing equipment of a Very Lone Baseline Interferometry (VLBI) synthetic aperture telescope. It receives the mass data collected by the VLBI observatories and produces the visibility function of the target, which can be used to spacecraft position, baseline length measurement, synthesis imaging, and other scientific applications. VLBI data correlation is a task of data intensive and computation intensive. This paper presents the algorithms of two parallel software correlators under multiprocessor environments. A near real-time correlator for spacecraft tracking adopts the pipelining and thread-parallel technology, and runs on the SMP (Symmetric Multiple Processor) servers. Another high speed prototype correlator using the mixed Pthreads and MPI (Massage Passing Interface) parallel algorithm is realized on a small Beowulf cluster platform. Both correlators have the characteristic of flexible structure, scalability, and with 10-station data correlating abilities.
A Parallel Processing Algorithm for Remote Sensing Classification

NASA Technical Reports Server (NTRS)

Gualtieri, J. Anthony

2005-01-01

A current thread in parallel computation is the use of cluster computers created by networking a few to thousands of commodity general-purpose workstation-level commuters using the Linux operating system. For example on the Medusa cluster at NASA/GSFC, this provides for super computing performance, 130 G(sub flops) (Linpack Benchmark) at moderate cost, $370K. However, to be useful for scientific computing in the area of Earth science, issues of ease of programming, access to existing scientific libraries, and portability of existing code need to be considered. In this paper, I address these issues in the context of tools for rendering earth science remote sensing data into useful products. In particular, I focus on a problem that can be decomposed into a set of independent tasks, which on a serial computer would be performed sequentially, but with a cluster computer can be performed in parallel, giving an obvious speedup. To make the ideas concrete, I consider the problem of classifying hyperspectral imagery where some ground truth is available to train the classifier. In particular I will use the Support Vector Machine (SVM) approach as applied to hyperspectral imagery. The approach will be to introduce notions about parallel computation and then to restrict the development to the SVM problem. Pseudocode (an outline of the computation) will be described and then details specific to the implementation will be given. Then timing results will be reported to show what speedups are possible using parallel computation. The paper will close with a discussion of the results.
Electromagnetic physics models for parallel computing architectures

DOE PAGES

Amadio, G.; Ananya, A.; Apostolakis, J.; ...

2016-11-21

The recent emergence of hardware architectures characterized by many-core or accelerated processors has opened new opportunities for concurrent programming models taking advantage of both SIMD and SIMT architectures. GeantV, a next generation detector simulation, has been designed to exploit both the vector capability of mainstream CPUs and multi-threading capabilities of coprocessors including NVidia GPUs and Intel Xeon Phi. The characteristics of these architectures are very different in terms of the vectorization depth and type of parallelization needed to achieve optimal performance. In this paper we describe implementation of electromagnetic physics models developed for parallel computing architectures as a part ofmore » the GeantV project. Finally, the results of preliminary performance evaluation and physics validation are presented as well.« less
A Moiré Pattern-Based Thread Counter

NASA Astrophysics Data System (ADS)

Reich, Gary

2017-10-01

Thread count is a term used in the textile industry as a measure of how closely woven a fabric is. It is usually defined as the sum of the number of warp threads per inch (or cm) and the number of weft threads per inch. (It is sometimes confusingly described as the number of threads per square inch.) In recent years it has also become a subject of considerable interest and some controversy among consumers. Many consumers consider thread count to be a key measure of the quality or fineness of a fabric, especially bed sheets, and they seek out fabrics that advertise high counts. Manufacturers in turn have responded to this interest by offering fabrics with ever higher claimed thread counts (sold at ever higher prices), sometime achieving the higher counts by distorting the definition of the term with some "creative math." In 2005 the Federal Trade Commission noted the growing use of thread count in advertising at the retail level and warned of the potential for consumers to be misled by distortions of the definition.
Real-time implementation of optimized maximum noise fraction transform for feature extraction of hyperspectral images

NASA Astrophysics Data System (ADS)

Wu, Yuanfeng; Gao, Lianru; Zhang, Bing; Zhao, Haina; Li, Jun

2014-01-01

We present a parallel implementation of the optimized maximum noise fraction (G-OMNF) transform algorithm for feature extraction of hyperspectral images on commodity graphics processing units (GPUs). The proposed approach explored the algorithm data-level concurrency and optimized the computing flow. We first defined a three-dimensional grid, in which each thread calculates a sub-block data to easily facilitate the spatial and spectral neighborhood data searches in noise estimation, which is one of the most important steps involved in OMNF. Then, we optimized the processing flow and computed the noise covariance matrix before computing the image covariance matrix to reduce the original hyperspectral image data transmission. These optimization strategies can greatly improve the computing efficiency and can be applied to other feature extraction algorithms. The proposed parallel feature extraction algorithm was implemented on an Nvidia Tesla GPU using the compute unified device architecture and basic linear algebra subroutines library. Through the experiments on several real hyperspectral images, our GPU parallel implementation provides a significant speedup of the algorithm compared with the CPU implementation, especially for highly data parallelizable and arithmetically intensive algorithm parts, such as noise estimation. In order to further evaluate the effectiveness of G-OMNF, we used two different applications: spectral unmixing and classification for evaluation. Considering the sensor scanning rate and the data acquisition time, the proposed parallel implementation met the on-board real-time feature extraction.
National Centers for Environmental Prediction

Science.gov Websites

the number of threads used. HWRF group cannot access Zeus and Jet for real-time data transfers from nodes used.). All single jobs will be run on one rack and will not share with parallel jobs. No official change the group when using tag_rstprod (-g option). autotag_rstprod is a script that tags all files. It
Web 2.0, Pedagogical Support for Reflexive and Emotional Social Interaction among Swedish Students

ERIC Educational Resources Information Center

Augustsson, Gunnar

2010-01-01

Collaborative social interaction when using Web 2.0 in terms of VoiceThread is investigated in a case study of a Swedish university course in social psychology. The case study method was chosen because of the desire not to manipulate the students' behaviour, and data was collected in parallel with course implementation. Two particular…
Reducing False Positives in Runtime Analysis of Deadlocks

NASA Technical Reports Server (NTRS)

Bensalem, Saddek; Havelund, Klaus; Clancy, Daniel (Technical Monitor)

2002-01-01

This paper presents an improvement of a standard algorithm for detecting dead-lock potentials in multi-threaded programs, in that it reduces the number of false positives. The standard algorithm works as follows. The multi-threaded program under observation is executed, while lock and unlock events are observed. A graph of locks is built, with edges between locks symbolizing locking orders. Any cycle in the graph signifies a potential for a deadlock. The typical standard example is the group of dining philosophers sharing forks. The algorithm is interesting because it can catch deadlock potentials even though no deadlocks occur in the examined trace, and at the same time it scales very well in contrast t o more formal approaches to deadlock detection. The algorithm, however, can yield false positives (as well as false negatives). The extension of the algorithm described in this paper reduces the amount of false positives for three particular cases: when a gate lock protects a cycle, when a single thread introduces a cycle, and when the code segments in different threads that cause the cycle can actually not execute in parallel. The paper formalizes a theory for dynamic deadlock detection and compares it to model checking and static analysis techniques. It furthermore describes an implementation for analyzing Java programs and its application to two case studies: a planetary rover and a space craft altitude control system.
Manyscale Computing for Sensor Processing in Support of Space Situational Awareness

NASA Astrophysics Data System (ADS)

Schmalz, M.; Chapman, W.; Hayden, E.; Sahni, S.; Ranka, S.

2014-09-01

Increasing image and signal data burden associated with sensor data processing in support of space situational awareness implies continuing computational throughput growth beyond the petascale regime. In addition to growing applications data burden and diversity, the breadth, diversity and scalability of high performance computing architectures and their various organizations challenge the development of a single, unifying, practicable model of parallel computation. Therefore, models for scalable parallel processing have exploited architectural and structural idiosyncrasies, yielding potential misapplications when legacy programs are ported among such architectures. In response to this challenge, we have developed a concise, efficient computational paradigm and software called Manyscale Computing to facilitate efficient mapping of annotated application codes to heterogeneous parallel architectures. Our theory, algorithms, software, and experimental results support partitioning and scheduling of application codes for envisioned parallel architectures, in terms of work atoms that are mapped (for example) to threads or thread blocks on computational hardware. Because of the rigor, completeness, conciseness, and layered design of our manyscale approach, application-to-architecture mapping is feasible and scalable for architectures at petascales, exascales, and above. Further, our methodology is simple, relying primarily on a small set of primitive mapping operations and support routines that are readily implemented on modern parallel processors such as graphics processing units (GPUs) and hybrid multi-processors (HMPs). In this paper, we overview the opportunities and challenges of manyscale computing for image and signal processing in support of space situational awareness applications. We discuss applications in terms of a layered hardware architecture (laboratory > supercomputer > rack > processor > component hierarchy). Demonstration applications include performance analysis and results in terms of execution time as well as storage, power, and energy consumption for bus-connected and/or networked architectures. The feasibility of the manyscale paradigm is demonstrated by addressing four principal challenges: (1) architectural/structural diversity, parallelism, and locality, (2) masking of I/O and memory latencies, (3) scalability of design as well as implementation, and (4) efficient representation/expression of parallel applications. Examples will demonstrate how manyscale computing helps solve these challenges efficiently on real-world computing systems.
Ultrastructural characteristics of tau filaments in tauopathies: immuno-electron microscopic demonstration of tau filaments in tauopathies.

PubMed

Arima, Kunimasa

2006-10-01

The microtubule-associated protein tau aggregates into filaments in the form of neurofibrillary tangles, neuropil threads and argyrophilic grains in neurons, in the form of variable astrocytic tangles in astrocytes and in the form of coiled bodies and argyrophilic threads in oligodendrocytes. These tau filaments may be classified into two types, straight filaments or tubules with 9-18 nm diameters and "twisted ribbons" composed of two parallel aligned components. In the same disease, the fine structure of tau filaments in glial cells roughly resembles that in neurons. In sporadic tauopathies, individual tau filaments show characteristic sizes, shapes and arrangements, and therefore contribute to neuropathologic differential diagnosis. In frontotemporal dementias caused by tau gene mutations, variable filamentous profiles were observed in association with mutation sites and insoluble tau isoforms, including straight filaments or tubules, paired helical filament-like filaments, and twisted ribbons. Pre-embedding immunoelectron microscopic studies were carried out using anti-3-repeat tau and anti-4-repeat tau specific antibodies, RD3 and RD4. Straight tubules in neuronal and astrocytic Pick bodies were immunolabeled by the anti-3-repeat tau antibody. The anti-4-repeat tau antibody recognized abnormal tubules comprising neurofibrillary tangles, coiled bodies and argyrophilic threads in progressive supranuclear palsy (PSP) and corticobasal degeneration. In the pre-embedding immunoelectron microscopic study using the phosphorylated tau AT8 antibody, tuft-shaped astrocytes of PSP were found to be composed of bundles of abnormal tubules in processes and perikarya of protoplasmic astrocytes. In this study, the 3-repeat tau or 4-repeat tau epitope was detected in situ at the ultrastructural level in abnormal tubules in representative pathological lesions in Pick's disease, PSP and corticobasal degeneration.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Langer, Steven H.; Karlin, Ian; Marinak, Marty M.

HYDRA is used to simulate a variety of experiments carried out at the National Ignition Facility (NIF) [4] and other high energy density physics facilities. HYDRA has packages to simulate radiation transfer, atomic physics, hydrodynamics, laser propagation, and a number of other physics effects. HYDRA has over one million lines of code and includes both MPI and thread-level (OpenMP and pthreads) parallelism. This paper measures the performance characteristics of HYDRA using hardware counters on an IBM BlueGene/Q system. We report key ratios such as bytes/instruction and memory bandwidth for several different physics packages. The total number of bytes read andmore » written per time step is also reported. We show that none of the packages which use significant time are memory bandwidth limited on a Blue Gene/Q. HYDRA currently issues very few SIMD instructions. The pressure on memory bandwidth will increase if high levels of SIMD instructions can be achieved.« less
Massively parallel multicanonical simulations

NASA Astrophysics Data System (ADS)

Gross, Jonathan; Zierenberg, Johannes; Weigel, Martin; Janke, Wolfhard

2018-03-01

Generalized-ensemble Monte Carlo simulations such as the multicanonical method and similar techniques are among the most efficient approaches for simulations of systems undergoing discontinuous phase transitions or with rugged free-energy landscapes. As Markov chain methods, they are inherently serial computationally. It was demonstrated recently, however, that a combination of independent simulations that communicate weight updates at variable intervals allows for the efficient utilization of parallel computational resources for multicanonical simulations. Implementing this approach for the many-thread architecture provided by current generations of graphics processing units (GPUs), we show how it can be efficiently employed with of the order of 104 parallel walkers and beyond, thus constituting a versatile tool for Monte Carlo simulations in the era of massively parallel computing. We provide the fully documented source code for the approach applied to the paradigmatic example of the two-dimensional Ising model as starting point and reference for practitioners in the field.
Simulating electron wave dynamics in graphene superlattices exploiting parallel processing advantages

NASA Astrophysics Data System (ADS)

Rodrigues, Manuel J.; Fernandes, David E.; Silveirinha, Mário G.; Falcão, Gabriel

2018-01-01

This work introduces a parallel computing framework to characterize the propagation of electron waves in graphene-based nanostructures. The electron wave dynamics is modeled using both "microscopic" and effective medium formalisms and the numerical solution of the two-dimensional massless Dirac equation is determined using a Finite-Difference Time-Domain scheme. The propagation of electron waves in graphene superlattices with localized scattering centers is studied, and the role of the symmetry of the microscopic potential in the electron velocity is discussed. The computational methodologies target the parallel capabilities of heterogeneous multi-core CPU and multi-GPU environments and are built with the OpenCL parallel programming framework which provides a portable, vendor agnostic and high throughput-performance solution. The proposed heterogeneous multi-GPU implementation achieves speedup ratios up to 75x when compared to multi-thread and multi-core CPU execution, reducing simulation times from several hours to a couple of minutes.
Fencing data transfers in a parallel active messaging interface of a parallel computer

DOEpatents

Blocksome, Michael A.; Mamidala, Amith R.

2015-06-02

Fencing data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task; the compute nodes coupled for data communications through the PAMI and through data communications resources including at least one segment of shared random access memory; including initiating execution through the PAMI of an ordered sequence of active SEND instructions for SEND data transfers between two endpoints, effecting deterministic SEND data transfers through a segment of shared memory; and executing through the PAMI, with no FENCE accounting for SEND data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all SEND instructions initiated prior to execution of the FENCE instruction for SEND data transfers between the two endpoints.
Fencing data transfers in a parallel active messaging interface of a parallel computer

DOEpatents

Blocksome, Michael A.; Mamidala, Amith R.

2015-06-09

Fencing data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task; the compute nodes coupled for data communications through the PAMI and through data communications resources including at least one segment of shared random access memory; including initiating execution through the PAMI of an ordered sequence of active SEND instructions for SEND data transfers between two endpoints, effecting deterministic SEND data transfers through a segment of shared memory; and executing through the PAMI, with no FENCE accounting for SEND data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all SEND instructions initiated prior to execution of the FENCE instruction for SEND data transfers between the two endpoints.
Fencing data transfers in a parallel active messaging interface of a parallel computer

DOEpatents

Blocksome, Michael A.; Mamidala, Amith R.

2015-08-11

Fencing data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint comprising a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes coupled for data communications through the PAMI and through data communications resources including a deterministic data communications network, including initiating execution through the PAMI of an ordered sequence of active SEND instructions for SEND data transfers between two endpoints, effecting deterministic SEND data transfers; and executing through the PAMI, with no FENCE accounting for SEND data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all SEND instructions initiated prior to execution of the FENCE instruction for SEND data transfers between the two endpoints.
Fencing data transfers in a parallel active messaging interface of a parallel computer

DOEpatents

Blocksome, Michael A.; Mamidala, Amith R.

2015-06-30

Fencing data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint comprising a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes coupled for data communications through the PAMI and through data communications resources including a deterministic data communications network, including initiating execution through the PAMI of an ordered sequence of active SEND instructions for SEND data transfers between two endpoints, effecting deterministic SEND data transfers; and executing through the PAMI, with no FENCE accounting for SEND data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all SEND instructions initiated prior to execution of the FENCE instruction for SEND data transfers between the two endpoints.
Data communications in a parallel active messaging interface of a parallel computer

DOEpatents

Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E

2015-02-03

Data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, endpoints coupled for data communications through the PAMI and through data communications resources, including receiving in an origin endpoint of the PAMI a SEND instruction, the SEND instruction specifying a transmission of transfer data from the origin endpoint to a first target endpoint; transmitting from the origin endpoint to the first target endpoint a Request-To-Send (`RTS`) message advising the first target endpoint of the location and size of the transfer data; assigning by the first target endpoint to each of a plurality of target endpoints separate portions of the transfer data; and receiving by the plurality of target endpoints the transfer data.
Data communications in a parallel active messaging interface of a parallel computer

DOEpatents

Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E

2014-11-18

Data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, endpoints coupled for data communications through the PAMI and through data communications resources, including receiving in an origin endpoint of the PAMI a SEND instruction, the SEND instruction specifying a transmission of transfer data from the origin endpoint to a first target endpoint; transmitting from the origin endpoint to the first target endpoint a Request-To-Send (`RTS`) message advising the first target endpoint of the location and size of the transfer data; assigning by the first target endpoint to each of a plurality of target endpoints separate portions of the transfer data; and receiving by the plurality of target endpoints the transfer data.
GPU-accelerated adjoint algorithmic differentiation

NASA Astrophysics Data System (ADS)

Gremse, Felix; Höfter, Andreas; Razik, Lukas; Kiessling, Fabian; Naumann, Uwe

2016-03-01

Many scientific problems such as classifier training or medical image reconstruction can be expressed as minimization of differentiable real-valued cost functions and solved with iterative gradient-based methods. Adjoint algorithmic differentiation (AAD) enables automated computation of gradients of such cost functions implemented as computer programs. To backpropagate adjoint derivatives, excessive memory is potentially required to store the intermediate partial derivatives on a dedicated data structure, referred to as the ;tape;. Parallelization is difficult because threads need to synchronize their accesses during taping and backpropagation. This situation is aggravated for many-core architectures, such as Graphics Processing Units (GPUs), because of the large number of light-weight threads and the limited memory size in general as well as per thread. We show how these limitations can be mediated if the cost function is expressed using GPU-accelerated vector and matrix operations which are recognized as intrinsic functions by our AAD software. We compare this approach with naive and vectorized implementations for CPUs. We use four increasingly complex cost functions to evaluate the performance with respect to memory consumption and gradient computation times. Using vectorization, CPU and GPU memory consumption could be substantially reduced compared to the naive reference implementation, in some cases even by an order of complexity. The vectorization allowed usage of optimized parallel libraries during forward and reverse passes which resulted in high speedups for the vectorized CPU version compared to the naive reference implementation. The GPU version achieved an additional speedup of 7.5 ± 4.4, showing that the processing power of GPUs can be utilized for AAD using this concept. Furthermore, we show how this software can be systematically extended for more complex problems such as nonlinear absorption reconstruction for fluorescence-mediated tomography.

GPU-Accelerated Adjoint Algorithmic Differentiation.

PubMed

Gremse, Felix; Höfter, Andreas; Razik, Lukas; Kiessling, Fabian; Naumann, Uwe

2016-03-01

Many scientific problems such as classifier training or medical image reconstruction can be expressed as minimization of differentiable real-valued cost functions and solved with iterative gradient-based methods. Adjoint algorithmic differentiation (AAD) enables automated computation of gradients of such cost functions implemented as computer programs. To backpropagate adjoint derivatives, excessive memory is potentially required to store the intermediate partial derivatives on a dedicated data structure, referred to as the "tape". Parallelization is difficult because threads need to synchronize their accesses during taping and backpropagation. This situation is aggravated for many-core architectures, such as Graphics Processing Units (GPUs), because of the large number of light-weight threads and the limited memory size in general as well as per thread. We show how these limitations can be mediated if the cost function is expressed using GPU-accelerated vector and matrix operations which are recognized as intrinsic functions by our AAD software. We compare this approach with naive and vectorized implementations for CPUs. We use four increasingly complex cost functions to evaluate the performance with respect to memory consumption and gradient computation times. Using vectorization, CPU and GPU memory consumption could be substantially reduced compared to the naive reference implementation, in some cases even by an order of complexity. The vectorization allowed usage of optimized parallel libraries during forward and reverse passes which resulted in high speedups for the vectorized CPU version compared to the naive reference implementation. The GPU version achieved an additional speedup of 7.5 ± 4.4, showing that the processing power of GPUs can be utilized for AAD using this concept. Furthermore, we show how this software can be systematically extended for more complex problems such as nonlinear absorption reconstruction for fluorescence-mediated tomography.
GPU-Accelerated Adjoint Algorithmic Differentiation

PubMed Central

Gremse, Felix; Höfter, Andreas; Razik, Lukas; Kiessling, Fabian; Naumann, Uwe

2015-01-01

Many scientific problems such as classifier training or medical image reconstruction can be expressed as minimization of differentiable real-valued cost functions and solved with iterative gradient-based methods. Adjoint algorithmic differentiation (AAD) enables automated computation of gradients of such cost functions implemented as computer programs. To backpropagate adjoint derivatives, excessive memory is potentially required to store the intermediate partial derivatives on a dedicated data structure, referred to as the “tape”. Parallelization is difficult because threads need to synchronize their accesses during taping and backpropagation. This situation is aggravated for many-core architectures, such as Graphics Processing Units (GPUs), because of the large number of light-weight threads and the limited memory size in general as well as per thread. We show how these limitations can be mediated if the cost function is expressed using GPU-accelerated vector and matrix operations which are recognized as intrinsic functions by our AAD software. We compare this approach with naive and vectorized implementations for CPUs. We use four increasingly complex cost functions to evaluate the performance with respect to memory consumption and gradient computation times. Using vectorization, CPU and GPU memory consumption could be substantially reduced compared to the naive reference implementation, in some cases even by an order of complexity. The vectorization allowed usage of optimized parallel libraries during forward and reverse passes which resulted in high speedups for the vectorized CPU version compared to the naive reference implementation. The GPU version achieved an additional speedup of 7.5 ± 4.4, showing that the processing power of GPUs can be utilized for AAD using this concept. Furthermore, we show how this software can be systematically extended for more complex problems such as nonlinear absorption reconstruction for fluorescence-mediated tomography. PMID:26941443
Development and application of a direct method to observe the implant/bone interface using simulated bone.

PubMed

Yamaguchi, Yoko; Shiota, Makoto; FuJii, Masaki; Sekiya, Michi; Ozeki, Masahiko

2016-01-01

Primary stability after implant placement is essential for osseointegration. It is important to understand the bone/implant interface for analyzing the influence of implant design on primary stability. In this study rigid polyurethane foam is used as artificial bone to evaluate the bone-implant interface and to identify where the torque is being generated during placement. Five implant systems-Straumann-Standard (ST), Straumann-Bone Level (BL), Straumann-Tapered Effect (TE), Nobel Biocare-Brånemark MKIII (MK3), and Nobel Biocare-Brånemark MKIV (MK4)-were used for this experiment. Artificial bone blocks were prepared and the implant was installed. After placement, a metal jig and one side artificial bone block were removed and then the implant embedded in the artificial bone was exposed for observing the bone-implant interface. A digital micro-analyzer was used for observing the contact interface. The insertion torque values were 39.35, 23.78, 12.53, 26.35, and 17.79 N cm for MK4, BL, ST, TE, and MK3, respectively. In ST, MK3, TE, MK4, and BL the white layer areas were 61 × 103 μm(2), 37 × 103 μm(2), 103 × 103 μm(2) in the tapered portion and 84 × 03 μm(2) in the parallel portion, 134 × 103 μm(2), and 98 × 103 μm(2) in the tapered portion and 87 × 103 μm(2) in the parallel portion, respectively. The direct observation method of the implant/artificial bone interface is a simple and useful method that enables the identification of the area where implant retention occurs. A white layer at the site of stress concentration during implant placement was identified and the magnitude of the stress was quantitatively estimated. The site where the highest torque occurred was the area from the thread crest to the thread root and the under and lateral aspect of the platform. The artificial bone debris created by the self-tapping blade accumulated in both the cutting chamber and in the space between the threads and artificial bone.
Trace: a high-throughput tomographic reconstruction engine for large-scale datasets.

PubMed

Bicer, Tekin; Gürsoy, Doğa; Andrade, Vincent De; Kettimuthu, Rajkumar; Scullin, William; Carlo, Francesco De; Foster, Ian T

2017-01-01

Modern synchrotron light sources and detectors produce data at such scale and complexity that large-scale computation is required to unleash their full power. One of the widely used imaging techniques that generates data at tens of gigabytes per second is computed tomography (CT). Although CT experiments result in rapid data generation, the analysis and reconstruction of the collected data may require hours or even days of computation time with a medium-sized workstation, which hinders the scientific progress that relies on the results of analysis. We present Trace, a data-intensive computing engine that we have developed to enable high-performance implementation of iterative tomographic reconstruction algorithms for parallel computers. Trace provides fine-grained reconstruction of tomography datasets using both (thread-level) shared memory and (process-level) distributed memory parallelization. Trace utilizes a special data structure called replicated reconstruction object to maximize application performance. We also present the optimizations that we apply to the replicated reconstruction objects and evaluate them using tomography datasets collected at the Advanced Photon Source. Our experimental evaluations show that our optimizations and parallelization techniques can provide 158× speedup using 32 compute nodes (384 cores) over a single-core configuration and decrease the end-to-end processing time of a large sinogram (with 4501 × 1 × 22,400 dimensions) from 12.5 h to <5 min per iteration. The proposed tomographic reconstruction engine can efficiently process large-scale tomographic data using many compute nodes and minimize reconstruction times.
A parallel approximate string matching under Levenshtein distance on graphics processing units using warp-shuffle operations

PubMed Central

Ho, ThienLuan; Oh, Seung-Rohk

2017-01-01

Approximate string matching with k-differences has a number of practical applications, ranging from pattern recognition to computational biology. This paper proposes an efficient memory-access algorithm for parallel approximate string matching with k-differences on Graphics Processing Units (GPUs). In the proposed algorithm, all threads in the same GPUs warp share data using warp-shuffle operation instead of accessing the shared memory. Moreover, we implement the proposed algorithm by exploiting the memory structure of GPUs to optimize its performance. Experiment results for real DNA packages revealed that the performance of the proposed algorithm and its implementation archived up to 122.64 and 1.53 times compared to that of sequential algorithm on CPU and previous parallel approximate string matching algorithm on GPUs, respectively. PMID:29016700
Spherical Harmonic Solutions to the 3D Kobayashi Benchmark Suite

DOE Office of Scientific and Technical Information (OSTI.GOV)

Brown, P.N.; Chang, B.; Hanebutte, U.R.

1999-12-29

Spherical harmonic solutions of order 5, 9 and 21 on spatial grids containing up to 3.3 million cells are presented for the Kobayashi benchmark suite. This suite of three problems with simple geometry of pure absorber with large void region was proposed by Professor Kobayashi at an OECD/NEA meeting in 1996. Each of the three problems contains a source, a void and a shield region. Problem 1 can best be described as a box in a box problem, where a source region is surrounded by a square void region which itself is embedded in a square shield region. Problems 2more » and 3 represent a shield with a void duct. Problem 2 having a straight and problem 3 a dog leg shaped duct. A pure absorber and a 50% scattering case are considered for each of the three problems. The solutions have been obtained with Ardra, a scalable, parallel neutron transport code developed at Lawrence Livermore National Laboratory (LLNL). The Ardra code takes advantage of a two-level parallelization strategy, which combines message passing between processing nodes and thread based parallelism amongst processors on each node. All calculations were performed on the IBM ASCI Blue-Pacific computer at LLNL.« less
Runtime Detection of C-Style Errors in UPC Code

DOE Office of Scientific and Technical Information (OSTI.GOV)

Pirkelbauer, P; Liao, C; Panas, T

2011-09-29

Unified Parallel C (UPC) extends the C programming language (ISO C 99) with explicit parallel programming support for the partitioned global address space (PGAS), which provides a global memory space with localized partitions to each thread. Like its ancestor C, UPC is a low-level language that emphasizes code efficiency over safety. The absence of dynamic (and static) safety checks allows programmer oversights and software flaws that can be hard to spot. In this paper, we present an extension of a dynamic analysis tool, ROSE-Code Instrumentation and Runtime Monitor (ROSECIRM), for UPC to help programmers find C-style errors involving the globalmore » address space. Built on top of the ROSE source-to-source compiler infrastructure, the tool instruments source files with code that monitors operations and keeps track of changes to the system state. The resulting code is linked to a runtime monitor that observes the program execution and finds software defects. We describe the extensions to ROSE-CIRM that were necessary to support UPC. We discuss complications that arise from parallel code and our solutions. We test ROSE-CIRM against a runtime error detection test suite, and present performance results obtained from running error-free codes. ROSE-CIRM is released as part of the ROSE compiler under a BSD-style open source license.« less
Effects of Ordering Strategies and Programming Paradigms on Sparse Matrix Computations

NASA Technical Reports Server (NTRS)

Oliker, Leonid; Li, Xiaoye; Husbands, Parry; Biswas, Rupak; Biegel, Bryan (Technical Monitor)

2002-01-01

The Conjugate Gradient (CG) algorithm is perhaps the best-known iterative technique to solve sparse linear systems that are symmetric and positive definite. For systems that are ill-conditioned, it is often necessary to use a preconditioning technique. In this paper, we investigate the effects of various ordering and partitioning strategies on the performance of parallel CG and ILU(O) preconditioned CG (PCG) using different programming paradigms and architectures. Results show that for this class of applications: ordering significantly improves overall performance on both distributed and distributed shared-memory systems, that cache reuse may be more important than reducing communication, that it is possible to achieve message-passing performance using shared-memory constructs through careful data ordering and distribution, and that a hybrid MPI+OpenMP paradigm increases programming complexity with little performance gains. A implementation of CG on the Cray MTA does not require special ordering or partitioning to obtain high efficiency and scalability, giving it a distinct advantage for adaptive applications; however, it shows limited scalability for PCG due to a lack of thread level parallelism.
Influence of fine threads and platform-switching on crestal bone stress around implant-a three-dimensional finite element analysis.

PubMed

Khurana, Pardeep; Sharma, Arun; Sodhi, Kiranmeet Kaur

2013-12-01

The aims of this study were to investigate the effect of implant fine threads on crestal bone stress compared to a standard smooth implant collar and to analyze how different abutment diameters influenced the crestal bone stress level. Three-dimensional finite element imaging was used to create a cross-sectional model in SolidWorks 2007 software of an implant (5-mm platform and 10 mm in length) placed in the premolar region of the mandible. The implant model was created to resemble a commercially available fine thread implant. Abutments of different diameters (5.0 mm: standard, 4.5 mm, 4.0 mm, and 3.5 mm) were loaded with a force of 100 N at 90° vertical and 40° oblique angles. Finite element analysis was done in COSMOSWorks software, which was used to analyze the stress patterns in bone, especially in the crestal region. Upon loading, the fine thread implant model had greater stress at the crestal bone adjacent to the implant than the smooth neck implant in both vertical and oblique loading. When the abutment diameter decreased progressively from 5.0 mm to 4.5 mm to 4 mm and to 3.5 mm the thread model showed a reduction of stress at the crestal bone level from 23.2 MPa to 15.02 MPa for fine thread and from 22.7 to 13.5 MPa for smooth collar implant group after vertical loading and from 43.7 MPa to 33.1 MPa in fine thread model and from 36.9 to 20.5 MPa in smooth collar implant model after oblique loading. Fine threads increase crestal stress upon loading. Reduced abutment diameter that is platform switching resulted in less stress translated to the crestal bone in the fine thread and smooth neck.
Parallelization of elliptic solver for solving 1D Boussinesq model

NASA Astrophysics Data System (ADS)

Tarwidi, D.; Adytia, D.

2018-03-01

In this paper, a parallel implementation of an elliptic solver in solving 1D Boussinesq model is presented. Numerical solution of Boussinesq model is obtained by implementing a staggered grid scheme to continuity, momentum, and elliptic equation of Boussinesq model. Tridiagonal system emerging from numerical scheme of elliptic equation is solved by cyclic reduction algorithm. The parallel implementation of cyclic reduction is executed on multicore processors with shared memory architectures using OpenMP. To measure the performance of parallel program, large number of grids is varied from 28 to 214. Two test cases of numerical experiment, i.e. propagation of solitary and standing wave, are proposed to evaluate the parallel program. The numerical results are verified with analytical solution of solitary and standing wave. The best speedup of solitary and standing wave test cases is about 2.07 with 214 of grids and 1.86 with 213 of grids, respectively, which are executed by using 8 threads. Moreover, the best efficiency of parallel program is 76.2% and 73.5% for solitary and standing wave test cases, respectively.
Symbolic Analysis of Concurrent Programs with Polymorphism

NASA Technical Reports Server (NTRS)

Rungta, Neha Shyam

2010-01-01

The current trend of multi-core and multi-processor computing is causing a paradigm shift from inherently sequential to highly concurrent and parallel applications. Certain thread interleavings, data input values, or combinations of both often cause errors in the system. Systematic verification techniques such as explicit state model checking and symbolic execution are extensively used to detect errors in such systems [7, 9]. Explicit state model checking enumerates possible thread schedules and input data values of a program in order to check for errors [3, 9]. To partially mitigate the state space explosion from data input values, symbolic execution techniques substitute data input values with symbolic values [5, 7, 6]. Explicit state model checking and symbolic execution techniques used in conjunction with exhaustive search techniques such as depth-first search are unable to detect errors in medium to large-sized concurrent programs because the number of behaviors caused by data and thread non-determinism is extremely large. We present an overview of abstraction-guided symbolic execution for concurrent programs that detects errors manifested by a combination of thread schedules and data values [8]. The technique generates a set of key program locations relevant in testing the reachability of the target locations. The symbolic execution is then guided along these locations in an attempt to generate a feasible execution path to the error state. This allows the execution to focus in parts of the behavior space more likely to contain an error.
AthenaMT: upgrading the ATLAS software framework for the many-core world with multi-threading

NASA Astrophysics Data System (ADS)

Leggett, Charles; Baines, John; Bold, Tomasz; Calafiura, Paolo; Farrell, Steven; van Gemmeren, Peter; Malon, David; Ritsch, Elmar; Stewart, Graeme; Snyder, Scott; Tsulaia, Vakhtang; Wynne, Benjamin; ATLAS Collaboration

2017-10-01

ATLAS’s current software framework, Gaudi/Athena, has been very successful for the experiment in LHC Runs 1 and 2. However, its single threaded design has been recognized for some time to be increasingly problematic as CPUs have increased core counts and decreased available memory per core. Even the multi-process version of Athena, AthenaMP, will not scale to the range of architectures we expect to use beyond Run2. After concluding a rigorous requirements phase, where many design components were examined in detail, ATLAS has begun the migration to a new data-flow driven, multi-threaded framework, which enables the simultaneous processing of singleton, thread unsafe legacy Algorithms, cloned Algorithms that execute concurrently in their own threads with different Event contexts, and fully re-entrant, thread safe Algorithms. In this paper we report on the process of modifying the framework to safely process multiple concurrent events in different threads, which entails significant changes in the underlying handling of features such as event and time dependent data, asynchronous callbacks, metadata, integration with the online High Level Trigger for partial processing in certain regions of interest, concurrent I/O, as well as ensuring thread safety of core services. We also report on upgrading the framework to handle Algorithms that are fully re-entrant.
Carbon Nanotube Thread Electrochemical Cell: Detection of Heavy Metals.

PubMed

Zhao, Daoli; Siebold, David; Alvarez, Noe T; Shanov, Vesselin N; Heineman, William R

2017-09-19

In this work, all three electrodes in an electrochemical cell were fabricated based on carbon nanotube (CNT) thread. CNT thread partially insulated with a thin polystyrene coating to define the microelectrode area was used as the working electrode; bare CNT thread was used as the auxiliary electrode; and a micro quasi-reference electrode was fabricated by electroplating CNT thread with Ag and then anodizing it in chloride solution to form a layer of AgCl. The Ag|AgCl coated CNT thread electrode provided a stable potential comparable to the conventional liquid-junction type Ag|AgCl reference electrode. The CNT thread auxiliary electrode provided a stable current, which is comparable to a Pt wire auxiliary electrode. This all-CNT thread three electrode cell has been evaluated as a microsensor for the simultaneous determination of trace levels of heavy metal ions by anodic stripping voltammetry (ASV). Hg 2+ , Cu 2+ , and Pb 2+ were used as a representative system for this study. The calculated detection limits (based on the 3σ method) with a 120 s deposition time are 1.05, 0.53, and 0.57 nM for Hg 2+ , Cu 2+ , and Pb 2+ , respectively. These electrodes significantly reduce the dimensions of the conventional three electrode electrochemical cell to the microscale.
Random Number Generation for High Performance Computing

DTIC Science & Technology

2015-01-01

number streams, a quality metric for the parallel random number streams. * * * * * Atty. Dkt . No.: 5660-14400 Customer No. 35690 Eric B. Meyertons...responsibility to ensure timely payment of maintenance fees when due. Pagel of3 PTOL-85 (Rev. 02/11) Atty. Dkt . No.: 5660-14400 Page 1 Meyertons...with each subtask executed by a separate thread or process (henceforth, process). Each process has Atty. Dkt . No.: 5660-14400 Page 2 Meyertons
Implementation of BT, SP, LU, and FT of NAS Parallel Benchmarks in Java

NASA Technical Reports Server (NTRS)

Schultz, Matthew; Frumkin, Michael; Jin, Hao-Qiang; Yan, Jerry

2000-01-01

A number of Java features make it an attractive but a debatable choice for High Performance Computing. We have implemented benchmarks working on single structured grid BT,SP,LU and FT in Java. The performance and scalability of the Java code shows that a significant improvement in Java compiler technology and in Java thread implementation are necessary for Java to compete with Fortran in HPC applications.
Application of Intel Many Integrated Core (MIC) architecture to the Yonsei University planetary boundary layer scheme in Weather Research and Forecasting model

NASA Astrophysics Data System (ADS)

Huang, Melin; Huang, Bormin; Huang, Allen H.

2014-10-01

The Weather Research and Forecasting (WRF) model provided operational services worldwide in many areas and has linked to our daily activity, in particular during severe weather events. The scheme of Yonsei University (YSU) is one of planetary boundary layer (PBL) models in WRF. The PBL is responsible for vertical sub-grid-scale fluxes due to eddy transports in the whole atmospheric column, determines the flux profiles within the well-mixed boundary layer and the stable layer, and thus provide atmospheric tendencies of temperature, moisture (including clouds), and horizontal momentum in the entire atmospheric column. The YSU scheme is very suitable for massively parallel computation as there are no interactions among horizontal grid points. To accelerate the computation process of the YSU scheme, we employ Intel Many Integrated Core (MIC) Architecture as it is a multiprocessor computer structure with merits of efficient parallelization and vectorization essentials. Our results show that the MIC-based optimization improved the performance of the first version of multi-threaded code on Xeon Phi 5110P by a factor of 2.4x. Furthermore, the same CPU-based optimizations improved the performance on Intel Xeon E5-2603 by a factor of 1.6x as compared to the first version of multi-threaded code.
Optimizing Approximate Weighted Matching on Nvidia Kepler K40

DOE Office of Scientific and Technical Information (OSTI.GOV)

Naim, Md; Manne, Fredrik; Halappanavar, Mahantesh

Matching is a fundamental graph problem with numerous applications in science and engineering. While algorithms for computing optimal matchings are difficult to parallelize, approximation algorithms on the other hand generally compute high quality solutions and are amenable to parallelization. In this paper, we present efficient implementations of the current best algorithm for half-approximate weighted matching, the Suitor algorithm, on Nvidia Kepler K-40 platform. We develop four variants of the algorithm that exploit hardware features to address key challenges for a GPU implementation. We also experiment with different combinations of work assigned to a warp. Using an exhaustive set ofmore » $269$ inputs, we demonstrate that the new implementation outperforms the previous best GPU algorithm by $10$ to $$100\\times$$ for over $100$ instances, and from $100$ to $$1000\\times$$ for $15$ instances. We also demonstrate up to $$20\\times$$ speedup relative to $2$ threads, and up to $$5\\times$$ relative to $16$ threads on Intel Xeon platform with $16$ cores for the same algorithm. The new algorithms and implementations provided in this paper will have a direct impact on several applications that repeatedly use matching as a key compute kernel. Further, algorithm designs and insights provided in this paper will benefit other researchers implementing graph algorithms on modern GPU architectures.« less
Frequent Statement and Dereference Elimination for Imperative and Object-Oriented Distributed Programs

PubMed Central

El-Zawawy, Mohamed A.

2014-01-01

This paper introduces new approaches for the analysis of frequent statement and dereference elimination for imperative and object-oriented distributed programs running on parallel machines equipped with hierarchical memories. The paper uses languages whose address spaces are globally partitioned. Distributed programs allow defining data layout and threads writing to and reading from other thread memories. Three type systems (for imperative distributed programs) are the tools of the proposed techniques. The first type system defines for every program point a set of calculated (ready) statements and memory accesses. The second type system uses an enriched version of types of the first type system and determines which of the ready statements and memory accesses are used later in the program. The third type system uses the information gather so far to eliminate unnecessary statement computations and memory accesses (the analysis of frequent statement and dereference elimination). Extensions to these type systems are also presented to cover object-oriented distributed programs. Two advantages of our work over related work are the following. The hierarchical style of concurrent parallel computers is similar to the memory model used in this paper. In our approach, each analysis result is assigned a type derivation (serves as a correctness proof). PMID:24892098
A comparison of parallel and diverging screw angles in the stability of locked plate constructs.

PubMed

Wähnert, D; Windolf, M; Brianza, S; Rothstock, S; Radtke, R; Brighenti, V; Schwieger, K

2011-09-01

We investigated the static and cyclical strength of parallel and angulated locking plate screws using rigid polyurethane foam (0.32 g/cm(3)) and bovine cancellous bone blocks. Custom-made stainless steel plates with two conically threaded screw holes with different angulations (parallel, 10° and 20° divergent) and 5 mm self-tapping locking screws underwent pull-out and cyclical pull and bending tests. The bovine cancellous blocks were only subjected to static pull-out testing. We also performed finite element analysis for the static pull-out test of the parallel and 20° configurations. In both the foam model and the bovine cancellous bone we found the significantly highest pull-out force for the parallel constructs. In the finite element analysis there was a 47% more damage in the 20° divergent constructs than in the parallel configuration. Under cyclical loading, the mean number of cycles to failure was significantly higher for the parallel group, followed by the 10° and 20° divergent configurations. In our laboratory setting we clearly showed the biomechanical disadvantage of a diverging locking screw angle under static and cyclical loading.
Heterogeneous computing architecture for fast detection of SNP-SNP interactions.

PubMed

Sluga, Davor; Curk, Tomaz; Zupan, Blaz; Lotric, Uros

2014-06-25

The extent of data in a typical genome-wide association study (GWAS) poses considerable computational challenges to software tools for gene-gene interaction discovery. Exhaustive evaluation of all interactions among hundreds of thousands to millions of single nucleotide polymorphisms (SNPs) may require weeks or even months of computation. Massively parallel hardware within a modern Graphic Processing Unit (GPU) and Many Integrated Core (MIC) coprocessors can shorten the run time considerably. While the utility of GPU-based implementations in bioinformatics has been well studied, MIC architecture has been introduced only recently and may provide a number of comparative advantages that have yet to be explored and tested. We have developed a heterogeneous, GPU and Intel MIC-accelerated software module for SNP-SNP interaction discovery to replace the previously single-threaded computational core in the interactive web-based data exploration program SNPsyn. We report on differences between these two modern massively parallel architectures and their software environments. Their utility resulted in an order of magnitude shorter execution times when compared to the single-threaded CPU implementation. GPU implementation on a single Nvidia Tesla K20 runs twice as fast as that for the MIC architecture-based Xeon Phi P5110 coprocessor, but also requires considerably more programming effort. General purpose GPUs are a mature platform with large amounts of computing power capable of tackling inherently parallel problems, but can prove demanding for the programmer. On the other hand the new MIC architecture, albeit lacking in performance reduces the programming effort and makes it up with a more general architecture suitable for a wider range of problems.

Heterogeneous computing architecture for fast detection of SNP-SNP interactions

PubMed Central

2014-01-01

Background The extent of data in a typical genome-wide association study (GWAS) poses considerable computational challenges to software tools for gene-gene interaction discovery. Exhaustive evaluation of all interactions among hundreds of thousands to millions of single nucleotide polymorphisms (SNPs) may require weeks or even months of computation. Massively parallel hardware within a modern Graphic Processing Unit (GPU) and Many Integrated Core (MIC) coprocessors can shorten the run time considerably. While the utility of GPU-based implementations in bioinformatics has been well studied, MIC architecture has been introduced only recently and may provide a number of comparative advantages that have yet to be explored and tested. Results We have developed a heterogeneous, GPU and Intel MIC-accelerated software module for SNP-SNP interaction discovery to replace the previously single-threaded computational core in the interactive web-based data exploration program SNPsyn. We report on differences between these two modern massively parallel architectures and their software environments. Their utility resulted in an order of magnitude shorter execution times when compared to the single-threaded CPU implementation. GPU implementation on a single Nvidia Tesla K20 runs twice as fast as that for the MIC architecture-based Xeon Phi P5110 coprocessor, but also requires considerably more programming effort. Conclusions General purpose GPUs are a mature platform with large amounts of computing power capable of tackling inherently parallel problems, but can prove demanding for the programmer. On the other hand the new MIC architecture, albeit lacking in performance reduces the programming effort and makes it up with a more general architecture suitable for a wider range of problems. PMID:24964802
A rapid parallelization of cone-beam projection and back-projection operator based on texture fetching interpolation

NASA Astrophysics Data System (ADS)

Xie, Lizhe; Hu, Yining; Chen, Yang; Shi, Luyao

2015-03-01

Projection and back-projection are the most computational consuming parts in Computed Tomography (CT) reconstruction. Parallelization strategies using GPU computing techniques have been introduced. We in this paper present a new parallelization scheme for both projection and back-projection. The proposed method is based on CUDA technology carried out by NVIDIA Corporation. Instead of build complex model, we aimed on optimizing the existing algorithm and make it suitable for CUDA implementation so as to gain fast computation speed. Besides making use of texture fetching operation which helps gain faster interpolation speed, we fixed sampling numbers in the computation of projection, to ensure the synchronization of blocks and threads, thus prevents the latency caused by inconsistent computation complexity. Experiment results have proven the computational efficiency and imaging quality of the proposed method.
Electronic and optical properties of GaN/AlN quantum dots with adjacent threading dislocations

NASA Astrophysics Data System (ADS)

Ye, Han; Lu, Peng-Fei; Yu, Zhong-Yuan; Yao, Wen-Jie; Chen, Zhi-Hui; Jia, Bo-Yong; Liu, Yu-Min

2010-04-01

We present a theory to simulate a coherent GaN QD with an adjacent pure edge threading dislocation by using a finite element method. The piezoelectric effects and the strain modified band edges are investigated in the framework of multi-band k · p theory to calculate the electron and the heavy hole energy levels. The linear optical absorption coefficients corresponding to the interband ground state transition are obtained via the density matrix approach and perturbation expansion method. The results indicate that the strain distribution of the threading dislocation affects the electronic structure. Moreover, the ground state transition behaviour is also influenced by the position of the adjacent threading dislocation.
Improving the performance of heterogeneous multi-core processors by modifying the cache coherence protocol

NASA Astrophysics Data System (ADS)

Fang, Juan; Hao, Xiaoting; Fan, Qingwen; Chang, Zeqing; Song, Shuying

2017-05-01

In the Heterogeneous multi-core architecture, CPU and GPU processor are integrated on the same chip, which poses a new challenge to the last-level cache management. In this architecture, the CPU application and the GPU application execute concurrently, accessing the last-level cache. CPU and GPU have different memory access characteristics, so that they have differences in the sensitivity of last-level cache (LLC) capacity. For many CPU applications, a reduced share of the LLC could lead to significant performance degradation. On the contrary, GPU applications can tolerate increase in memory access latency when there is sufficient thread-level parallelism. Taking into account the GPU program memory latency tolerance characteristics, this paper presents a method that let GPU applications can access to memory directly, leaving lots of LLC space for CPU applications, in improving the performance of CPU applications and does not affect the performance of GPU applications. When the CPU application is cache sensitive, and the GPU application is insensitive to the cache, the overall performance of the system is improved significantly.
Real-time implementations of image segmentation algorithms on shared memory multicore architecture: a survey (Conference Presentation)

NASA Astrophysics Data System (ADS)

Akil, Mohamed

2017-05-01

The real-time processing is getting more and more important in many image processing applications. Image segmentation is one of the most fundamental tasks image analysis. As a consequence, many different approaches for image segmentation have been proposed. The watershed transform is a well-known image segmentation tool. The watershed transform is a very data intensive task. To achieve acceleration and obtain real-time processing of watershed algorithms, parallel architectures and programming models for multicore computing have been developed. This paper focuses on the survey of the approaches for parallel implementation of sequential watershed algorithms on multicore general purpose CPUs: homogeneous multicore processor with shared memory. To achieve an efficient parallel implementation, it's necessary to explore different strategies (parallelization/distribution/distributed scheduling) combined with different acceleration and optimization techniques to enhance parallelism. In this paper, we give a comparison of various parallelization of sequential watershed algorithms on shared memory multicore architecture. We analyze the performance measurements of each parallel implementation and the impact of the different sources of overhead on the performance of the parallel implementations. In this comparison study, we also discuss the advantages and disadvantages of the parallel programming models. Thus, we compare the OpenMP (an application programming interface for multi-Processing) with Ptheads (POSIX Threads) to illustrate the impact of each parallel programming model on the performance of the parallel implementations.
A pervasive parallel framework for visualization: final report for FWP 10-014707

DOE Office of Scientific and Technical Information (OSTI.GOV)

Moreland, Kenneth D.

2014-01-01

We are on the threshold of a transformative change in the basic architecture of highperformance computing. The use of accelerator processors, characterized by large core counts, shared but asymmetrical memory, and heavy thread loading, is quickly becoming the norm in high performance computing. These accelerators represent significant challenges in updating our existing base of software. An intrinsic problem with this transition is a fundamental programming shift from message passing processes to much more fine thread scheduling with memory sharing. Another problem is the lack of stability in accelerator implementation; processor and compiler technology is currently changing rapidly. This report documentsmore » the results of our three-year ASCR project to address these challenges. Our project includes the development of the Dax toolkit, which contains the beginnings of new algorithms for a new generation of computers and the underlying infrastructure to rapidly prototype and build further algorithms as necessary.« less
Modeling of outgassing and matrix decomposition in carbon-phenolic composites

NASA Technical Reports Server (NTRS)

Mcmanus, Hugh L.

1994-01-01

Work done in the period Jan. - June 1994 is summarized. Two threads of research have been followed. First, the thermodynamics approach was used to model the chemical and mechanical responses of composites exposed to high temperatures. The thermodynamics approach lends itself easily to the usage of variational principles. This thermodynamic-variational approach has been applied to the transpiration cooling problem. The second thread is the development of a better algorithm to solve the governing equations resulting from the modeling. Explicit finite difference method is explored for solving the governing nonlinear, partial differential equations. The method allows detailed material models to be included and solution on massively parallel supercomputers. To demonstrate the feasibility of the explicit scheme in solving nonlinear partial differential equations, a transpiration cooling problem was solved. Some interesting transient behaviors were captured such as stress waves and small spatial oscillations of transient pressure distribution.
Characterizing Task-Based OpenMP Programs

PubMed Central

Muddukrishna, Ananya; Jonsson, Peter A.; Brorsson, Mats

2015-01-01

Programmers struggle to understand performance of task-based OpenMP programs since profiling tools only report thread-based performance. Performance tuning also requires task-based performance in order to balance per-task memory hierarchy utilization against exposed task parallelism. We provide a cost-effective method to extract detailed task-based performance information from OpenMP programs. We demonstrate the utility of our method by quickly diagnosing performance problems and characterizing exposed task parallelism and per-task instruction profiles of benchmarks in the widely-used Barcelona OpenMP Tasks Suite. Programmers can tune performance faster and understand performance tradeoffs more effectively than existing tools by using our method to characterize task-based performance. PMID:25860023
Adapting the serial Alpgen parton-interaction generator to simulate LHC collisions on millions of parallel threads

NASA Astrophysics Data System (ADS)

Childers, J. T.; Uram, T. D.; LeCompte, T. J.; Papka, M. E.; Benjamin, D. P.

2017-01-01

As the LHC moves to higher energies and luminosity, the demand for computing resources increases accordingly and will soon outpace the growth of the Worldwide LHC Computing Grid. To meet this greater demand, event generation Monte Carlo was targeted for adaptation to run on Mira, the supercomputer at the Argonne Leadership Computing Facility. Alpgen is a Monte Carlo event generation application that is used by LHC experiments in the simulation of collisions that take place in the Large Hadron Collider. This paper details the process by which Alpgen was adapted from a single-processor serial-application to a large-scale parallel-application and the performance that was achieved.
Parallelizing ATLAS Reconstruction and Simulation: Issues and Optimization Solutions for Scaling on Multi- and Many-CPU Platforms

NASA Astrophysics Data System (ADS)

Leggett, C.; Binet, S.; Jackson, K.; Levinthal, D.; Tatarkhanov, M.; Yao, Y.

2011-12-01

Thermal limitations have forced CPU manufacturers to shift from simply increasing clock speeds to improve processor performance, to producing chip designs with multi- and many-core architectures. Further the cores themselves can run multiple threads as a zero overhead context switch allowing low level resource sharing (Intel Hyperthreading). To maximize bandwidth and minimize memory latency, memory access has become non uniform (NUMA). As manufacturers add more cores to each chip, a careful understanding of the underlying architecture is required in order to fully utilize the available resources. We present AthenaMP and the Atlas event loop manager, the driver of the simulation and reconstruction engines, which have been rewritten to make use of multiple cores, by means of event based parallelism, and final stage I/O synchronization. However, initial studies on 8 andl6 core Intel architectures have shown marked non-linearities as parallel process counts increase, with as much as 30% reductions in event throughput in some scenarios. Since the Intel Nehalem architecture (both Gainestown and Westmere) will be the most common choice for the next round of hardware procurements, an understanding of these scaling issues is essential. Using hardware based event counters and Intel's Performance Tuning Utility, we have studied the performance bottlenecks at the hardware level, and discovered optimization schemes to maximize processor throughput. We have also produced optimization mechanisms, common to all large experiments, that address the extreme nature of today's HEP code, which due to it's size, places huge burdens on the memory infrastructure of today's processors.
Distributed parallel computing in stochastic modeling of groundwater systems.

PubMed

Dong, Yanhui; Li, Guomin; Xu, Haizhen

2013-03-01

Stochastic modeling is a rapidly evolving, popular approach to the study of the uncertainty and heterogeneity of groundwater systems. However, the use of Monte Carlo-type simulations to solve practical groundwater problems often encounters computational bottlenecks that hinder the acquisition of meaningful results. To improve the computational efficiency, a system that combines stochastic model generation with MODFLOW-related programs and distributed parallel processing is investigated. The distributed computing framework, called the Java Parallel Processing Framework, is integrated into the system to allow the batch processing of stochastic models in distributed and parallel systems. As an example, the system is applied to the stochastic delineation of well capture zones in the Pinggu Basin in Beijing. Through the use of 50 processing threads on a cluster with 10 multicore nodes, the execution times of 500 realizations are reduced to 3% compared with those of a serial execution. Through this application, the system demonstrates its potential in solving difficult computational problems in practical stochastic modeling. © 2012, The Author(s). Groundwater © 2012, National Ground Water Association.
Software Issues in High-Performance Computing and a Framework for the Development of HPC Applications

DTIC Science & Technology

1995-01-01

possible to determine communication points. For this version, a C program spawning Posix threads and using semaphores to synchronize would have to...performance such as the time required for network communication and synchronization as well as issues of asynchrony and memory hierarchy. For example...enhances reusability. Process (or task) parallel computations can also be succinctly expressed with a small set of process creation and synchronization
THREAD: A programming environment for interactive planning-level robotics applications

NASA Technical Reports Server (NTRS)

Beahan, John J., Jr.

1989-01-01

THREAD programming language, which was developed to meet the needs of researchers in developing robotics applications that perform such tasks as grasp, trajectory design, sensor data analysis, and interfacing with external subsystems in order to perform servo-level control of manipulators and real time sensing is discussed. The philosophy behind THREAD, the issues which entered into its design, and the features of the language are discussed from the viewpoint of researchers who want to develop algorithms in a simulation environment, and from those who want to implement physical robotics systems. The detailed functions of the many complex robotics algorithms and tools which are part of the language are not explained, but an overall impression of their capability is given.
Efficient parallel implementation of active appearance model fitting algorithm on GPU.

PubMed

Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou

2014-01-01

The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures.
Efficient Parallel Implementation of Active Appearance Model Fitting Algorithm on GPU

PubMed Central

Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou

2014-01-01

The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures. PMID:24723812
Optics Program Modified for Multithreaded Parallel Computing

NASA Technical Reports Server (NTRS)

Lou, John; Bedding, Dave; Basinger, Scott

2006-01-01

A powerful high-performance computer program for simulating and analyzing adaptive and controlled optical systems has been developed by modifying the serial version of the Modeling and Analysis for Controlled Optical Systems (MACOS) program to impart capabilities for multithreaded parallel processing on computing systems ranging from supercomputers down to Symmetric Multiprocessing (SMP) personal computers. The modifications included the incorporation of OpenMP, a portable and widely supported application interface software, that can be used to explicitly add multithreaded parallelism to an application program under a shared-memory programming model. OpenMP was applied to parallelize ray-tracing calculations, one of the major computing components in MACOS. Multithreading is also used in the diffraction propagation of light in MACOS based on pthreads [POSIX Thread, (where "POSIX" signifies a portable operating system for UNIX)]. In tests of the parallelized version of MACOS, the speedup in ray-tracing calculations was found to be linear, or proportional to the number of processors, while the speedup in diffraction calculations ranged from 50 to 60 percent, depending on the type and number of processors. The parallelized version of MACOS is portable, and, to the user, its interface is basically the same as that of the original serial version of MACOS.
Multi-threaded ATLAS simulation on Intel Knights Landing processors

NASA Astrophysics Data System (ADS)

Farrell, Steven; Calafiura, Paolo; Leggett, Charles; Tsulaia, Vakhtang; Dotti, Andrea; ATLAS Collaboration

2017-10-01

The Knights Landing (KNL) release of the Intel Many Integrated Core (MIC) Xeon Phi line of processors is a potential game changer for HEP computing. With 72 cores and deep vector registers, the KNL cards promise significant performance benefits for highly-parallel, compute-heavy applications. Cori, the newest supercomputer at the National Energy Research Scientific Computing Center (NERSC), was delivered to its users in two phases with the first phase online at the end of 2015 and the second phase now online at the end of 2016. Cori Phase 2 is based on the KNL architecture and contains over 9000 compute nodes with 96GB DDR4 memory. ATLAS simulation with the multithreaded Athena Framework (AthenaMT) is a good potential use-case for the KNL architecture and supercomputers like Cori. ATLAS simulation jobs have a high ratio of CPU computation to disk I/O and have been shown to scale well in multi-threading and across many nodes. In this paper we will give an overview of the ATLAS simulation application with details on its multi-threaded design. Then, we will present a performance analysis of the application on KNL devices and compare it to a traditional x86 platform to demonstrate the capabilities of the architecture and evaluate the benefits of utilizing KNL platforms like Cori for ATLAS production.
Event Reconstruction for Many-core Architectures using Java

DOE Office of Scientific and Technical Information (OSTI.GOV)

Graf, Norman A.; /SLAC

Although Moore's Law remains technically valid, the performance enhancements in computing which traditionally resulted from increased CPU speeds ended years ago. Chip manufacturers have chosen to increase the number of core CPUs per chip instead of increasing clock speed. Unfortunately, these extra CPUs do not automatically result in improvements in simulation or reconstruction times. To take advantage of this extra computing power requires changing how software is written. Event reconstruction is globally serial, in the sense that raw data has to be unpacked first, channels have to be clustered to produce hits before those hits are identified as belonging tomore » a track or shower, tracks have to be found and fit before they are vertexed, etc. However, many of the individual procedures along the reconstruction chain are intrinsically independent and are perfect candidates for optimization using multi-core architecture. Threading is perhaps the simplest approach to parallelizing a program and Java includes a powerful threading facility built into the language. We have developed a fast and flexible reconstruction package (org.lcsim) written in Java that has been used for numerous physics and detector optimization studies. In this paper we present the results of our studies on optimizing the performance of this toolkit using multiple threads on many-core architectures.« less
Efficient methods for implementation of multi-level nonrigid mass-preserving image registration on GPUs and multi-threaded CPUs.

PubMed

Ellingwood, Nathan D; Yin, Youbing; Smith, Matthew; Lin, Ching-Long

2016-04-01

Faster and more accurate methods for registration of images are important for research involved in conducting population-based studies that utilize medical imaging, as well as improvements for use in clinical applications. We present a novel computation- and memory-efficient multi-level method on graphics processing units (GPU) for performing registration of two computed tomography (CT) volumetric lung images. We developed a computation- and memory-efficient Diffeomorphic Multi-level B-Spline Transform Composite (DMTC) method to implement nonrigid mass-preserving registration of two CT lung images on GPU. The framework consists of a hierarchy of B-Spline control grids of increasing resolution. A similarity criterion known as the sum of squared tissue volume difference (SSTVD) was adopted to preserve lung tissue mass. The use of SSTVD consists of the calculation of the tissue volume, the Jacobian, and their derivatives, which makes its implementation on GPU challenging due to memory constraints. The use of the DMTC method enabled reduced computation and memory storage of variables with minimal communication between GPU and Central Processing Unit (CPU) due to ability to pre-compute values. The method was assessed on six healthy human subjects. Resultant GPU-generated displacement fields were compared against the previously validated CPU counterpart fields, showing good agreement with an average normalized root mean square error (nRMS) of 0.044±0.015. Runtime and performance speedup are compared between single-threaded CPU, multi-threaded CPU, and GPU algorithms. Best performance speedup occurs at the highest resolution in the GPU implementation for the SSTVD cost and cost gradient computations, with a speedup of 112 times that of the single-threaded CPU version and 11 times over the twelve-threaded version when considering average time per iteration using a Nvidia Tesla K20X GPU. The proposed GPU-based DMTC method outperforms its multi-threaded CPU version in terms of runtime. Total registration time reduced runtime to 2.9min on the GPU version, compared to 12.8min on twelve-threaded CPU version and 112.5min on a single-threaded CPU. Furthermore, the GPU implementation discussed in this work can be adapted for use of other cost functions that require calculation of the first derivatives. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.
Automation of Data Traffic Control on DSM Architecture

NASA Technical Reports Server (NTRS)

Frumkin, Michael; Jin, Hao-Qiang; Yan, Jerry

2001-01-01

The design of distributed shared memory (DSM) computers liberates users from the duty to distribute data across processors and allows for the incremental development of parallel programs using, for example, OpenMP or Java threads. DSM architecture greatly simplifies the development of parallel programs having good performance on a few processors. However, to achieve a good program scalability on DSM computers requires that the user understand data flow in the application and use various techniques to avoid data traffic congestions. In this paper we discuss a number of such techniques, including data blocking, data placement, data transposition and page size control and evaluate their efficiency on the NAS (NASA Advanced Supercomputing) Parallel Benchmarks. We also present a tool which automates the detection of constructs causing data congestions in Fortran array oriented codes and advises the user on code transformations for improving data traffic in the application.

Parallel Eclipse Project Checkout

NASA Technical Reports Server (NTRS)

Crockett, Thomas M.; Joswig, Joseph C.; Shams, Khawaja S.; Powell, Mark W.; Bachmann, Andrew G.

2011-01-01

Parallel Eclipse Project Checkout (PEPC) is a program written to leverage parallelism and to automate the checkout process of plug-ins created in Eclipse RCP (Rich Client Platform). Eclipse plug-ins can be aggregated in a feature project. This innovation digests a feature description (xml file) and automatically checks out all of the plug-ins listed in the feature. This resolves the issue of manually checking out each plug-in required to work on the project. To minimize the amount of time necessary to checkout the plug-ins, this program makes the plug-in checkouts parallel. After parsing the feature, a request to checkout for each plug-in in the feature has been inserted. These requests are handled by a thread pool with a configurable number of threads. By checking out the plug-ins in parallel, the checkout process is streamlined before getting started on the project. For instance, projects that took 30 minutes to checkout now take less than 5 minutes. The effect is especially clear on a Mac, which has a network monitor displaying the bandwidth use. When running the client from a developer s home, the checkout process now saturates the bandwidth in order to get all the plug-ins checked out as fast as possible. For comparison, a checkout process that ranged from 8-200 Kbps from a developer s home is now able to saturate a pipe of 1.3 Mbps, resulting in significantly faster checkouts. Eclipse IDE (integrated development environment) tries to build a project as soon as it is downloaded. As part of another optimization, this innovation programmatically tells Eclipse to stop building while checkouts are happening, which dramatically reduces lock contention and enables plug-ins to continue downloading until all of them finish. Furthermore, the software re-enables automatic building, and forces Eclipse to do a clean build once it finishes checking out all of the plug-ins. This software is fully generic and does not contain any NASA-specific code. It can be applied to any Eclipse-based repository with a similar structure. It also can apply build parameters and preferences automatically at the end of the checkout.
A cooperative strategy for parameter estimation in large scale systems biology models.

PubMed

Villaverde, Alejandro F; Egea, Jose A; Banga, Julio R

2012-06-22

Mathematical models play a key role in systems biology: they summarize the currently available knowledge in a way that allows to make experimentally verifiable predictions. Model calibration consists of finding the parameters that give the best fit to a set of experimental data, which entails minimizing a cost function that measures the goodness of this fit. Most mathematical models in systems biology present three characteristics which make this problem very difficult to solve: they are highly non-linear, they have a large number of parameters to be estimated, and the information content of the available experimental data is frequently scarce. Hence, there is a need for global optimization methods capable of solving this problem efficiently. A new approach for parameter estimation of large scale models, called Cooperative Enhanced Scatter Search (CeSS), is presented. Its key feature is the cooperation between different programs ("threads") that run in parallel in different processors. Each thread implements a state of the art metaheuristic, the enhanced Scatter Search algorithm (eSS). Cooperation, meaning information sharing between threads, modifies the systemic properties of the algorithm and allows to speed up performance. Two parameter estimation problems involving models related with the central carbon metabolism of E. coli which include different regulatory levels (metabolic and transcriptional) are used as case studies. The performance and capabilities of the method are also evaluated using benchmark problems of large-scale global optimization, with excellent results. The cooperative CeSS strategy is a general purpose technique that can be applied to any model calibration problem. Its capability has been demonstrated by calibrating two large-scale models of different characteristics, improving the performance of previously existing methods in both cases. The cooperative metaheuristic presented here can be easily extended to incorporate other global and local search solvers and specific structural information for particular classes of problems.
Parallel satellite orbital situational problems solver for space missions design and control

NASA Astrophysics Data System (ADS)

Atanassov, Atanas Marinov

2016-11-01

Solving different scientific problems for space applications demands implementation of observations, measurements or realization of active experiments during time intervals in which specific geometric and physical conditions are fulfilled. The solving of situational problems for determination of these time intervals when the satellite instruments work optimally is a very important part of all activities on every stage of preparation and realization of space missions. The elaboration of universal, flexible and robust approach for situation analysis, which is easily portable toward new satellite missions, is significant for reduction of missions' preparation times and costs. Every situation problem could be based on one or more situation conditions. Simultaneously solving different kinds of situation problems based on different number and types of situational conditions, each one of them satisfied on different segments of satellite orbit requires irregular calculations. Three formal approaches are presented. First one is related to situation problems description that allows achieving flexibility in situation problem assembling and presentation in computer memory. The second formal approach is connected with developing of situation problem solver organized as processor that executes specific code for every particular situational condition. The third formal approach is related to solver parallelization utilizing threads and dynamic scheduling based on "pool of threads" abstraction and ensures a good load balance. The developed situation problems solver is intended for incorporation in the frames of multi-physics multi-satellite space mission's design and simulation tools.
GPU-based Branchless Distance-Driven Projection and Backprojection

PubMed Central

Liu, Rui; Fu, Lin; De Man, Bruno; Yu, Hengyong

2017-01-01

Projection and backprojection operations are essential in a variety of image reconstruction and physical correction algorithms in CT. The distance-driven (DD) projection and backprojection are widely used for their highly sequential memory access pattern and low arithmetic cost. However, a typical DD implementation has an inner loop that adjusts the calculation depending on the relative position between voxel and detector cell boundaries. The irregularity of the branch behavior makes it inefficient to be implemented on massively parallel computing devices such as graphics processing units (GPUs). Such irregular branch behaviors can be eliminated by factorizing the DD operation as three branchless steps: integration, linear interpolation, and differentiation, all of which are highly amenable to massive vectorization. In this paper, we implement and evaluate a highly parallel branchless DD algorithm for 3D cone beam CT. The algorithm utilizes the texture memory and hardware interpolation on GPUs to achieve fast computational speed. The developed branchless DD algorithm achieved 137-fold speedup for forward projection and 188-fold speedup for backprojection relative to a single-thread CPU implementation. Compared with a state-of-the-art 32-thread CPU implementation, the proposed branchless DD achieved 8-fold acceleration for forward projection and 10-fold acceleration for backprojection. GPU based branchless DD method was evaluated by iterative reconstruction algorithms with both simulation and real datasets. It obtained visually identical images as the CPU reference algorithm. PMID:29333480
GPU-based Branchless Distance-Driven Projection and Backprojection.

PubMed

Liu, Rui; Fu, Lin; De Man, Bruno; Yu, Hengyong

2017-12-01

Projection and backprojection operations are essential in a variety of image reconstruction and physical correction algorithms in CT. The distance-driven (DD) projection and backprojection are widely used for their highly sequential memory access pattern and low arithmetic cost. However, a typical DD implementation has an inner loop that adjusts the calculation depending on the relative position between voxel and detector cell boundaries. The irregularity of the branch behavior makes it inefficient to be implemented on massively parallel computing devices such as graphics processing units (GPUs). Such irregular branch behaviors can be eliminated by factorizing the DD operation as three branchless steps: integration, linear interpolation, and differentiation, all of which are highly amenable to massive vectorization. In this paper, we implement and evaluate a highly parallel branchless DD algorithm for 3D cone beam CT. The algorithm utilizes the texture memory and hardware interpolation on GPUs to achieve fast computational speed. The developed branchless DD algorithm achieved 137-fold speedup for forward projection and 188-fold speedup for backprojection relative to a single-thread CPU implementation. Compared with a state-of-the-art 32-thread CPU implementation, the proposed branchless DD achieved 8-fold acceleration for forward projection and 10-fold acceleration for backprojection. GPU based branchless DD method was evaluated by iterative reconstruction algorithms with both simulation and real datasets. It obtained visually identical images as the CPU reference algorithm.
CUDAMPF: a multi-tiered parallel framework for accelerating protein sequence search in HMMER on CUDA-enabled GPU.

PubMed

Jiang, Hanyu; Ganesan, Narayan

2016-02-27

HMMER software suite is widely used for analysis of homologous protein and nucleotide sequences with high sensitivity. The latest version of hmmsearch in HMMER 3.x, utilizes heuristic-pipeline which consists of MSV/SSV (Multiple/Single ungapped Segment Viterbi) stage, P7Viterbi stage and the Forward scoring stage to accelerate homology detection. Since the latest version is highly optimized for performance on modern multi-core CPUs with SSE capabilities, only a few acceleration attempts report speedup. However, the most compute intensive tasks within the pipeline (viz., MSV/SSV and P7Viterbi stages) still stand to benefit from the computational capabilities of massively parallel processors. A Multi-Tiered Parallel Framework (CUDAMPF) implemented on CUDA-enabled GPUs presented here, offers a finer-grained parallelism for MSV/SSV and Viterbi algorithms. We couple SIMT (Single Instruction Multiple Threads) mechanism with SIMD (Single Instructions Multiple Data) video instructions with warp-synchronism to achieve high-throughput processing and eliminate thread idling. We also propose a hardware-aware optimal allocation scheme of scarce resources like on-chip memory and caches in order to boost performance and scalability of CUDAMPF. In addition, runtime compilation via NVRTC available with CUDA 7.0 is incorporated into the presented framework that not only helps unroll innermost loop to yield upto 2 to 3-fold speedup than static compilation but also enables dynamic loading and switching of kernels depending on the query model size, in order to achieve optimal performance. CUDAMPF is designed as a hardware-aware parallel framework for accelerating computational hotspots within the hmmsearch pipeline as well as other sequence alignment applications. It achieves significant speedup by exploiting hierarchical parallelism on single GPU and takes full advantage of limited resources based on their own performance features. In addition to exceeding performance of other acceleration attempts, comprehensive evaluations against high-end CPUs (Intel i5, i7 and Xeon) shows that CUDAMPF yields upto 440 GCUPS for SSV, 277 GCUPS for MSV and 14.3 GCUPS for P7Viterbi all with 100 % accuracy, which translates to a maximum speedup of 37.5, 23.1 and 11.6-fold for MSV, SSV and P7Viterbi respectively. The source code is available at https://github.com/Super-Hippo/CUDAMPF.
Adapting the serial Alpgen parton-interaction generator to simulate LHC collisions on millions of parallel threads

DOE Office of Scientific and Technical Information (OSTI.GOV)

Childers, J. T.; Uram, T. D.; LeCompte, T. J.

As the LHC moves to higher energies and luminosity, the demand for computing resources increases accordingly and will soon outpace the growth of the World- wide LHC Computing Grid. To meet this greater demand, event generation Monte Carlo was targeted for adaptation to run on Mira, the supercomputer at the Argonne Leadership Computing Facility. Alpgen is a Monte Carlo event generation application that is used by LHC experiments in the simulation of collisions that take place in the Large Hadron Collider. This paper details the process by which Alpgen was adapted from a single-processor serial-application to a large-scale parallel-application andmore » the performance that was achieved.« less
Adapting the serial Alpgen parton-interaction generator to simulate LHC collisions on millions of parallel threads

DOE PAGES

Childers, J. T.; Uram, T. D.; LeCompte, T. J.; ...

2016-09-29

As the LHC moves to higher energies and luminosity, the demand for computing resources increases accordingly and will soon outpace the growth of the Worldwide LHC Computing Grid. To meet this greater demand, event generation Monte Carlo was targeted for adaptation to run on Mira, the supercomputer at the Argonne Leadership Computing Facility. Alpgen is a Monte Carlo event generation application that is used by LHC experiments in the simulation of collisions that take place in the Large Hadron Collider. Finally, this paper details the process by which Alpgen was adapted from a single-processor serial-application to a large-scale parallel-application andmore » the performance that was achieved.« less
Adapting the serial Alpgen parton-interaction generator to simulate LHC collisions on millions of parallel threads

DOE Office of Scientific and Technical Information (OSTI.GOV)

Childers, J. T.; Uram, T. D.; LeCompte, T. J.

As the LHC moves to higher energies and luminosity, the demand for computing resources increases accordingly and will soon outpace the growth of the Worldwide LHC Computing Grid. To meet this greater demand, event generation Monte Carlo was targeted for adaptation to run on Mira, the supercomputer at the Argonne Leadership Computing Facility. Alpgen is a Monte Carlo event generation application that is used by LHC experiments in the simulation of collisions that take place in the Large Hadron Collider. Finally, this paper details the process by which Alpgen was adapted from a single-processor serial-application to a large-scale parallel-application andmore » the performance that was achieved.« less
Privacy-Preserving Location-Based Query Using Location Indexes and Parallel Searching in Distributed Networks

PubMed Central

Liu, Lei; Zhao, Jing

2014-01-01

An efficient location-based query algorithm of protecting the privacy of the user in the distributed networks is given. This algorithm utilizes the location indexes of the users and multiple parallel threads to search and select quickly all the candidate anonymous sets with more users and their location information with more uniform distribution to accelerate the execution of the temporal-spatial anonymous operations, and it allows the users to configure their custom-made privacy-preserving location query requests. The simulated experiment results show that the proposed algorithm can offer simultaneously the location query services for more users and improve the performance of the anonymous server and satisfy the anonymous location requests of the users. PMID:24790579
2nd-Order CESE Results For C1.4: Vortex Transport by Uniform Flow

NASA Technical Reports Server (NTRS)

Friedlander, David J.

2015-01-01

The Conservation Element and Solution Element (CESE) method was used as implemented in the NASA research code ez4d. The CESE method is a time accurate formulation with flux-conservation in both space and time. The method treats the discretized derivatives of space and time identically and while the 2nd-order accurate version was used, high-order versions exist, the 2nd-order accurate version was used. In regards to the ez4d code, it is an unstructured Navier-Stokes solver coded in C++ with serial and parallel versions available. As part of its architecture, ez4d has the capability to utilize multi-thread and Messaging Passage Interface (MPI) for parallel runs.
Privacy-preserving location-based query using location indexes and parallel searching in distributed networks.

PubMed

Zhong, Cheng; Liu, Lei; Zhao, Jing

2014-01-01

An efficient location-based query algorithm of protecting the privacy of the user in the distributed networks is given. This algorithm utilizes the location indexes of the users and multiple parallel threads to search and select quickly all the candidate anonymous sets with more users and their location information with more uniform distribution to accelerate the execution of the temporal-spatial anonymous operations, and it allows the users to configure their custom-made privacy-preserving location query requests. The simulated experiment results show that the proposed algorithm can offer simultaneously the location query services for more users and improve the performance of the anonymous server and satisfy the anonymous location requests of the users.
Metabolic studies of mammalian cells by 31P-NMR using a continuous perfusion technique.

PubMed

Knop, R H; Chen, C W; Mitchell, J B; Russo, A; McPherson, S; Cohen, J S

1984-07-20

Levels of ATP and Pi in metabolically active Chinese hamster lung fibroblasts were monitored noninvasively by 31P-NMR over many hours and under a variety of conditions. The cells were embedded in a matrix of agarose gel in the form of fine threads which were continuously perfused in a standard NMR tube. The small diameter of the thread allows rapid diffusion of metabolites and drugs into the cells. The changes in ATP and Pi levels were followed as a function of time in response to perfusion with a glucose-containing medium, with isotonic saline and with a medium containing 2,4-dinitrophenol, an uncoupler of oxidative phosphorylation. This gel-thread perfusion method should enable routine NMR studies of cellular metabolism, and may have other potential biological applications.
The path toward HEP High Performance Computing

NASA Astrophysics Data System (ADS)

Apostolakis, John; Brun, René; Carminati, Federico; Gheata, Andrei; Wenzel, Sandro

2014-06-01

High Energy Physics code has been known for making poor use of high performance computing architectures. Efforts in optimising HEP code on vector and RISC architectures have yield limited results and recent studies have shown that, on modern architectures, it achieves a performance between 10% and 50% of the peak one. Although several successful attempts have been made to port selected codes on GPUs, no major HEP code suite has a "High Performance" implementation. With LHC undergoing a major upgrade and a number of challenging experiments on the drawing board, HEP cannot any longer neglect the less-than-optimal performance of its code and it has to try making the best usage of the hardware. This activity is one of the foci of the SFT group at CERN, which hosts, among others, the Root and Geant4 project. The activity of the experiments is shared and coordinated via a Concurrency Forum, where the experience in optimising HEP code is presented and discussed. Another activity is the Geant-V project, centred on the development of a highperformance prototype for particle transport. Achieving a good concurrency level on the emerging parallel architectures without a complete redesign of the framework can only be done by parallelizing at event level, or with a much larger effort at track level. Apart the shareable data structures, this typically implies a multiplication factor in terms of memory consumption compared to the single threaded version, together with sub-optimal handling of event processing tails. Besides this, the low level instruction pipelining of modern processors cannot be used efficiently to speedup the program. We have implemented a framework that allows scheduling vectors of particles to an arbitrary number of computing resources in a fine grain parallel approach. The talk will review the current optimisation activities within the SFT group with a particular emphasis on the development perspectives towards a simulation framework able to profit best from the recent technology evolution in computing.
High Performance Computing Based Parallel HIearchical Modal Association Clustering (HPAR HMAC)

DOE Office of Scientific and Technical Information (OSTI.GOV)

Patlolla, Dilip R; Surendran Nair, Sujithkumar; Graves, Daniel A.

For many applications, clustering is a crucial step in order to gain insight into the makeup of a dataset. The best approach to a given problem often depends on a variety of factors, such as the size of the dataset, time restrictions, and soft clustering requirements. The HMAC algorithm seeks to combine the strengths of 2 particular clustering approaches: model-based and linkage-based clustering. One particular weakness of HMAC is its computational complexity. HMAC is not practical for mega-scale data clustering. For high-definition imagery, a user would have to wait months or years for a result; for a 16-megapixel image, themore » estimated runtime skyrockets to over a decade! To improve the execution time of HMAC, it is reasonable to consider an multi-core implementation that utilizes available system resources. An existing imple-mentation (Ray and Cheng 2014) divides the dataset into N partitions - one for each thread prior to executing the HMAC algorithm. This implementation benefits from 2 types of optimization: parallelization and divide-and-conquer. By running each partition in parallel, the program is able to accelerate computation by utilizing more system resources. Although the parallel implementation provides considerable improvement over the serial HMAC, it still suffers from poor computational complexity, O(N2). Once the maximum number of cores on a system is exhausted, the program exhibits slower behavior. We now consider a modification to HMAC that involves a recursive partitioning scheme. Our modification aims to exploit divide-and-conquer benefits seen by the parallel HMAC implementation. At each level in the recursion tree, partitions are divided into 2 sub-partitions until a threshold size is reached. When the partition can no longer be divided without falling below threshold size, the base HMAC algorithm is applied. This results in a significant speedup over the parallel HMAC.« less
Modern gyrokinetic particle-in-cell simulation of fusion plasmas on top supercomputers

DOE PAGES

Wang, Bei; Ethier, Stephane; Tang, William; ...

2017-06-29

The Gyrokinetic Toroidal Code at Princeton (GTC-P) is a highly scalable and portable particle-in-cell (PIC) code. It solves the 5D Vlasov-Poisson equation featuring efficient utilization of modern parallel computer architectures at the petascale and beyond. Motivated by the goal of developing a modern code capable of dealing with the physics challenge of increasing problem size with sufficient resolution, new thread-level optimizations have been introduced as well as a key additional domain decomposition. GTC-P's multiple levels of parallelism, including inter-node 2D domain decomposition and particle decomposition, as well as intra-node shared memory partition and vectorization have enabled pushing the scalability ofmore » the PIC method to extreme computational scales. In this paper, we describe the methods developed to build a highly parallelized PIC code across a broad range of supercomputer designs. This particularly includes implementations on heterogeneous systems using NVIDIA GPU accelerators and Intel Xeon Phi (MIC) co-processors and performance comparisons with state-of-the-art homogeneous HPC systems such as Blue Gene/Q. New discovery science capabilities in the magnetic fusion energy application domain are enabled, including investigations of Ion-Temperature-Gradient (ITG) driven turbulence simulations with unprecedented spatial resolution and long temporal duration. Performance studies with realistic fusion experimental parameters are carried out on multiple supercomputing systems spanning a wide range of cache capacities, cache-sharing configurations, memory bandwidth, interconnects and network topologies. These performance comparisons using a realistic discovery-science-capable domain application code provide valuable insights on optimization techniques across one of the broadest sets of current high-end computing platforms worldwide.« less
Modern gyrokinetic particle-in-cell simulation of fusion plasmas on top supercomputers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wang, Bei; Ethier, Stephane; Tang, William

The Gyrokinetic Toroidal Code at Princeton (GTC-P) is a highly scalable and portable particle-in-cell (PIC) code. It solves the 5D Vlasov-Poisson equation featuring efficient utilization of modern parallel computer architectures at the petascale and beyond. Motivated by the goal of developing a modern code capable of dealing with the physics challenge of increasing problem size with sufficient resolution, new thread-level optimizations have been introduced as well as a key additional domain decomposition. GTC-P's multiple levels of parallelism, including inter-node 2D domain decomposition and particle decomposition, as well as intra-node shared memory partition and vectorization have enabled pushing the scalability ofmore » the PIC method to extreme computational scales. In this paper, we describe the methods developed to build a highly parallelized PIC code across a broad range of supercomputer designs. This particularly includes implementations on heterogeneous systems using NVIDIA GPU accelerators and Intel Xeon Phi (MIC) co-processors and performance comparisons with state-of-the-art homogeneous HPC systems such as Blue Gene/Q. New discovery science capabilities in the magnetic fusion energy application domain are enabled, including investigations of Ion-Temperature-Gradient (ITG) driven turbulence simulations with unprecedented spatial resolution and long temporal duration. Performance studies with realistic fusion experimental parameters are carried out on multiple supercomputing systems spanning a wide range of cache capacities, cache-sharing configurations, memory bandwidth, interconnects and network topologies. These performance comparisons using a realistic discovery-science-capable domain application code provide valuable insights on optimization techniques across one of the broadest sets of current high-end computing platforms worldwide.« less
Comprehensive Synchronization Elimination for Java (PREPRINT)

DTIC Science & Technology

2003-01-01

e : % thread-local % reentrant % enclosed Figure...0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 ca ss ow ar y ja va c ja va cu p ja va do c jg l jle x pi zz a ar ra y in st an td b jlo go pl as m a sl ic e Figure 6...1998. [DR98] P. Diniz and M. Rinard. Lock Coarsening: Eliminating Lock Overhead in Automatically Parallelized Object-based Programs. In Journal
Parallel heuristics for scalable community detection

DOE PAGES

Lu, Hao; Halappanavar, Mahantesh; Kalyanaraman, Ananth

2015-08-14

Community detection has become a fundamental operation in numerous graph-theoretic applications. Despite its potential for application, there is only limited support for community detection on large-scale parallel computers, largely owing to the irregular and inherently sequential nature of the underlying heuristics. In this paper, we present parallelization heuristics for fast community detection using the Louvain method as the serial template. The Louvain method is an iterative heuristic for modularity optimization. Originally developed in 2008, the method has become increasingly popular owing to its ability to detect high modularity community partitions in a fast and memory-efficient manner. However, the method ismore » also inherently sequential, thereby limiting its scalability. Here, we observe certain key properties of this method that present challenges for its parallelization, and consequently propose heuristics that are designed to break the sequential barrier. For evaluation purposes, we implemented our heuristics using OpenMP multithreading, and tested them over real world graphs derived from multiple application domains. Compared to the serial Louvain implementation, our parallel implementation is able to produce community outputs with a higher modularity for most of the inputs tested, in comparable number or fewer iterations, while providing real speedups of up to 16x using 32 threads.« less
21. View of endlift slide on pedestal and threaded shaft ...

Library of Congress Historic Buildings Survey, Historic Engineering Record, Historic Landscapes Survey

21. View of end-lift slide on pedestal and threaded shaft with level gears. Curved deck joint at underside of roadway deck is seen, as well as submarine electrical cables resting on the masonry pier. (Nov. 30, 1988) - University Heights Bridge, Spanning Harlem River at 207th Street & West Harlem Road, New York County, NY

Magnetoresistance devices based on single-walled carbon nanotubes

NASA Astrophysics Data System (ADS)

Hod, Oded; Rabani, Eran; Baer, Roi

2005-08-01

We demonstrate the physical principles for the construction of a nanometer-sized magnetoresistance device based on the Aharonov-Bohm effect [Phys. Rev. 115, 485 (1959)]. The proposed device is made of a short single-walled carbon nanotube (SWCNT) placed on a substrate and coupled to a tip/contacts. We consider conductance due to the motion of electrons along the circumference of the tube (as opposed to the motion parallel to its axis). We find that the circumference conductance is sensitive to magnetic fields threading the SWCNT due to the Aharonov-Bohm effect, and show that by retracting the tip/contacts, so that the coupling to the SWCNT is reduced, very high sensitivity to the threading magnetic field develops. This is due to the formation of a narrow resonance through which the tunneling current flows. Using a bias potential the resonance can be shifted to low magnetic fields, allowing the control of conductance with magnetic fields of the order of 1 T.
Effective Vectorization with OpenMP 4.5

DOE Office of Scientific and Technical Information (OSTI.GOV)

Huber, Joseph N.; Hernandez, Oscar R.; Lopez, Matthew Graham

This paper describes how the Single Instruction Multiple Data (SIMD) model and its extensions in OpenMP work, and how these are implemented in different compilers. Modern processors are highly parallel computational machines which often include multiple processors capable of executing several instructions in parallel. Understanding SIMD and executing instructions in parallel allows the processor to achieve higher performance without increasing the power required to run it. SIMD instructions can significantly reduce the runtime of code by executing a single operation on large groups of data. The SIMD model is so integral to the processor s potential performance that, if SIMDmore » is not utilized, less than half of the processor is ever actually used. Unfortunately, using SIMD instructions is a challenge in higher level languages because most programming languages do not have a way to describe them. Most compilers are capable of vectorizing code by using the SIMD instructions, but there are many code features important for SIMD vectorization that the compiler cannot determine at compile time. OpenMP attempts to solve this by extending the C++/C and Fortran programming languages with compiler directives that express SIMD parallelism. OpenMP is used to pass hints to the compiler about the code to be executed in SIMD. This is a key resource for making optimized code, but it does not change whether or not the code can use SIMD operations. However, in many cases critical functions are limited by a poor understanding of how SIMD instructions are actually implemented, as SIMD can be implemented through vector instructions or simultaneous multi-threading (SMT). We have found that it is often the case that code cannot be vectorized, or is vectorized poorly, because the programmer does not have sufficient knowledge of how SIMD instructions work.« less
Acceleration of Semiempirical QM/MM Methods through Message Passage Interface (MPI), Hybrid MPI/Open Multiprocessing, and Self-Consistent Field Accelerator Implementations.

PubMed

Ojeda-May, Pedro; Nam, Kwangho

2017-08-08

The strategy and implementation of scalable and efficient semiempirical (SE) QM/MM methods in CHARMM are described. The serial version of the code was first profiled to identify routines that required parallelization. Afterward, the code was parallelized and accelerated with three approaches. The first approach was the parallelization of the entire QM/MM routines, including the Fock matrix diagonalization routines, using the CHARMM message passage interface (MPI) machinery. In the second approach, two different self-consistent field (SCF) energy convergence accelerators were implemented using density and Fock matrices as targets for their extrapolations in the SCF procedure. In the third approach, the entire QM/MM and MM energy routines were accelerated by implementing the hybrid MPI/open multiprocessing (OpenMP) model in which both the task- and loop-level parallelization strategies were adopted to balance loads between different OpenMP threads. The present implementation was tested on two solvated enzyme systems (including <100 QM atoms) and an S N 2 symmetric reaction in water. The MPI version exceeded existing SE QM methods in CHARMM, which include the SCC-DFTB and SQUANTUM methods, by at least 4-fold. The use of SCF convergence accelerators further accelerated the code by ∼12-35% depending on the size of the QM region and the number of CPU cores used. Although the MPI version displayed good scalability, the performance was diminished for large numbers of MPI processes due to the overhead associated with MPI communications between nodes. This issue was partially overcome by the hybrid MPI/OpenMP approach which displayed a better scalability for a larger number of CPU cores (up to 64 CPUs in the tested systems).
P-Hint-Hunt: a deep parallelized whole genome DNA methylation detection tool.

PubMed

Peng, Shaoliang; Yang, Shunyun; Gao, Ming; Liao, Xiangke; Liu, Jie; Yang, Canqun; Wu, Chengkun; Yu, Wenqiang

2017-03-14

The increasing studies have been conducted using whole genome DNA methylation detection as one of the most important part of epigenetics research to find the significant relationships among DNA methylation and several typical diseases, such as cancers and diabetes. In many of those studies, mapping the bisulfite treated sequence to the whole genome has been the main method to study DNA cytosine methylation. However, today's relative tools almost suffer from inaccuracies and time-consuming problems. In our study, we designed a new DNA methylation prediction tool ("Hint-Hunt") to solve the problem. By having an optimal complex alignment computation and Smith-Waterman matrix dynamic programming, Hint-Hunt could analyze and predict the DNA methylation status. But when Hint-Hunt tried to predict DNA methylation status with large-scale dataset, there are still slow speed and low temporal-spatial efficiency problems. In order to solve the problems of Smith-Waterman dynamic programming and low temporal-spatial efficiency, we further design a deep parallelized whole genome DNA methylation detection tool ("P-Hint-Hunt") on Tianhe-2 (TH-2) supercomputer. To the best of our knowledge, P-Hint-Hunt is the first parallel DNA methylation detection tool with a high speed-up to process large-scale dataset, and could run both on CPU and Intel Xeon Phi coprocessors. Moreover, we deploy and evaluate Hint-Hunt and P-Hint-Hunt on TH-2 supercomputer in different scales. The experimental results illuminate our tools eliminate the deviation caused by bisulfite treatment in mapping procedure and the multi-level parallel program yields a 48 times speed-up with 64 threads. P-Hint-Hunt gain a deep acceleration on CPU and Intel Xeon Phi heterogeneous platform, which gives full play of the advantages of multi-cores (CPU) and many-cores (Phi).
Implementation of 5-layer thermal diffusion scheme in weather research and forecasting model with Intel Many Integrated Cores

NASA Astrophysics Data System (ADS)

Huang, Melin; Huang, Bormin; Huang, Allen H.

2014-10-01

For weather forecasting and research, the Weather Research and Forecasting (WRF) model has been developed, consisting of several components such as dynamic solvers and physical simulation modules. WRF includes several Land- Surface Models (LSMs). The LSMs use atmospheric information, the radiative and precipitation forcing from the surface layer scheme, the radiation scheme, and the microphysics/convective scheme all together with the land's state variables and land-surface properties, to provide heat and moisture fluxes over land and sea-ice points. The WRF 5-layer thermal diffusion simulation is an LSM based on the MM5 5-layer soil temperature model with an energy budget that includes radiation, sensible, and latent heat flux. The WRF LSMs are very suitable for massively parallel computation as there are no interactions among horizontal grid points. The features, efficient parallelization and vectorization essentials, of Intel Many Integrated Core (MIC) architecture allow us to optimize this WRF 5-layer thermal diffusion scheme. In this work, we present the results of the computing performance on this scheme with Intel MIC architecture. Our results show that the MIC-based optimization improved the performance of the first version of multi-threaded code on Xeon Phi 5110P by a factor of 2.1x. Accordingly, the same CPU-based optimizations improved the performance on Intel Xeon E5- 2603 by a factor of 1.6x as compared to the first version of multi-threaded code.
Parallel mutual information estimation for inferring gene regulatory networks on GPUs

PubMed Central

2011-01-01

Background Mutual information is a measure of similarity between two variables. It has been widely used in various application domains including computational biology, machine learning, statistics, image processing, and financial computing. Previously used simple histogram based mutual information estimators lack the precision in quality compared to kernel based methods. The recently introduced B-spline function based mutual information estimation method is competitive to the kernel based methods in terms of quality but at a lower computational complexity. Results We present a new approach to accelerate the B-spline function based mutual information estimation algorithm with commodity graphics hardware. To derive an efficient mapping onto this type of architecture, we have used the Compute Unified Device Architecture (CUDA) programming model to design and implement a new parallel algorithm. Our implementation, called CUDA-MI, can achieve speedups of up to 82 using double precision on a single GPU compared to a multi-threaded implementation on a quad-core CPU for large microarray datasets. We have used the results obtained by CUDA-MI to infer gene regulatory networks (GRNs) from microarray data. The comparisons to existing methods including ARACNE and TINGe show that CUDA-MI produces GRNs of higher quality in less time. Conclusions CUDA-MI is publicly available open-source software, written in CUDA and C++ programming languages. It obtains significant speedup over sequential multi-threaded implementation by fully exploiting the compute capability of commonly used CUDA-enabled low-cost GPUs. PMID:21672264
Jali - Unstructured Mesh Infrastructure for Multi-Physics Applications

DOE Office of Scientific and Technical Information (OSTI.GOV)

Garimella, Rao V; Berndt, Markus; Coon, Ethan

2017-04-13

Jali is a parallel unstructured mesh infrastructure library designed for use by multi-physics simulations. It supports 2D and 3D arbitrary polyhedral meshes distributed over hundreds to thousands of nodes. Jali can read write Exodus II meshes along with fields and sets on the mesh and support for other formats is partially implemented or is (https://github.com/MeshToolkit/MSTK), an open source general purpose unstructured mesh infrastructure library from Los Alamos National Laboratory. While it has been made to work with other mesh frameworks such as MOAB and STKmesh in the past, support for maintaining the interface to these frameworks has been suspended formore » now. Jali supports distributed as well as on-node parallelism. Support of on-node parallelism is through direct use of the the mesh in multi-threaded constructs or through the use of "tiles" which are submeshes or sub-partitions of a partition destined for a compute node.« less
Performance enhancement of various real-time image processing techniques via speculative execution

NASA Astrophysics Data System (ADS)

Younis, Mohamed F.; Sinha, Purnendu; Marlowe, Thomas J.; Stoyenko, Alexander D.

1996-03-01

In real-time image processing, an application must satisfy a set of timing constraints while ensuring the semantic correctness of the system. Because of the natural structure of digital data, pure data and task parallelism have been used extensively in real-time image processing to accelerate the handling time of image data. These types of parallelism are based on splitting the execution load performed by a single processor across multiple nodes. However, execution of all parallel threads is mandatory for correctness of the algorithm. On the other hand, speculative execution is an optimistic execution of part(s) of the program based on assumptions on program control flow or variable values. Rollback may be required if the assumptions turn out to be invalid. Speculative execution can enhance average, and sometimes worst-case, execution time. In this paper, we target various image processing techniques to investigate applicability of speculative execution. We identify opportunities for safe and profitable speculative execution in image compression, edge detection, morphological filters, and blob recognition.
The Software Correlator of the Chinese VLBI Network

NASA Technical Reports Server (NTRS)

Zheng, Weimin; Quan, Ying; Shu, Fengchun; Chen, Zhong; Chen, Shanshan; Wang, Weihua; Wang, Guangli

2010-01-01

The software correlator of the Chinese VLBI Network (CVN) has played an irreplaceable role in the CVN routine data processing, e.g., in the Chinese lunar exploration project. This correlator will be upgraded to process geodetic and astronomical observation data. In the future, with several new stations joining the network, CVN will carry out crustal movement observations, quick UT1 measurements, astrophysical observations, and deep space exploration activities. For the geodetic or astronomical observations, we need a wide-band 10-station correlator. For spacecraft tracking, a realtime and highly reliable correlator is essential. To meet the scientific and navigation requirements of CVN, two parallel software correlators in the multiprocessor environments are under development. A high speed, 10-station prototype correlator using the mixed Pthreads and MPI (Massage Passing Interface) parallel algorithm on a computer cluster platform is being developed. Another real-time software correlator for spacecraft tracking adopts the thread-parallel technology, and it runs on the SMP (Symmetric Multiple Processor) servers. Both correlators have the characteristic of flexible structure and scalability.
Screw-Thread Standards for Federal Services, 1957. Handbook H28 (1957), Part 3

DTIC Science & Technology

1957-09-01

MOUNTING THREADS PHOTOGRAPHIC EQUIPMENT THREADS ISO METRIC THREADS; MISCELLANEOUS THREADS CLASS 5 INTERFERENCE-FIT THREADS, TRIAL STANDARD WRENCH...Bibliography on measurement of pitch diameter by means of wires 60 Appendix 14. Metric screw-thread standards 61 1. ISO thread profiles...61 2. Standard series for ISO metric threads 62 3. Designations for ISO metric threads 62 Tables Page Table XII. 1.—Basic
: A Scalable and Transparent System for Simulating MPI Programs

DOE Office of Scientific and Technical Information (OSTI.GOV)

Perumalla, Kalyan S

2010-01-01

is a scalable, transparent system for experimenting with the execution of parallel programs on simulated computing platforms. The level of simulated detail can be varied for application behavior as well as for machine characteristics. Unique features of are repeatability of execution, scalability to millions of simulated (virtual) MPI ranks, scalability to hundreds of thousands of host (real) MPI ranks, portability of the system to a variety of host supercomputing platforms, and the ability to experiment with scientific applications whose source-code is available. The set of source-code interfaces supported by is being expanded to support a wider set of applications, andmore » MPI-based scientific computing benchmarks are being ported. In proof-of-concept experiments, has been successfully exercised to spawn and sustain very large-scale executions of an MPI test program given in source code form. Low slowdowns are observed, due to its use of purely discrete event style of execution, and due to the scalability and efficiency of the underlying parallel discrete event simulation engine, sik. In the largest runs, has been executed on up to 216,000 cores of a Cray XT5 supercomputer, successfully simulating over 27 million virtual MPI ranks, each virtual rank containing its own thread context, and all ranks fully synchronized by virtual time.« less
Scalable Metropolis Monte Carlo for simulation of hard shapes

NASA Astrophysics Data System (ADS)

Anderson, Joshua A.; Eric Irrgang, M.; Glotzer, Sharon C.

2016-07-01

We design and implement a scalable hard particle Monte Carlo simulation toolkit (HPMC), and release it open source as part of HOOMD-blue. HPMC runs in parallel on many CPUs and many GPUs using domain decomposition. We employ BVH trees instead of cell lists on the CPU for fast performance, especially with large particle size disparity, and optimize inner loops with SIMD vector intrinsics on the CPU. Our GPU kernel proposes many trial moves in parallel on a checkerboard and uses a block-level queue to redistribute work among threads and avoid divergence. HPMC supports a wide variety of shape classes, including spheres/disks, unions of spheres, convex polygons, convex spheropolygons, concave polygons, ellipsoids/ellipses, convex polyhedra, convex spheropolyhedra, spheres cut by planes, and concave polyhedra. NVT and NPT ensembles can be run in 2D or 3D triclinic boxes. Additional integration schemes permit Frenkel-Ladd free energy computations and implicit depletant simulations. In a benchmark system of a fluid of 4096 pentagons, HPMC performs 10 million sweeps in 10 min on 96 CPU cores on XSEDE Comet. The same simulation would take 7.6 h in serial. HPMC also scales to large system sizes, and the same benchmark with 16.8 million particles runs in 1.4 h on 2048 GPUs on OLCF Titan.
High performance computing for deformable image registration: towards a new paradigm in adaptive radiotherapy.

PubMed

Samant, Sanjiv S; Xia, Junyi; Muyan-Ozcelik, Pinar; Owens, John D

2008-08-01

The advent of readily available temporal imaging or time series volumetric (4D) imaging has become an indispensable component of treatment planning and adaptive radiotherapy (ART) at many radiotherapy centers. Deformable image registration (DIR) is also used in other areas of medical imaging, including motion corrected image reconstruction. Due to long computation time, clinical applications of DIR in radiation therapy and elsewhere have been limited and consequently relegated to offline analysis. With the recent advances in hardware and software, graphics processing unit (GPU) based computing is an emerging technology for general purpose computation, including DIR, and is suitable for highly parallelized computing. However, traditional general purpose computation on the GPU is limited because the constraints of the available programming platforms. As well, compared to CPU programming, the GPU currently has reduced dedicated processor memory, which can limit the useful working data set for parallelized processing. We present an implementation of the demons algorithm using the NVIDIA 8800 GTX GPU and the new CUDA programming language. The GPU performance will be compared with single threading and multithreading CPU implementations on an Intel dual core 2.4 GHz CPU using the C programming language. CUDA provides a C-like language programming interface, and allows for direct access to the highly parallel compute units in the GPU. Comparisons for volumetric clinical lung images acquired using 4DCT were carried out. Computation time for 100 iterations in the range of 1.8-13.5 s was observed for the GPU with image size ranging from 2.0 x 10(6) to 14.2 x 10(6) pixels. The GPU registration was 55-61 times faster than the CPU for the single threading implementation, and 34-39 times faster for the multithreading implementation. For CPU based computing, the computational time generally has a linear dependence on image size for medical imaging data. Computational efficiency is characterized in terms of time per megapixels per iteration (TPMI) with units of seconds per megapixels per iteration (or spmi). For the demons algorithm, our CPU implementation yielded largely invariant values of TPMI. The mean TPMIs were 0.527 spmi and 0.335 spmi for the single threading and multithreading cases, respectively, with <2% variation over the considered image data range. For GPU computing, we achieved TPMI =0.00916 spmi with 3.7% variation, indicating optimized memory handling under CUDA. The paradigm of GPU based real-time DIR opens up a host of clinical applications for medical imaging.
A Pervasive Parallel Processing Framework for Data Visualization and Analysis at Extreme Scale

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ma, Kwan-Liu

Most of today’s visualization libraries and applications are based off of what is known today as the visualization pipeline. In the visualization pipeline model, algorithms are encapsulated as “filtering” components with inputs and outputs. These components can be combined by connecting the outputs of one filter to the inputs of another filter. The visualization pipeline model is popular because it provides a convenient abstraction that allows users to combine algorithms in powerful ways. Unfortunately, the visualization pipeline cannot run effectively on exascale computers. Experts agree that the exascale machine will comprise processors that contain many cores. Furthermore, physical limitations willmore » prevent data movement in and out of the chip (that is, between main memory and the processing cores) from keeping pace with improvements in overall compute performance. To use these processors to their fullest capability, it is essential to carefully consider memory access. This is where the visualization pipeline fails. Each filtering component in the visualization library is expected to take a data set in its entirety, perform some computation across all of the elements, and output the complete results. The process of iterating over all elements must be repeated in each filter, which is one of the worst possible ways to traverse memory when trying to maximize the number of executions per memory access. This project investigates a new type of visualization framework that exhibits a pervasive parallelism necessary to run on exascale machines. Our framework achieves this by defining algorithms in terms of functors, which are localized, stateless operations. Functors can be composited in much the same way as filters in the visualization pipeline. But, functors’ design allows them to be concurrently running on massive amounts of lightweight threads. Only with such fine-grained parallelism can we hope to fill the billions of threads we expect will be necessary for efficient computation on an exascale computer. This project concludes with a functional prototype containing pervasively parallel algorithms that perform demonstratively well on many-core processors. These algorithms are fundamental for performing data analysis and visualization at extreme scale.« less
Dedicated memory structure holding data for detecting available worker thread(s) and informing available worker thread(s) of task(s) to execute

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chiu, George L.; Eichenberger, Alexandre E.; O'Brien, John K. P.

The present disclosure relates generally to a dedicated memory structure (that is, hardware device) holding data for detecting available worker thread(s) and informing available worker thread(s) of task(s) to execute.
Java Concurrency Guidelines

DTIC Science & Technology

2010-05-01

allows task submission to be decoupled from low-level scheduling and thread management details. It provides the thread pool mechanism that allows a...resource management . In this compliant solution, the client’s doSomething() method provides only the required func- tionality by implementing the...doSomethingWithFile() method of the LockAction inter- face, without having to manage the acquisition and release of locks or the open and close opera
OS friendly microprocessor architecture: Hardware level computer security

NASA Astrophysics Data System (ADS)

Jungwirth, Patrick; La Fratta, Patrick

2016-05-01

We present an introduction to the patented OS Friendly Microprocessor Architecture (OSFA) and hardware level computer security. Conventional microprocessors have not tried to balance hardware performance and OS performance at the same time. Conventional microprocessors have depended on the Operating System for computer security and information assurance. The goal of the OS Friendly Architecture is to provide a high performance and secure microprocessor and OS system. We are interested in cyber security, information technology (IT), and SCADA control professionals reviewing the hardware level security features. The OS Friendly Architecture is a switched set of cache memory banks in a pipeline configuration. For light-weight threads, the memory pipeline configuration provides near instantaneous context switching times. The pipelining and parallelism provided by the cache memory pipeline provides for background cache read and write operations while the microprocessor's execution pipeline is running instructions. The cache bank selection controllers provide arbitration to prevent the memory pipeline and microprocessor's execution pipeline from accessing the same cache bank at the same time. This separation allows the cache memory pages to transfer to and from level 1 (L1) caching while the microprocessor pipeline is executing instructions. Computer security operations are implemented in hardware. By extending Unix file permissions bits to each cache memory bank and memory address, the OSFA provides hardware level computer security.
Rhizobium sp. Degradation of Legume Root Hair Cell Wall at the Site of Infection Thread Origin

PubMed Central

Ridge, Robert W.; Rolfe, Barry G.

1985-01-01

Using a new microinoculation technique, we demonstrated that penetration of Rhizobium sp. into the host root hair cell occurs at 20 to 22 h after inoculation. It did this by dissolving the cell wall maxtrix, leaving a layer of depolymerized wall microfibrils. Colony growth pressure “stretched” the weakened wall, forming a bulge into an interfacial zone between the wall and plasmalemma. At the same time vesicular bodies, similar to plasmalemmasomes, accumulated at the penetration site in a manner which parallels host-pathogen systems. Images PMID:16346892
Synchronizing compute node time bases in a parallel computer

DOEpatents

Chen, Dong; Faraj, Daniel A; Gooding, Thomas M; Heidelberger, Philip

2015-01-27

Synchronizing time bases in a parallel computer that includes compute nodes organized for data communications in a tree network, where one compute node is designated as a root, and, for each compute node: calculating data transmission latency from the root to the compute node; configuring a thread as a pulse waiter; initializing a wakeup unit; and performing a local barrier operation; upon each node completing the local barrier operation, entering, by all compute nodes, a global barrier operation; upon all nodes entering the global barrier operation, sending, to all the compute nodes, a pulse signal; and for each compute node upon receiving the pulse signal: waking, by the wakeup unit, the pulse waiter; setting a time base for the compute node equal to the data transmission latency between the root node and the compute node; and exiting the global barrier operation.
Synchronizing compute node time bases in a parallel computer

DOEpatents

Chen, Dong; Faraj, Daniel A; Gooding, Thomas M; Heidelberger, Philip

2014-12-30

Synchronizing time bases in a parallel computer that includes compute nodes organized for data communications in a tree network, where one compute node is designated as a root, and, for each compute node: calculating data transmission latency from the root to the compute node; configuring a thread as a pulse waiter; initializing a wakeup unit; and performing a local barrier operation; upon each node completing the local barrier operation, entering, by all compute nodes, a global barrier operation; upon all nodes entering the global barrier operation, sending, to all the compute nodes, a pulse signal; and for each compute node upon receiving the pulse signal: waking, by the wakeup unit, the pulse waiter; setting a time base for the compute node equal to the data transmission latency between the root node and the compute node; and exiting the global barrier operation.

A task-based parallelism and vectorized approach to 3D Method of Characteristics (MOC) reactor simulation for high performance computing architectures

NASA Astrophysics Data System (ADS)

Tramm, John R.; Gunow, Geoffrey; He, Tim; Smith, Kord S.; Forget, Benoit; Siegel, Andrew R.

2016-05-01

In this study we present and analyze a formulation of the 3D Method of Characteristics (MOC) technique applied to the simulation of full core nuclear reactors. Key features of the algorithm include a task-based parallelism model that allows independent MOC tracks to be assigned to threads dynamically, ensuring load balancing, and a wide vectorizable inner loop that takes advantage of modern SIMD computer architectures. The algorithm is implemented in a set of highly optimized proxy applications in order to investigate its performance characteristics on CPU, GPU, and Intel Xeon Phi architectures. Speed, power, and hardware cost efficiencies are compared. Additionally, performance bottlenecks are identified for each architecture in order to determine the prospects for continued scalability of the algorithm on next generation HPC architectures.
Parallel Conjugate Gradient: Effects of Ordering Strategies, Programming Paradigms, and Architectural Platforms

NASA Technical Reports Server (NTRS)

Oliker, Leonid; Heber, Gerd; Biswas, Rupak

2000-01-01

The Conjugate Gradient (CG) algorithm is perhaps the best-known iterative technique to solve sparse linear systems that are symmetric and positive definite. A sparse matrix-vector multiply (SPMV) usually accounts for most of the floating-point operations within a CG iteration. In this paper, we investigate the effects of various ordering and partitioning strategies on the performance of parallel CG and SPMV using different programming paradigms and architectures. Results show that for this class of applications, ordering significantly improves overall performance, that cache reuse may be more important than reducing communication, and that it is possible to achieve message passing performance using shared memory constructs through careful data ordering and distribution. However, a multi-threaded implementation of CG on the Tera MTA does not require special ordering or partitioning to obtain high efficiency and scalability.
The Opportunity Cost of Smaller Classes: A State-By-State Spending Analysis. Schools in Crisis: Making Ends Meet

ERIC Educational Resources Information Center

Roza, Marguerite; Ouijdani, Monica

2012-01-01

Two seemingly different threads are in play on the issue of class size. The first is manifested in media reports that tell readers that class sizes are rising to concerning levels. The second thread appears in the work of some researchers and education leaders and suggests that repurposing class-size reduction funds to pay for other reforms may…
Shadow-Bitcoin: Scalable Simulation via Direct Execution of Multi-Threaded Applications

DTIC Science & Technology

2015-08-10

Shadow- Bitcoin : Scalable Simulation via Direct Execution of Multi-threaded Applications Andrew Miller University of Maryland amiller@cs.umd.edu Rob...Shadow plug-in that directly executes the Bitcoin reference client software. To demonstrate the usefulness of this tool, we present novel denial-of...service attacks against the Bit- coin software that exploit low-level implementation ar- tifacts in the Bitcoin reference client; our determinis- tic
Pre-Assembly of Near-Infrared Fluorescent Multivalent Molecular Probes for Biological Imaging.

PubMed

Peck, Evan M; Battles, Paul M; Rice, Douglas R; Roland, Felicia M; Norquest, Kathryn A; Smith, Bradley D

2016-05-18

A programmable pre-assembly method is described and shown to produce near-infrared fluorescent molecular probes with tunable multivalent binding properties. The modular assembly process threads one or two copies of a tetralactam macrocycle onto a fluorescent PEGylated squaraine scaffold containing a complementary number of docking stations. Appended to the macrocycle periphery are multiple copies of a ligand that is known to target a biomarker. The structure and high purity of each threaded complex was determined by independent spectrometric methods and also by gel electrophoresis. Especially helpful were diagnostic red-shift and energy transfer features in the absorption and fluorescence spectra. The threaded complexes were found to be effective multivalent molecular probes for fluorescence microscopy and in vivo fluorescence imaging of living subjects. Two multivalent probes were prepared and tested for targeting of bone in mice. A pre-assembled probe with 12 bone-targeting iminodiacetate ligands produced more bone accumulation than an analogous pre-assembled probe with six iminodiacetate ligands. Notably, there was no loss in probe fluorescence at the bone target site after 24 h in the living animal, indicating that the pre-assembled fluorescent probe maintained very high mechanical and chemical stability on the skeletal surface. The study shows how this versatile pre-assembly method can be used in a parallel combinatorial manner to produce libraries of near-infrared fluorescent multivalent molecular probes for different types of imaging and diagnostic applications, with incremental structural changes in the number of targeting groups, linker lengths, linker flexibility, and degree of PEGylation.
Hierarchical resilience with lightweight threads.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wheeler, Kyle Bruce

2011-10-01

This paper proposes methodology for providing robustness and resilience for a highly threaded distributed- and shared-memory environment based on well-defined inputs and outputs to lightweight tasks. These inputs and outputs form a failure 'barrier', allowing tasks to be restarted or duplicated as necessary. These barriers must be expanded based on task behavior, such as communication between tasks, but do not prohibit any given behavior. One of the trends in high-performance computing codes seems to be a trend toward self-contained functions that mimic functional programming. Software designers are trending toward a model of software design where their core functions are specifiedmore » in side-effect free or low-side-effect ways, wherein the inputs and outputs of the functions are well-defined. This provides the ability to copy the inputs to wherever they need to be - whether that's the other side of the PCI bus or the other side of the network - do work on that input using local memory, and then copy the outputs back (as needed). This design pattern is popular among new distributed threading environment designs. Such designs include the Barcelona STARS system, distributed OpenMP systems, the Habanero-C and Habanero-Java systems from Vivek Sarkar at Rice University, the HPX/ParalleX model from LSU, as well as our own Scalable Parallel Runtime effort (SPR) and the Trilinos stateless kernels. This design pattern is also shared by CUDA and several OpenMP extensions for GPU-type accelerators (e.g. the PGI OpenMP extensions).« less
Contention Modeling for Multithreaded Distributed Shared Memory Machines: The Cray XMT

DOE Office of Scientific and Technical Information (OSTI.GOV)

Secchi, Simone; Tumeo, Antonino; Villa, Oreste

Distributed Shared Memory (DSM) machines are a wide class of multi-processor computing systems where a large virtually-shared address space is mapped on a network of physically distributed memories. High memory latency and network contention are two of the main factors that limit performance scaling of such architectures. Modern high-performance computing DSM systems have evolved toward exploitation of massive hardware multi-threading and fine-grained memory hashing to tolerate irregular latencies, avoid network hot-spots and enable high scaling. In order to model the performance of such large-scale machines, parallel simulation has been proved to be a promising approach to achieve good accuracy inmore » reasonable times. One of the most critical factors in solving the simulation speed-accuracy trade-off is network modeling. The Cray XMT is a massively multi-threaded supercomputing architecture that belongs to the DSM class, since it implements a globally-shared address space abstraction on top of a physically distributed memory substrate. In this paper, we discuss the development of a contention-aware network model intended to be integrated in a full-system XMT simulator. We start by measuring the effects of network contention in a 128-processor XMT machine and then investigate the trade-off that exists between simulation accuracy and speed, by comparing three network models which operate at different levels of accuracy. The comparison and model validation is performed by executing a string-matching algorithm on the full-system simulator and on the XMT, using three datasets that generate noticeably different contention patterns.« less
On the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods

PubMed Central

Lee, Anthony; Yau, Christopher; Giles, Michael B.; Doucet, Arnaud; Holmes, Christopher C.

2011-01-01

We present a case-study on the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods. Graphics cards, containing multiple Graphics Processing Units (GPUs), are self-contained parallel computational devices that can be housed in conventional desktop and laptop computers and can be thought of as prototypes of the next generation of many-core processors. For certain classes of population-based Monte Carlo algorithms they offer massively parallel simulation, with the added advantage over conventional distributed multi-core processors that they are cheap, easily accessible, easy to maintain, easy to code, dedicated local devices with low power consumption. On a canonical set of stochastic simulation examples including population-based Markov chain Monte Carlo methods and Sequential Monte Carlo methods, we nd speedups from 35 to 500 fold over conventional single-threaded computer code. Our findings suggest that GPUs have the potential to facilitate the growth of statistical modelling into complex data rich domains through the availability of cheap and accessible many-core computation. We believe the speedup we observe should motivate wider use of parallelizable simulation methods and greater methodological attention to their design. PMID:22003276
An Intrinsic Algorithm for Parallel Poisson Disk Sampling on Arbitrary Surfaces.

PubMed

Ying, Xiang; Xin, Shi-Qing; Sun, Qian; He, Ying

2013-03-08

Poisson disk sampling plays an important role in a variety of visual computing, due to its useful statistical property in distribution and the absence of aliasing artifacts. While many effective techniques have been proposed to generate Poisson disk distribution in Euclidean space, relatively few work has been reported to the surface counterpart. This paper presents an intrinsic algorithm for parallel Poisson disk sampling on arbitrary surfaces. We propose a new technique for parallelizing the dart throwing. Rather than the conventional approaches that explicitly partition the spatial domain to generate the samples in parallel, our approach assigns each sample candidate a random and unique priority that is unbiased with regard to the distribution. Hence, multiple threads can process the candidates simultaneously and resolve conflicts by checking the given priority values. It is worth noting that our algorithm is accurate as the generated Poisson disks are uniformly and randomly distributed without bias. Our method is intrinsic in that all the computations are based on the intrinsic metric and are independent of the embedding space. This intrinsic feature allows us to generate Poisson disk distributions on arbitrary surfaces. Furthermore, by manipulating the spatially varying density function, we can obtain adaptive sampling easily.
A triaxial supramolecular weave

NASA Astrophysics Data System (ADS)

Lewandowska, Urszula; Zajaczkowski, Wojciech; Corra, Stefano; Tanabe, Junki; Borrmann, Ruediger; Benetti, Edmondo M.; Stappert, Sebastian; Watanabe, Kohei; Ochs, Nellie A. K.; Schaeublin, Robin; Li, Chen; Yashima, Eiji; Pisula, Wojciech; Müllen, Klaus; Wennemers, Helma

2017-11-01

Despite recent advances in the synthesis of increasingly complex topologies at the molecular level, nano- and microscopic weaves have remained difficult to achieve. Only a few diaxial molecular weaves exist—these were achieved by templation with metals. Here, we present an extended triaxial supramolecular weave that consists of self-assembled organic threads. Each thread is formed by the self-assembly of a building block comprising a rigid oligoproline segment with two perylene-monoimide chromophores spaced at 18 Å. Upon π stacking of the chromophores, threads form that feature alternating up- and down-facing voids at regular distances. These voids accommodate incoming building blocks and establish crossing points through CH-π interactions on further assembly of the threads into a triaxial woven superstructure. The resulting micrometre-scale supramolecular weave proved to be more robust than non-woven self-assemblies of the same building block. The uniform hexagonal pores of the interwoven network were able to host iridium nanoparticles, which may be of interest for practical applications.
Thread selection according to power characteristics during context switching on compute nodes

DOE Office of Scientific and Technical Information (OSTI.GOV)

Archer, Charles J.; Blocksome, Michael A.; Randles, Amanda E.

Methods, apparatus, and products are disclosed for thread selection during context switching on a plurality of compute nodes that includes: executing, by a compute node, an application using a plurality of threads of execution, including executing one or more of the threads of execution; selecting, by the compute node from a plurality of available threads of execution for the application, a next thread of execution in dependence upon power characteristics for each of the available threads; determining, by the compute node, whether criteria for a thread context switch are satisfied; and performing, by the compute node, the thread context switchmore » if the criteria for a thread context switch are satisfied, including executing the next thread of execution.« less
Thread selection according to predefined power characteristics during context switching on compute nodes

DOE Office of Scientific and Technical Information (OSTI.GOV)

None, None

Methods, apparatus, and products are disclosed for thread selection during context switching on a plurality of compute nodes that includes: executing, by a compute node, an application using a plurality of threads of execution, including executing one or more of the threads of execution; selecting, by the compute node from a plurality of available threads of execution for the application, a next thread of execution in dependence upon power characteristics for each of the available threads; determining, by the compute node, whether criteria for a thread context switch are satisfied; and performing, by the compute node, the thread context switchmore » if the criteria for a thread context switch are satisfied, including executing the next thread of execution.« less
Modified locking thread form for fastener

NASA Technical Reports Server (NTRS)

Roopnarine, (Inventor); Vranish, John D. (Inventor)

1998-01-01

A threaded fastener has a standard part with a standard thread form characterized by thread walls with a standard included angle, and a modified part complementary to the standard part having a modified thread form characterized by thread walls which are symmetrically inclined with a modified included angle that is different from the standard included angle of the standard part's thread walls, such that the threads of one part make pre-loaded edge contact with the thread walls of the other part. The thread form of the modified part can have an included angle that is greater, less, or compound as compared to the included angle of the standard part. The standard part may be a bolt and the modified part a nut, or vice versa. The modified thread form holds securely even under large vibrational forces, it permits bi-directional use of standard mating threads, is impervious to the build up of tolerances and can be manufactured with a wider range of tolerances without loss of functionality, and distributes loading stresses (per thread) in a manner that decreases the possibility of single thread failure.
Designing Next Generation Massively Multithreaded Architectures for Irregular Applications

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tumeo, Antonino; Secchi, Simone; Villa, Oreste

Irregular applications, such as data mining or graph-based computations, show unpredictable memory/network access patterns and control structures. Massively multi-threaded architectures with large node count, like the Cray XMT, have been shown to address their requirements better than commodity clusters. In this paper we present the approaches that we are currently pursuing to design future generations of these architectures. First, we introduce the Cray XMT and compare it to other multithreaded architectures. We then propose an evolution of the architecture, integrating multiple cores per node and next generation network interconnect. We advocate the use of hardware support for remote memory referencemore » aggregation to optimize network utilization. For this evaluation we developed a highly parallel, custom simulation infrastructure for multi-threaded systems. Our simulator executes unmodified XMT binaries with very large datasets, capturing effects due to contention and hot-spotting, while predicting execution times with greater than 90% accuracy. We also discuss the FPGA prototyping approach that we are employing to study efficient support for irregular applications in next generation manycore processors.« less
Development and Evaluation of Vectorised and Multi-Core Event Reconstruction Algorithms within the CMS Software Framework

NASA Astrophysics Data System (ADS)

Hauth, T.; Innocente and, V.; Piparo, D.

2012-12-01

The processing of data acquired by the CMS detector at LHC is carried out with an object-oriented C++ software framework: CMSSW. With the increasing luminosity delivered by the LHC, the treatment of recorded data requires extraordinary large computing resources, also in terms of CPU usage. A possible solution to cope with this task is the exploitation of the features offered by the latest microprocessor architectures. Modern CPUs present several vector units, the capacity of which is growing steadily with the introduction of new processor generations. Moreover, an increasing number of cores per die is offered by the main vendors, even on consumer hardware. Most recent C++ compilers provide facilities to take advantage of such innovations, either by explicit statements in the programs sources or automatically adapting the generated machine instructions to the available hardware, without the need of modifying the existing code base. Programming techniques to implement reconstruction algorithms and optimised data structures are presented, that aim to scalable vectorization and parallelization of the calculations. One of their features is the usage of new language features of the C++11 standard. Portions of the CMSSW framework are illustrated which have been found to be especially profitable for the application of vectorization and multi-threading techniques. Specific utility components have been developed to help vectorization and parallelization. They can easily become part of a larger common library. To conclude, careful measurements are described, which show the execution speedups achieved via vectorised and multi-threaded code in the context of CMSSW.
Cutting thread at flexible endoscopy.

PubMed

Gong, F; Swain, P; Kadirkamanathan, S; Hepworth, C; Laufer, J; Shelton, J; Mills, T

1996-12-01

New thread-cutting techniques were developed for use at flexible endoscopy. A guillotine was designed to follow and cut thread at the endoscope tip. A new method was developed for guiding suture cutters. Efficacy of Nd: YAG laser cutting of threads was studied. Experimental and clinical experience with thread-cutting methods is presented. A 2.4 mm diameter flexible thread-cutting guillotine was constructed featuring two lateral holes with sharp edges through which sutures to be cut are passed. Standard suture cutters were guided by backloading thread through the cutters extracorporeally. A snare cutter was constructed to retrieve objects sewn to tissue. Efficacy and speed of Nd: YAG laser in cutting twelve different threads were studied. The guillotine cut thread faster (p < 0.05) than standard suture cutters. Backloading thread shortened time taken to cut thread (p < 0.001) compared with free-hand cutting. Nd: YAG laser was ineffective in cutting uncolored threads and slower than mechanical cutters. Results of thread cutting in clinical studies using sewing machine (n = 77 cutting episodes in 21 patients), in-vivo experiments (n = 156), and postsurgical cases (n = 15 over 15 years) are presented. New thread-cutting methods are described and their efficacy demonstrated in experimental and clinical studies.
Tool Removes Coil-Spring Thread Inserts

NASA Technical Reports Server (NTRS)

Collins, Gerald J., Jr.; Swenson, Gary J.; Mcclellan, J. Scott

1991-01-01

Tool removes coil-spring thread inserts from threaded holes. Threads into hole, pries insert loose, grips insert, then pulls insert to thread it out of hole. Effects essentially reverse of insertion process to ease removal and avoid further damage to threaded inner surface of hole.
Red blood cell transport mechanisms in polyester thread-based blood typing devices.

PubMed

Nilghaz, Azadeh; Ballerini, David R; Guan, Liyun; Li, Lizi; Shen, Wei

2016-02-01

A recently developed blood typing diagnostic based on a polyester thread substrate has shown great promise for use in medical emergencies and in impoverished regions. The device is easy to use and transport, while also being inexpensive, accurate, and rapid. This study used a fluorescent confocal microscope to delve deeper into how red blood cells were behaving within the polyester thread-based diagnostic at the cellular level, and how plasma separation could be made to visibly occur on the thread, making it possible to identify blood type in a single step. Red blood cells were stained and the plasma phase dyed with fluorescent compounds to enable them to be visualised under the confocal microscope at high magnification. The mechanisms uncovered were in surprising contrast with those found for a similar, paper-based method. Red blood cell aggregates did not flow over each other within the thread substrate as expected, but suffered from a restriction to their flow which resulted in the chromatographic separation of the RBCs from the liquid phase of the blood. It is hoped that these results will lead to the optimisation of the method to enable more accurate and sensitive detection, increasing the range of blood systems that can be detected.
Thread gauge for measuring thread pitch diameters

DOEpatents

Brewster, A.L.

1985-11-19

A thread gauge which attaches to a vernier caliper to measure the thread pitch diameter of both externally threaded and internally threaded parts is disclosed. A pair of anvils are externally threaded with threads having the same pitch as those of the threaded part. Each anvil is mounted on a stem having a ball on which the anvil can rotate to properly mate with the parts to which the anvils are applied. The stems are detachably secured to the caliper blades by attachment collars having keyhole openings for receiving the stems and caliper blades. A set screw is used to secure each collar on its caliper blade. 2 figs.
Thread gauge for measuring thread pitch diameters

DOEpatents

Brewster, Albert L.

1985-01-01

A thread gauge which attaches to a vernier caliper to measure the thread pitch diameter of both externally threaded and internally threaded parts. A pair of anvils are externally threaded with threads having the same pitch as those of the threaded part. Each anvil is mounted on a stem having a ball on which the anvil can rotate to properly mate with the parts to which the anvils are applied. The stems are detachably secured to the caliper blades by attachment collars having keyhole openings for receiving the stems and caliper blades. A set screw is used to secure each collar on its caliper blade.

ng: What next-generation languages can teach us about HENP frameworks in the manycore era

NASA Astrophysics Data System (ADS)

Binet, Sébastien

2011-12-01

Current High Energy and Nuclear Physics (HENP) frameworks were written before multicore systems became widely deployed. A 'single-thread' execution model naturally emerged from that environment, however, this no longer fits into the processing model on the dawn of the manycore era. Although previous work focused on minimizing the changes to be applied to the LHC frameworks (because of the data taking phase) while still trying to reap the benefits of the parallel-enhanced CPU architectures, this paper explores what new languages could bring to the design of the next-generation frameworks. Parallel programming is still in an intensive phase of R&D and no silver bullet exists despite the 30+ years of literature on the subject. Yet, several parallel programming styles have emerged: actors, message passing, communicating sequential processes, task-based programming, data flow programming, ... to name a few. We present the work of the prototyping of a next-generation framework in new and expressive languages (python and Go) to investigate how code clarity and robustness are affected and what are the downsides of using languages younger than FORTRAN/C/C++.
Message Passing on GPUs

NASA Astrophysics Data System (ADS)

Stuart, J. A.

2011-12-01

This paper explores the challenges in implementing a message passing interface usable on systems with data-parallel processors, and more specifically GPUs. As a case study, we design and implement the ``DCGN'' API on NVIDIA GPUs that is similar to MPI and allows full access to the underlying architecture. We introduce the notion of data-parallel thread-groups as a way to map resources to MPI ranks. We use a method that also allows the data-parallel processors to run autonomously from user-written CPU code. In order to facilitate communication, we use a sleep-based polling system to store and retrieve messages. Unlike previous systems, our method provides both performance and flexibility. By running a test suite of applications with different communication requirements, we find that a tolerable amount of overhead is incurred, somewhere between one and five percent depending on the application, and indicate the locations where this overhead accumulates. We conclude that with innovations in chipsets and drivers, this overhead will be mitigated and provide similar performance to typical CPU-based MPI implementations while providing fully-dynamic communication.
System, methods and apparatus for program optimization for multi-threaded processor architectures

DOEpatents

Bastoul, Cedric; Lethin, Richard A; Leung, Allen K; Meister, Benoit J; Szilagyi, Peter; Vasilache, Nicolas T; Wohlford, David E

2015-01-06

Methods, apparatus and computer software product for source code optimization are provided. In an exemplary embodiment, a first custom computing apparatus is used to optimize the execution of source code on a second computing apparatus. In this embodiment, the first custom computing apparatus contains a memory, a storage medium and at least one processor with at least one multi-stage execution unit. The second computing apparatus contains at least two multi-stage execution units that allow for parallel execution of tasks. The first custom computing apparatus optimizes the code for parallelism, locality of operations and contiguity of memory accesses on the second computing apparatus. This Abstract is provided for the sole purpose of complying with the Abstract requirement rules. This Abstract is submitted with the explicit understanding that it will not be used to interpret or to limit the scope or the meaning of the claims.
UPC++ Programmer’s Guide (v1.0 2017.9)

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bachan, J.; Baden, S.; Bonachea, D.

UPC++ is a C++11 library that provides Asynchronous Partitioned Global Address Space (APGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The APGAS model is single program, multiple-data (SPMD), with each separate thread of execution (referred to as a rank, a term borrowed from MPI) having access to local memory as it would in C++. However, APGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the ranks. UPC++ provides numerous methods for accessing and using global memory. In UPC++, allmore » operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.« less
UPC++ Programmer’s Guide, v1.0-2018.3.0

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bachan, J.; Baden, S.; Bonachea, Dan

UPC++ is a C++11 library that provides Partitioned Global Address Space (PGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The PGAS model is single program, multiple-data (SPMD), with each separate thread of execution (referred to as a rank, a term borrowed from MPI) having access to local memory as it would in C++. However, PGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the ranks. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operationsmore » that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.« less
Intershot Analysis of Flows in DIII-D

NASA Astrophysics Data System (ADS)

Meyer, W. H.; Allen, S. L.; Samuell, C. M.; Howard, J.

2016-10-01

Analysis of the DIII-D flow diagnostic data require demodulation of interference images, and inversion of the resultant line integrated emissivity and flow (phase) images. Four response matrices are pre-calculated: the emissivity line integral and the line integral of the scalar product of the lines-of-site with the orthogonal unit vectors of parallel flow. Equilibrium data determines the relative weight of the component matrices used in the final flow inversion matrix. Serial processing has been used for the lower divertor viewing flow camera 800x600 pixel image. The full cross section viewing camera will require parallel processing of the 2160x2560 pixel image. We will discuss using a Posix thread pool and a Tesla K40c GPU in the processing of this data. Prepared by LLNL under Contract DE-AC52-07NA27344. This material is based upon work supported by the U.S. DOE, Office of Science, Fusion Energy Sciences.
LAMMPS strong scaling performance optimization on Blue Gene/Q

DOE Office of Scientific and Technical Information (OSTI.GOV)

Coffman, Paul; Jiang, Wei; Romero, Nichols A.

2014-11-12

LAMMPS "Large-scale Atomic/Molecular Massively Parallel Simulator" is an open-source molecular dynamics package from Sandia National Laboratories. Significant performance improvements in strong-scaling and time-to-solution for this application on IBM's Blue Gene/Q have been achieved through computational optimizations of the OpenMP versions of the short-range Lennard-Jones term of the CHARMM force field and the long-range Coulombic interaction implemented with the PPPM (particle-particle-particle mesh) algorithm, enhanced by runtime parameter settings controlling thread utilization. Additionally, MPI communication performance improvements were made to the PPPM calculation by re-engineering the parallel 3D FFT to use MPICH collectives instead of point-to-point. Performance testing was done using anmore » 8.4-million atom simulation scaling up to 16 racks on the Mira system at Argonne Leadership Computing Facility (ALCF). Speedups resulting from this effort were in some cases over 2x.« less
Parallel optimization algorithm for drone inspection in the building industry

NASA Astrophysics Data System (ADS)

Walczyński, Maciej; BoŻejko, Wojciech; Skorupka, Dariusz

2017-07-01

In this paper we present an approach for Vehicle Routing Problem with Drones (VRPD) in case of building inspection from the air. In autonomic inspection process there is a need to determine of the optimal route for inspection drone. This is especially important issue because of the very limited flight time of modern multicopters. The method of determining solutions for Traveling Salesman Problem(TSP), described in this paper bases on Parallel Evolutionary Algorithm (ParEA)with cooperative and independent approach for communication between threads. This method described first by Bożejko and Wodecki [1] bases on the observation that if exists some number of elements on certain positions in a number of permutations which are local minima, then those elements will be in the same position in the optimal solution for TSP problem. Numerical experiments were made on BEM computational cluster with using MPI library.
Gilgamesh: A Multithreaded Processor-In-Memory Architecture for Petaflops Computing

NASA Technical Reports Server (NTRS)

Sterling, T. L.; Zima, H. P.

2002-01-01

Processor-in-Memory (PIM) architectures avoid the von Neumann bottleneck in conventional machines by integrating high-density DRAM and CMOS logic on the same chip. Parallel systems based on this new technology are expected to provide higher scalability, adaptability, robustness, fault tolerance and lower power consumption than current MPPs or commodity clusters. In this paper we describe the design of Gilgamesh, a PIM-based massively parallel architecture, and elements of its execution model. Gilgamesh extends existing PIM capabilities by incorporating advanced mechanisms for virtualizing tasks and data and providing adaptive resource management for load balancing and latency tolerance. The Gilgamesh execution model is based on macroservers, a middleware layer which supports object-based runtime management of data and threads allowing explicit and dynamic control of locality and load balancing. The paper concludes with a discussion of related research activities and an outlook to future work.
Generating performance portable geoscientific simulation code with Firedrake (Invited)

NASA Astrophysics Data System (ADS)

Ham, D. A.; Bercea, G.; Cotter, C. J.; Kelly, P. H.; Loriant, N.; Luporini, F.; McRae, A. T.; Mitchell, L.; Rathgeber, F.

2013-12-01

This presentation will demonstrate how a change in simulation programming paradigm can be exploited to deliver sophisticated simulation capability which is far easier to programme than are conventional models, is capable of exploiting different emerging parallel hardware, and is tailored to the specific needs of geoscientific simulation. Geoscientific simulation represents a grand challenge computational task: many of the largest computers in the world are tasked with this field, and the requirements of resolution and complexity of scientists in this field are far from being sated. However, single thread performance has stalled, even sometimes decreased, over the last decade, and has been replaced by ever more parallel systems: both as conventional multicore CPUs and in the emerging world of accelerators. At the same time, the needs of scientists to couple ever-more complex dynamics and parametrisations into their models makes the model development task vastly more complex. The conventional approach of writing code in low level languages such as Fortran or C/C++ and then hand-coding parallelism for different platforms by adding library calls and directives forces the intermingling of the numerical code with its implementation. This results in an almost impossible set of skill requirements for developers, who must simultaneously be domain science experts, numericists, software engineers and parallelisation specialists. Even more critically, it requires code to be essentially rewritten for each emerging hardware platform. Since new platforms are emerging constantly, and since code owners do not usually control the procurement of the supercomputers on which they must run, this represents an unsustainable development load. The Firedrake system, conversely, offers the developer the opportunity to write PDE discretisations in the high-level mathematical language UFL from the FEniCS project (http://fenicsproject.org). Non-PDE model components, such as parametrisations, can be written as short C kernels operating locally on the underlying mesh, with no explicit parallelism. The executable code is then generated in C, CUDA or OpenCL and executed in parallel on the target architecture. The system also offers features of special relevance to the geosciences. In particular, the large scale separation between the vertical and horizontal directions in many geoscientific processes can be exploited to offer the flexibility of unstructured meshes in the horizontal direction, without the performance penalty usually associated with those methods.
78 FR 76815 - Steel Threaded Rod From India: Preliminary Affirmative Countervailing Duty Determination and...

Federal Register 2010, 2011, 2012, 2013, 2014

2013-12-19

... DEPARTMENT OF COMMERCE International Trade Administration [C-533-856] Steel Threaded Rod From... exporters of steel threaded rod from India. The period of investigation (``POI'') is January 1, 2012... this investigation is steel threaded rod. Steel threaded rod is certain threaded rod, bar, or studs, of...
Towards Highly Scalable Ab Initio Molecular Dynamics (AIMD) Simulations on the Intel Knights Landing Manycore Processor

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jacquelin, Mathias; De Jong, Wibe A.; Bylaska, Eric J.

2017-07-03

The Ab Initio Molecular Dynamics (AIMD) method allows scientists to treat the dynamics of molecular and condensed phase systems while retaining a first-principles-based description of their interactions. This extremely important method has tremendous computational requirements, because the electronic Schr¨odinger equation, approximated using Kohn-Sham Density Functional Theory (DFT), is solved at every time step. With the advent of manycore architectures, application developers have a significant amount of processing power within each compute node that can only be exploited through massive parallelism. A compute intensive application such as AIMD forms a good candidate to leverage this processing power. In this paper, wemore » focus on adding thread level parallelism to the plane wave DFT methodology implemented in NWChem. Through a careful optimization of tall-skinny matrix products, which are at the heart of the Lagrange multiplier and nonlocal pseudopotential kernels, as well as 3D FFTs, our OpenMP implementation delivers excellent strong scaling on the latest Intel Knights Landing (KNL) processor. We assess the efficiency of our Lagrange multiplier kernels by building a Roofline model of the platform, and verify that our implementation is close to the roofline for various problem sizes. Finally, we present strong scaling results on the complete AIMD simulation for a 64 water molecules test case, that scales up to all 68 cores of the Knights Landing processor.« less
Morphological relationships in the chromospheric H-alpha fine structure

NASA Technical Reports Server (NTRS)

Foukal, P.

1971-01-01

A continuous relationship is proposed between the basic elements of the dark fine structure of the quiet and active chromosphere. A progression from chromospheric bushes to fibrils, then to chromospheric threads and active region filaments, and finally to diffuse quiescent filaments, is described. It is shown that the horizontal component of the field on opposite sides of an active region quiescent filament can be in the same direction and closely parallel to the filament axis. Consequently, it is unnecessary to postulate twisted or otherwise complex field configurations to reconcile the support mechanism of filaments with the observed motion along their axis.
Accelerate quasi Monte Carlo method for solving systems of linear algebraic equations through shared memory

NASA Astrophysics Data System (ADS)

Lai, Siyan; Xu, Ying; Shao, Bo; Guo, Menghan; Lin, Xiaola

2017-04-01

In this paper we study on Monte Carlo method for solving systems of linear algebraic equations (SLAE) based on shared memory. Former research demostrated that GPU can effectively speed up the computations of this issue. Our purpose is to optimize Monte Carlo method simulation on GPUmemoryachritecture specifically. Random numbers are organized to storein shared memory, which aims to accelerate the parallel algorithm. Bank conflicts can be avoided by our Collaborative Thread Arrays(CTA)scheme. The results of experiments show that the shared memory based strategy can speed up the computaions over than 3X at most.
Stability comparison between commercially available mini-implants and a novel design: part 1.

PubMed

Hong, Christine; Lee, Haofu; Webster, Richard; Kwak, Jinny; Wu, Benjamin M; Moon, Won

2011-07-01

To compare mechanical stability among five mini-implant designs--a newly invented design and four commercially available designs that vary by shape and threading; to calculate external surface area of each design using high-resolution micro-computed tomography; and to evaluate the relationship between surface area and stability results. The four commercially available mini-implants--single-threaded and cylindrical (SC), single-threaded and tapered (ST), double-threaded and cylindrical (DC), double-threaded and tapered (DT)--and a new implant that is designed to engage mostly in cortical bone with shorter and wider dimensions (N1) were inserted in simulated bone with cortical and trabecular bone layers. The mechanical study consisted of torque measurements and lateral displacement tests. External surface area was computed using a 25-µm micro-CT. Maximum insertion torque, maximum removal torque, and force levels for displacements were the highest in N1, followed by DT, ST, DC, and SC (α = .05). The surface area was largest in DT, followed by N1, ST, DC, and SC. Surface area engaged in cortical bone, however, was the greatest in N1. The surface area of mini-implants had positive correlation with stability. Among commercial designs, both added tapering and double threading improved stability. N1 was the most stable design within this research design. The new design has the potential to be clinically superior; it has enhanced stability and there is diminished risk of endangering nearby anatomic structures during placement and orthodontic treatment, but the design requires refinements to reduce insertion torque to avoid clinical difficulty and patient discomfort.
Online discussion groups for bulimia nervosa: an inductive approach to Internet-based communication between patients.

PubMed

Wesemann, Dorette; Grunwald, Martin

2008-09-01

Online discussion forums are often used by people with eating disorders. This study analyses 2,072 threads containing a total of 14,903 postings from an unmoderated German "prorecovery" forum for persons suffering from bulimia nervosa (www.ab-server.de) during the period from October 2004 to May 2006. The threads were inductively analyzed for underlying structural types, and the various types found were then analyzed for differences in temporal and quantitative parameters. Communication in the online discussion forum occurred in three types of thread: (1) problem-oriented threads (78.8% of threads), (2) communication-oriented threads (15.3% of threads), and (3) metacommunication threads (2.6% of threads). Metacommunication threads contained significantly more postings than problem-oriented and communication-oriented threads, and they were viewed significantly more often. Moreover, there are temporal differences between the structural types. Topics relating to active management of the disorder receive great attention in prorecovery forums. (c) 2008 by Wiley Periodicals, Inc.
Two years' outcome of thread lifting with absorbable barbed PDO threads: Innovative score for objective and subjective assessment.

PubMed

Ali, Yasser Helmy

2018-02-01

Thread-lifting rejuvenation procedures have evolved again, with the development of absorbable threads. Although they have gained popularity among plastic surgeons and dermatologists, very few articles have been written in literature about absorbable threads. This study aims to evaluate two years' outcome of thread lifting using absorbable barbed threads for facial rejuvenation. Prospective comparative stud both objectively and subjectively and follow-up assessment for 24 months. Thread lifting for face rejuvenation has significant long-lasting effects that include skin lifting from 3-10 mm and high degree of patients' satisfaction with less incidence rate of complications, about 4.8%. Augmented results are obtained when thread lifting is combined with other lifting and rejuvenation modalities. Significant facial rejuvenation is achieved by thread lifting and highly augmented results are observed when they are combined with Botox, fillers, and/or platelet rich plasma (PRP) rejuvenations.
Thread gauge for tapered threads

DOEpatents

Brewster, Albert L.

1994-01-11

The thread gauge permits the user to determine the pitch diameter of tapered threads at the intersection of the pitch cone and the end face of the object being measured. A pair of opposed anvils having lines of threads which match the configuration and taper of the threads on the part being measured are brought into meshing engagement with the threads on opposite sides of the part. The anvils are located linearly into their proper positions by stop fingers on the anvils that are brought into abutting engagement with the end face of the part. This places predetermined reference points of the pitch cone of the thread anvils in registration with corresponding points on the end face of the part being measured, resulting in an accurate determination of the pitch diameter at that location. The thread anvils can be arranged for measuring either internal or external threads.
Thread gauge for tapered threads

DOEpatents

Brewster, A.L.

1994-01-11

The thread gauge permits the user to determine the pitch diameter of tapered threads at the intersection of the pitch cone and the end face of the object being measured. A pair of opposed anvils having lines of threads which match the configuration and taper of the threads on the part being measured are brought into meshing engagement with the threads on opposite sides of the part. The anvils are located linearly into their proper positions by stop fingers on the anvils that are brought into abutting engagement with the end face of the part. This places predetermined reference points of the pitch cone of the thread anvils in registration with corresponding points on the end face of the part being measured, resulting in an accurate determination of the pitch diameter at that location. The thread anvils can be arranged for measuring either internal or external threads. 13 figures.
CNT coated thread micro-electro-mechanical system for finger proprioception sensing

NASA Astrophysics Data System (ADS)

Shafi, A. A.; Wicaksono, D. H. B.

2017-04-01

In this paper, we aim to fabricate cotton thread based sensor for proprioceptive application. Cotton threads are utilized as the structural component of flexible sensors. The thread is coated with multi-walled carbon nanotube (MWCNT) dispersion by using facile conventional dipping-drying method. The electrical characterization of the coated thread found that the resistance per meter of the coated thread decreased with increasing the number of dipping. The CNT coated thread sensor works based on piezoresistive theory in which the resistance of the coated thread changes when force is applied. This thread sensor is sewed on glove at the index finger between middle and proximal phalanx parts and the resistance change is measured upon grasping mechanism. The thread based microelectromechanical system (MEMS) enables the flexible sensor to easily fit perfectly on the finger joint and gives reliable response as proprioceptive sensing.

Design of internal screw thread measuring device based on the Three-Line method principle

NASA Astrophysics Data System (ADS)

Hu, Dachao; Chen, Jianguo

2010-08-01

In accordance with the principle of Three-Line, this paper analyze the correlation of every main parameter of internal screw thread, and then designed a device to measure the main parameters of internal screw thread. Internal thread parameters, such as the pitch diameter, thread angle and screw-pitch of common screw thread, terraced screw thread, zigzag screw thread were obtained through calculation and measurement. The practical applications have proved that this device is convenience to use, and the measurements have a high accuracy. Meanwhile, the application for the patent of invention has been accepted by the Patent Office (Filing number: 200710044081.5).
A New Operating System for Security Tagged Architecture Hardware in Support of Multiple Independent Levels of Security (MILS) Compliant System

DTIC Science & Technology

2014-04-01

important data structures of RTEMS are introduced. Section 3.2.2 discusses the problems we found in RTEMS that may cause security vulnerabilities...the important data structures in RTEMS: Object, which is a critical data structure in the SCORE, tasks threads. Approved for Public Release...these important system codes. The example code shows a possibility that a user can delete a system thread. Therefore, in order to protect system
Application configuration selection for energy-efficient execution on multicore systems

DOE PAGES

Wang, Shinan; Luo, Bing; Shi, Weisong; ...

2015-09-21

Balanced performance and energy consumption are incorporated in the design of modern computer systems. Several runtime factors, such as concurrency levels, thread mapping strategies, and dynamic voltage and frequency scaling (DVFS) should be considered in order to achieve optimal energy efficiency fora workload. Selecting appropriate run-time factors, however, is one of the most challenging tasks because the run-time factors are architecture-specific and workload-specific. And while most existing works concentrate on either static analysis of the workload or run-time prediction results, we present a hybrid two-step method that utilizes concurrency levels and DVFS settings to achieve the energy efficiency configuration formore » a worldoad. The experimental results based on a Xeon E5620 server with NPB and PARSEC benchmark suites show that the model is able to predict the energy efficient configuration accurately. On average, an additional 10% EDP (Energy Delay Product) saving is obtained by using run-time DVFS for the entire system. An off-line optimal solution is used to compare with the proposed scheme. Finally, the experimental results show that the average extra EDP saved by the optimal solution is within 5% on selective parallel benchmarks.« less
ACTS: from ATLAS software towards a common track reconstruction software

NASA Astrophysics Data System (ADS)

Gumpert, C.; Salzburger, A.; Kiehn, M.; Hrdinka, J.; Calace, N.; ATLAS Collaboration

2017-10-01

Reconstruction of charged particles’ trajectories is a crucial task for most particle physics experiments. The high instantaneous luminosity achieved at the LHC leads to a high number of proton-proton collisions per bunch crossing, which has put the track reconstruction software of the LHC experiments through a thorough test. Preserving track reconstruction performance under increasingly difficult experimental conditions, while keeping the usage of computational resources at a reasonable level, is an inherent problem for many HEP experiments. Exploiting concurrent algorithms and using multivariate techniques for track identification are the primary strategies to achieve that goal. Starting from current ATLAS software, the ACTS project aims to encapsulate track reconstruction software into a generic, framework- and experiment-independent software package. It provides a set of high-level algorithms and data structures for performing track reconstruction tasks as well as fast track simulation. The software is developed with special emphasis on thread-safety to support parallel execution of the code and data structures are optimised for vectorisation to speed up linear algebra operations. The implementation is agnostic to the details of the detection technologies and magnetic field configuration which makes it applicable to many different experiments.
A Parallel Nonrigid Registration Algorithm Based on B-Spline for Medical Images.

PubMed

Du, Xiaogang; Dang, Jianwu; Wang, Yangping; Wang, Song; Lei, Tao

2016-01-01

The nonrigid registration algorithm based on B-spline Free-Form Deformation (FFD) plays a key role and is widely applied in medical image processing due to the good flexibility and robustness. However, it requires a tremendous amount of computing time to obtain more accurate registration results especially for a large amount of medical image data. To address the issue, a parallel nonrigid registration algorithm based on B-spline is proposed in this paper. First, the Logarithm Squared Difference (LSD) is considered as the similarity metric in the B-spline registration algorithm to improve registration precision. After that, we create a parallel computing strategy and lookup tables (LUTs) to reduce the complexity of the B-spline registration algorithm. As a result, the computing time of three time-consuming steps including B-splines interpolation, LSD computation, and the analytic gradient computation of LSD, is efficiently reduced, for the B-spline registration algorithm employs the Nonlinear Conjugate Gradient (NCG) optimization method. Experimental results of registration quality and execution efficiency on the large amount of medical images show that our algorithm achieves a better registration accuracy in terms of the differences between the best deformation fields and ground truth and a speedup of 17 times over the single-threaded CPU implementation due to the powerful parallel computing ability of Graphics Processing Unit (GPU).
Parallel Agent-Based Simulations on Clusters of GPUs and Multi-Core Processors

DOE Office of Scientific and Technical Information (OSTI.GOV)

Aaby, Brandon G; Perumalla, Kalyan S; Seal, Sudip K

2010-01-01

An effective latency-hiding mechanism is presented in the parallelization of agent-based model simulations (ABMS) with millions of agents. The mechanism is designed to accommodate the hierarchical organization as well as heterogeneity of current state-of-the-art parallel computing platforms. We use it to explore the computation vs. communication trade-off continuum available with the deep computational and memory hierarchies of extant platforms and present a novel analytical model of the tradeoff. We describe our implementation and report preliminary performance results on two distinct parallel platforms suitable for ABMS: CUDA threads on multiple, networked graphical processing units (GPUs), and pthreads on multi-core processors. Messagemore » Passing Interface (MPI) is used for inter-GPU as well as inter-socket communication on a cluster of multiple GPUs and multi-core processors. Results indicate the benefits of our latency-hiding scheme, delivering as much as over 100-fold improvement in runtime for certain benchmark ABMS application scenarios with several million agents. This speed improvement is obtained on our system that is already two to three orders of magnitude faster on one GPU than an equivalent CPU-based execution in a popular simulator in Java. Thus, the overall execution of our current work is over four orders of magnitude faster when executed on multiple GPUs.« less
Parallel protein secondary structure prediction based on neural networks.

PubMed

Zhong, Wei; Altun, Gulsah; Tian, Xinmin; Harrison, Robert; Tai, Phang C; Pan, Yi

2004-01-01

Protein secondary structure prediction has a fundamental influence on today's bioinformatics research. In this work, binary and tertiary classifiers of protein secondary structure prediction are implemented on Denoeux belief neural network (DBNN) architecture. Hydrophobicity matrix, orthogonal matrix, BLOSUM62 and PSSM (position specific scoring matrix) are experimented separately as the encoding schemes for DBNN. The experimental results contribute to the design of new encoding schemes. New binary classifier for Helix versus not Helix ( approximately H) for DBNN produces prediction accuracy of 87% when PSSM is used for the input profile. The performance of DBNN binary classifier is comparable to other best prediction methods. The good test results for binary classifiers open a new approach for protein structure prediction with neural networks. Due to the time consuming task of training the neural networks, Pthread and OpenMP are employed to parallelize DBNN in the hyperthreading enabled Intel architecture. Speedup for 16 Pthreads is 4.9 and speedup for 16 OpenMP threads is 4 in the 4 processors shared memory architecture. Both speedup performance of OpenMP and Pthread is superior to that of other research. With the new parallel training algorithm, thousands of amino acids can be processed in reasonable amount of time. Our research also shows that hyperthreading technology for Intel architecture is efficient for parallel biological algorithms.
An intrinsic algorithm for parallel Poisson disk sampling on arbitrary surfaces.

PubMed

Ying, Xiang; Xin, Shi-Qing; Sun, Qian; He, Ying

2013-09-01

Poisson disk sampling has excellent spatial and spectral properties, and plays an important role in a variety of visual computing. Although many promising algorithms have been proposed for multidimensional sampling in euclidean space, very few studies have been reported with regard to the problem of generating Poisson disks on surfaces due to the complicated nature of the surface. This paper presents an intrinsic algorithm for parallel Poisson disk sampling on arbitrary surfaces. In sharp contrast to the conventional parallel approaches, our method neither partitions the given surface into small patches nor uses any spatial data structure to maintain the voids in the sampling domain. Instead, our approach assigns each sample candidate a random and unique priority that is unbiased with regard to the distribution. Hence, multiple threads can process the candidates simultaneously and resolve conflicts by checking the given priority values. Our algorithm guarantees that the generated Poisson disks are uniformly and randomly distributed without bias. It is worth noting that our method is intrinsic and independent of the embedding space. This intrinsic feature allows us to generate Poisson disk patterns on arbitrary surfaces in IR(n). To our knowledge, this is the first intrinsic, parallel, and accurate algorithm for surface Poisson disk sampling. Furthermore, by manipulating the spatially varying density function, we can obtain adaptive sampling easily.
Convergent and parallel evolution in life habit of the scallops (Bivalvia: Pectinidae)

PubMed Central

2011-01-01

Background We employed a phylogenetic framework to identify patterns of life habit evolution in the marine bivalve family Pectinidae. Specifically, we examined the number of independent origins of each life habit and distinguished between convergent and parallel trajectories of life habit evolution using ancestral state estimation. We also investigated whether ancestral character states influence the frequency or type of evolutionary trajectories. Results We determined that temporary attachment to substrata by byssal threads is the most likely ancestral condition for the Pectinidae, with subsequent transitions to the five remaining habit types. Nearly all transitions between life habit classes were repeated in our phylogeny and the majority of these transitions were the result of parallel evolution from byssate ancestors. Convergent evolution also occurred within the Pectinidae and produced two additional gliding clades and two recessing lineages. Furthermore, our analysis indicates that byssal attaching gave rise to significantly more of the transitions than any other life habit and that the cementing and nestling classes are only represented as evolutionary outcomes in our phylogeny, never as progenitor states. Conclusions Collectively, our results illustrate that both convergence and parallelism generated repeated life habit states in the scallops. Bias in the types of habit transitions observed may indicate constraints due to physical or ontogenetic limitations of particular phenotypes. PMID:21672233
3D finite element analysis of changes in stress levels and distributions for an osseointegrated implant after vertical bone loss.

PubMed

Yoon, Kyung-Ho; Kim, Su-Gwan; Lee, Jeong-Hoon; Suh, Seung-Woo

2011-10-01

The effect of stress levels and distributions around the internal nonsubmerged type implants after vertical bone resorption was investigated in this study. An HSII implant was placed in 4 cylindrical alveolar bone models with differing degrees of thread exposures. The load applied to each implant was von Mises stress and principal stress, 250 N in axial direction and 30 degrees lateral pressure. The difference in the load between the bone and the connective portion of the implant was obtained using ANSYS analysis. Bone loss in the cervical area of the implant was more obvious under lateral pressure. When more threads were exposed, bone level decreased and the maximum load applied on the fixture increased. It was concluded that higher bone level has a biomechanical advantage with respect to stress concentration.
Simulation of peri-implant bone healing due to immediate loading in dental implant treatments.

PubMed

Chou, Hsuan-Yu; Müftü, Sinan

2013-03-15

The goal of this work was to investigate the role of immediate loading on the peri-implant bone healing in dental implant treatments. A mechano-regulatory tissue differentiation model that takes into account the stimuli through the solid and the fluid components of the healing tissue, and the diffusion of pluripotent stem cells into the healing callus was used. A two-dimensional axisymmetric model consisting of a dental implant, the healing callus tissue and the host bone tissue was constructed for the finite element analysis. Poroelastic material properties were assigned to the healing callus and the bone tissue. The effects of micro-motion, healing callus size, and implant thread design on the length of the bone-to-implant contact (BIC) and the bone volume (BV) formed in the healing callus were investigated. In general, the analysis predicted formation of a continuous layer of soft tissue along the faces of the implant which are parallel to the loading direction. This was predicted to be correlated with the high levels of distortional strain transferred through the solid component of the stimulus. It was also predicted that the external threads on the implant, redistribute the interfacial load, thus help reduce the high distortional stimulus and also help the cells to differentiate to bone tissue. In addition, the region underneath the implant apex was predicted to experience high fluid stimulus that results in the development of soft tissue. The relationship between the variables considered in this study and the outcome measures, BV and BIC, was found to be highly nonlinear. A three-way analysis of variance (ANOVA) of the results was conducted and it showed that micro-motion presents the largest hindrance to bone formation during healing. Copyright © 2013 Elsevier Ltd. All rights reserved.
Thread angle dependency on flame spread shape over kenaf/polyester combined fabric

NASA Astrophysics Data System (ADS)

Azahari Razali, Mohd; Sapit, Azwan; Nizam Mohammed, Akmal; Nor Anuar Mohamad, Md; Nordin, Normayati; Sadikin, Azmahani; Faisal Hushim, Mohd; Jaat, Norrizam; Khalid, Amir

2017-09-01

Understanding flame spread behavior is crucial to Fire Safety Engineering. It is noted that the natural fiber exhibits different flame spread behavior than the one of the synthetic fiber. This different may influences the flame spread behavior over combined fabric. There is a research has been done to examined the flame spread behavior over kenaf/polyester fabric. It is seen that the flame spread shape is dependent on the thread angle dependency. However, the explanation of this phenomenon is not described in detail in that research. In this study, explanation about this phenomenon is given in detail. Results show that the flame spread shape is dependent on the position of synthetic thread. For thread angle, θ = 0°, the polyester thread is breaking when the flame approach to the thread and the kenaf thread tends to move to the breaking direction. This behavior produces flame to be ‘V’ shape. However, for thread angle, θ = 90°, the polyester thread melts while the kenaf thread decomposed and burned. At this angle, the distance between kenaf threads remains constant as flame approaches.
The effect of thread pattern upon implant osseointegration.

PubMed

Abuhussein, Heba; Pagni, Giorgio; Rebaudi, Alberto; Wang, Hom-Lay

2010-02-01

Implant design features such as macro- and micro-design may influence overall implant success. Limited information is currently available. Therefore, it is the purpose of this paper to examine these factors such as thread pitch, thread geometry, helix angle, thread depth and width as well as implant crestal module may affect implant stability. A literature search was conducted using MEDLINE to identify studies, from simulated laboratory models, animal, to human, related to this topic using the keywords of implant thread, implant macrodesign, thread pitch, thread geometry, helix angle, thread depth, thread width and implant crestal module. The results showed how thread geometry affects the distribution of stress forces around the implant. A decreased thread pitch may positively influence implant stability. Excess helix angles in spite of a faster insertion may jeopardize the ability of implants to sustain axial load. Deeper threads seem to have an important effect on the stabilization in poorer bone quality situations. The addition of threads or microthreads up to the crestal module of an implant might provide a potential positive contribution on bone-to to-implant contact as well as on the preservation of marginal bone; nonetheless this remains to be determined. Appraising the current literature on this subject and combining existing data to verify the presence of any association between the selected characteristics may be critical in the achievement of overall implant success.
Method for molding threads in graphite panels

DOEpatents

Short, W.W.; Spencer, C.

1994-11-29

A graphite panel with a hole having a damaged thread is repaired by drilling the hole to remove all of the thread and making a new hole of larger diameter. A bolt with a lubricated thread is placed in the new hole and the hole is packed with graphite cement to fill the hole and the thread on the bolt. The graphite cement is cured, and the bolt is unscrewed therefrom to leave a thread in the cement which is at least as strong as that of the original thread. 8 figures.
The measure method of internal screw thread and the measure device design

NASA Astrophysics Data System (ADS)

Hu, Dachao; Chen, Jianguo

2008-12-01

In accordance with the principle of Three-Line, this paper analyzed the correlation of every main parameter of internal screw thread, and then designed a device to measure the main parameters of internal screw thread. Basis on the measured value and corresponding formula calculation, we can get the internal thread parameters, such as the pitch diameter, thread angle and screw-pitch of common screw thread, terraced screw thread, zigzag screw thread and some else. The practical application has proved that this operation of this device is convenience, and the measured dates have a high accuracy. Meanwhile, the application of this device's patent of invention is accepted by the Patent Office. (The filing number: 200710044081.5)
Insertion tube methods and apparatus

DOEpatents

Casper, William L.; Clark, Don T.; Grover, Blair K.; Mathewson, Rodney O.; Seymour, Craig A.

2007-02-20

A drill string comprises a first drill string member having a male end; and a second drill string member having a female end configured to be joined to the male end of the first drill string member, the male end having a threaded portion including generally square threads, the male end having a non-threaded extension portion coaxial with the threaded portion, and the male end further having a bearing surface, the female end having a female threaded portion having corresponding female threads, the female end having a non-threaded extension portion coaxial with the female threaded portion, and the female end having a bearing surface. Installation methods, including methods of installing instrumented probes are also provided.
Subsurface drill string

DOEpatents

Casper, William L [Rigby, ID; Clark, Don T [Idaho Falls, ID; Grover, Blair K [Idaho Falls, ID; Mathewson, Rodney O [Idaho Falls, ID; Seymour, Craig A [Idaho Falls, ID

2008-10-07

A drill string comprises a first drill string member having a male end; and a second drill string member having a female end configured to be joined to the male end of the first drill string member, the male end having a threaded portion including generally square threads, the male end having a non-threaded extension portion coaxial with the threaded portion, and the male end further having a bearing surface, the female end having a female threaded portion having corresponding female threads, the female end having a non-threaded extension portion coaxial with the female threaded portion, and the female end having a bearing surface. Installation methods, including methods of installing instrumented probes are also provided.
Revised Extended Grid Library

DOE Office of Scientific and Technical Information (OSTI.GOV)

Martz, Roger L.

The Revised Eolus Grid Library (REGL) is a mesh-tracking library that was developed for use with the MCNP6TM computer code so that (radiation) particles can track on an unstructured mesh. The unstructured mesh is a finite element representation of any geometric solid model created with a state-of-the-art CAE/CAD tool. The mesh-tracking library is written using modern Fortran and programming standards; the library is Fortran 2003 compliant. The library was created with a defined application programmer interface (API) so that it could easily integrate with other particle tracking/transport codes. The library does not handle parallel processing via the message passing interfacemore » (mpi), but has been used successfully where the host code handles the mpi calls. The library is thread-safe and supports the OpenMP paradigm. As a library, all features are available through the API and overall a tight coupling between it and the host code is required. Features of the library are summarized with the following list: Can accommodate first and second order 4, 5, and 6-sided polyhedra; any combination of element types may appear in a single geometry model; parts may not contain tetrahedra mixed with other element types; pentahedra and hexahedra can be together in the same part; robust handling of overlaps and gaps; tracks element-to-element to produce path length results at the element level; finds element numbers for a given mesh location; finds intersection points on element faces for the particle tracks; produce a data file for post processing results analysis; reads Abaqus .inp input (ASCII) files to obtain information for the global mesh-model; supports parallel input processing via mpi; and support parallel particle transport by both mpi and OpenMP.« less
Specialized Computer Systems for Environment Visualization

NASA Astrophysics Data System (ADS)

Al-Oraiqat, Anas M.; Bashkov, Evgeniy A.; Zori, Sergii A.

2018-06-01

The need for real time image generation of landscapes arises in various fields as part of tasks solved by virtual and augmented reality systems, as well as geographic information systems. Such systems provide opportunities for collecting, storing, analyzing and graphically visualizing geographic data. Algorithmic and hardware software tools for increasing the realism and efficiency of the environment visualization in 3D visualization systems are proposed. This paper discusses a modified path tracing algorithm with a two-level hierarchy of bounding volumes and finding intersections with Axis-Aligned Bounding Box. The proposed algorithm eliminates the branching and hence makes the algorithm more suitable to be implemented on the multi-threaded CPU and GPU. A modified ROAM algorithm is used to solve the qualitative visualization of reliefs' problems and landscapes. The algorithm is implemented on parallel systems—cluster and Compute Unified Device Architecture-networks. Results show that the implementation on MPI clusters is more efficient than Graphics Processing Unit/Graphics Processing Clusters and allows real-time synthesis. The organization and algorithms of the parallel GPU system for the 3D pseudo stereo image/video synthesis are proposed. With realizing possibility analysis on a parallel GPU-architecture of each stage, 3D pseudo stereo synthesis is performed. An experimental prototype of a specialized hardware-software system 3D pseudo stereo imaging and video was developed on the CPU/GPU. The experimental results show that the proposed adaptation of 3D pseudo stereo imaging to the architecture of GPU-systems is efficient. Also it accelerates the computational procedures of 3D pseudo-stereo synthesis for the anaglyph and anamorphic formats of the 3D stereo frame without performing optimization procedures. The acceleration is on average 11 and 54 times for test GPUs.
Parallel Task Management Library for MARTe

NASA Astrophysics Data System (ADS)

Valcarcel, Daniel F.; Alves, Diogo; Neto, Andre; Reux, Cedric; Carvalho, Bernardo B.; Felton, Robert; Lomas, Peter J.; Sousa, Jorge; Zabeo, Luca

2014-06-01

The Multithreaded Application Real-Time executor (MARTe) is a real-time framework with increasing popularity and support in the thermonuclear fusion community. It allows modular code to run in a multi-threaded environment leveraging on the current multi-core processor (CPU) technology. One application that relies on the MARTe framework is the Joint European Torus (JET) tokamak WAll Load Limiter System (WALLS). It calculates and monitors the temperature on metal tiles and plasma facing components (PFCs) that can melt or flake if their temperature gets too high when exposed to power loads. One of the main time consuming tasks in WALLS is the calculation of thermal diffusion models in real-time. These models tend to be described by very large state-space models thus making them perfect candidates for parallelisation. MARTe's traditional approach for task parallelisation is to split the problem into several Real-Time Threads, each responsible for a self-contained sequential execution of an input-to-output chain. This is usually possible, but it might not always be practical for algorithmic or technical reasons. Also, it might not be easily scalable with an increase in the number of available CPU cores. The WorkLibrary introduces a “GPU-like approach” of splitting work among the available cores of modern CPUs that is (i) straightforward to use in an application, (ii) scalable with the availability of cores and all of this (iii) without rewriting or recompiling the source code. The first part of this article explains the motivation behind the library, its architecture and implementation. The second part presents a real application for WALLS, a parallel version of a large state-space model describing the 2D thermal diffusion on a JET tile.

Using Intel's Knight Landing Processor to Accelerate Global Nested Air Quality Prediction Modeling System (GNAQPMS) Model

NASA Astrophysics Data System (ADS)

Wang, H.; Chen, H.; Chen, X.; Wu, Q.; Wang, Z.

2016-12-01

The Global Nested Air Quality Prediction Modeling System for Hg (GNAQPMS-Hg) is a global chemical transport model coupled Hg transport module to investigate the mercury pollution. In this study, we present our work of transplanting the GNAQPMS model on Intel Xeon Phi processor, Knights Landing (KNL) to accelerate the model. KNL is the second-generation product adopting Many Integrated Core Architecture (MIC) architecture. Compared with the first generation Knight Corner (KNC), KNL has more new hardware features, that it can be used as unique processor as well as coprocessor with other CPU. According to the Vtune tool, the high overhead modules in GNAQPMS model have been addressed, including CBMZ gas chemistry, advection and convection module, and wet deposition module. These high overhead modules were accelerated by optimizing code and using new techniques of KNL. The following optimized measures was done: 1) Changing the pure MPI parallel mode to hybrid parallel mode with MPI and OpenMP; 2.Vectorizing the code to using the 512-bit wide vector computation unit. 3. Reducing unnecessary memory access and calculation. 4. Reducing Thread Local Storage (TLS) for common variables with each OpenMP thread in CBMZ. 5. Changing the way of global communication from files writing and reading to MPI functions. After optimization, the performance of GNAQPMS is greatly increased both on CPU and KNL platform, the single-node test showed that optimized version has 2.6x speedup on two sockets CPU platform and 3.3x speedup on one socket KNL platform compared with the baseline version code, which means the KNL has 1.29x speedup when compared with 2 sockets CPU platform.
A feasibility study on porting the community land model onto accelerators using OpenACC

DOE PAGES

Wang, Dali; Wu, Wei; Winkler, Frank; ...

2014-01-01

As environmental models (such as Accelerated Climate Model for Energy (ACME), Parallel Reactive Flow and Transport Model (PFLOTRAN), Arctic Terrestrial Simulator (ATS), etc.) became more and more complicated, we are facing enormous challenges regarding to porting those applications onto hybrid computing architecture. OpenACC appears as a very promising technology, therefore, we have conducted a feasibility analysis on porting the Community Land Model (CLM), a terrestrial ecosystem model within the Community Earth System Models (CESM)). Specifically, we used automatic function testing platform to extract a small computing kernel out of CLM, then we apply this kernel into the actually CLM dataflowmore » procedure, and investigate the strategy of data parallelization and the benefit of data movement provided by current implementation of OpenACC. Even it is a non-intensive kernel, on a single 16-core computing node, the performance (based on the actual computation time using one GPU) of OpenACC implementation is 2.3 time faster than that of OpenMP implementation using single OpenMP thread, but it is 2.8 times slower than the performance of OpenMP implementation using 16 threads. On multiple nodes, MPI_OpenACC implementation demonstrated very good scalability on up to 128 GPUs on 128 computing nodes. This study also provides useful information for us to look into the potential benefits of “deep copy” capability and “routine” feature of OpenACC standards. In conclusion, we believe that our experience on the environmental model, CLM, can be beneficial to many other scientific research programs who are interested to porting their large scale scientific code using OpenACC onto high-end computers, empowered by hybrid computing architecture.« less
Improvement and speed optimization of numerical tsunami modelling program using OpenMP technology

NASA Astrophysics Data System (ADS)

Chernov, A.; Zaytsev, A.; Yalciner, A.; Kurkin, A.

2009-04-01

Currently, the basic problem of tsunami modeling is low speed of calculations which is unacceptable for services of the operative notification. Existing algorithms of numerical modeling of hydrodynamic processes of tsunami waves are developed without taking the opportunities of modern computer facilities. There is an opportunity to have considerable acceleration of process of calculations by using parallel algorithms. We discuss here new approach to parallelization tsunami modeling code using OpenMP Technology (for multiprocessing systems with the general memory). Nowadays, multiprocessing systems are easily accessible for everyone. The cost of the use of such systems becomes much lower comparing to the costs of clusters. This opportunity also benefits all programmers to apply multithreading algorithms on desktop computers of researchers. Other important advantage of the given approach is the mechanism of the general memory - there is no necessity to send data on slow networks (for example Ethernet). All memory is the common for all computing processes; it causes almost linear scalability of the program and processes. In the new version of NAMI DANCE using OpenMP technology and multi-threading algorithm provide 80% gain in speed in comparison with the one-thread version for dual-processor unit. The speed increased and 320% gain was attained for four core processor unit of PCs. Thus, it was possible to reduce considerably time of performance of calculations on the scientific workstations (desktops) without complete change of the program and user interfaces. The further modernization of algorithms of preparation of initial data and processing of results using OpenMP looks reasonable. The final version of NAMI DANCE with the increased computational speed can be used not only for research purposes but also in real time Tsunami Warning Systems.
Influence of implantoplasty on stress distribution of exposed implants at different bone insertion levels.

PubMed

Tribst, João Paulo Mendes; Dal Piva, Amanda Maria de Oliveira; Shibli, Jamil Awad; Borges, Alexandre Luiz Souto; Tango, Rubens Nisie

2017-12-07

This study evaluated the effect of implantoplasty on different bone insertion levels of exposed implants. A model of the Bone Level Tapered implant (Straumann Institute, Waldenburg, Switzerland) was created through the Rhinoceros software (version 5.0 SR8, McNeel North America, Seattle, WA, USA). The abutment was fixed to the implant through a retention screw and a monolithic crown was modeled over a cementation line. Six models were created with increasing portions of the implant threads exposed: C1 (1 mm), C2 (2 mm), C3 (3 mm), C4 (4 mm), C5 (5 mm) and C6 (6 mm). The models were made in duplicates and one of each pair was used to simulate implantoplasty, by removing the threads (I1, I2, I3, I4, I5 and I6). The final geometry was exported in STEP format to ANSYS (ANSYS 15.0, ANSYS Inc., Houston, USA) and all materials were considered homogeneous, isotropic and linearly elastic. To assess distribution of stress forces, an axial load (300 N) was applied on the cusp. For the periodontal insert, the strains increased in the peri-implant region according to the size of the exposed portion and independent of the threads' presence. The difference between groups with and without implantoplasty was less than 10%. Critical values were found when the inserted portion was smaller than the exposed portion. In the exposed implants, the stress generated on the implant and retention screw was higher in the models that received implantoplasty. For the bone tissue, exposure of the implant's thread was a damaging factor, independent of implantoplasty. Implantoplasty treatment can be safely used to control peri-implantitis if at least half of the implant is still inserted in bone.
Assessing the Effect of Dental Implants Thread Design on Distribution of Stress in Impact Loadings Using Three Dimensional Finite Element Method

PubMed Central

I, Zarei; S, Khajehpour; A, Sabouri; AZ, Haghnegahdar; K, Jafari

2016-01-01

Statement of Problem: Impacts and accidents are considered as the main fac- tors in losing the teeth, so the analysis and design of the implants that they can be more resistant against impacts is very important. One of the important nu- merical methods having widespread application in various fields of engineering sciences is the finite element method. Among its wide applications, the study of distribution of power in complex structures can be noted. Objectives: The aim of this research was to assess the geometric effect and the type of implant thread on its performance; we also made an attempt to determine the created stress using finite element method. Materials and Methods: In this study, the three dimensional model of bone by using Cone Beam Computerized Tomography (CBCT) of the patient has been provided. The implants in this study are designed by Solid Works software. Loading is simulated in explicit dynamic, by struck of a rigid body with the speed of 1 mm/s to implant vertically and horizontally; and the maximum level of induced stress for the cortical and trabecular bone in the ANSYS Workbench software was calculated. Results: By considering the results of this study, it was identified that, among the designed samples, the maximum imposed stress in the cortical bone layer occurred in the first group (straight threads) and the maximum stress value in the trabecular bone layer and implant occurred in the second group (tapered threads). Conclusions: Due to the limitations of this study, the implants with more depth thread, because of the increased contact surface of the implant with the bone, caused more stability; also, the implant with smaller thread and shorter pitch length caused more stress to the bone. PMID:28959748
Assessing the Effect of Dental Implants Thread Design on Distribution of Stress in Impact Loadings Using Three Dimensional Finite Element Method.

PubMed

I, Zarei; S, Khajehpour; A, Sabouri; Az, Haghnegahdar; K, Jafari

2016-06-01

Impacts and accidents are considered as the main fac- tors in losing the teeth, so the analysis and design of the implants that they can be more resistant against impacts is very important. One of the important nu- merical methods having widespread application in various fields of engineering sciences is the finite element method. Among its wide applications, the study of distribution of power in complex structures can be noted. The aim of this research was to assess the geometric effect and the type of implant thread on its performance; we also made an attempt to determine the created stress using finite element method. In this study, the three dimensional model of bone by using Cone Beam Computerized Tomography (CBCT) of the patient has been provided. The implants in this study are designed by Solid Works software. Loading is simulated in explicit dynamic, by struck of a rigid body with the speed of 1 mm/s to implant vertically and horizontally; and the maximum level of induced stress for the cortical and trabecular bone in the ANSYS Workbench software was calculated. By considering the results of this study, it was identified that, among the designed samples, the maximum imposed stress in the cortical bone layer occurred in the first group (straight threads) and the maximum stress value in the trabecular bone layer and implant occurred in the second group (tapered threads). Due to the limitations of this study, the implants with more depth thread, because of the increased contact surface of the implant with the bone, caused more stability; also, the implant with smaller thread and shorter pitch length caused more stress to the bone.
Improved Screw-Thread Lock

NASA Technical Reports Server (NTRS)

Macmartin, Malcolm

1995-01-01

Improved screw-thread lock engaged after screw tightened in nut or other mating threaded part. Device does not release contaminating material during tightening of screw. Includes pellet of soft material encased in screw and retained by pin. Hammer blow on pin extrudes pellet into slot, engaging threads in threaded hole or in nut.
Method for molding threads in graphite panels

DOEpatents

Short, William W.; Spencer, Cecil

1994-01-01

A graphite panel (10) with a hole (11) having a damaged thread (12) is repaired by drilling the hole (11) to remove all of the thread and make a new hole (13) of larger diameter. A bolt (14) with a lubricated thread (17) is placed in the new hole (13) and the hole (13) is packed with graphite cement (16) to fill the hole and the thread on the bolt. The graphite cement (16) is cured, and the bolt is unscrewed therefrom to leave a thread (20) in the cement (16) which is at least as strong as that of the original thread (12).
Self-locking threaded fasteners

DOEpatents

Glovan, Ronald J.; Tierney, John C.; McLean, Leroy L.; Johnson, Lawrence L.

1996-01-01

A threaded fastener with a shape memory alloy (SMA) coatings on its threads is disclosed. The fastener has special usefulness in high temperature applications where high reliability is important. The SMA coated fastener is threaded into or onto a mating threaded part at room temperature to produce a fastened object. The SMA coating is distorted during the assembly. At elevated temperatures the coating tries to recover its original shape and thereby exerts locking forces on the threads. When the fastened object is returned to room temperature the locking forces dissipate. Consequently the threaded fasteners can be readily disassembled at room temperature but remains securely fastened at high temperatures. A spray technique is disclosed as a particularly useful method of coating of threads of a fastener with a shape memory alloy.
Efficient molecular dynamics simulations with many-body potentials on graphics processing units

NASA Astrophysics Data System (ADS)

Fan, Zheyong; Chen, Wei; Vierimaa, Ville; Harju, Ari

2017-09-01

Graphics processing units have been extensively used to accelerate classical molecular dynamics simulations. However, there is much less progress on the acceleration of force evaluations for many-body potentials compared to pairwise ones. In the conventional force evaluation algorithm for many-body potentials, the force, virial stress, and heat current for a given atom are accumulated within different loops, which could result in write conflict between different threads in a CUDA kernel. In this work, we provide a new force evaluation algorithm, which is based on an explicit pairwise force expression for many-body potentials derived recently (Fan et al., 2015). In our algorithm, the force, virial stress, and heat current for a given atom can be accumulated within a single thread and is free of write conflicts. We discuss the formulations and algorithms and evaluate their performance. A new open-source code, GPUMD, is developed based on the proposed formulations. For the Tersoff many-body potential, the double precision performance of GPUMD using a Tesla K40 card is equivalent to that of the LAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator) molecular dynamics code running with about 100 CPU cores (Intel Xeon CPU X5670 @ 2.93 GHz).
Interface COMSOL-PHREEQC (iCP), an efficient numerical framework for the solution of coupled multiphysics and geochemistry

NASA Astrophysics Data System (ADS)

Nardi, Albert; Idiart, Andrés; Trinchero, Paolo; de Vries, Luis Manuel; Molinero, Jorge

2014-08-01

This paper presents the development, verification and application of an efficient interface, denoted as iCP, which couples two standalone simulation programs: the general purpose Finite Element framework COMSOL Multiphysics® and the geochemical simulator PHREEQC. The main goal of the interface is to maximize the synergies between the aforementioned codes, providing a numerical platform that can efficiently simulate a wide number of multiphysics problems coupled with geochemistry. iCP is written in Java and uses the IPhreeqc C++ dynamic library and the COMSOL Java-API. Given the large computational requirements of the aforementioned coupled models, special emphasis has been placed on numerical robustness and efficiency. To this end, the geochemical reactions are solved in parallel by balancing the computational load over multiple threads. First, a benchmark exercise is used to test the reliability of iCP regarding flow and reactive transport. Then, a large scale thermo-hydro-chemical (THC) problem is solved to show the code capabilities. The results of the verification exercise are successfully compared with those obtained using PHREEQC and the application case demonstrates the scalability of a large scale model, at least up to 32 threads.
One Approach to the Synthesis, Design and Manufacture of Hyperboloid Gear Sets With Face Mating Gears. Part 1: Basic Theoretical and Cad Experience

NASA Astrophysics Data System (ADS)

Abadjiev, Valentin; Abadjieva, Emilia

2016-06-01

Hyperboloid gear drives with face mating gears are used to transform rotations between shafts with non-parallel and non-intersecting axes. A special case of these transmissions are Spiroid and Helicon gear drives. The classical gear drives of this type are the Archimedean ones. The objective of this study are hyperboloid gear drives with face meshing, when the pinion possesses threads of conic convolute, Archimedean and involute types, or the pinion has threads of cylindrical convolute, Archimedean and involute types. For simplicity, all three types transmis- sions with face mating gears and a conic pinion are titled Spiroid and all three types transmissions with face mating gears and a cylindrical pinion are titled Helicon. Principles of the mathematical modelling of tooth contact synthesis are discussed in this study. The presented research shows that the synthesis is realized by application of two mathematical models: pitch contact point and mesh region models. Two approaches for synthesis of the gear drives in accordance with Olivier's principles are illustrated. The algorithms and computer programs for optimization synthesis and design of the studied hyperboloid gear drives are presented.
Method for Estimating Thread Strength Reduction of Damaged Parent Holes with Inserts

NASA Technical Reports Server (NTRS)

Johnson, David L.; Stratton, Troy C.

2005-01-01

During normal assembly and disassembly of bolted-joint components, thread damage and/or deformation may occur. If threads are overloaded, thread damage/deformation can also be anticipated. Typical inspection techniques (e.g. using GO-NO GO gages) may not provide adequate visibility of the extent of thread damage. More detailed inspection techniques have provided actual pitch-diameter profiles of damaged-hardware holes. A method to predict the reduction in thread shear-out capacity of damaged threaded holes has been developed. This method was based on testing and analytical modeling. Test samples were machined to simulate damaged holes in the hardware of interest. Test samples containing pristine parent-holes were also manufactured from the same bar-stock material to provide baseline results for comparison purposes. After the particular parent-hole thread profile was machined into each sample a helical insert was installed into the threaded hole. These samples were tested in a specially designed fixture to determine the maximum load required to shear out the parent threads. It was determined from the pristine-hole samples that, for the specific material tested, each individual thread could resist an average load of 3980 pounds. The shear-out loads of the holes having modified pitch diameters were compared to the ultimate loads of the specimens with pristine holes. An equivalent number of missing helical coil threads was then determined based on the ratio of shear-out loads for each thread configuration. These data were compared with the results from a finite element model (FEM). The model gave insights into the ability of the thread loads to redistribute for both pristine and simulated damage configurations. In this case, it was determined that the overall potential reduction in thread load-carrying capability in the hardware of interest was equal to having up to three fewer threads in the hole that bolt threads could engage. One- half of this potential reduction was due to local pitch-diameter variations and the other half was due to overall pitch-diameter enlargement beyond Class 2 fit. This result was important in that the thread shear capacity for this particular hardware design was the limiting structural capability. The details of the method development, including the supporting testing, data reduction and analytical model results comparison will be discussed hereafter.
An MPI + $X$ implementation of contact global search using Kokkos

DOE PAGES

Hansen, Glen A.; Xavier, Patrick G.; Mish, Sam P.; ...

2015-10-05

This paper describes an approach that seeks to parallelize the spatial search associated with computational contact mechanics. In contact mechanics, the purpose of the spatial search is to find “nearest neighbors,” which is the prelude to an imprinting search that resolves the interactions between the external surfaces of contacting bodies. In particular, we are interested in the contact global search portion of the spatial search associated with this operation on domain-decomposition-based meshes. Specifically, we describe an implementation that combines standard domain-decomposition-based MPI-parallel spatial search with thread-level parallelism (MPI-X) available on advanced computer architectures (those with GPU coprocessors). Our goal ismore » to demonstrate the efficacy of the MPI-X paradigm in the overall contact search. Standard MPI-parallel implementations typically use a domain decomposition of the external surfaces of bodies within the domain in an attempt to efficiently distribute computational work. This decomposition may or may not be the same as the volume decomposition associated with the host physics. The parallel contact global search phase is then employed to find and distribute surface entities (nodes and faces) that are needed to compute contact constraints between entities owned by different MPI ranks without further inter-rank communication. Key steps of the contact global search include computing bounding boxes, building surface entity (node and face) search trees and finding and distributing entities required to complete on-rank (local) spatial searches. To enable source-code portability and performance across a variety of different computer architectures, we implemented the algorithm using the Kokkos hardware abstraction library. While we targeted development towards machines with a GPU accelerator per MPI rank, we also report performance results for OpenMP with a conventional multi-core compute node per rank. Results here demonstrate a 47 % decrease in the time spent within the global search algorithm, comparing the reference ACME algorithm with the GPU implementation, on an 18M face problem using four MPI ranks. As a result, while further work remains to maximize performance on the GPU, this result illustrates the potential of the proposed implementation.« less
A Parallel Vector Machine for the PM Programming Language

NASA Astrophysics Data System (ADS)

Bellerby, Tim

2016-04-01

PM is a new programming language which aims to make the writing of computational geoscience models on parallel hardware accessible to scientists who are not themselves expert parallel programmers. It is based around the concept of communicating operators: language constructs that enable variables local to a single invocation of a parallelised loop to be viewed as if they were arrays spanning the entire loop domain. This mechanism enables different loop invocations (which may or may not be executing on different processors) to exchange information in a manner that extends the successful Communicating Sequential Processes idiom from single messages to collective communication. Communicating operators avoid the additional synchronisation mechanisms, such as atomic variables, required when programming using the Partitioned Global Address Space (PGAS) paradigm. Using a single loop invocation as the fundamental unit of concurrency enables PM to uniformly represent different levels of parallelism from vector operations through shared memory systems to distributed grids. This paper describes an implementation of PM based on a vectorised virtual machine. On a single processor node, concurrent operations are implemented using masked vector operations. Virtual machine instructions operate on vectors of values and may be unmasked, masked using a Boolean field, or masked using an array of active vector cell locations. Conditional structures (such as if-then-else or while statement implementations) calculate and apply masks to the operations they control. A shift in mask representation from Boolean to location-list occurs when active locations become sufficiently sparse. Parallel loops unfold data structures (or vectors of data structures for nested loops) into vectors of values that may additionally be distributed over multiple computational nodes and then split into micro-threads compatible with the size of the local cache. Inter-node communication is accomplished using standard OpenMP and MPI. Performance analyses of the PM vector machine, demonstrating its scaling properties with respect to domain size and the number of processor nodes will be presented for a range of hardware configurations. The PM software and language definition are being made available under unrestrictive MIT and Creative Commons Attribution licenses respectively: www.pm-lang.org.
The research and development of the non-contact detection of the tubing internal thread with a line structured light

NASA Astrophysics Data System (ADS)

Hu, Yuanyuan; Xu, Yingying; Hao, Qun; Hu, Yao

2013-12-01

The tubing internal thread plays an irreplaceable role in the petroleum equipment. The unqualified tubing can directly lead to leakage, slippage and bring huge losses for oil industry. For the purpose of improving efficiency and precision of tubing internal thread detection, we develop a new non-contact tubing internal thread measurement system based on the laser triangulation principle. Firstly, considering that the tubing thread had a small diameter and relatively smooth surface, we built a set of optical system with a line structured light to irradiate the internal thread surface and obtain an image which contains the internal thread profile information through photoelectric sensor. Secondly, image processing techniques were used to do the edge detection of the internal thread from the obtained image. One key method was the sub-pixel technique which greatly improved the detection accuracy under the same hardware conditions. Finally, we restored the real internal thread contour information on the basis of laser triangulation method and calculated tubing thread parameters such as the pitch, taper and tooth type angle. In this system, the profile of several thread teeth can be obtained at the same time. Compared with other existing scanning methods using point light and stepper motor, this system greatly improves the detection efficiency. Experiment results indicate that this system can achieve the high precision and non-contact measurement of the tubing internal thread.
Measurement of Sound Speed in Thread

NASA Astrophysics Data System (ADS)

Saito, Shigemi; Shibata, Yasuhiro; Ichiki, Akira; Miyazaki, Akiho

2006-05-01

By employing thin wires, human hairs and threads, the measurement of sound speed in a thread whose diameter is smaller than 0.2 mm has been attempted. Preparing two cylindrical ceramic transducers with a 300 kHz resonance frequency, a perforated glass bead to be knotted by a sample thread is bonded to the center of the end surface of each transducer. After connecting these transducers with a sample thread, a receiving transducer is attached at a ceiling so as to hang another transmitting transducer with the thread. A glass bead is bonded to another end surface of the transmitting transducer so that tension, varied with a hanged plumb, can be applied to the sample thread. The time delay of the received signal relative to the transmitting pulse is measured while gradually shortening the thread. Sound speed is determined by the proportionality of time delay with thread length. Although the measured values for metallic wires are somewhat different from the values derived from the density and Young’s modulus cited in references, they are reproducible. The sound speed for human hairs of over twenty samples, which varies between 2000 and 2500 m/s, seems to depend on hair quality. Sound speed in a cotton thread is found to approach a constant value under large tension. An advanced measurement system available for uncut threads is also presented, where semi cylindrical transducers pinch the thread.
78 FR 79670 - Steel Threaded Rod From Thailand: Preliminary Determination of Sales at Less Than Fair Value and...

Federal Register 2010, 2011, 2012, 2013, 2014

2013-12-31

... DEPARTMENT OF COMMERCE International Trade Administration [A-549-831] Steel Threaded Rod From... ``Department'') preliminarily determines that steel threaded rod from Thailand is being, or is likely to be... Investigation The merchandise covered by this investigation is steel threaded rod. Steel threaded rod is certain...
49 CFR 178.46 - Specification 3AL seamless aluminum cylinders.

Code of Federal Regulations, 2012 CFR

2012-10-01

... circular. (5) All openings must be threaded. Threads must comply with the following: (i) Each thread must be clean cut, even, without checks, and to gauge. (ii) Taper threads, when used, must conform to one of the following: (A) American Standard Pipe Thread (NPT) type, conforming to the requirements of NBS...
49 CFR 178.46 - Specification 3AL seamless aluminum cylinders.

Code of Federal Regulations, 2014 CFR

2014-10-01

... circular. (5) All openings must be threaded. Threads must comply with the following: (i) Each thread must be clean cut, even, without checks, and to gauge. (ii) Taper threads, when used, must conform to one of the following: (A) American Standard Pipe Thread (NPT) type, conforming to the requirements of NBS...

49 CFR 178.46 - Specification 3AL seamless aluminum cylinders.

Code of Federal Regulations, 2013 CFR

2013-10-01

... circular. (5) All openings must be threaded. Threads must comply with the following: (i) Each thread must be clean cut, even, without checks, and to gauge. (ii) Taper threads, when used, must conform to one of the following: (A) American Standard Pipe Thread (NPT) type, conforming to the requirements of NBS...
49 CFR 178.46 - Specification 3AL seamless aluminum cylinders.

Code of Federal Regulations, 2011 CFR

2011-10-01

... circular. (5) All openings must be threaded. Threads must comply with the following: (i) Each thread must be clean cut, even, without checks, and to gauge. (ii) Taper threads, when used, must conform to one of the following: (A) American Standard Pipe Thread (NPT) type, conforming to the requirements of NBS...
AN MHD AVALANCHE IN A MULTI-THREADED CORONAL LOOP

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hood, A. W.; Cargill, P. J.; Tam, K. V.

For the first time, we demonstrate how an MHD avalanche might occur in a multithreaded coronal loop. Considering 23 non-potential magnetic threads within a loop, we use 3D MHD simulations to show that only one thread needs to be unstable in order to start an avalanche even when the others are below marginal stability. This has significant implications for coronal heating in that it provides for energy dissipation with a trigger mechanism. The instability of the unstable thread follows the evolution determined in many earlier investigations. However, once one stable thread is disrupted, it coalesces with a neighboring thread andmore » this process disrupts other nearby threads. Coalescence with these disrupted threads then occurs leading to the disruption of yet more threads as the avalanche develops. Magnetic energy is released in discrete bursts as the surrounding stable threads are disrupted. The volume integrated heating, as a function of time, shows short spikes suggesting that the temporal form of the heating is more like that of nanoflares than of constant heating.« less
Self-locking threaded fasteners

DOEpatents

Glovan, R.J.; Tierney, J.C.; McLean, L.L.; Johnson, L.L.

1996-01-16

A threaded fastener with a shape memory alloy (SMA) coatings on its threads is disclosed. The fastener has special usefulness in high temperature applications where high reliability is important. The SMA coated fastener is threaded into or onto a mating threaded part at room temperature to produce a fastened object. The SMA coating is distorted during the assembly. At elevated temperatures the coating tries to recover its original shape and thereby exerts locking forces on the threads. When the fastened object is returned to room temperature the locking forces dissipate. Consequently the threaded fasteners can be readily disassembled at room temperature but remains securely fastened at high temperatures. A spray technique is disclosed as a particularly useful method of coating of threads of a fastener with a shape memory alloy. 13 figs.
Parallel workflow manager for non-parallel bioinformatic applications to solve large-scale biological problems on a supercomputer.

PubMed

Suplatov, Dmitry; Popova, Nina; Zhumatiy, Sergey; Voevodin, Vladimir; Švedas, Vytas

2016-04-01

Rapid expansion of online resources providing access to genomic, structural, and functional information associated with biological macromolecules opens an opportunity to gain a deeper understanding of the mechanisms of biological processes due to systematic analysis of large datasets. This, however, requires novel strategies to optimally utilize computer processing power. Some methods in bioinformatics and molecular modeling require extensive computational resources. Other algorithms have fast implementations which take at most several hours to analyze a common input on a modern desktop station, however, due to multiple invocations for a large number of subtasks the full task requires a significant computing power. Therefore, an efficient computational solution to large-scale biological problems requires both a wise parallel implementation of resource-hungry methods as well as a smart workflow to manage multiple invocations of relatively fast algorithms. In this work, a new computer software mpiWrapper has been developed to accommodate non-parallel implementations of scientific algorithms within the parallel supercomputing environment. The Message Passing Interface has been implemented to exchange information between nodes. Two specialized threads - one for task management and communication, and another for subtask execution - are invoked on each processing unit to avoid deadlock while using blocking calls to MPI. The mpiWrapper can be used to launch all conventional Linux applications without the need to modify their original source codes and supports resubmission of subtasks on node failure. We show that this approach can be used to process huge amounts of biological data efficiently by running non-parallel programs in parallel mode on a supercomputer. The C++ source code and documentation are available from http://biokinet.belozersky.msu.ru/mpiWrapper .
PICH and BLM limit histone association with anaphase centromeric DNA threads and promote their resolution

PubMed Central

Ke, Yuwen; Huh, Jae-Wan; Warrington, Ross; Li, Bing; Wu, Nan; Leng, Mei; Zhang, Junmei; Ball, Haydn L; Li, Bing; Yu, Hongtao

2011-01-01

Centromeres nucleate the formation of kinetochores and are vital for chromosome segregation during mitosis. The SNF2 family helicase PICH (Plk1-interacting checkpoint helicase) and the BLM (the Bloom's syndrome protein) helicase decorate ultrafine histone-negative DNA threads that link the segregating sister centromeres during anaphase. The functions of PICH and BLM at these threads are not understood, however. Here, we show that PICH binds to BLM and enables BLM localization to anaphase centromeric threads. PICH- or BLM-RNAi cells fail to resolve these threads in anaphase. The fragmented threads form centromeric-chromatin-containing micronuclei in daughter cells. Anaphase threads in PICH- and BLM-RNAi cells contain histones and centromere markers. Recombinant purified PICH has nucleosome remodelling activities in vitro. We propose that PICH and BLM unravel centromeric chromatin and keep anaphase DNA threads mostly free of nucleosomes, thus allowing these threads to span long distances between rapidly segregating centromeres without breakage and providing a spatiotemporal window for their resolution. PMID:21743438
Understanding thread properties for red blood cell antigen assays: weak ABO blood typing.

PubMed

Nilghaz, Azadeh; Zhang, Liyuan; Li, Miaosi; Ballerini, David R; Shen, Wei

2014-12-24

"Thread-based microfluidics" research has so far focused on utilizing and manipulating the wicking properties of threads to form controllable microfluidic channels. In this study we aim to understand the separation properties of threads, which are important to their microfluidic detection applications for blood analysis. Confocal microscopy was utilized to investigate the effect of the microscale surface morphologies of fibers on the thread's separation efficiency of red blood cells. We demonstrated the remarkably different separation properties of threads made using silk and cotton fibers. Thread separation properties dominate the clarity of blood typing assays of the ABO groups and some of their weak subgroups (Ax and A3). The microfluidic thread-based analytical devices (μTADs) designed in this work were used to accurately type different blood samples, including 89 normal ABO and 6 weak A subgroups. By selecting thread with the right surface morphology, we were able to build μTADs capable of providing rapid and accurate typing of the weak blood groups with high clarity.
Effect of Thread and Rotating Speed on Material Flow Behavior and Mechanical Properties of Friction Stir Lap Welding Joints

NASA Astrophysics Data System (ADS)

Ji, Shude; Li, Zhengwei; Zhou, Zhenlu; Wu, Baosheng

2017-10-01

This study focused on the effects of thread on hook and cold lap formation, lap shear property and impact toughness of alclad 2024-T4 friction stir lap welding (FSLW) joints. Except the traditional threaded pin tool (TR-tool), three new tools with different thread locations and orientations were designed. Results showed that thread significantly affected hook, cold lap morphologies and lap shear properties. The tool with tip-threaded pin (T-tool) fabricated joint with flat hook and cold lap, which resulted in shear fracture mode. The tools with bottom-threaded pin (B-tool) eliminated the hook. The tool with reverse-threaded pin (R-tool) widened the stir zone width. When using configuration A, the joints fabricated by the three new tools showed higher failure loads than the joint fabricated by the TR-tool. The joint using the T-tool owned the optimum impact toughness. This study demonstrated the significance of thread during FSLW and provided a reference to optimize tool geometry.
Parallel algorithm of real-time infrared image restoration based on total variation theory

NASA Astrophysics Data System (ADS)

Zhu, Ran; Li, Miao; Long, Yunli; Zeng, Yaoyuan; An, Wei

2015-10-01

Image restoration is a necessary preprocessing step for infrared remote sensing applications. Traditional methods allow us to remove the noise but penalize too much the gradients corresponding to edges. Image restoration techniques based on variational approaches can solve this over-smoothing problem for the merits of their well-defined mathematical modeling of the restore procedure. The total variation (TV) of infrared image is introduced as a L1 regularization term added to the objective energy functional. It converts the restoration process to an optimization problem of functional involving a fidelity term to the image data plus a regularization term. Infrared image restoration technology with TV-L1 model exploits the remote sensing data obtained sufficiently and preserves information at edges caused by clouds. Numerical implementation algorithm is presented in detail. Analysis indicates that the structure of this algorithm can be easily implemented in parallelization. Therefore a parallel implementation of the TV-L1 filter based on multicore architecture with shared memory is proposed for infrared real-time remote sensing systems. Massive computation of image data is performed in parallel by cooperating threads running simultaneously on multiple cores. Several groups of synthetic infrared image data are used to validate the feasibility and effectiveness of the proposed parallel algorithm. Quantitative analysis of measuring the restored image quality compared to input image is presented. Experiment results show that the TV-L1 filter can restore the varying background image reasonably, and that its performance can achieve the requirement of real-time image processing.
A Verification System for Distributed Objects with Asynchronous Method Calls

NASA Astrophysics Data System (ADS)

Ahrendt, Wolfgang; Dylla, Maximilian

We present a verification system for Creol, an object-oriented modeling language for concurrent distributed applications. The system is an instance of KeY, a framework for object-oriented software verification, which has so far been applied foremost to sequential Java. Building on KeY characteristic concepts, like dynamic logic, sequent calculus, explicit substitutions, and the taclet rule language, the system presented in this paper addresses functional correctness of Creol models featuring local cooperative thread parallelism and global communication via asynchronous method calls. The calculus heavily operates on communication histories which describe the interfaces of Creol units. Two example scenarios demonstrate the usage of the system.
Kalman filter tracking on parallel architectures

NASA Astrophysics Data System (ADS)

Cerati, G.; Elmer, P.; Krutelyov, S.; Lantz, S.; Lefebvre, M.; McDermott, K.; Riley, D.; Tadel, M.; Wittich, P.; Wurthwein, F.; Yagil, A.

2017-10-01

We report on the progress of our studies towards a Kalman filter track reconstruction algorithm with optimal performance on manycore architectures. The combinatorial structure of these algorithms is not immediately compatible with an efficient SIMD (or SIMT) implementation; the challenge for us is to recast the existing software so it can readily generate hundreds of shared-memory threads that exploit the underlying instruction set of modern processors. We show how the data and associated tasks can be organized in a way that is conducive to both multithreading and vectorization. We demonstrate very good performance on Intel Xeon and Xeon Phi architectures, as well as promising first results on Nvidia GPUs.
A wavelet approach to binary blackholes with asynchronous multitasking

NASA Astrophysics Data System (ADS)

Lim, Hyun; Hirschmann, Eric; Neilsen, David; Anderson, Matthew; Debuhr, Jackson; Zhang, Bo

2016-03-01

Highly accurate simulations of binary black holes and neutron stars are needed to address a variety of interesting problems in relativistic astrophysics. We present a new method for the solving the Einstein equations (BSSN formulation) using iterated interpolating wavelets. Wavelet coefficients provide a direct measure of the local approximation error for the solution and place collocation points that naturally adapt to features of the solution. Further, they exhibit exponential convergence on unevenly spaced collection points. The parallel implementation of the wavelet simulation framework presented here deviates from conventional practice in combining multi-threading with a form of message-driven computation sometimes referred to as asynchronous multitasking.
Testing New Programming Paradigms with NAS Parallel Benchmarks

NASA Technical Reports Server (NTRS)

Jin, H.; Frumkin, M.; Schultz, M.; Yan, J.

2000-01-01

Over the past decade, high performance computing has evolved rapidly, not only in hardware architectures but also with increasing complexity of real applications. Technologies have been developing to aim at scaling up to thousands of processors on both distributed and shared memory systems. Development of parallel programs on these computers is always a challenging task. Today, writing parallel programs with message passing (e.g. MPI) is the most popular way of achieving scalability and high performance. However, writing message passing programs is difficult and error prone. Recent years new effort has been made in defining new parallel programming paradigms. The best examples are: HPF (based on data parallelism) and OpenMP (based on shared memory parallelism). Both provide simple and clear extensions to sequential programs, thus greatly simplify the tedious tasks encountered in writing message passing programs. HPF is independent of memory hierarchy, however, due to the immaturity of compiler technology its performance is still questionable. Although use of parallel compiler directives is not new, OpenMP offers a portable solution in the shared-memory domain. Another important development involves the tremendous progress in the internet and its associated technology. Although still in its infancy, Java promisses portability in a heterogeneous environment and offers possibility to "compile once and run anywhere." In light of testing these new technologies, we implemented new parallel versions of the NAS Parallel Benchmarks (NPBs) with HPF and OpenMP directives, and extended the work with Java and Java-threads. The purpose of this study is to examine the effectiveness of alternative programming paradigms. NPBs consist of five kernels and three simulated applications that mimic the computation and data movement of large scale computational fluid dynamics (CFD) applications. We started with the serial version included in NPB2.3. Optimization of memory and cache usage was applied to several benchmarks, noticeably BT and SP, resulting in better sequential performance. In order to overcome the lack of an HPF performance model and guide the development of the HPF codes, we employed an empirical performance model for several primitives found in the benchmarks. We encountered a few limitations of HPF, such as lack of supporting the "REDISTRIBUTION" directive and no easy way to handle irregular computation. The parallelization with OpenMP directives was done at the outer-most loop level to achieve the largest granularity. The performance of six HPF and OpenMP benchmarks is compared with their MPI counterparts for the Class-A problem size in the figure in next page. These results were obtained on an SGI Origin2000 (195MHz) with MIPSpro-f77 compiler 7.2.1 for OpenMP and MPI codes and PGI pghpf-2.4.3 compiler with MPI interface for HPF programs.
Massively Multithreaded Maxflow for Image Segmentation on the Cray XMT-2

PubMed Central

Bokhari, Shahid H.; Çatalyürek, Ümit V.; Gurcan, Metin N.

2014-01-01

SUMMARY Image segmentation is a very important step in the computerized analysis of digital images. The maxflow mincut approach has been successfully used to obtain minimum energy segmentations of images in many fields. Classical algorithms for maxflow in networks do not directly lend themselves to efficient parallel implementations on contemporary parallel processors. We present the results of an implementation of Goldberg-Tarjan preflow-push algorithm on the Cray XMT-2 massively multithreaded supercomputer. This machine has hardware support for 128 threads in each physical processor, a uniformly accessible shared memory of up to 4 TB and hardware synchronization for each 64 bit word. It is thus well-suited to the parallelization of graph theoretic algorithms, such as preflow-push. We describe the implementation of the preflow-push code on the XMT-2 and present the results of timing experiments on a series of synthetically generated as well as real images. Our results indicate very good performance on large images and pave the way for practical applications of this machine architecture for image analysis in a production setting. The largest images we have run are 320002 pixels in size, which are well beyond the largest previously reported in the literature. PMID:25598745
Data Acquisition with GPUs: The DAQ for the Muon $g$-$2$ Experiment at Fermilab

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gohn, W.

Graphical Processing Units (GPUs) have recently become a valuable computing tool for the acquisition of data at high rates and for a relatively low cost. The devices work by parallelizing the code into thousands of threads, each executing a simple process, such as identifying pulses from a waveform digitizer. The CUDA programming library can be used to effectively write code to parallelize such tasks on Nvidia GPUs, providing a significant upgrade in performance over CPU based acquisition systems. The muonmore » $g$-$2$ experiment at Fermilab is heavily relying on GPUs to process its data. The data acquisition system for this experiment must have the ability to create deadtime-free records from 700 $$\\mu$$s muon spills at a raw data rate 18 GB per second. Data will be collected using 1296 channels of $$\\mu$$TCA-based 800 MSPS, 12 bit waveform digitizers and processed in a layered array of networked commodity processors with 24 GPUs working in parallel to perform a fast recording of the muon decays during the spill. The described data acquisition system is currently being constructed, and will be fully operational before the start of the experiment in 2017.« less
Proceedings: Sisal `93

DOE Office of Scientific and Technical Information (OSTI.GOV)

Feo, J.T.

1993-10-01

This report contain papers on: Programmability and performance issues; The case of an iterative partial differential equation solver; Implementing the kernal of the Australian Region Weather Prediction Model in Sisal; Even and quarter-even prime length symmetric FFTs and their Sisal Implementations; Top-down thread generation for Sisal; Overlapping communications and computations on NUMA architechtures; Compiling technique based on dataflow analysis for funtional programming language Valid; Copy elimination for true multidimensional arrays in Sisal 2.0; Increasing parallelism for an optimization that reduces copying in IF2 graphs; Caching in on Sisal; Cache performance of Sisal Vs. FORTRAN; FFT algorithms on a shared-memory multiprocessor;more » A parallel implementation of nonnumeric search problems in Sisal; Computer vision algorithms in Sisal; Compilation of Sisal for a high-performance data driven vector processor; Sisal on distributed memory machines; A virtual shared addressing system for distributed memory Sisal; Developing a high-performance FFT algorithm in Sisal for a vector supercomputer; Implementation issues for IF2 on a static data-flow architechture; and Systematic control of parallelism in array-based data-flow computation. Selected papers have been indexed separately for inclusion in the Energy Science and Technology Database.« less
Metal and transuranic records in mussel shells, byssal threads and tissues

NASA Astrophysics Data System (ADS)

Koide, Minoru; Lee, Dong Soo; Goldberg, Edward D.

1982-12-01

Bivalve shells offer several advantages over tissues for the monitoring of heavy metal pollutants in the marine environment. They are easier to handle and to store. The problem of whether to depurate the animals before analyses is avoided. The shells appear to be more sensitive to environmental heavy metals levels over the long term than do the soft parts. Of the substances examined (Cd, Cu, Zn, Pb, Ag, Ni, 238Pu and 239 + 240Pu) only Pb and Pu displayed a strong covariance between soft tissue and shell concentrations. There were strong correlations between metals in the shell but not in the soft tissues in general. The byssal threads, because of their enrichment of transuranic elements and of their ease in handling, may be useful in monitoring these metals. A very weak discharge of 238Pu to marine waters adjacent to a nuclear reactor was detected in the byssal threads of mussels.
DistributedFBA.jl: High-level, high-performance flux balance analysis in Julia

DOE Office of Scientific and Technical Information (OSTI.GOV)

Heirendt, Laurent; Thiele, Ines; Fleming, Ronan M. T.

Flux balance analysis and its variants are widely used methods for predicting steady-state reaction rates in biochemical reaction networks. The exploration of high dimensional networks with such methods is currently hampered by software performance limitations. DistributedFBA.jl is a high-level, high-performance, open-source implementation of flux balance analysis in Julia. It is tailored to solve multiple flux balance analyses on a subset or all the reactions of large and huge-scale networks, on any number of threads or nodes. DistributedFBA.jl is a high-level, high-performance, open-source implementation of flux balance analysis in Julia. It is tailored to solve multiple flux balance analyses on amore » subset or all the reactions of large and huge-scale networks, on any number of threads or nodes.« less
DistributedFBA.jl: High-level, high-performance flux balance analysis in Julia

DOE PAGES

Heirendt, Laurent; Thiele, Ines; Fleming, Ronan M. T.

2017-01-16

Flux balance analysis and its variants are widely used methods for predicting steady-state reaction rates in biochemical reaction networks. The exploration of high dimensional networks with such methods is currently hampered by software performance limitations. DistributedFBA.jl is a high-level, high-performance, open-source implementation of flux balance analysis in Julia. It is tailored to solve multiple flux balance analyses on a subset or all the reactions of large and huge-scale networks, on any number of threads or nodes. DistributedFBA.jl is a high-level, high-performance, open-source implementation of flux balance analysis in Julia. It is tailored to solve multiple flux balance analyses on amore » subset or all the reactions of large and huge-scale networks, on any number of threads or nodes.« less
Study of a Fine Grained Threaded Framework Design

NASA Astrophysics Data System (ADS)

Jones, C. D.

2012-12-01

Traditionally, HEP experiments exploit the multiple cores in a CPU by having each core process one event. However, future PC designs are expected to use CPUs which double the number of processing cores at the same rate as the cost of memory falls by a factor of two. This effectively means the amount of memory per processing core will remain constant. This is a major challenge for LHC processing frameworks since the LHC is expected to deliver more complex events (e.g. greater pileup events) in the coming years while the LHC experiment's frameworks are already memory constrained. Therefore in the not so distant future we may need to be able to efficiently use multiple cores to process one event. In this presentation we will discuss a design for an HEP processing framework which can allow very fine grained parallelization within one event as well as supporting processing multiple events simultaneously while minimizing the memory footprint of the job. The design is built around the libdispatch framework created by Apple Inc. (a port for Linux is available) whose central concept is the use of task queues. This design also accommodates the reality that not all code will be thread safe and therefore allows one to easily mark modules or sub parts of modules as being thread unsafe. In addition, the design efficiently handles the requirement that events in one run must all be processed before starting to process events from a different run. After explaining the design we will provide measurements from simulating different processing scenarios where the processing times used for the simulation are drawn from processing times measured from actual CMS event processing.

Structural Turnbuckle Bears Compressive or Tensile Loads

NASA Technical Reports Server (NTRS)

Bateman, W. A.; Lang, C. H.

1985-01-01

Column length adjuster based on turnbuckle principle. Device consists of internally and externally threaded bushing, threaded housing and threaded rod. Housing attached to one part and threaded rod attached to other part of structure. Turning double threaded bushing contracts or extends rod in relation to housing. Once adjusted, bushing secured with jamnuts. Device used for axially loaded members requiring length adjustment during installation.
Do dual-thread orthodontic mini-implants improve bone/tissue mechanical retention?

PubMed

Lin, Yang-Sung; Chang, Yau-Zen; Yu, Jian-Hong; Lin, Chun-Li

2014-12-01

The aim of this study was to understand whether the pitch relationship between micro and macro thread designs with a parametrical relationship in a dual-thread mini-implant can improve primary stability. Three types of mini-implants consisting of single-thread (ST) (0.75 mm pitch in whole length), dual-thread A (DTA) with double-start 0.375 mm pitch, and dual-thread B (DTB) with single-start 0.2 mm pitch in upper 2-mm micro thread region for performing insertion and pull-out testing. Histomorphometric analysis was performed in these specimens in evaluating peri-implant bone defects using a non-contact vision measuring system. The maximum inserted torque (Tmax) in type DTA was found to be the smallest significantly, but corresponding values found no significant difference between ST and DTB. The largest pull-out strength (Fmax) in the DTA mini-implant was found significantly greater than that for the ST mini-implant regardless of implant insertion orientation. Mini-implant engaged the cortical bone well as observed in ST and DTA types. Dual-thread mini-implant with correct micro thread pitch (parametrical relationship with macro thread pitch) in the cortical bone region can improve primary stability and enhanced mechanical retention.
Three-dimensional optimization and sensitivity analysis of dental implant thread parameters using finite element analysis.

PubMed

Geramizadeh, Maryam; Katoozian, Hamidreza; Amid, Reza; Kadkhodazadeh, Mahdi

2018-04-01

This study aimed to optimize the thread depth and pitch of a recently designed dental implant to provide uniform stress distribution by means of a response surface optimization method available in finite element (FE) software. The sensitivity of simulation to different mechanical parameters was also evaluated. A three-dimensional model of a tapered dental implant with micro-threads in the upper area and V-shaped threads in the rest of the body was modeled and analyzed using finite element analysis (FEA). An axial load of 100 N was applied to the top of the implants. The model was optimized for thread depth and pitch to determine the optimal stress distribution. In this analysis, micro-threads had 0.25 to 0.3 mm depth and 0.27 to 0.33 mm pitch, and V-shaped threads had 0.405 to 0.495 mm depth and 0.66 to 0.8 mm pitch. The optimized depth and pitch were 0.307 and 0.286 mm for micro-threads and 0.405 and 0.808 mm for V-shaped threads, respectively. In this design, the most effective parameters on stress distribution were the depth and pitch of the micro-threads based on sensitivity analysis results. Based on the results of this study, the optimal implant design has micro-threads with 0.307 and 0.286 mm depth and pitch, respectively, in the upper area and V-shaped threads with 0.405 and 0.808 mm depth and pitch in the rest of the body. These results indicate that micro-thread parameters have a greater effect on stress and strain values.
The Study of Importance of the Balance Space Food -Storage Method -

NASA Astrophysics Data System (ADS)

Katayama, Naomi; Yamashita, Masamichi; Hashimoto, Hirofumi; Space Agriculture Task Force, J.

Providing foods to space crew is the important requirements to support long term manned space exploration. Foods fill not only physiological requirements to sustain life, but psychological needs for refreshment and joy during the long and hard mission to extraterrestrial planets. We designed joyful and healthy recipe with materials, which can be produced by the bio-regenerative agricultural system operated at limited resources available in Mars base, Moon base and spaceship. We need to think about how to use the storage food when we have the time of emergency. The pupa of the silkworm becomes the important nourishment source as protein and lipid. The silk thread uses it as clothing and cosmetics and medical supplies. However, we can use the silk thread as food as protein. The silk thread is mad of sericin and fibroin. The sericin is used for cosmetics mainly, but can make sheet food by mixing it with rice flour. We can make Japanese rolled sushi with this product. In addition, we can make spring roll and gyoza and shao-mai. As for the fibroin which is the subject of the silk thread, is to extract it high pressure heat; of the protein can powder it, and can use it as food. Even if there is the silk thread in this way after having made it clothes once, we can do it to food again. We can reuse the cotton thread as carbohydrates equally, too. We can use the wood as carbohydrates, also. Based upon the foregoing, we use the pupa of the silkworm as protein and lipid, and the silk thread as protein, and the cotton thread and wood as carbohydrates. It is recommended as healthy meal balance; Protein: Lipid: Carbohydrate ratio equal 15-20We succeeded to develop joyful and nutritious space recipe at the end. Since energy consumption for physical exercise activities under micro-or sub-gravity is less than the terrestrial case, choice of our space foods is essencial to suppress blood sugar level, and prevent the metabolic syndrome. Because of less need of agricultural resources at choosing ecological members from the lower ladder of the food chain, our space recipe could be a proposal to solve the food problem on Earth.
Deploy Nalu/Kokkos algorithmic infrastructure with performance benchmarking.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Domino, Stefan P.; Ananthan, Shreyas; Knaus, Robert C.

The former Nalu interior heterogeneous algorithm design, which was originally designed to manage matrix assembly operations over all elemental topology types, has been modified to operate over homogeneous collections of mesh entities. This newly templated kernel design allows for removal of workset variable resize operations that were formerly required at each loop over a Sierra ToolKit (STK) bucket (nominally, 512 entities in size). Extensive usage of the Standard Template Library (STL) std::vector has been removed in favor of intrinsic Kokkos memory views. In this milestone effort, the transition to Kokkos as the underlying infrastructure to support performance and portability onmore » many-core architectures has been deployed for key matrix algorithmic kernels. A unit-test driven design effort has developed a homogeneous entity algorithm that employs a team-based thread parallelism construct. The STK Single Instruction Multiple Data (SIMD) infrastructure is used to interleave data for improved vectorization. The collective algorithm design, which allows for concurrent threading and SIMD management, has been deployed for the core low-Mach element- based algorithm. Several tests to ascertain SIMD performance on Intel KNL and Haswell architectures have been carried out. The performance test matrix includes evaluation of both low- and higher-order methods. The higher-order low-Mach methodology builds on polynomial promotion of the core low-order control volume nite element method (CVFEM). Performance testing of the Kokkos-view/SIMD design indicates low-order matrix assembly kernel speed-up ranging between two and four times depending on mesh loading and node count. Better speedups are observed for higher-order meshes (currently only P=2 has been tested) especially on KNL. The increased workload per element on higher-order meshes bene ts from the wide SIMD width on KNL machines. Combining multiple threads with SIMD on KNL achieves a 4.6x speedup over the baseline, with assembly timings faster than that observed on Haswell architecture. The computational workload of higher-order meshes, therefore, seems ideally suited for the many-core architecture and justi es further exploration of higher-order on NGP platforms. A Trilinos/Tpetra-based multi-threaded GMRES preconditioned by symmetric Gauss Seidel (SGS) represents the core solver infrastructure for the low-Mach advection/diffusion implicit solves. The threaded solver stack has been tested on small problems on NREL's Peregrine system using the newly developed and deployed Kokkos-view/SIMD kernels. fforts are underway to deploy the Tpetra-based solver stack on NERSC Cori system to benchmark its performance at scale on KNL machines.« less
SEM and fractography analysis of screw thread loosening in dental implants.

PubMed

Scarano, A; Quaranta, M; Traini, T; Piattelli, M; Piattelli, A

2007-01-01

Biological and technical failures of implants have already been reported. Mechanical factors are certainly of importance in implant failures, even if their exact nature has not yet been established. The abutment screw fracture or loosening represents a rare, but quite unpleasant failure. The aim of the present research is an analysis and structural examination of screw thread or abutment loosening compared with screw threads or abutment without loosening. The loosening of screw threads was compared to screw thread without loosening of three different implant systems; Branemark (Nobel Biocare, Gothenburg, Sweden), T.B.R. implant systems (Benax, Ancona, Italy) and Restore (Lifecore Biomedical, Chaska, Minnesota, USA). In this study broken screws were excluded. A total of 16 screw thread loosenings were observed (Group I) (4 Branemark, 4 T.B.R and 5 Restore), 10 screw threads without loosening were removed (Group II), and 6 screw threads as received by the manufacturer (unused) (Group III) were used as control (2 Branemark, 2 T.B.R and 2 Restore). The loosened abutment screws were retrieved and analyzed under SEM. Many alterations and deformations were present in concavities and convexities of screw threads in group I. No macroscopic alterations or deformations were observed in groups II and III. A statistical difference of the presence of microcracks were observed between screw threads with an abutment loosening and screw threads without an abutment loosening.
A Moiré Pattern-Based Thread Counter

ERIC Educational Resources Information Center

Reich, Gary

2017-01-01

Thread count is a term used in the textile industry as a measure of how closely woven a fabric is. It is usually defined as the sum of the number of warp threads per inch (or cm) and the number of weft threads per inch. (It is sometimes confusingly described as the number of threads per square inch.) In recent years it has also become a subject of…
Characterisation of defects in p-GaN by admittance spectroscopy

NASA Astrophysics Data System (ADS)

Elsherif, O. S.; Vernon-Parry, K. D.; Evans-Freeman, J. H.; Airey, R. J.; Kappers, M.; Humphreys, C. J.

2012-08-01

Mg-doped GaN films have been grown on (0 0 0 1) sapphire using metal organic vapour phase epitaxy. Use of different buffer layer strategies caused the threading dislocation density (TDD) in the GaN to be either approximately 2×109 cm-2 or 1×1010 cm-2. Frequency-dependent capacitance and conductance measurements at temperatures up to 450 K have been used to study the electronic states associated with the Mg doping, and to determine how these are affected by the TDD. Admittance spectroscopy of the films finds a single impurity-related acceptor level with an activation energy of 160±10 meV for [Mg] of about 1×1019 cm-3, and 120±10 eV as the Mg precursor flux decreased. This level is thought to be associated with the Mg acceptor state. The TDD has no discernible effect on the trap detected by admittance spectroscopy. We compare these results with cathodoluminescence measurements reported in the literature, which reveal that most threading dislocations are non-radiative recombination centres, and discuss possible reasons why our admittance spectroscopy have not detected electrically active defects associated with threading dislocations.
Performance and scalability of Fourier domain optical coherence tomography acceleration using graphics processing units.

PubMed

Li, Jian; Bloch, Pavel; Xu, Jing; Sarunic, Marinko V; Shannon, Lesley

2011-05-01

Fourier domain optical coherence tomography (FD-OCT) provides faster line rates, better resolution, and higher sensitivity for noninvasive, in vivo biomedical imaging compared to traditional time domain OCT (TD-OCT). However, because the signal processing for FD-OCT is computationally intensive, real-time FD-OCT applications demand powerful computing platforms to deliver acceptable performance. Graphics processing units (GPUs) have been used as coprocessors to accelerate FD-OCT by leveraging their relatively simple programming model to exploit thread-level parallelism. Unfortunately, GPUs do not "share" memory with their host processors, requiring additional data transfers between the GPU and CPU. In this paper, we implement a complete FD-OCT accelerator on a consumer grade GPU/CPU platform. Our data acquisition system uses spectrometer-based detection and a dual-arm interferometer topology with numerical dispersion compensation for retinal imaging. We demonstrate that the maximum line rate is dictated by the memory transfer time and not the processing time due to the GPU platform's memory model. Finally, we discuss how the performance trends of GPU-based accelerators compare to the expected future requirements of FD-OCT data rates.
Lack of ubiquitin immunoreactivities at both ends of neuropil threads. Possible bidirectional growth of neuropil threads.

PubMed

Iwatsubo, T; Hasegawa, M; Esaki, Y; Ihara, Y

1992-02-01

Immunocytochemically, neuropil threads (curly fibers) were investigated in the Alzheimer's disease brain using a confocal laser scanning fluorescence microscope by double labeling with tau/ubiquitin antibodies. Ubiquitin immunoreactivities were found to be lacking at one or both ends in more than 40% of tau-positive threads. Immunoelectron microscopy showed that bundles of paired helical filaments, which constitute neuropil threads, were positive for ubiquitin around their midportions, but often negative at their ends. Since it is reasonable to postulate that tau deposition as paired helical filaments precedes ubiquitination, the aforementioned observation suggests that the ends of the threads are newly formed portions, and thus the threads are often growing bidirectionally in small neuronal processes.
Parallel Online Temporal Difference Learning for Motor Control.

PubMed

Caarls, Wouter; Schuitema, Erik

2016-07-01

Temporal difference (TD) learning, a key concept in reinforcement learning, is a popular method for solving simulated control problems. However, in real systems, this method is often avoided in favor of policy search methods because of its long learning time. But policy search suffers from its own drawbacks, such as the necessity of informed policy parameterization and initialization. In this paper, we show that TD learning can work effectively in real robotic systems as well, using parallel model learning and planning. Using locally weighted linear regression and trajectory sampled planning with 14 concurrent threads, we can achieve a speedup of almost two orders of magnitude over regular TD control on simulated control benchmarks. For a real-world pendulum swing-up task and a two-link manipulator movement task, we report a speedup of 20× to 60× , with a real-time learning speed of less than half a minute. The results are competitive with state-of-the-art policy search.
Thermal stability analysis of the fine structure of solar prominences

NASA Technical Reports Server (NTRS)

Demoulin, Pascal; Malherbe, Jean-Marie; Schmieder, Brigitte; Raadu, Mickael A.

1986-01-01

The linear thermal stability of a 2D periodic structure (alternatively hot and cold) in a uniform magnetic field is analyzed. The energy equation includes wave heating (assumed proportional to density), radiative cooling and both conduction parallel and orthogonal to magnetic lines. The equilibrium is perturbed at constant gas pressure. With parallel conduction only, it is found to be unstable when the length scale 1// is greater than 45 Mn. In that case, orthogonal conduction becomes important and stabilizes the structure when the length scale is smaller than 5 km. On the other hand, when the length scale is greater than 5 km, the thermal equilibrium is unstable, and the corresponding time scale is about 10,000 s: this result may be compared to observations showing that the lifetime of the fine structure of solar prominences is about one hour; consequently, our computations suggest that the size of the unresolved threads could be of the order of 10 km only.
Gold thread implantation promotes hair growth in human and mice

PubMed Central

Kim, Jong-Hwan; Cho, Eun-Young; Kwon, Euna; Kim, Woo-Ho; Park, Jin-Sung; Lee, Yong-Soon

2017-01-01

Thread-embedding therapy has been widely applied for cosmetic purposes such as wrinkle reduction and skin tightening. Particularly, gold thread was reported to support connective tissue regeneration, but, its role in hair biology remains largely unknown due to lack of investigation. When we implanted gold thread and Happy Lift™ in human patient for facial lifting, we unexpectedly found an increase of hair regrowth in spite of no use of hair growth medications. When embedded into the depilated dorsal skin of mice, gold thread or polyglycolic acid (PGA) thread, similarly to 5% minoxidil, significantly increased the number of hair follicles on day 14 after implantation. And, hair re-growth promotion in the gold threadimplanted mice were significantly higher than that in PGA thread group on day 11 after depilation. In particular, the skin tissue of gold thread-implanted mice showed stronger PCNA staining and higher collagen density compared with control mice. These results indicate that gold thread implantation can be an effective way to promote hair re-growth although further confirmatory study is needed for more information on therapeutic mechanisms and long-term safety. PMID:29399026
A communication-avoiding, hybrid-parallel, rank-revealing orthogonalization method.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hoemmen, Mark

2010-11-01

Orthogonalization consumes much of the run time of many iterative methods for solving sparse linear systems and eigenvalue problems. Commonly used algorithms, such as variants of Gram-Schmidt or Householder QR, have performance dominated by communication. Here, 'communication' includes both data movement between the CPU and memory, and messages between processors in parallel. Our Tall Skinny QR (TSQR) family of algorithms requires asymptotically fewer messages between processors and data movement between CPU and memory than typical orthogonalization methods, yet achieves the same accuracy as Householder QR factorization. Furthermore, in block orthogonalizations, TSQR is faster and more accurate than existing approaches formore » orthogonalizing the vectors within each block ('normalization'). TSQR's rank-revealing capability also makes it useful for detecting deflation in block iterative methods, for which existing approaches sacrifice performance, accuracy, or both. We have implemented a version of TSQR that exploits both distributed-memory and shared-memory parallelism, and supports real and complex arithmetic. Our implementation is optimized for the case of orthogonalizing a small number (5-20) of very long vectors. The shared-memory parallel component uses Intel's Threading Building Blocks, though its modular design supports other shared-memory programming models as well, including computation on the GPU. Our implementation achieves speedups of 2 times or more over competing orthogonalizations. It is available now in the development branch of the Trilinos software package, and will be included in the 10.8 release.« less
A Parallel Nonrigid Registration Algorithm Based on B-Spline for Medical Images

PubMed Central

Wang, Yangping; Wang, Song

2016-01-01

The nonrigid registration algorithm based on B-spline Free-Form Deformation (FFD) plays a key role and is widely applied in medical image processing due to the good flexibility and robustness. However, it requires a tremendous amount of computing time to obtain more accurate registration results especially for a large amount of medical image data. To address the issue, a parallel nonrigid registration algorithm based on B-spline is proposed in this paper. First, the Logarithm Squared Difference (LSD) is considered as the similarity metric in the B-spline registration algorithm to improve registration precision. After that, we create a parallel computing strategy and lookup tables (LUTs) to reduce the complexity of the B-spline registration algorithm. As a result, the computing time of three time-consuming steps including B-splines interpolation, LSD computation, and the analytic gradient computation of LSD, is efficiently reduced, for the B-spline registration algorithm employs the Nonlinear Conjugate Gradient (NCG) optimization method. Experimental results of registration quality and execution efficiency on the large amount of medical images show that our algorithm achieves a better registration accuracy in terms of the differences between the best deformation fields and ground truth and a speedup of 17 times over the single-threaded CPU implementation due to the powerful parallel computing ability of Graphics Processing Unit (GPU). PMID:28053653
Orthorectification by Using Gpgpu Method

NASA Astrophysics Data System (ADS)

Sahin, H.; Kulur, S.

2012-07-01

Thanks to the nature of the graphics processing, the newly released products offer highly parallel processing units with high-memory bandwidth and computational power of more than teraflops per second. The modern GPUs are not only powerful graphic engines but also they are high level parallel programmable processors with very fast computing capabilities and high-memory bandwidth speed compared to central processing units (CPU). Data-parallel computations can be shortly described as mapping data elements to parallel processing threads. The rapid development of GPUs programmability and capabilities attracted the attentions of researchers dealing with complex problems which need high level calculations. This interest has revealed the concepts of "General Purpose Computation on Graphics Processing Units (GPGPU)" and "stream processing". The graphic processors are powerful hardware which is really cheap and affordable. So the graphic processors became an alternative to computer processors. The graphic chips which were standard application hardware have been transformed into modern, powerful and programmable processors to meet the overall needs. Especially in recent years, the phenomenon of the usage of graphics processing units in general purpose computation has led the researchers and developers to this point. The biggest problem is that the graphics processing units use different programming models unlike current programming methods. Therefore, an efficient GPU programming requires re-coding of the current program algorithm by considering the limitations and the structure of the graphics hardware. Currently, multi-core processors can not be programmed by using traditional programming methods. Event procedure programming method can not be used for programming the multi-core processors. GPUs are especially effective in finding solution for repetition of the computing steps for many data elements when high accuracy is needed. Thus, it provides the computing process more quickly and accurately. Compared to the GPUs, CPUs which perform just one computing in a time according to the flow control are slower in performance. This structure can be evaluated for various applications of computer technology. In this study covers how general purpose parallel programming and computational power of the GPUs can be used in photogrammetric applications especially direct georeferencing. The direct georeferencing algorithm is coded by using GPGPU method and CUDA (Compute Unified Device Architecture) programming language. Results provided by this method were compared with the traditional CPU programming. In the other application the projective rectification is coded by using GPGPU method and CUDA programming language. Sample images of various sizes, as compared to the results of the program were evaluated. GPGPU method can be used especially in repetition of same computations on highly dense data, thus finding the solution quickly.
Scheduler for multiprocessor system switch with selective pairing

DOEpatents

Gara, Alan; Gschwind, Michael Karl; Salapura, Valentina

2015-01-06

System, method and computer program product for scheduling threads in a multiprocessing system with selective pairing of processor cores for increased processing reliability. A selective pairing facility is provided that selectively connects, i.e., pairs, multiple microprocessor or processor cores to provide one highly reliable thread (or thread group). The method configures the selective pairing facility to use checking provide one highly reliable thread for high-reliability and allocate threads to corresponding processor cores indicating need for hardware checking. The method configures the selective pairing facility to provide multiple independent cores and allocate threads to corresponding processor cores indicating inherent resilience.
Threaded biliary inside stents are a safe and effective therapeutic option in cases of malignant hilar obstruction.

PubMed

Inatomi, Osamu; Bamba, Shigeki; Shioya, Makoto; Mochizuki, Yosuke; Ban, Hiromitsu; Tsujikawa, Tomoyuki; Saito, Yasuharu; Andoh, Akira; Fujiyama, Yoshihide

2013-02-14

Although endoscopic biliary stents have been accepted as part of palliative therapy for cases of malignant hilar obstruction, the optimal endoscopic management regime remains controversial. In this study, we evaluated the safety and efficacy of placing a threaded stent above the sphincter of Oddi (threaded inside plastic stents, threaded PS) and compared the results with those of other stent types. Patients with malignant hilar obstruction, including those requiring biliary drainage for stent occlusion, were selected. Patients received either one of the following endoscopic indwelling stents: threaded PS, conventional plastic stents (conventional PS), or metallic stents (MS). Duration of stent patency and the incident of complication were compared in these patients. Forty-two patients underwent placement of endoscopic indwelling stents (threaded PS = 12, conventional PS = 17, MS = 13). The median duration of threaded PS patency was significantly longer than that of conventional PS patency (142 vs. 32 days; P = 0.04, logrank test). The median duration of threaded PS and MS patency was not significantly different (142 vs. 150 days, P = 0.83). Stent migration did not occur in any group. Among patients who underwent threaded PS placement as a salvage therapy after MS obstruction due to tumor ingrowth, the median duration of MS patency was significantly shorter than that of threaded PS patency (123 vs. 240 days). Threaded PS are safe and effective in cases of malignant hilar obstruction; moreover, it is a suitable therapeutic option not only for initial drainage but also for salvage therapy.
Exploration of microfluidic devices based on multi-filament threads and textiles: A review

PubMed Central

Nilghaz, A.; Ballerini, D. R.; Shen, W.

2013-01-01

In this paper, we review the recent progress in the development of low-cost microfluidic devices based on multifilament threads and textiles for semi-quantitative diagnostic and environmental assays. Hydrophilic multifilament threads are capable of transporting aqueous and non-aqueous fluids via capillary action and possess desirable properties for building fluid transport pathways in microfluidic devices. Thread can be sewn onto various support materials to form fluid transport channels without the need for the patterned hydrophobic barriers essential for paper-based microfluidic devices. Thread can also be used to manufacture fabrics which can be patterned to achieve suitable hydrophilic-hydrophobic contrast, creating hydrophilic channels which allow the control of fluids flow. Furthermore, well established textile patterning methods and combination of hydrophilic and hydrophobic threads can be applied to fabricate low-cost microfluidic devices that meet the low-cost and low-volume requirements. In this paper, we review the current limitations and shortcomings of multifilament thread and textile-based microfluidics, and the research efforts to date on the development of fluid flow control concepts and fabrication methods. We also present a summary of different methods for modelling the fluid capillary flow in microfluidic thread and textile-based systems. Finally, we summarized the published works of thread surface treatment methods and the potential of combining multifilament thread with other materials to construct devices with greater functionality. We believe these will be important research focuses of thread- and textile-based microfluidics in future. PMID:24086179
Lack of ubiquitin immunoreactivities at both ends of neuropil threads. Possible bidirectional growth of neuropil threads.

PubMed Central

Iwatsubo, T.; Hasegawa, M.; Esaki, Y.; Ihara, Y.

1992-01-01

Immunocytochemically, neuropil threads (curly fibers) were investigated in the Alzheimer's disease brain using a confocal laser scanning fluorescence microscope by double labeling with tau/ubiquitin antibodies. Ubiquitin immunoreactivities were found to be lacking at one or both ends in more than 40% of tau-positive threads. Immunoelectron microscopy showed that bundles of paired helical filaments, which constitute neuropil threads, were positive for ubiquitin around their midportions, but often negative at their ends. Since it is reasonable to postulate that tau deposition as paired helical filaments precedes ubiquitination, the aforementioned observation suggests that the ends of the threads are newly formed portions, and thus the threads are often growing bidirectionally in small neuronal processes. Images Figure 1 Figure 2 PMID:1310831

Fatigue acceptance test limit criterion for larger diameter rolled thread fasteners

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kephart, A.R.

1997-05-01

This document describes a fatigue lifetime acceptance test criterion by which studs having rolled threads, larger than 1.0 inches in diameter, can be assured to meet minimum quality attributes associated with a controlled rolling process. This criterion is derived from a stress dependent, room temperature air fatigue database for test studs having a 0.625 inch diameter threads of Alloys X-750 HTH and direct aged 625. Anticipated fatigue lives of larger threads are based on thread root elastic stress concentration factors which increase with increasing thread diameters. Over the thread size range of interest, a 30% increase in notch stress ismore » equivalent to a factor of five (5X) reduction in fatigue life. The resulting diameter dependent fatigue acceptance criterion is normalized to the aerospace rolled thread acceptance standards for a 1.0 inch diameter, 0.125 inch pitch, Unified National thread with a controlled Root radius (UNR). Testing was conducted at a stress of 50% of the minimum specified material ultimate strength, 80 Ksi, and at a stress ratio (R) of 0.10. Limited test data for fastener diameters of 1.00 to 2.25 inches are compared to the acceptance criterion. Sensitivity of fatigue life of threads to test nut geometry variables was also shown to be dependent on notch stress conditions. Bearing surface concavity of the compression nuts and thread flank contact mismatch conditions can significantly affect the fastener fatigue life. Without improved controls these conditions could potentially provide misleading acceptance data. Alternate test nut geometry features are described and implemented in the rolled thread stud specification, MIL-DTL-24789(SH), to mitigate the potential effects on fatigue acceptance data.« less
Wedges for ultrasonic inspection

DOEpatents

Gavin, Donald A.

1982-01-01

An ultrasonic transducer device is provided which is used in ultrasonic inspection of the material surrounding a threaded hole and which comprises a wedge of plastic or the like including a curved threaded surface adapted to be screwed into the threaded hole and a generally planar surface on which a conventional ultrasonic transducer is mounted. The plastic wedge can be rotated within the threaded hole to inspect for flaws in the material surrounding the threaded hole.
Apparatus for accurately preloading auger attachment means for frangible protective material

NASA Technical Reports Server (NTRS)

Wood, K. E.

1983-01-01

Apparatus for preloading a spring loaded threaded member is described. The apparatus is formed of three telescoping tubes. The innermost tube has means to prevent rotation of the threaded member. The middle tube is threadedly engaged with the threaded member and by axial movement applies a preload thereto. The outer tube engages a nut which may be rotated to retain the threaded member in axial position to maintain the preload.
Effect of thread shape on screw stress concentration by photoelastic measurements

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dragoni, E.

1994-11-01

The screw stress concentration for six nut-bolt connections embodying three different thread profiles and two nut shapes is measured photoelastically. Buttress (nearly zero flank angle), trapezoidal (15-deg flank angle), and triangular (30-deg flank angle) thread forms are examined in combination with standard and lip-type nuts. The effect of the thread profile on the screw stress concentration appears to be dependent upon the kind of nut considered. If the fastening incorporates a standard nut, the buttress thread is stronger than the triangular one, which, in turn, behaves better than the trapezoidal contour. The improvement is roughly a 20% reduction in themore » stress concentration factor from the trapezoidal to the buttress thread. In the case of lip nut, conversely, this tendency is somewhat reversed, with the trapezoidal thread performing slightly (but not decidedly) better than the other two shapes. Finally, averaged over all three thread forms, the lip nut exhibits a stress concentration factor which is about 50% lower than that of the standard nut.« less
Quick connect fastener

NASA Technical Reports Server (NTRS)

Weddendorf, Bruce (Inventor)

1994-01-01

A quick connect fastener and method of use is presented wherein the quick connect fastener is suitable for replacing available bolts and screws, the quick connect fastener being capable of installation by simply pushing a threaded portion of the connector into a member receptacle hole, the inventive apparatus being comprised of an externally threaded fastener having a threaded portion slidably mounted upon a stud or bolt shaft, wherein the externally threaded fastener portion is expandable by a preloaded spring member. The fastener, upon contact with the member receptacle hole, has the capacity of presenting cylindrical threads of a reduced diameter for insertion purposes and once inserted into the receiving threads of the receptacle member hole, are expandable for engagement of the receptacle hole threads forming a quick connect of the fastener and the member to be fastened, the quick connect fastener can be further secured by rotation after insertion, even to the point of locking engagement, the quick connect fastener being disengagable only by reverse rotation of the mated thread engagement.
Form and function of cnidarian spirocysts. III. Ultrastructure of the thread and the function of spirocysts

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mariscal, R.N.; McLean, R.B.; Hand, C.

1977-01-01

Unlike most nematocysts, undischarged spirocyst threads bear hollow tubules rather than spines. The undischarged tubules are interconnected in hexagonal arrays and appear to be arranged in bundles along the length of the thread. Although the wall of the thread is folded in length and width, the tubules are not. Upon discharge and contact with sea water, the tubules solubilize and adhere to various substrates and prey. Traction between such objects and the everting thread causes the tubules to spin out into a web or meshwork of fine microfibrillae. Lack of contact of the everting thread with objects results in themore » tubules forming small droplets of partially solubilized material, some of which appear to be arranged in a helical pattern around the thread. The web or meshwork formed by the solubilized tubules in contact with various substrates probably serves to increase significantly the surface area and adhesive properties of the everted spirocyst thread.« less
Thread bonds in molecules

NASA Astrophysics Data System (ADS)

Ivlev, B.

2017-07-01

Unusual chemical bonds are proposed. Each bond is characterized by the thread of a small radius, 10-11 cm, extended between two nuclei in a molecule. An analogue of a potential well, of the depth of MeV scale, is formed within the thread. This occurs due to the local reduction of zero point electromagnetic energy. This is similar to formation of the Casimir well. The electron-photon interaction only is not sufficient for formation of thread state. The mechanism of electron mass generation is involved in the close vicinity, 10-16 cm, of the thread. Thread bonds are stable and cannot be created or destructed in chemical or optical processes.
Flare particle acceleration in the interaction of twisted coronal flux ropes

NASA Astrophysics Data System (ADS)

Threlfall, J.; Hood, A. W.; Browning, P. K.

2018-03-01

Aim. The aim of this work is to investigate and characterise non-thermal particle behaviour in a three-dimensional (3D) magnetohydrodynamical (MHD) model of unstable multi-threaded flaring coronal loops. Methods: We have used a numerical scheme which solves the relativistic guiding centre approximation to study the motion of electrons and protons. The scheme uses snapshots from high resolution numerical MHD simulations of coronal loops containing two threads, where a single thread becomes unstable and (in one case) destabilises and merges with an additional thread. Results: The particle responses to the reconnection and fragmentation in MHD simulations of two loop threads are examined in detail. We illustrate the role played by uniform background resistivity and distinguish this from the role of anomalous resistivity using orbits in an MHD simulation where only one thread becomes unstable without destabilising further loop threads. We examine the (scalable) orbit energy gains and final positions recovered at different stages of a second MHD simulation wherein a secondary loop thread is destabilised by (and merges with) the first thread. We compare these results with other theoretical particle acceleration models in the context of observed energetic particle populations during solar flares.
Long-term effect of the insoluble thread-lifting technique.

PubMed

Fukaya, Mototsugu

2017-01-01

Although the thread-lifting technique for sagging faces has become more common and popular, medical literature evaluating its effects is scarce. Studies on its long-term prognosis are particularly uncommon. One hundred individuals who had previously undergone insoluble thread-lifting were retrospectively investigated. Photos in frontal and oblique views from the first and last visits were evaluated by six female individuals by guessing the patients' ages. The mean guessed age was defined as the apparent age, and the difference between the real and apparent ages was defined as the youth value. The difference between the youth values before and after the thread-lift was defined as the rejuvenation effect and analyzed in relation to the time since the operation, the number of threads used and the number of thread-lift operations performed. The rejuvenation effect decreased over the first year after the operation, but showed an increasing trend thereafter. The rejuvenation effect increased with the number of threads used and the number of thread-lift operations performed. The insoluble thread-lifting technique appears to be associated with both early and late effects. The rejuvenation effect appeared to decrease during the first year, but increased thereafter. A multicenter trial is necessary to confirm these findings.
The Ethylene Responsive Factor Required for Nodulation 1 (ERN1) Transcription Factor Is Required for Infection-Thread Formation in Lotus japonicus.

PubMed

Kawaharada, Yasuyuki; James, Euan K; Kelly, Simon; Sandal, Niels; Stougaard, Jens

2017-03-01

Several hundred genes are transcriptionally regulated during infection-thread formation and development of nitrogen-fixing root nodules. We have characterized a set of Lotus japonicus mutants impaired in root-nodule formation and found that the causative gene, Ern1, encodes a protein with a characteristic APETALA2/Ethylene Responsive Factor (AP2/ERF) transcription-factor domain. Phenotypic characterization of four ern1 alleles shows that infection pockets are formed but root-hair infection threads are absent. Formation of root-nodule primordia is delayed and no normal transcellular infection threads are found in the infected nodules. Corroborating the role of ERN1 (ERF Required for Nodulation1) in nodule organogenesis, spontaneous nodulation induced by an autoactive CCaMK and cytokinin-induced nodule primordia were not observed in ern1 mutants. Expression of Ern1 is induced in the susceptible zone by Nod factor treatment or rhizobial inoculation. At the cellular level, the pErn1:GUS reporter is highly expressed in root epidermal cells of the susceptible zone and in the cortical cells that form nodule primordia. The genetic regulation of this cellular expression pattern was further investigated in symbiotic mutants. Nod factor induction of Ern1 in epidermal cells was found to depend on Nfr1, Cyclops, and Nsp2 but was independent of Nin and Nf-ya1. These results suggest that ERN1 functions as a transcriptional regulator involved in the formation of infection threads and development of nodule primordia and may coordinate these two processes.
Thread Migration in the Presence of Pointers

NASA Technical Reports Server (NTRS)

Cronk, David; Haines, Matthew; Mehrotra, Piyush

1996-01-01

Dynamic migration of lightweight threads supports both data locality and load balancing. However, migrating threads that contain pointers referencing data in both the stack and heap remains an open problem. In this paper we describe a technique by which threads with pointers referencing both stack and non-shared heap data can be migrated such that the pointers remain valid after migration. As a result, threads containing pointers can now be migrated between processors in a homogeneous distributed memory environment.
Real-time inextensible surgical thread simulation.

PubMed

Xu, Lang; Liu, Qian

2018-03-27

This paper discusses a real-time simulation method of inextensible surgical thread based on the Cosserat rod theory using position-based dynamics (PBD). The method realizes stable twining and knotting of surgical thread while including inextensibility, bending, twisting and coupling effects. The Cosserat rod theory is used to model the nonlinear elastic behavior of surgical thread. The surgical thread model is solved with PBD to achieve a real-time, extremely stable simulation. Due to the one-dimensional linear structure of surgical thread, the direct solution of the distance constraint based on tridiagonal matrix algorithm is used to enhance stretching resistance in every constraint projection iteration. In addition, continuous collision detection and collision response guarantee a large time step and high performance. Furthermore, friction is integrated into the constraint projection process to stabilize the twining of multiple threads and complex contact situations. Through comparisons with existing methods, the surgical thread maintains constant length under large deformation after applying the direct distance constraint in our method. The twining and knotting of multiple threads correspond to stable solutions to contact and friction forces. A surgical suture scene is also modeled to demonstrate the practicality and simplicity of our method. Our method achieves stable and fast simulation of inextensible surgical thread. Benefiting from the unified particle framework, the rigid body, elastic rod, and soft body can be simultaneously simulated. The method is appropriate for applications in virtual surgery that require multiple dynamic bodies.
High precision optomechanical assembly using threads as mechanical reference

NASA Astrophysics Data System (ADS)

Lamontagne, Frédéric; Desnoyers, Nichola; Bergeron, Guy; Cantin, Mario

2016-09-01

A convenient method to assemble optomechanical components is to use threaded interface. For example, lenses are often secured inside barrels using threaded rings. In other cases, multiple optical sub-assemblies such as lens barrels can be threaded to each other. Threads have the advantage to provide a simple assembly method, to be easy to manufacture, and to offer a compact mechanical design. On the other hand, threads are not considered to provide accurate centering between parts because of the assembly clearance between the inner and outer threads. For that reason, threads are often used in conjunction with precision cylindrical surfaces to limit the radial clearance between the parts to be centered. Therefore, tight manufacturing tolerances are needed on these pilot diameters, which affect the cost of the optical assembly. This paper presents a new optomechanical approach that uses threads as mechanical reference. This innovative method relies on geometric principles to auto-center parts to each other with a very low centering error that is usually less than 5 μm. The method allows to auto-center an optical group in a main barrel, to perform an axial adjustment of an optical group inside a main barrel, and to perform stacking of multiple barrels. In conjunction with the lens auto-centering method that also used threads as a mechanical reference, this novel solution opens new possibilities to realize a variety of different high precision optomechanical assemblies at lower cost.
[Time and use of discussion forums in type 1 diabetes: contribution to patient education].

PubMed

Harry, Isabelle; Gagnayre, Rémi

2013-01-01

The purpose of this study was to elucidate the concept of temporality in discussions on forums used by individuals concerned by type 1 diabetes: adults and parents of children. The contents of messages were first converted into skills, and their temporality was then analysed, particularly in terms of the duration of active threads. Two types of temporality are involved in the use of forums: prescribed time governed by the therapeutic requirements related to a chronic disease and the decisions to be taken, and open-ended social time available on the Internet and the resulting reflexive processes. Our results show that topics relating to self-care and adaptation skills are often discussed and new threads on the topic are frequently introduced. Considerable diversity in the activity level associated with the various threads was observed, as most threads were only active for short periods. Following this study, our research perspectives concern: (i) the ways in which patients and their families reconcile the temporality dictated by a chronic disease (prescribed time) with the open-ended social time available on the Internet; and (ii) the ways in which this temporality is characteristic of patient learning processes via discussion forums. Future research will focus on the concept of rythmo-apprenance (rhythmic learning) in therapeutic patient education.
Coding for parallel execution of hardware-in-the-loop millimeter-wave scene generation models on multicore SIMD processor architectures

NASA Astrophysics Data System (ADS)

Olson, Richard F.

2013-05-01

Rendering of point scatterer based radar scenes for millimeter wave (mmW) seeker tests in real-time hardware-in-the-loop (HWIL) scene generation requires efficient algorithms and vector-friendly computer architectures for complex signal synthesis. New processor technology from Intel implements an extended 256-bit vector SIMD instruction set (AVX, AVX2) in a multi-core CPU design providing peak execution rates of hundreds of GigaFLOPS (GFLOPS) on one chip. Real world mmW scene generation code can approach peak SIMD execution rates only after careful algorithm and source code design. An effective software design will maintain high computing intensity emphasizing register-to-register SIMD arithmetic operations over data movement between CPU caches or off-chip memories. Engineers at the U.S. Army Aviation and Missile Research, Development and Engineering Center (AMRDEC) applied two basic parallel coding methods to assess new 256-bit SIMD multi-core architectures for mmW scene generation in HWIL. These include use of POSIX threads built on vector library functions and more portable, highlevel parallel code based on compiler technology (e.g. OpenMP pragmas and SIMD autovectorization). Since CPU technology is rapidly advancing toward high processor core counts and TeraFLOPS peak SIMD execution rates, it is imperative that coding methods be identified which produce efficient and maintainable parallel code. This paper describes the algorithms used in point scatterer target model rendering, the parallelization of those algorithms, and the execution performance achieved on an AVX multi-core machine using the two basic parallel coding methods. The paper concludes with estimates for scale-up performance on upcoming multi-core technology.
Threaded biliary inside stents are a safe and effective therapeutic option in cases of malignant hilar obstruction

PubMed Central

2013-01-01

Background Although endoscopic biliary stents have been accepted as part of palliative therapy for cases of malignant hilar obstruction, the optimal endoscopic management regime remains controversial. In this study, we evaluated the safety and efficacy of placing a threaded stent above the sphincter of Oddi (threaded inside plastic stents, threaded PS) and compared the results with those of other stent types. Methods Patients with malignant hilar obstruction, including those requiring biliary drainage for stent occlusion, were selected. Patients received either one of the following endoscopic indwelling stents: threaded PS, conventional plastic stents (conventional PS), or metallic stents (MS). Duration of stent patency and the incident of complication were compared in these patients. Results Forty-two patients underwent placement of endoscopic indwelling stents (threaded PS = 12, conventional PS = 17, MS = 13). The median duration of threaded PS patency was significantly longer than that of conventional PS patency (142 vs. 32 days; P = 0.04, logrank test). The median duration of threaded PS and MS patency was not significantly different (142 vs. 150 days, P = 0.83). Stent migration did not occur in any group. Among patients who underwent threaded PS placement as a salvage therapy after MS obstruction due to tumor ingrowth, the median duration of MS patency was significantly shorter than that of threaded PS patency (123 vs. 240 days). Conclusions Threaded PS are safe and effective in cases of malignant hilar obstruction; moreover, it is a suitable therapeutic option not only for initial drainage but also for salvage therapy. PMID:23410217
78 FR 12718 - Certain Steel Threaded Rod From the People's Republic of China: Affirmative Final Determination...

Federal Register 2010, 2011, 2012, 2013, 2014

2013-02-25

... DEPARTMENT OF COMMERCE International Trade Administration [A-570-932] Certain Steel Threaded Rod... Preliminary Determination of the circumvention inquiry concerning the antidumping duty order on certain steel threaded rod (``steel threaded rod'') from the People's Republic of China (``PRC'').\\1\\ The period of...
A hierarchical wavefront reconstruction algorithm for gradient sensors

NASA Astrophysics Data System (ADS)

Bharmal, Nazim; Bitenc, Urban; Basden, Alastair; Myers, Richard

2013-12-01

ELT-scale extreme adaptive optics systems will require new approaches tocompute the wavefront suitably quickly, when the computational burden ofapplying a MVM is no longer practical. An approach is demonstrated here whichis hierarchical in transforming wavefront slopes from a WFS into a wavefront,and then to actuator values. First, simple integration in 1D is used to create1D-wavefront estimates with unknown starting points at the edges of independentspatial domains. Second, these starting points are estimated globally. By thesestarting points are a sub-set of the overall grid where wavefront values are tobe estimated, sparse representations are produced and numerical complexity canbe chosen by the spacing of the starting point grid relative to the overallgrid. Using a combination of algebraic expressions, sparse representation, anda conjugate gradient solver, the number of non-parallelized operations forreconstruction on a 100x100 sub-aperture sized problem is ~600,000 or O(N^3/2),which is approximately the same as for each thread of a MVM solutionparallelized over 100 threads. To reduce the effects of noise propagationwithin each domain, a noise reduction algorithm can be applied which ensuresthe continuity of the wavefront. To apply this additional step has a cost of~1,200,000 operations. We conclude by briefly discussing how the final step ofconverting from wavefront to actuator values can be achieved.
Valve actuator for internal combustion engine

DOE Office of Scientific and Technical Information (OSTI.GOV)

Uchida, T.

1987-06-16

A valve actuating mechanism is described for an overhead valve and overhead cam type internal combustion engine in which the camshaft is positioned above and between the valve and a cam follower seat member in a cylinder head of the engine. The cam follower seat member is threadedly mounted in the cylinder head and has a semi-spherical recess facing upwardly. A cam follower has an adjustable bolt threadedly received in one end of the cam follower. The adjustable bolt has a spherical fulcrum engaging the semispherical recess of the seat member. The cam follower also has a downwardly facing meansmore » on the other end for engaging the valve and an upwardly facing slipper face for sliding engagement with a cam on the camshaft. The cam is adapted to rotate across the slipper face in the direction of the valve. The slipper face has a surface shape for engaging the cam at the start of valve-lifting movement of the cam follower at a point through which a line tangent to the slipper face is substantially parallel to a line through contact points between the cam follower. The seat member and valve for minimizing the lateral forces are imposed on the cam follower by the cam at the start of the valve-lifting movement.« less
H-BLAST: a fast protein sequence alignment toolkit on heterogeneous computers with GPUs.

PubMed

Ye, Weicai; Chen, Ying; Zhang, Yongdong; Xu, Yuesheng

2017-04-15

The sequence alignment is a fundamental problem in bioinformatics. BLAST is a routinely used tool for this purpose with over 118 000 citations in the past two decades. As the size of bio-sequence databases grows exponentially, the computational speed of alignment softwares must be improved. We develop the heterogeneous BLAST (H-BLAST), a fast parallel search tool for a heterogeneous computer that couples CPUs and GPUs, to accelerate BLASTX and BLASTP-basic tools of NCBI-BLAST. H-BLAST employs a locally decoupled seed-extension algorithm for better performance on GPUs, and offers a performance tuning mechanism for better efficiency among various CPUs and GPUs combinations. H-BLAST produces identical alignment results as NCBI-BLAST and its computational speed is much faster than that of NCBI-BLAST. Speedups achieved by H-BLAST over sequential NCBI-BLASTP (resp. NCBI-BLASTX) range mostly from 4 to 10 (resp. 5 to 7.2). With 2 CPU threads and 2 GPUs, H-BLAST can be faster than 16-threaded NCBI-BLASTX. Furthermore, H-BLAST is 1.5-4 times faster than GPU-BLAST. https://github.com/Yeyke/H-BLAST.git. yux06@syr.edu. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com

An efficient implementation of 3D high-resolution imaging for large-scale seismic data with GPU/CPU heterogeneous parallel computing

NASA Astrophysics Data System (ADS)

Xu, Jincheng; Liu, Wei; Wang, Jin; Liu, Linong; Zhang, Jianfeng

2018-02-01

De-absorption pre-stack time migration (QPSTM) compensates for the absorption and dispersion of seismic waves by introducing an effective Q parameter, thereby making it an effective tool for 3D, high-resolution imaging of seismic data. Although the optimal aperture obtained via stationary-phase migration reduces the computational cost of 3D QPSTM and yields 3D stationary-phase QPSTM, the associated computational efficiency is still the main problem in the processing of 3D, high-resolution images for real large-scale seismic data. In the current paper, we proposed a division method for large-scale, 3D seismic data to optimize the performance of stationary-phase QPSTM on clusters of graphics processing units (GPU). Then, we designed an imaging point parallel strategy to achieve an optimal parallel computing performance. Afterward, we adopted an asynchronous double buffering scheme for multi-stream to perform the GPU/CPU parallel computing. Moreover, several key optimization strategies of computation and storage based on the compute unified device architecture (CUDA) were adopted to accelerate the 3D stationary-phase QPSTM algorithm. Compared with the initial GPU code, the implementation of the key optimization steps, including thread optimization, shared memory optimization, register optimization and special function units (SFU), greatly improved the efficiency. A numerical example employing real large-scale, 3D seismic data showed that our scheme is nearly 80 times faster than the CPU-QPSTM algorithm. Our GPU/CPU heterogeneous parallel computing framework significant reduces the computational cost and facilitates 3D high-resolution imaging for large-scale seismic data.
Three-dimensional imaging of threading dislocations in GaN crystals using two-photon excitation photoluminescence

NASA Astrophysics Data System (ADS)

Tanikawa, Tomoyuki; Ohnishi, Kazuki; Kanoh, Masaya; Mukai, Takashi; Matsuoka, Takashi

2018-03-01

The three-dimensional imaging of threading dislocations in GaN films was demonstrated using two-photon excitation photoluminescence. The threading dislocations were shown as dark lines. The spatial resolutions near the surface were about 0.32 and 3.2 µm for the in-plane and depth directions, respectively. The threading dislocations with a density less than 108 cm-2 were resolved, although the aberration induced by the refractive index mismatch was observed. The decrease in threading dislocation density was clearly observed by increasing the GaN film thickness. This can be considered a novel method for characterizing threading dislocations in GaN films without any destructive preparations.
Entanglement-Gradient Routing for Quantum Networks.

PubMed

Gyongyosi, Laszlo; Imre, Sandor

2017-10-27

We define the entanglement-gradient routing scheme for quantum repeater networks. The routing framework fuses the fundamentals of swarm intelligence and quantum Shannon theory. Swarm intelligence provides nature-inspired solutions for problem solving. Motivated by models of social insect behavior, the routing is performed using parallel threads to determine the shortest path via the entanglement gradient coefficient, which describes the feasibility of the entangled links and paths of the network. The routing metrics are derived from the characteristics of entanglement transmission and relevant measures of entanglement distribution in quantum networks. The method allows a moderate complexity decentralized routing in quantum repeater networks. The results can be applied in experimental quantum networking, future quantum Internet, and long-distance quantum communications.
Design and implementation of online automatic judging system

NASA Astrophysics Data System (ADS)

Liang, Haohui; Chen, Chaojie; Zhong, Xiuyu; Chen, Yuefeng

2017-06-01

For lower efficiency and poorer reliability in programming training and competition by currently artificial judgment, design an Online Automatic Judging (referred to as OAJ) System. The OAJ system including the sandbox judging side and Web side, realizes functions of automatically compiling and running the tested codes, and generating evaluation scores and corresponding reports. To prevent malicious codes from damaging system, the OAJ system utilizes sandbox, ensuring the safety of the system. The OAJ system uses thread pools to achieve parallel test, and adopt database optimization mechanism, such as horizontal split table, to improve the system performance and resources utilization rate. The test results show that the system has high performance, high reliability, high stability and excellent extensibility.
Message passing with queues and channels

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dozsa, Gabor J; Heidelberger, Philip; Kumar, Sameer

In an embodiment, a reception thread receives a source node identifier, a type, and a data pointer from an application and, in response, creates a receive request. If the source node identifier specifies a source node, the reception thread adds the receive request to a fast-post queue. If a message received from a network does not match a receive request on a posted queue, a polling thread adds a receive request that represents the message to an unexpected queue. If the fast-post queue contains the receive request, the polling thread removes the receive request from the fast-post queue. If themore » receive request that was removed from the fast-post queue does not match the receive request on the unexpected queue, the polling thread adds the receive request that was removed from the fast-post queue to the posted queue. The reception thread and the polling thread execute asynchronously from each other.« less
49 CFR 178.42 - Specification 3E seamless steel cylinders.

Code of Federal Regulations, 2012 CFR

2012-10-01

... (valves, fuse plugs, etc.) for those openings. Threads conforming to the following are required on openings. (1) Threads must be clean cut, even, without checks, and to gauge. (2) Taper threads, when used, must be of length not less than as specified for American Standard taper pipe threads. (3) Straight...
49 CFR 178.42 - Specification 3E seamless steel cylinders.

Code of Federal Regulations, 2013 CFR

2013-10-01

... (valves, fuse plugs, etc.) for those openings. Threads conforming to the following are required on openings. (1) Threads must be clean cut, even, without checks, and to gauge. (2) Taper threads, when used, must be of length not less than as specified for American Standard taper pipe threads. (3) Straight...
Neuropil threads occur in dendrites of tangle-bearing nerve cells.

PubMed

Braak, H; Braak, E

1988-01-01

Transparent Golgi preparations counterstained for Alzheimer's neurofibrillary changes rendered possible the demonstration of neuropil threads in defined cellular processes. Only dendrites of tangle-bearing cortical nerve cells were found to contain neuropil threads. Processes of glial cells as well as axons present in the material were devoid of neuropil threads.
49 CFR 178.42 - Specification 3E seamless steel cylinders.

Code of Federal Regulations, 2010 CFR

2010-10-01

... (valves, fuse plugs, etc.) for those openings. Threads conforming to the following are required on openings. (1) Threads must be clean cut, even, without checks, and to gauge. (2) Taper threads, when used, must be of length not less than as specified for American Standard taper pipe threads. (3) Straight...
Threaded Cognition: An Integrated Theory of Concurrent Multitasking

ERIC Educational Resources Information Center

Salvucci, Dario D.; Taatgen, Niels A.

2008-01-01

The authors propose the idea of threaded cognition, an integrated theory of concurrent multitasking--that is, performing 2 or more tasks at once. Threaded cognition posits that streams of thought can be represented as threads of processing coordinated by a serial procedural resource and executed across other available resources (e.g., perceptual…
49 CFR 178.42 - Specification 3E seamless steel cylinders.

Code of Federal Regulations, 2011 CFR

2011-10-01

... (valves, fuse plugs, etc.) for those openings. Threads conforming to the following are required on openings. (1) Threads must be clean cut, even, without checks, and to gauge. (2) Taper threads, when used, must be of length not less than as specified for American Standard taper pipe threads. (3) Straight...
49 CFR 178.42 - Specification 3E seamless steel cylinders.

Code of Federal Regulations, 2014 CFR

2014-10-01

... (valves, fuse plugs, etc.) for those openings. Threads conforming to the following are required on openings. (1) Threads must be clean cut, even, without checks, and to gauge. (2) Taper threads, when used, must be of length not less than as specified for American Standard taper pipe threads. (3) Straight...
A Primer on the Effective Use of Threaded Discussion Forums.

ERIC Educational Resources Information Center

Kirk, James J.; Orr, Robert L.

Threaded discussion forums are asynchronous, World Wide Web-based discussions occurring under a number of different topics called threads. By allowing students to post, read, and respond to messages independently of time or place, threaded discussion forums give students an opportunity for deeper reflection and more thoughtful replies than chat…
46 CFR 164.023-7 - Performance; non-standard thread.

Code of Federal Regulations, 2010 CFR

2010-10-01

... 46 Shipping 6 2010-10-01 2010-10-01 false Performance; non-standard thread. 164.023-7 Section 164... Performance; non-standard thread. (a) Use Codes 1, 2, 3, 4BC, 4RB, 5 (any). Each non-standard thread which...) testing machine. (2) Single strand breaking strength (after weathering). After exposure in a sunshine...
46 CFR 164.023-7 - Performance; non-standard thread.

Code of Federal Regulations, 2011 CFR

2011-10-01

... 46 Shipping 6 2011-10-01 2011-10-01 false Performance; non-standard thread. 164.023-7 Section 164... Performance; non-standard thread. (a) Use Codes 1, 2, 3, 4BC, 4RB, 5 (any). Each non-standard thread which...) testing machine. (2) Single strand breaking strength (after weathering). After exposure in a sunshine...
Hyperunstable matrix proteins in the byssus of Mytilus galloprovincialis.

PubMed

Sagert, Jason; Waite, J Herbert

2009-07-01

The marine mussel Mytilus galloprovincialis is tethered to rocks in the intertidal zone by a holdfast known as the byssus. Functioning as a shock absorber, the byssus is composed of threads, the primary molecular components of which are collagen-containing proteins (preCOLs) that largely dictate the higher order self-assembly and mechanical properties of byssal threads. The threads contain additional matrix components that separate and perhaps lubricate the collagenous microfibrils during deformation in tension. In this study, the thread matrix proteins (TMPs), a glycine-, tyrosine- and asparagine-rich protein family, were shown to possess unique repeated sequence motifs, significant transcriptional heterogeneity and were distributed throughout the byssal thread. Deamidation was shown to occur at a significant rate in a recombinant TMP and in the byssal thread as a function of time. Furthermore, charge heterogeneity presumably due to deamidation was observed in TMPs extracted from threads. The TMPs were localized to the preCOL-containing secretory granules in the collagen gland of the foot and are assumed to provide a viscoelastic matrix around the collagenous fibers in byssal threads.
GaAsP/InGaP HBTs grown epitaxially on Si substrates: Effect of dislocation density on DC current gain

NASA Astrophysics Data System (ADS)

Heidelberger, Christopher; Fitzgerald, Eugene A.

2018-04-01

Heterojunction bipolar transistors (HBTs) with GaAs0.825P0.175 bases and collectors and In0.40Ga0.60P emitters were integrated monolithically onto Si substrates. The HBT structures were grown epitaxially on Si via metalorganic chemical vapor deposition, using SiGe compositionally graded buffers to accommodate the lattice mismatch while maintaining threading dislocation density at an acceptable level (˜3 × 106 cm-2). GaAs0.825P0.175 is used as an active material instead of GaAs because of its higher bandgap (increased breakdown voltage) and closer lattice constant to Si. Misfit dislocation density in the active device layers, measured by electron-beam-induced current, was reduced by making iterative changes to the epitaxial structure. This optimized process culminated in a GaAs0.825P0.175/In0.40Ga0.60P HBT grown on Si with a DC current gain of 156. By considering the various GaAsP/InGaP HBTs grown on Si substrates alongside several control devices grown on GaAs substrates, a wide range of threading dislocation densities and misfit dislocation densities in the active layers could be correlated with HBT current gain. The effect of threading dislocations on current gain was moderated by the reduction in minority carrier lifetime in the base region, in agreement with existing models for GaAs light-emitting diodes and photovoltaic cells. Current gain was shown to be extremely sensitive to misfit dislocations in the active layers of the HBT—much more sensitive than to threading dislocations. We develop a model for this relationship where increased base current is mediated by Fermi level pinning near misfit dislocations.
Differential regulation of the Epr3 receptor coordinates membrane-restricted rhizobial colonization of root nodule primordia

PubMed Central

Kawaharada, Yasuyuki; Nielsen, Mette W.; Kelly, Simon; James, Euan K.; Andersen, Kasper R.; Rasmussen, Sheena R.; Füchtbauer, Winnie; Madsen, Lene H.; Heckmann, Anne B.; Radutoiu, Simona; Stougaard, Jens

2017-01-01

In Lotus japonicus, a LysM receptor kinase, EPR3, distinguishes compatible and incompatible rhizobial exopolysaccharides at the epidermis. However, the role of this recognition system in bacterial colonization of the root interior is unknown. Here we show that EPR3 advances the intracellular infection mechanism that mediates infection thread invasion of the root cortex and nodule primordia. At the cellular level, Epr3 expression delineates progression of infection threads into nodule primordia and cortical infection thread formation is impaired in epr3 mutants. Genetic dissection of this developmental coordination showed that Epr3 is integrated into the symbiosis signal transduction pathways. Further analysis showed differential expression of Epr3 in the epidermis and cortical primordia and identified key transcription factors controlling this tissue specificity. These results suggest that exopolysaccharide recognition is reiterated during the progressing infection and that EPR3 perception of compatible exopolysaccharide promotes an intracellular cortical infection mechanism maintaining bacteria enclosed in plant membranes. PMID:28230048
Self-cleaning threaded rod spinneret for high-efficiency needleless electrospinning

NASA Astrophysics Data System (ADS)

Zheng, Gaofeng; Jiang, Jiaxin; Wang, Xiang; Li, Wenwang; Zhong, Weizheng; Guo, Shumin

2018-07-01

High-efficiency production of nanofibers is the key to the application of electrospinning technology. This work focuses on multi-jet electrospinning, in which a threaded rod electrode is utilized as the needless spinneret to achieve high-efficiency production of nanofibers. A slipper block, which fits into and moves through the threaded rod, is designed to transfer polymer solution evenly to the surface of the rod spinneret. The relative motion between the slipper block and the threaded rod electrode promotes the instable fluctuation of the solution surface, thus the rotation of threaded rod electrode decreases the critical voltage for the initial multi-jet ejection and the diameter of nanofibers. The residual solution on the surface of threaded rod is cleaned up by the moving slipper block, showing a great self-cleaning ability, which ensures the stable multi-jet ejection and increases the productivity of nanofibers. Each thread of the threaded rod electrode serves as an independent spinneret, which enhances the electric field strength and constrains the position of the Taylor cone, resulting in high productivity of uniform nanofibers. The diameter of nanofibers decreases with the increase of threaded rod rotation speed, and the productivity increases with the solution flow rate. The rotation of electrode provides an excess force for the ejection of charged jets, which also contributes to the high-efficiency production of nanofibers. The maximum productivity of nanofibers from the threaded rod spinneret is 5-6 g/h, about 250-300 times as high as that from the single-needle spinneret. The self-cleaning threaded rod spinneret is an effective way to realize continuous multi-jet electrospinning, which promotes industrial applications of uniform nanofibrous membrane.
Multi-threading performance of Geant4, MCNP6, and PHITS Monte Carlo codes for tetrahedral-mesh geometry.

PubMed

Han, Min Cheol; Yeom, Yeon Soo; Lee, Hyun Su; Shin, Bangho; Kim, Chan Hyeong; Furuta, Takuya

2018-05-04

In this study, the multi-threading performance of the Geant4, MCNP6, and PHITS codes was evaluated as a function of the number of threads (N) and the complexity of the tetrahedral-mesh phantom. For this, three tetrahedral-mesh phantoms of varying complexity (simple, moderately complex, and highly complex) were prepared and implemented in the three different Monte Carlo codes, in photon and neutron transport simulations. Subsequently, for each case, the initialization time, calculation time, and memory usage were measured as a function of the number of threads used in the simulation. It was found that for all codes, the initialization time significantly increased with the complexity of the phantom, but not with the number of threads. Geant4 exhibited much longer initialization time than the other codes, especially for the complex phantom (MRCP). The improvement of computation speed due to the use of a multi-threaded code was calculated as the speed-up factor, the ratio of the computation speed on a multi-threaded code to the computation speed on a single-threaded code. Geant4 showed the best multi-threading performance among the codes considered in this study, with the speed-up factor almost linearly increasing with the number of threads, reaching ~30 when N = 40. PHITS and MCNP6 showed a much smaller increase of the speed-up factor with the number of threads. For PHITS, the speed-up factors were low when N = 40. For MCNP6, the increase of the speed-up factors was better, but they were still less than ~10 when N = 40. As for memory usage, Geant4 was found to use more memory than the other codes. In addition, compared to that of the other codes, the memory usage of Geant4 more rapidly increased with the number of threads, reaching as high as ~74 GB when N = 40 for the complex phantom (MRCP). It is notable that compared to that of the other codes, the memory usage of PHITS was much lower, regardless of both the complexity of the phantom and the number of threads, hardly increasing with the number of threads for the MRCP.

Some links on this page may take you to non-federal websites. Their policies may differ from this site.