multicore high performance: Topics by Science.gov

Sample records for multicore high performance

Multi-Core Processor Memory Contention Benchmark Analysis Case Study

NASA Technical Reports Server (NTRS)

Simon, Tyler; McGalliard, James

2009-01-01

Multi-core processors dominate current mainframe, server, and high performance computing (HPC) systems. This paper provides synthetic kernel and natural benchmark results from an HPC system at the NASA Goddard Space Flight Center that illustrate the performance impacts of multi-core (dual- and quad-core) vs. single core processor systems. Analysis of processor design, application source code, and synthetic and natural test results all indicate that multi-core processors can suffer from significant memory subsystem contention compared to similar single-core processors.
Case for a field-programmable gate array multicore hybrid machine for an image-processing application

NASA Astrophysics Data System (ADS)

Rakvic, Ryan N.; Ives, Robert W.; Lira, Javier; Molina, Carlos

2011-01-01

General purpose computer designers have recently begun adding cores to their processors in order to increase performance. For example, Intel has adopted a homogeneous quad-core processor as a base for general purpose computing. PlayStation3 (PS3) game consoles contain a multicore heterogeneous processor known as the Cell, which is designed to perform complex image processing algorithms at a high level. Can modern image-processing algorithms utilize these additional cores? On the other hand, modern advancements in configurable hardware, most notably field-programmable gate arrays (FPGAs) have created an interesting question for general purpose computer designers. Is there a reason to combine FPGAs with multicore processors to create an FPGA multicore hybrid general purpose computer? Iris matching, a repeatedly executed portion of a modern iris-recognition algorithm, is parallelized on an Intel-based homogeneous multicore Xeon system, a heterogeneous multicore Cell system, and an FPGA multicore hybrid system. Surprisingly, the cheaper PS3 slightly outperforms the Intel-based multicore on a core-for-core basis. However, both multicore systems are beaten by the FPGA multicore hybrid system by >50%.
Enabling Future Robotic Missions with Multicore Processors

NASA Technical Reports Server (NTRS)

Powell, Wesley A.; Johnson, Michael A.; Wilmot, Jonathan; Some, Raphael; Gostelow, Kim P.; Reeves, Glenn; Doyle, Richard J.

2011-01-01

Recent commercial developments in multicore processors (e.g. Tilera, Clearspeed, HyperX) have provided an option for high performance embedded computing that rivals the performance attainable with FPGA-based reconfigurable computing architectures. Furthermore, these processors offer more straightforward and streamlined application development by allowing the use of conventional programming languages and software tools in lieu of hardware design languages such as VHDL and Verilog. With these advantages, multicore processors can significantly enhance the capabilities of future robotic space missions. This paper will discuss these benefits, along with onboard processing applications where multicore processing can offer advantages over existing or competing approaches. This paper will also discuss the key artchitecural features of current commercial multicore processors. In comparison to the current art, the features and advancements necessary for spaceflight multicore processors will be identified. These include power reduction, radiation hardening, inherent fault tolerance, and support for common spacecraft bus interfaces. Lastly, this paper will explore how multicore processors might evolve with advances in electronics technology and how avionics architectures might evolve once multicore processors are inserted into NASA robotic spacecraft.
Network Coding on Heterogeneous Multi-Core Processors for Wireless Sensor Networks

PubMed Central

Kim, Deokho; Park, Karam; Ro, Won W.

2011-01-01

While network coding is well known for its efficiency and usefulness in wireless sensor networks, the excessive costs associated with decoding computation and complexity still hinder its adoption into practical use. On the other hand, high-performance microprocessors with heterogeneous multi-cores would be used as processing nodes of the wireless sensor networks in the near future. To this end, this paper introduces an efficient network coding algorithm developed for the heterogenous multi-core processors. The proposed idea is fully tested on one of the currently available heterogeneous multi-core processors referred to as the Cell Broadband Engine. PMID:22164053
Multi-core processing and scheduling performance in CMS

NASA Astrophysics Data System (ADS)

Hernández, J. M.; Evans, D.; Foulkes, S.

2012-12-01

Commodity hardware is going many-core. We might soon not be able to satisfy the job memory needs per core in the current single-core processing model in High Energy Physics. In addition, an ever increasing number of independent and incoherent jobs running on the same physical hardware not sharing resources might significantly affect processing performance. It will be essential to effectively utilize the multi-core architecture. CMS has incorporated support for multi-core processing in the event processing framework and the workload management system. Multi-core processing jobs share common data in memory, such us the code libraries, detector geometry and conditions data, resulting in a much lower memory usage than standard single-core independent jobs. Exploiting this new processing model requires a new model in computing resource allocation, departing from the standard single-core allocation for a job. The experiment job management system needs to have control over a larger quantum of resource since multi-core aware jobs require the scheduling of multiples cores simultaneously. CMS is exploring the approach of using whole nodes as unit in the workload management system where all cores of a node are allocated to a multi-core job. Whole-node scheduling allows for optimization of the data/workflow management (e.g. I/O caching, local merging) but efficient utilization of all scheduled cores is challenging. Dedicated whole-node queues have been setup at all Tier-1 centers for exploring multi-core processing workflows in CMS. We present the evaluation of the performance scheduling and executing multi-core workflows in whole-node queues compared to the standard single-core processing workflows.
Neural simulations on multi-core architectures.

PubMed

Eichner, Hubert; Klug, Tobias; Borst, Alexander

2009-01-01

Neuroscience is witnessing increasing knowledge about the anatomy and electrophysiological properties of neurons and their connectivity, leading to an ever increasing computational complexity of neural simulations. At the same time, a rather radical change in personal computer technology emerges with the establishment of multi-cores: high-density, explicitly parallel processor architectures for both high performance as well as standard desktop computers. This work introduces strategies for the parallelization of biophysically realistic neural simulations based on the compartmental modeling technique and results of such an implementation, with a strong focus on multi-core architectures and automation, i.e. user-transparent load balancing.
Neural Simulations on Multi-Core Architectures

PubMed Central

Eichner, Hubert; Klug, Tobias; Borst, Alexander

2009-01-01

Neuroscience is witnessing increasing knowledge about the anatomy and electrophysiological properties of neurons and their connectivity, leading to an ever increasing computational complexity of neural simulations. At the same time, a rather radical change in personal computer technology emerges with the establishment of multi-cores: high-density, explicitly parallel processor architectures for both high performance as well as standard desktop computers. This work introduces strategies for the parallelization of biophysically realistic neural simulations based on the compartmental modeling technique and results of such an implementation, with a strong focus on multi-core architectures and automation, i.e. user-transparent load balancing. PMID:19636393
Efficient provisioning for multi-core applications with LSF

NASA Astrophysics Data System (ADS)

Dal Pra, Stefano

2015-12-01

Tier-1 sites providing computing power for HEP experiments are usually tightly designed for high throughput performances. This is pursued by reducing the variety of supported use cases and tuning for performances those ones, the most important of which have been that of singlecore jobs. Moreover, the usual workload is saturation: each available core in the farm is in use and there are queued jobs waiting for their turn to run. Enabling multi-core jobs thus requires dedicating a number of hosts where to run, and waiting for them to free the needed number of cores. This drain-time introduces a loss of computing power driven by the number of unusable empty cores. As an increasing demand for multi-core capable resources have emerged, a Task Force have been constituted in WLCG, with the goal to define a simple and efficient multi-core resource provisioning model. This paper details the work done at the INFN Tier-1 to enable multi-core support for the LSF batch system, with the intent of reducing to the minimum the average number of unused cores. The adopted strategy has been that of dedicating to multi-core a dynamic set of nodes, whose dimension is mainly driven by the number of pending multi-core requests and fair-share priority of the submitting user. The node status transition, from single to multi core et vice versa, is driven by a finite state machine which is implemented in a custom multi-core director script, running in the cluster. After describing and motivating both the implementation and the details specific to the LSF batch system, results about performance are reported. Factors having positive and negative impact on the overall efficiency are discussed and solutions to reduce at most the negative ones are proposed.
Multi-core processing and scheduling performance in CMS

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hernandez, J. M.; Evans, D.; Foulkes, S.

2012-01-01

Commodity hardware is going many-core. We might soon not be able to satisfy the job memory needs per core in the current single-core processing model in High Energy Physics. In addition, an ever increasing number of independent and incoherent jobs running on the same physical hardware not sharing resources might significantly affect processing performance. It will be essential to effectively utilize the multi-core architecture. CMS has incorporated support for multi-core processing in the event processing framework and the workload management system. Multi-core processing jobs share common data in memory, such us the code libraries, detector geometry and conditions data, resultingmore » in a much lower memory usage than standard single-core independent jobs. Exploiting this new processing model requires a new model in computing resource allocation, departing from the standard single-core allocation for a job. The experiment job management system needs to have control over a larger quantum of resource since multi-core aware jobs require the scheduling of multiples cores simultaneously. CMS is exploring the approach of using whole nodes as unit in the workload management system where all cores of a node are allocated to a multi-core job. Whole-node scheduling allows for optimization of the data/workflow management (e.g. I/O caching, local merging) but efficient utilization of all scheduled cores is challenging. Dedicated whole-node queues have been setup at all Tier-1 centers for exploring multi-core processing workflows in CMS. We present the evaluation of the performance scheduling and executing multi-core workflows in whole-node queues compared to the standard single-core processing workflows.« less
Optimization of sparse matrix-vector multiplication on emerging multicore platforms

DOE Office of Scientific and Technical Information (OSTI.GOV)

Williams, Samuel; Oliker, Leonid; Vuduc, Richard

2007-01-01

We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore specific optimization methodologies for important scientific computations. In this work, we examine sparse matrix-vector multiply (SpMV) - one of the most heavily used kernels in scientific computing - across a broad spectrum of multicore designs. Our experimental platform includes the homogeneous AMD dual-core and Intel quad-core designs, the heterogeneous STI Cell, as well as the first scientificmore » study of the highly multithreaded Sun Niagara2. We present several optimization strategies especially effective for the multicore environment, and demonstrate significant performance improvements compared to existing state-of-the-art serial and parallel SpMV implementations. Additionally, we present key insights into the architectural tradeoffs of leading multicore design strategies, in the context of demanding memory-bound numerical algorithms.« less
Parallelizing Compiler Framework and API for Power Reduction and Software Productivity of Real-Time Heterogeneous Multicores

NASA Astrophysics Data System (ADS)

Hayashi, Akihiro; Wada, Yasutaka; Watanabe, Takeshi; Sekiguchi, Takeshi; Mase, Masayoshi; Shirako, Jun; Kimura, Keiji; Kasahara, Hironori

Heterogeneous multicores have been attracting much attention to attain high performance keeping power consumption low in wide spread of areas. However, heterogeneous multicores force programmers very difficult programming. The long application program development period lowers product competitiveness. In order to overcome such a situation, this paper proposes a compilation framework which bridges a gap between programmers and heterogeneous multicores. In particular, this paper describes the compilation framework based on OSCAR compiler. It realizes coarse grain task parallel processing, data transfer using a DMA controller, power reduction control from user programs with DVFS and clock gating on various heterogeneous multicores from different vendors. This paper also evaluates processing performance and the power reduction by the proposed framework on a newly developed 15 core heterogeneous multicore chip named RP-X integrating 8 general purpose processor cores and 3 types of accelerator cores which was developed by Renesas Electronics, Hitachi, Tokyo Institute of Technology and Waseda University. The framework attains speedups up to 32x for an optical flow program with eight general purpose processor cores and four DRP(Dynamically Reconfigurable Processor) accelerator cores against sequential execution by a single processor core and 80% of power reduction for the real-time AAC encoding.
Using Multi-Core Systems for Rover Autonomy

NASA Technical Reports Server (NTRS)

Clement, Brad; Estlin, Tara; Bornstein, Benjamin; Springer, Paul; Anderson, Robert C.

2010-01-01

Task Objectives are: (1) Develop and demonstrate key capabilities for rover long-range science operations using multi-core computing, (a) Adapt three rover technologies to execute on SOA multi-core processor (b) Illustrate performance improvements achieved (c) Demonstrate adapted capabilities with rover hardware, (2) Targeting three high-level autonomy technologies (a) Two for onboard data analysis (b) One for onboard command sequencing/planning, (3) Technologies identified as enabling for future missions, (4)Benefits will be measured along several metrics: (a) Execution time / Power requirements (b) Number of data products processed per unit time (c) Solution quality
Evolution of CMS workload management towards multicore job support

NASA Astrophysics Data System (ADS)

Pérez-Calero Yzquierdo, A.; Hernández, J. M.; Khan, F. A.; Letts, J.; Majewski, K.; Rodrigues, A. M.; McCrea, A.; Vaandering, E.

2015-12-01

The successful exploitation of multicore processor architectures is a key element of the LHC distributed computing system in the coming era of the LHC Run 2. High-pileup complex-collision events represent a challenge for the traditional sequential programming in terms of memory and processing time budget. The CMS data production and processing framework is introducing the parallel execution of the reconstruction and simulation algorithms to overcome these limitations. CMS plans to execute multicore jobs while still supporting singlecore processing for other tasks difficult to parallelize, such as user analysis. The CMS strategy for job management thus aims at integrating single and multicore job scheduling across the Grid. This is accomplished by employing multicore pilots with internal dynamic partitioning of the allocated resources, capable of running payloads of various core counts simultaneously. An extensive test programme has been conducted to enable multicore scheduling with the various local batch systems available at CMS sites, with the focus on the Tier-0 and Tier-1s, responsible during 2015 of the prompt data reconstruction. Scale tests have been run to analyse the performance of this scheduling strategy and ensure an efficient use of the distributed resources. This paper presents the evolution of the CMS job management and resource provisioning systems in order to support this hybrid scheduling model, as well as its deployment and performance tests, which will enable CMS to transition to a multicore production model for the second LHC run.
Evolution of CMS Workload Management Towards Multicore Job Support

DOE Office of Scientific and Technical Information (OSTI.GOV)

Perez-Calero Yzquierdo, A.; Hernández, J. M.; Khan, F. A.

The successful exploitation of multicore processor architectures is a key element of the LHC distributed computing system in the coming era of the LHC Run 2. High-pileup complex-collision events represent a challenge for the traditional sequential programming in terms of memory and processing time budget. The CMS data production and processing framework is introducing the parallel execution of the reconstruction and simulation algorithms to overcome these limitations. CMS plans to execute multicore jobs while still supporting singlecore processing for other tasks difficult to parallelize, such as user analysis. The CMS strategy for job management thus aims at integrating single andmore » multicore job scheduling across the Grid. This is accomplished by employing multicore pilots with internal dynamic partitioning of the allocated resources, capable of running payloads of various core counts simultaneously. An extensive test programme has been conducted to enable multicore scheduling with the various local batch systems available at CMS sites, with the focus on the Tier-0 and Tier-1s, responsible during 2015 of the prompt data reconstruction. Scale tests have been run to analyse the performance of this scheduling strategy and ensure an efficient use of the distributed resources. This paper presents the evolution of the CMS job management and resource provisioning systems in order to support this hybrid scheduling model, as well as its deployment and performance tests, which will enable CMS to transition to a multicore production model for the second LHC run.« less
Evaluation of SuperLU on multicore architectures

NASA Astrophysics Data System (ADS)

Li, X. S.

2008-07-01

The Chip Multiprocessor (CMP) will be the basic building block for computer systems ranging from laptops to supercomputers. New software developments at all levels are needed to fully utilize these systems. In this work, we evaluate performance of different high-performance sparse LU factorization and triangular solution algorithms on several representative multicore machines. We included both Pthreads and MPI implementations in this study and found that the Pthreads implementation consistently delivers good performance and that a left-looking algorithm is usually superior.
Multicore Challenges and Benefits for High Performance Scientific Computing

DOE PAGES

Nielsen, Ida M. B.; Janssen, Curtis L.

2008-01-01

Until recently, performance gains in processors were achieved largely by improvements in clock speeds and instruction level parallelism. Thus, applications could obtain performance increases with relatively minor changes by upgrading to the latest generation of computing hardware. Currently, however, processor performance improvements are realized by using multicore technology and hardware support for multiple threads within each core, and taking full advantage of this technology to improve the performance of applications requires exposure of extreme levels of software parallelism. We will here discuss the architecture of parallel computers constructed from many multicore chips as well as techniques for managing the complexitymore » of programming such computers, including the hybrid message-passing/multi-threading programming model. We will illustrate these ideas with a hybrid distributed memory matrix multiply and a quantum chemistry algorithm for energy computation using Møller–Plesset perturbation theory.« less
Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms

DOE Office of Scientific and Technical Information (OSTI.GOV)

Williams, Samuel; Oliker, Leonid; Vuduc, Richard

2008-10-16

We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore specific-optimization methodologies for important scientific computations. In this work, we examine sparse matrix-vector multiply (SpMV) - one of the most heavily used kernels in scientific computing - across a broad spectrum of multicore designs. Our experimental platform includes the homogeneous AMD quad-core, AMD dual-core, and Intel quad-core designs, the heterogeneous STI Cell, as well as one ofmore » the first scientific studies of the highly multithreaded Sun Victoria Falls (a Niagara2 SMP). We present several optimization strategies especially effective for the multicore environment, and demonstrate significant performance improvements compared to existing state-of-the-art serial and parallel SpMV implementations. Additionally, we present key insights into the architectural trade-offs of leading multicore design strategies, in the context of demanding memory-bound numerical algorithms.« less
PERI - Auto-tuning Memory Intensive Kernels for Multicore

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bailey, David H; Williams, Samuel; Datta, Kaushik

2008-06-24

We present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of search-based performance optimizations, popular in linear algebra and FFT libraries, to application-specific computational kernels. Our work applies this strategy to Sparse Matrix Vector Multiplication (SpMV), the explicit heat equation PDE on a regular grid (Stencil), and a lattice Boltzmann application (LBMHD). We explore one of the broadest sets of multicore architectures in the HPC literature, including the Intel Xeon Clovertown, AMD Opteron Barcelona, Sun Victoria Falls, and the Sony-Toshiba-IBM (STI) Cell. Rather than hand-tuning each kernel for each system, we developmore » a code generator for each kernel that allows us to identify a highly optimized version for each platform, while amortizing the human programming effort. Results show that our auto-tuned kernel applications often achieve a better than 4X improvement compared with the original code. Additionally, we analyze a Roofline performance model for each platform to reveal hardware bottlenecks and software challenges for future multicore systems and applications.« less
Interactive high-resolution isosurface ray casting on multicore processors.

PubMed

Wang, Qin; JaJa, Joseph

2008-01-01

We present a new method for the interactive rendering of isosurfaces using ray casting on multi-core processors. This method consists of a combination of an object-order traversal that coarsely identifies possible candidate 3D data blocks for each small set of contiguous pixels, and an isosurface ray casting strategy tailored for the resulting limited-size lists of candidate 3D data blocks. While static screen partitioning is widely used in the literature, our scheme performs dynamic allocation of groups of ray casting tasks to ensure almost equal loads among the different threads running on multi-cores while maintaining spatial locality. We also make careful use of memory management environment commonly present in multi-core processors. We test our system on a two-processor Clovertown platform, each consisting of a Quad-Core 1.86-GHz Intel Xeon Processor, for a number of widely different benchmarks. The detailed experimental results show that our system is efficient and scalable, and achieves high cache performance and excellent load balancing, resulting in an overall performance that is superior to any of the previous algorithms. In fact, we achieve an interactive isosurface rendering on a 1024(2) screen for all the datasets tested up to the maximum size of the main memory of our platform.
Aho-Corasick String Matching on Shared and Distributed Memory Parallel Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tumeo, Antonino; Villa, Oreste; Chavarría-Miranda, Daniel

String matching is at the core of many critical applications, including network intrusion detection systems, search engines, virus scanners, spam filters, DNA and protein sequencing, and data mining. For all of these applications string matching requires a combination of (sometimes all) the following characteristics: high and/or predictable performance, support for large data sets and flexibility of integration and customization. Many software based implementations targeting conventional cache-based microprocessors fail to achieve high and predictable performance requirements, while Field-Programmable Gate Array (FPGA) implementations and dedicated hardware solutions fail to support large data sets (dictionary sizes) and are difficult to integrate and customize.more » The advent of multicore, multithreaded, and GPU-based systems is opening the possibility for software based solutions to reach very high performance at a sustained rate. This paper compares several software-based implementations of the Aho-Corasick string searching algorithm for high performance systems. We discuss the implementation of the algorithm on several types of shared-memory high-performance architectures (Niagara 2, large x86 SMPs and Cray XMT), distributed memory with homogeneous processing elements (InfiniBand cluster of x86 multicores) and heterogeneous processing elements (InfiniBand cluster of x86 multicores with NVIDIA Tesla C10 GPUs). We describe in detail how each solution achieves the objectives of supporting large dictionaries, sustaining high performance, and enabling customization and flexibility using various data sets.« less

A high performance load balance strategy for real-time multicore systems.

PubMed

Cho, Keng-Mao; Tsai, Chun-Wei; Chiu, Yi-Shiuan; Yang, Chu-Sing

2014-01-01

Finding ways to distribute workloads to each processor core and efficiently reduce power consumption is of vital importance, especially for real-time systems. In this paper, a novel scheduling algorithm is proposed for real-time multicore systems to balance the computation loads and save power. The developed algorithm simultaneously considers multiple criteria, a novel factor, and task deadline, and is called power and deadline-aware multicore scheduling (PDAMS). Experiment results show that the proposed algorithm can greatly reduce energy consumption by up to 54.2% and the deadline times missed, as compared to the other scheduling algorithms outlined in this paper.
A High Performance Load Balance Strategy for Real-Time Multicore Systems

PubMed Central

Cho, Keng-Mao; Tsai, Chun-Wei; Chiu, Yi-Shiuan; Yang, Chu-Sing

2014-01-01

Finding ways to distribute workloads to each processor core and efficiently reduce power consumption is of vital importance, especially for real-time systems. In this paper, a novel scheduling algorithm is proposed for real-time multicore systems to balance the computation loads and save power. The developed algorithm simultaneously considers multiple criteria, a novel factor, and task deadline, and is called power and deadline-aware multicore scheduling (PDAMS). Experiment results show that the proposed algorithm can greatly reduce energy consumption by up to 54.2% and the deadline times missed, as compared to the other scheduling algorithms outlined in this paper. PMID:24955382
[Series: Medical Applications of the PHITS Code (2): Acceleration by Parallel Computing].

PubMed

Furuta, Takuya; Sato, Tatsuhiko

2015-01-01

Time-consuming Monte Carlo dose calculation becomes feasible owing to the development of computer technology. However, the recent development is due to emergence of the multi-core high performance computers. Therefore, parallel computing becomes a key to achieve good performance of software programs. A Monte Carlo simulation code PHITS contains two parallel computing functions, the distributed-memory parallelization using protocols of message passing interface (MPI) and the shared-memory parallelization using open multi-processing (OpenMP) directives. Users can choose the two functions according to their needs. This paper gives the explanation of the two functions with their advantages and disadvantages. Some test applications are also provided to show their performance using a typical multi-core high performance workstation.
Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms

DOE Office of Scientific and Technical Information (OSTI.GOV)

Williams, Samuel; Carter, Jonathan; Oliker, Leonid

2008-02-01

We present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of search-based performance optimizations, popular in linear algebra and FFT libraries, to application-specific computational kernels. Our work applies this strategy to a lattice Boltzmann application (LBMHD) that historically has made poor use of scalar microprocessors due to its complex data structures and memory access patterns. We explore one of the broadest sets of multicore architectures in the HPC literature, including the Intel Clovertown, AMD Opteron X2, Sun Niagara2, STI Cell, as well as the single core Intel Itanium2. Rather than hand-tuning LBMHDmore » for each system, we develop a code generator that allows us identify a highly optimized version for each platform, while amortizing the human programming effort. Results show that our auto-tuned LBMHD application achieves up to a 14x improvement compared with the original code. Additionally, we present detailed analysis of each optimization, which reveal surprising hardware bottlenecks and software challenges for future multicore systems and applications.« less
Lattice Boltzmann simulation optimization on leading multicore platforms

DOE Office of Scientific and Technical Information (OSTI.GOV)

Williams, S.; Carter, J.; Oliker, L.

2008-01-01

We present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of searchbased performance optimizations, popular in linear algebra and FFT libraries, to application-specific computational kernels. Our work applies this strategy to a lattice Boltzmann application (LBMHD) that historically has made poor use of scalar microprocessors due to its complex data structures and memory access patterns. We explore one of the broadest sets of multicore architectures in the HPC literature, including the Intel Clovertown, AMD Opteron X2, Sun Niagara2, STI Cell, as well as the single core Intel Itanium2. Rather than hand-tuning LBMHDmore » for each system, we develop a code generator that allows us identify a highly optimized version for each platform, while amortizing the human programming effort. Results show that our autotuned LBMHD application achieves up to a 14 improvement compared with the original code. Additionally, we present detailed analysis of each optimization, which reveal surprising hardware bottlenecks and software challenges for future multicore systems and applications.« less
Toward performance portability of the Albany finite element analysis code using the Kokkos library

DOE Office of Scientific and Technical Information (OSTI.GOV)

Demeshko, Irina; Watkins, Jerry; Tezaur, Irina K.

Performance portability on heterogeneous high-performance computing (HPC) systems is a major challenge faced today by code developers: parallel code needs to be executed correctly as well as with high performance on machines with different architectures, operating systems, and software libraries. The finite element method (FEM) is a popular and flexible method for discretizing partial differential equations arising in a wide variety of scientific, engineering, and industrial applications that require HPC. This paper presents some preliminary results pertaining to our development of a performance portable implementation of the FEM-based Albany code. Performance portability is achieved using the Kokkos library. We presentmore » performance results for the Aeras global atmosphere dynamical core module in Albany. Finally, numerical experiments show that our single code implementation gives reasonable performance across three multicore/many-core architectures: NVIDIA General Processing Units (GPU’s), Intel Xeon Phis, and multicore CPUs.« less
Toward performance portability of the Albany finite element analysis code using the Kokkos library

DOE PAGES

Demeshko, Irina; Watkins, Jerry; Tezaur, Irina K.; ...

2018-02-05

Performance portability on heterogeneous high-performance computing (HPC) systems is a major challenge faced today by code developers: parallel code needs to be executed correctly as well as with high performance on machines with different architectures, operating systems, and software libraries. The finite element method (FEM) is a popular and flexible method for discretizing partial differential equations arising in a wide variety of scientific, engineering, and industrial applications that require HPC. This paper presents some preliminary results pertaining to our development of a performance portable implementation of the FEM-based Albany code. Performance portability is achieved using the Kokkos library. We presentmore » performance results for the Aeras global atmosphere dynamical core module in Albany. Finally, numerical experiments show that our single code implementation gives reasonable performance across three multicore/many-core architectures: NVIDIA General Processing Units (GPU’s), Intel Xeon Phis, and multicore CPUs.« less
Energy-aware Thread and Data Management in Heterogeneous Multi-core, Multi-memory Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Su, Chun-Yi

By 2004, microprocessor design focused on multicore scaling—increasing the number of cores per die in each generation—as the primary strategy for improving performance. These multicore processors typically equip multiple memory subsystems to improve data throughput. In addition, these systems employ heterogeneous processors such as GPUs and heterogeneous memories like non-volatile memory to improve performance, capacity, and energy efficiency. With the increasing volume of hardware resources and system complexity caused by heterogeneity, future systems will require intelligent ways to manage hardware resources. Early research to improve performance and energy efficiency on heterogeneous, multi-core, multi-memory systems focused on tuning a single primitivemore » or at best a few primitives in the systems. The key limitation of past efforts is their lack of a holistic approach to resource management that balances the tradeoff between performance and energy consumption. In addition, the shift from simple, homogeneous systems to these heterogeneous, multicore, multi-memory systems requires in-depth understanding of efficient resource management for scalable execution, including new models that capture the interchange between performance and energy, smarter resource management strategies, and novel low-level performance/energy tuning primitives and runtime systems. Tuning an application to control available resources efficiently has become a daunting challenge; managing resources in automation is still a dark art since the tradeoffs among programming, energy, and performance remain insufficiently understood. In this dissertation, I have developed theories, models, and resource management techniques to enable energy-efficient execution of parallel applications through thread and data management in these heterogeneous multi-core, multi-memory systems. I study the effect of dynamic concurrent throttling on the performance and energy of multi-core, non-uniform memory access (NUMA) systems. I use critical path analysis to quantify memory contention in the NUMA memory system and determine thread mappings. In addition, I implement a runtime system that combines concurrent throttling and a novel thread mapping algorithm to manage thread resources and improve energy efficient execution in multi-core, NUMA systems.« less
On the Performance of an Algebraic MultigridSolver on Multicore Clusters

DOE Office of Scientific and Technical Information (OSTI.GOV)

Baker, A H; Schulz, M; Yang, U M

2010-04-29

Algebraic multigrid (AMG) solvers have proven to be extremely efficient on distributed-memory architectures. However, when executed on modern multicore cluster architectures, we face new challenges that can significantly harm AMG's performance. We discuss our experiences on such an architecture and present a set of techniques that help users to overcome the associated problems, including thread and process pinning and correct memory associations. We have implemented most of the techniques in a MultiCore SUPport library (MCSup), which helps to map OpenMP applications to multicore machines. We present results using both an MPI-only and a hybrid MPI/OpenMP model.
Optimization of a Lattice Boltzmann Computation on State-of-the-Art Multicore Platforms

DOE Office of Scientific and Technical Information (OSTI.GOV)

Williams, Samuel; Carter, Jonathan; Oliker, Leonid

2009-04-10

We present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of search-based performance optimizations, popular in linear algebra and FFT libraries, to application-specific computational kernels. Our work applies this strategy to a lattice Boltzmann application (LBMHD) that historically has made poor use of scalar microprocessors due to its complex data structures and memory access patterns. We explore one of the broadest sets of multicore architectures in the HPC literature, including the Intel Xeon E5345 (Clovertown), AMD Opteron 2214 (Santa Rosa), AMD Opteron 2356 (Barcelona), Sun T5140 T2+ (Victoria Falls), as well asmore » a QS20 IBM Cell Blade. Rather than hand-tuning LBMHD for each system, we develop a code generator that allows us to identify a highly optimized version for each platform, while amortizing the human programming effort. Results show that our auto-tuned LBMHD application achieves up to a 15x improvement compared with the original code at a given concurrency. Additionally, we present detailed analysis of each optimization, which reveal surprising hardware bottlenecks and software challenges for future multicore systems and applications.« less
CQPSO scheduling algorithm for heterogeneous multi-core DAG task model

NASA Astrophysics Data System (ADS)

Zhai, Wenzheng; Hu, Yue-Li; Ran, Feng

2017-07-01

Efficient task scheduling is critical to achieve high performance in a heterogeneous multi-core computing environment. The paper focuses on the heterogeneous multi-core directed acyclic graph (DAG) task model and proposes a novel task scheduling method based on an improved chaotic quantum-behaved particle swarm optimization (CQPSO) algorithm. A task priority scheduling list was built. A processor with minimum cumulative earliest finish time (EFT) was acted as the object of the first task assignment. The task precedence relationships were satisfied and the total execution time of all tasks was minimized. The experimental results show that the proposed algorithm has the advantage of optimization abilities, simple and feasible, fast convergence, and can be applied to the task scheduling optimization for other heterogeneous and distributed environment.
Coding for parallel execution of hardware-in-the-loop millimeter-wave scene generation models on multicore SIMD processor architectures

NASA Astrophysics Data System (ADS)

Olson, Richard F.

2013-05-01

Rendering of point scatterer based radar scenes for millimeter wave (mmW) seeker tests in real-time hardware-in-the-loop (HWIL) scene generation requires efficient algorithms and vector-friendly computer architectures for complex signal synthesis. New processor technology from Intel implements an extended 256-bit vector SIMD instruction set (AVX, AVX2) in a multi-core CPU design providing peak execution rates of hundreds of GigaFLOPS (GFLOPS) on one chip. Real world mmW scene generation code can approach peak SIMD execution rates only after careful algorithm and source code design. An effective software design will maintain high computing intensity emphasizing register-to-register SIMD arithmetic operations over data movement between CPU caches or off-chip memories. Engineers at the U.S. Army Aviation and Missile Research, Development and Engineering Center (AMRDEC) applied two basic parallel coding methods to assess new 256-bit SIMD multi-core architectures for mmW scene generation in HWIL. These include use of POSIX threads built on vector library functions and more portable, highlevel parallel code based on compiler technology (e.g. OpenMP pragmas and SIMD autovectorization). Since CPU technology is rapidly advancing toward high processor core counts and TeraFLOPS peak SIMD execution rates, it is imperative that coding methods be identified which produce efficient and maintainable parallel code. This paper describes the algorithms used in point scatterer target model rendering, the parallelization of those algorithms, and the execution performance achieved on an AVX multi-core machine using the two basic parallel coding methods. The paper concludes with estimates for scale-up performance on upcoming multi-core technology.
Parameters that affect parallel processing for computational electromagnetic simulation codes on high performance computing clusters

NASA Astrophysics Data System (ADS)

Moon, Hongsik

What is the impact of multicore and associated advanced technologies on computational software for science? Most researchers and students have multicore laptops or desktops for their research and they need computing power to run computational software packages. Computing power was initially derived from Central Processing Unit (CPU) clock speed. That changed when increases in clock speed became constrained by power requirements. Chip manufacturers turned to multicore CPU architectures and associated technological advancements to create the CPUs for the future. Most software applications benefited by the increased computing power the same way that increases in clock speed helped applications run faster. However, for Computational ElectroMagnetics (CEM) software developers, this change was not an obvious benefit - it appeared to be a detriment. Developers were challenged to find a way to correctly utilize the advancements in hardware so that their codes could benefit. The solution was parallelization and this dissertation details the investigation to address these challenges. Prior to multicore CPUs, advanced computer technologies were compared with the performance using benchmark software and the metric was FLoting-point Operations Per Seconds (FLOPS) which indicates system performance for scientific applications that make heavy use of floating-point calculations. Is FLOPS an effective metric for parallelized CEM simulation tools on new multicore system? Parallel CEM software needs to be benchmarked not only by FLOPS but also by the performance of other parameters related to type and utilization of the hardware, such as CPU, Random Access Memory (RAM), hard disk, network, etc. The codes need to be optimized for more than just FLOPs and new parameters must be included in benchmarking. In this dissertation, the parallel CEM software named High Order Basis Based Integral Equation Solver (HOBBIES) is introduced. This code was developed to address the needs of the changing computer hardware platforms in order to provide fast, accurate and efficient solutions to large, complex electromagnetic problems. The research in this dissertation proves that the performance of parallel code is intimately related to the configuration of the computer hardware and can be maximized for different hardware platforms. To benchmark and optimize the performance of parallel CEM software, a variety of large, complex projects are created and executed on a variety of computer platforms. The computer platforms used in this research are detailed in this dissertation. The projects run as benchmarks are also described in detail and results are presented. The parameters that affect parallel CEM software on High Performance Computing Clusters (HPCC) are investigated. This research demonstrates methods to maximize the performance of parallel CEM software code.
MIC-SVM: Designing A Highly Efficient Support Vector Machine For Advanced Modern Multi-Core and Many-Core Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

You, Yang; Song, Shuaiwen; Fu, Haohuan

2014-08-16

Support Vector Machine (SVM) has been widely used in data-mining and Big Data applications as modern commercial databases start to attach an increasing importance to the analytic capabilities. In recent years, SVM was adapted to the field of High Performance Computing for power/performance prediction, auto-tuning, and runtime scheduling. However, even at the risk of losing prediction accuracy due to insufficient runtime information, researchers can only afford to apply offline model training to avoid significant runtime training overhead. To address the challenges above, we designed and implemented MICSVM, a highly efficient parallel SVM for x86 based multi-core and many core architectures,more » such as the Intel Ivy Bridge CPUs and Intel Xeon Phi coprocessor (MIC).« less
Using a Multicore Processor for Rover Autonomous Science

NASA Technical Reports Server (NTRS)

Bornstein, Benjamin; Estlin, Tara; Clement, Bradley; Springer, Paul

2011-01-01

Multicore processing promises to be a critical component of future spacecraft. It provides immense increases in onboard processing power and provides an environment for directly supporting fault-tolerant computing. This paper discusses using a state-of-the-art multicore processor to efficiently perform image analysis onboard a Mars rover in support of autonomous science activities.
Progress Towards a Rad-Hydro Code for Modern Computing Architectures LA-UR-10-02825

NASA Astrophysics Data System (ADS)

Wohlbier, J. G.; Lowrie, R. B.; Bergen, B.; Calef, M.

2010-11-01

We are entering an era of high performance computing where data movement is the overwhelming bottleneck to scalable performance, as opposed to the speed of floating-point operations per processor. All multi-core hardware paradigms, whether heterogeneous or homogeneous, be it the Cell processor, GPGPU, or multi-core x86, share this common trait. In multi-physics applications such as inertial confinement fusion or astrophysics, one may be solving multi-material hydrodynamics with tabular equation of state data lookups, radiation transport, nuclear reactions, and charged particle transport in a single time cycle. The algorithms are intensely data dependent, e.g., EOS, opacity, nuclear data, and multi-core hardware memory restrictions are forcing code developers to rethink code and algorithm design. For the past two years LANL has been funding a small effort referred to as Multi-Physics on Multi-Core to explore ideas for code design as pertaining to inertial confinement fusion and astrophysics applications. The near term goals of this project are to have a multi-material radiation hydrodynamics capability, with tabular equation of state lookups, on cartesian and curvilinear block structured meshes. In the longer term we plan to add fully implicit multi-group radiation diffusion and material heat conduction, and block structured AMR. We will report on our progress to date.
Implementing an Affordable High-Performance Computing for Teaching-Oriented Computer Science Curriculum

ERIC Educational Resources Information Center

Abuzaghleh, Omar; Goldschmidt, Kathleen; Elleithy, Yasser; Lee, Jeongkyu

2013-01-01

With the advances in computing power, high-performance computing (HPC) platforms have had an impact on not only scientific research in advanced organizations but also computer science curriculum in the educational community. For example, multicore programming and parallel systems are highly desired courses in the computer science major. However,…
OSCAR API for Real-Time Low-Power Multicores and Its Performance on Multicores and SMP Servers

NASA Astrophysics Data System (ADS)

Kimura, Keiji; Mase, Masayoshi; Mikami, Hiroki; Miyamoto, Takamichi; Shirako, Jun; Kasahara, Hironori

OSCAR (Optimally Scheduled Advanced Multiprocessor) API has been designed for real-time embedded low-power multicores to generate parallel programs for various multicores from different vendors by using the OSCAR parallelizing compiler. The OSCAR API has been developed by Waseda University in collaboration with Fujitsu Laboratory, Hitachi, NEC, Panasonic, Renesas Technology, and Toshiba in an METI/NEDO project entitled "Multicore Technology for Realtime Consumer Electronics." By using the OSCAR API as an interface between the OSCAR compiler and backend compilers, the OSCAR compiler enables hierarchical multigrain parallel processing with memory optimization under capacity restriction for cache memory, local memory, distributed shared memory, and on-chip/off-chip shared memory; data transfer using a DMA controller; and power reduction control using DVFS (Dynamic Voltage and Frequency Scaling), clock gating, and power gating for various embedded multicores. In addition, a parallelized program automatically generated by the OSCAR compiler with OSCAR API can be compiled by the ordinary OpenMP compilers since the OSCAR API is designed on a subset of the OpenMP. This paper describes the OSCAR API and its compatibility with the OSCAR compiler by showing code examples. Performance evaluations of the OSCAR compiler and the OSCAR API are carried out using an IBM Power5+ workstation, an IBM Power6 high-end SMP server, and a newly developed consumer electronics multicore chip RP2 by Renesas, Hitachi and Waseda. From the results of scalability evaluation, it is found that on an average, the OSCAR compiler with the OSCAR API can exploit 5.8 times speedup over the sequential execution on the Power5+ workstation with eight cores and 2.9 times speedup on RP2 with four cores, respectively. In addition, the OSCAR compiler can accelerate an IBM XL Fortran compiler up to 3.3 times on the Power6 SMP server. Due to low-power optimization on RP2, the OSCAR compiler with the OSCAR API achieves a maximum power reduction of 84% in the real-time execution mode.
Scheduling multicore workload on shared multipurpose clusters

NASA Astrophysics Data System (ADS)

Templon, J. A.; Acosta-Silva, C.; Flix Molina, J.; Forti, A. C.; Pérez-Calero Yzquierdo, A.; Starink, R.

2015-12-01

With the advent of workloads containing explicit requests for multiple cores in a single grid job, grid sites faced a new set of challenges in workload scheduling. The most common batch schedulers deployed at HEP computing sites do a poor job at multicore scheduling when using only the native capabilities of those schedulers. This paper describes how efficient multicore scheduling was achieved at the sites the authors represent, by implementing dynamically-sized multicore partitions via a minimalistic addition to the Torque/Maui batch system already in use at those sites. The paper further includes example results from use of the system in production, as well as measurements on the dependence of performance (especially the ramp-up in throughput for multicore jobs) on node size and job size.
Multicore-based 3D-DWT video encoder

NASA Astrophysics Data System (ADS)

Galiano, Vicente; López-Granado, Otoniel; Malumbres, Manuel P.; Migallón, Hector

2013-12-01

Three-dimensional wavelet transform (3D-DWT) encoders are good candidates for applications like professional video editing, video surveillance, multi-spectral satellite imaging, etc. where a frame must be reconstructed as quickly as possible. In this paper, we present a new 3D-DWT video encoder based on a fast run-length coding engine. Furthermore, we present several multicore optimizations to speed-up the 3D-DWT computation. An exhaustive evaluation of the proposed encoder (3D-GOP-RL) has been performed, and we have compared the evaluation results with other video encoders in terms of rate/distortion (R/D), coding/decoding delay, and memory consumption. Results show that the proposed encoder obtains good R/D results for high-resolution video sequences with nearly in-place computation using only the memory needed to store a group of pictures. After applying the multicore optimization strategies over the 3D DWT, the proposed encoder is able to compress a full high-definition video sequence in real-time.

Energy Efficient Image/Video Data Transmission on Commercial Multi-Core Processors

PubMed Central

Lee, Sungju; Kim, Heegon; Chung, Yongwha; Park, Daihee

2012-01-01

In transmitting image/video data over Video Sensor Networks (VSNs), energy consumption must be minimized while maintaining high image/video quality. Although image/video compression is well known for its efficiency and usefulness in VSNs, the excessive costs associated with encoding computation and complexity still hinder its adoption for practical use. However, it is anticipated that high-performance handheld multi-core devices will be used as VSN processing nodes in the near future. In this paper, we propose a way to improve the energy efficiency of image and video compression with multi-core processors while maintaining the image/video quality. We improve the compression efficiency at the algorithmic level or derive the optimal parameters for the combination of a machine and compression based on the tradeoff between the energy consumption and the image/video quality. Based on experimental results, we confirm that the proposed approach can improve the energy efficiency of the straightforward approach by a factor of 2∼5 without compromising image/video quality. PMID:23202181
3D Kirchhoff depth migration algorithm: A new scalable approach for parallelization on multicore CPU based cluster

NASA Astrophysics Data System (ADS)

Rastogi, Richa; Londhe, Ashutosh; Srivastava, Abhishek; Sirasala, Kirannmayi M.; Khonde, Kiran

2017-03-01

In this article, a new scalable 3D Kirchhoff depth migration algorithm is presented on state of the art multicore CPU based cluster. Parallelization of 3D Kirchhoff depth migration is challenging due to its high demand of compute time, memory, storage and I/O along with the need of their effective management. The most resource intensive modules of the algorithm are traveltime calculations and migration summation which exhibit an inherent trade off between compute time and other resources. The parallelization strategy of the algorithm largely depends on the storage of calculated traveltimes and its feeding mechanism to the migration process. The presented work is an extension of our previous work, wherein a 3D Kirchhoff depth migration application for multicore CPU based parallel system had been developed. Recently, we have worked on improving parallel performance of this application by re-designing the parallelization approach. The new algorithm is capable to efficiently migrate both prestack and poststack 3D data. It exhibits flexibility for migrating large number of traces within the available node memory and with minimal requirement of storage, I/O and inter-node communication. The resultant application is tested using 3D Overthrust data on PARAM Yuva II, which is a Xeon E5-2670 based multicore CPU cluster with 16 cores/node and 64 GB shared memory. Parallel performance of the algorithm is studied using different numerical experiments and the scalability results show striking improvement over its previous version. An impressive 49.05X speedup with 76.64% efficiency is achieved for 3D prestack data and 32.00X speedup with 50.00% efficiency for 3D poststack data, using 64 nodes. The results also demonstrate the effectiveness and robustness of the improved algorithm with high scalability and efficiency on a multicore CPU cluster.
A Bandwidth-Optimized Multi-Core Architecture for Irregular Applications

DOE Office of Scientific and Technical Information (OSTI.GOV)

Secchi, Simone; Tumeo, Antonino; Villa, Oreste

This paper presents an architecture template for next-generation high performance computing systems specifically targeted to irregular applications. We start our work by considering that future generation interconnection and memory bandwidth full-system numbers are expected to grow by a factor of 10. In order to keep up with such a communication capacity, while still resorting to fine-grained multithreading as the main way to tolerate unpredictable memory access latencies of irregular applications, we show how overall performance scaling can benefit from the multi-core paradigm. At the same time, we also show how such an architecture template must be coupled with specific techniquesmore » in order to optimize bandwidth utilization and achieve the maximum scalability. We propose a technique based on memory references aggregation, together with the related hardware implementation, as one of such optimization techniques. We explore the proposed architecture template by focusing on the Cray XMT architecture and, using a dedicated simulation infrastructure, validate the performance of our template with two typical irregular applications. Our experimental results prove the benefits provided by both the multi-core approach and the bandwidth optimization reference aggregation technique.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)

Madduri, Kamesh; Im, Eun-Jin; Ibrahim, Khaled Z.

The next decade of high-performance computing (HPC) systems will see a rapid evolution and divergence of multi- and manycore architectures as power and cooling constraints limit increases in microprocessor clock speeds. Understanding efficient optimization methodologies on diverse multicore designs in the context of demanding numerical methods is one of the greatest challenges faced today by the HPC community. In this paper, we examine the efficient multicore optimization of GTC, a petascale gyrokinetic toroidal fusion code for studying plasma microturbulence in tokamak devices. For GTC’s key computational components (charge deposition and particle push), we explore efficient parallelization strategies across a broadmore » range of emerging multicore designs, including the recently-released Intel Nehalem-EX, the AMD Opteron Istanbul, and the highly multithreaded Sun UltraSparc T2+. We also present the first study on tuning gyrokinetic particle-in-cell (PIC) algorithms for graphics processors, using the NVIDIA C2050 (Fermi). Our work discusses several novel optimization approaches for gyrokinetic PIC, including mixed-precision computation, particle binning and decomposition strategies, grid replication, SIMDized atomic floating-point operations, and effective GPU texture memory utilization. Overall, we achieve significant performance improvements of 1.3–4.7× on these complex PIC kernels, despite the inherent challenges of data dependency and locality. Finally, our work also points to several architectural and programming features that could significantly enhance PIC performance and productivity on next-generation architectures.« less
A highly efficient multi-core algorithm for clustering extremely large datasets

PubMed Central

2010-01-01

Background In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer. Results We introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization. Conclusions Most desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer. PMID:20370922
Compiler-Driven Performance Optimization and Tuning for Multicore Architectures

DTIC Science & Technology

2015-04-10

develop a powerful system for auto-tuning of library routines and compute-intensive kernels, driven by the Pluto system for multicores that we are...kernels, driven by the Pluto system for multicores that we are developing. The work here is motivated by recent advances in two major areas of...automatic C-to-CUDA code generator using a polyhedral compiler transformation framework. We have used and adapted PLUTO (our state-of-the-art tool
Second generation OH suppression filters using multicore fibers

NASA Astrophysics Data System (ADS)

Haynes, R.; Birks, T. A.; Bland-Hawthorn, J.; Cruz, J. L.; Diez, A.; Ellis, S. C.; Haynes, D.; Krämer, R. G.; Mangan, B. J.; Min, S.; Murphy, D. F.; Nolte, S.; Olaya, J. C.; Thomas, J. U.; Trinh, C. Q.; Tünnermann, A.; Voigtländer, Christian

2012-09-01

Ground based near-infrared observations have long been plagued by poor sensitivity when compared to visible observations as a result of the bright narrow line emission from atmospheric OH molecules. The GNOSIS instrument recently commissioned at the Australian Astronomical Observatory uses Photonic Lanterns in combination with individually printed single mode fibre Bragg gratings to filter out the brightest OH-emission lines between 1.47 and 1.70μm. GNOSIS, reported in a separate paper in this conference, demonstrates excellent OH-suppression, providing very “clean” filtering of the lines. It represents a major step forward in the goal to improve the sensitivity of ground based near-infrared observation to that possible at visible wavelengths, however, the filter units are relatively bulky and costly to produce. The 2nd generation fibre OH-Suppression filters based on multicore fibres are currently under development. The development aims to produce high quality, cost effective, compact and robust OH-Suppression units in a single optical fibre with numerous isolated single mode cores that replicate the function and performance of the current generation of “conventional” photonic lantern based devices. In this paper we present the early results from the multicore fibre development and multicore fibre Bragg grating imprinting process.
All-fiber intensity bend sensor based on photonic crystal fiber with asymmetric air-hole structure

NASA Astrophysics Data System (ADS)

Budnicki, Dawid; Szostkiewicz, Lukasz; Szymanski, Michal O.; Ostrowski, Lukasz; Holdynski, Zbigniew; Lipinski, Stanislaw; Murawski, Michal; Wojcik, Grzegorz; Makara, Mariusz; Poturaj, Krzysztof; Mergo, Pawel; Napierala, Marek; Nasilowski, Tomasz

2017-10-01

Monitoring the geometry of an moving element is a crucial task for example in robotics. The robots equipped with fiber bend sensor integrated in their arms can be a promising solution for medicine, physiotherapy and also for application in computer games. We report an all-fiber intensity bend sensor, which is based on microstructured multicore optical fiber. It allows to perform a measurement of the bending radius as well as the bending orientation. The reported solution has a special airhole structure which makes the sensor only bend-sensitive. Our solution is an intensity based sensor, which measures power transmitted along the fiber, influenced by bend. The sensor is based on a multicore fiber with the special air-hole structure that allows detection of bending orientation in range of 360°. Each core in the multicore fiber is sensitive to bend in specified direction. The principle behind sensor operation is to differentiate the confinement loss of fundamental mode propagating in each core. Thanks to received power differences one can distinguish not only bend direction but also its amplitude. Multicore fiber is designed to utilize most common light sources that operate at 1.55 μm thus ensuring high stability of operation. The sensitivity of the proposed solution is equal 29,4 dB/cm and the accuracy of bend direction for the fiber end point is up to 5 degrees for 15 cm fiber length. Such sensitivity allows to perform end point detection with millimeter precision.
Real-time implementations of image segmentation algorithms on shared memory multicore architecture: a survey (Conference Presentation)

NASA Astrophysics Data System (ADS)

Akil, Mohamed

2017-05-01

The real-time processing is getting more and more important in many image processing applications. Image segmentation is one of the most fundamental tasks image analysis. As a consequence, many different approaches for image segmentation have been proposed. The watershed transform is a well-known image segmentation tool. The watershed transform is a very data intensive task. To achieve acceleration and obtain real-time processing of watershed algorithms, parallel architectures and programming models for multicore computing have been developed. This paper focuses on the survey of the approaches for parallel implementation of sequential watershed algorithms on multicore general purpose CPUs: homogeneous multicore processor with shared memory. To achieve an efficient parallel implementation, it's necessary to explore different strategies (parallelization/distribution/distributed scheduling) combined with different acceleration and optimization techniques to enhance parallelism. In this paper, we give a comparison of various parallelization of sequential watershed algorithms on shared memory multicore architecture. We analyze the performance measurements of each parallel implementation and the impact of the different sources of overhead on the performance of the parallel implementations. In this comparison study, we also discuss the advantages and disadvantages of the parallel programming models. Thus, we compare the OpenMP (an application programming interface for multi-Processing) with Ptheads (POSIX Threads) to illustrate the impact of each parallel programming model on the performance of the parallel implementations.
Results of SEI Independent Research and Development Projects

DTIC Science & Technology

2009-12-01

Achieving Predictable Performance in Multicore Embedded Real - Time Systems Dionisio de Niz, Jeffrey Hansen, Gabriel Moreno, Daniel Plakosh, Jorgen Hanson...Description Languages.‖ Fourth Congress on Embedded Real - Time Systems (ERTS), January 2008. [Hansson 2008b] J. Hansson, P. H. Feiler, & J. Morley...Predictable Performance in Multicore Embedded Real - Time Systems Dionisio de Niz, Jeffrey Hansen, Gabriel Moreno, Daniel Plakosh, Jorgen Hanson, Mark
Gyrokinetic particle-in-cell optimization on emerging multi- and manycore platforms

DOE PAGES

Madduri, Kamesh; Im, Eun-Jin; Ibrahim, Khaled Z.; ...

2011-03-02

The next decade of high-performance computing (HPC) systems will see a rapid evolution and divergence of multi- and manycore architectures as power and cooling constraints limit increases in microprocessor clock speeds. Understanding efficient optimization methodologies on diverse multicore designs in the context of demanding numerical methods is one of the greatest challenges faced today by the HPC community. In this paper, we examine the efficient multicore optimization of GTC, a petascale gyrokinetic toroidal fusion code for studying plasma microturbulence in tokamak devices. For GTC’s key computational components (charge deposition and particle push), we explore efficient parallelization strategies across a broadmore » range of emerging multicore designs, including the recently-released Intel Nehalem-EX, the AMD Opteron Istanbul, and the highly multithreaded Sun UltraSparc T2+. We also present the first study on tuning gyrokinetic particle-in-cell (PIC) algorithms for graphics processors, using the NVIDIA C2050 (Fermi). Our work discusses several novel optimization approaches for gyrokinetic PIC, including mixed-precision computation, particle binning and decomposition strategies, grid replication, SIMDized atomic floating-point operations, and effective GPU texture memory utilization. Overall, we achieve significant performance improvements of 1.3–4.7× on these complex PIC kernels, despite the inherent challenges of data dependency and locality. Finally, our work also points to several architectural and programming features that could significantly enhance PIC performance and productivity on next-generation architectures.« less
Enhancing Image Processing Performance for PCID in a Heterogeneous Network of Multi-core Processors

DTIC Science & Technology

2009-09-01

TFLOPS of Playstation 3 (PS3) nodes with IBM Cell Broadband Engine multi-cores and 15 dual-quad Xeon head nodes. The interconnect fabric includes... 4 3. INFORMATION MANAGEMENT FOR PARALLELIZATION AND...STREAMING............................................................. 7 4 . RESULTS
Efficiently Scheduling Multi-core Guest Virtual Machines on Multi-core Hosts in Network Simulation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Yoginath, Srikanth B; Perumalla, Kalyan S

2011-01-01

Virtual machine (VM)-based simulation is a method used by network simulators to incorporate realistic application behaviors by executing actual VMs as high-fidelity surrogates for simulated end-hosts. A critical requirement in such a method is the simulation time-ordered scheduling and execution of the VMs. Prior approaches such as time dilation are less efficient due to the high degree of multiplexing possible when multiple multi-core VMs are simulated on multi-core host systems. We present a new simulation time-ordered scheduler to efficiently schedule multi-core VMs on multi-core real hosts, with a virtual clock realized on each virtual core. The distinguishing features of ourmore » approach are: (1) customizable granularity of the VM scheduling time unit on the simulation time axis, (2) ability to take arbitrary leaps in virtual time by VMs to maximize the utilization of host (real) cores when guest virtual cores idle, and (3) empirically determinable optimality in the tradeoff between total execution (real) time and time-ordering accuracy levels. Experiments show that it is possible to get nearly perfect time-ordered execution, with a slight cost in total run time, relative to optimized non-simulation VM schedulers. Interestingly, with our time-ordered scheduler, it is also possible to reduce the time-ordering error from over 50% of non-simulation scheduler to less than 1% realized by our scheduler, with almost the same run time efficiency as that of the highly efficient non-simulation VM schedulers.« less
Computational performance of a smoothed particle hydrodynamics simulation for shared-memory parallel computing

NASA Astrophysics Data System (ADS)

Nishiura, Daisuke; Furuichi, Mikito; Sakaguchi, Hide

2015-09-01

The computational performance of a smoothed particle hydrodynamics (SPH) simulation is investigated for three types of current shared-memory parallel computer devices: many integrated core (MIC) processors, graphics processing units (GPUs), and multi-core CPUs. We are especially interested in efficient shared-memory allocation methods for each chipset, because the efficient data access patterns differ between compute unified device architecture (CUDA) programming for GPUs and OpenMP programming for MIC processors and multi-core CPUs. We first introduce several parallel implementation techniques for the SPH code, and then examine these on our target computer architectures to determine the most effective algorithms for each processor unit. In addition, we evaluate the effective computing performance and power efficiency of the SPH simulation on each architecture, as these are critical metrics for overall performance in a multi-device environment. In our benchmark test, the GPU is found to produce the best arithmetic performance as a standalone device unit, and gives the most efficient power consumption. The multi-core CPU obtains the most effective computing performance. The computational speed of the MIC processor on Xeon Phi approached that of two Xeon CPUs. This indicates that using MICs is an attractive choice for existing SPH codes on multi-core CPUs parallelized by OpenMP, as it gains computational acceleration without the need for significant changes to the source code.
Printed freeform lens arrays on multi-core fibers for highly efficient coupling in astrophotonic systems.

PubMed

Dietrich, Philipp-Immanuel; Harris, Robert J; Blaicher, Matthias; Corrigan, Mark K; Morris, Tim M; Freude, Wolfgang; Quirrenbach, Andreas; Koos, Christian

2017-07-24

Coupling of light into multi-core fibers (MCF) for spatially resolved spectroscopy is of great importance to astronomical instrumentation. To achieve high coupling efficiencies along with fill-fractions close to unity, micro-optical elements are required to concentrate the incoming light to the individual cores of the MCF. In this paper we demonstrate facet-attached lens arrays (LA) fabricated by two-photon polymerization. The LA provide close to 100% fill-fraction along with efficiencies of up to 73% (down to 1.4 dB loss) for coupling of light from free space into an MCF core. We show the viability of the concept for astrophotonic applications by integrating an MCF-LA assembly in an adaptive-optics test bed and by assessing its performance as a tip/tilt sensor.
Optimizing the Performance of Reactive Molecular Dynamics Simulations for Multi-core Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Aktulga, Hasan Metin; Coffman, Paul; Shan, Tzu-Ray

2015-12-01

Hybrid parallelism allows high performance computing applications to better leverage the increasing on-node parallelism of modern supercomputers. In this paper, we present a hybrid parallel implementation of the widely used LAMMPS/ReaxC package, where the construction of bonded and nonbonded lists and evaluation of complex ReaxFF interactions are implemented efficiently using OpenMP parallelism. Additionally, the performance of the QEq charge equilibration scheme is examined and a dual-solver is implemented. We present the performance of the resulting ReaxC-OMP package on a state-of-the-art multi-core architecture Mira, an IBM BlueGene/Q supercomputer. For system sizes ranging from 32 thousand to 16.6 million particles, speedups inmore » the range of 1.5-4.5x are observed using the new ReaxC-OMP software. Sustained performance improvements have been observed for up to 262,144 cores (1,048,576 processes) of Mira with a weak scaling efficiency of 91.5% in larger simulations containing 16.6 million particles.« less
Modern multicore and manycore architectures: Modelling, optimisation and benchmarking a multiblock CFD code

NASA Astrophysics Data System (ADS)

Hadade, Ioan; di Mare, Luca

2016-08-01

Modern multicore and manycore processors exhibit multiple levels of parallelism through a wide range of architectural features such as SIMD for data parallel execution or threads for core parallelism. The exploitation of multi-level parallelism is therefore crucial for achieving superior performance on current and future processors. This paper presents the performance tuning of a multiblock CFD solver on Intel SandyBridge and Haswell multicore CPUs and the Intel Xeon Phi Knights Corner coprocessor. Code optimisations have been applied on two computational kernels exhibiting different computational patterns: the update of flow variables and the evaluation of the Roe numerical fluxes. We discuss at great length the code transformations required for achieving efficient SIMD computations for both kernels across the selected devices including SIMD shuffles and transpositions for flux stencil computations and global memory transformations. Core parallelism is expressed through threading based on a number of domain decomposition techniques together with optimisations pertaining to alleviating NUMA effects found in multi-socket compute nodes. Results are correlated with the Roofline performance model in order to assert their efficiency for each distinct architecture. We report significant speedups for single thread execution across both kernels: 2-5X on the multicore CPUs and 14-23X on the Xeon Phi coprocessor. Computations at full node and chip concurrency deliver a factor of three speedup on the multicore processors and up to 24X on the Xeon Phi manycore coprocessor.
Electrosprayed Multi-Core Alginate Microcapsules as Novel Self-Healing Containers

NASA Astrophysics Data System (ADS)

Hia, Iee Lee; Pasbakhsh, Pooria; Chan, Eng-Seng; Chai, Siang-Piao

2016-10-01

Alginate microcapsules containing epoxy resin were developed through electrospraying method and embedded into epoxy matrix to produce a capsule-based self-healing composite system. These formaldehyde free alginate/epoxy microcapsules were characterized via light microscope, field emission scanning electron microscope, fourier transform infrared spectroscopy and thermogravimetric analysis. Results showed that epoxy resin was successfully encapsulated within alginate matrix to form porous (multi-core) microcapsules with pore size ranged from 5-100 μm. The microcapsules had an average size of 320 ± 20 μm with decomposition temperature at 220 °C. The loading capacity of these capsules was estimated to be 79%. Under in situ healing test, impact specimens showed healing efficiency as high as 86% and the ability to heal up to 3 times due to the multi-core capsule structure and the high impact energy test that triggered the released of epoxy especially in the second and third healings. TDCB specimens showed one-time healing only with the highest healing efficiency of 76%. The single healing event was attributed by the constant crack propagation rate of TDCB fracture test. For the first time, a cost effective, environmentally benign and sustainable capsule-based self-healing system with multiple healing capabilities and high healing performance was developed.
Electrosprayed Multi-Core Alginate Microcapsules as Novel Self-Healing Containers.

PubMed

Hia, Iee Lee; Pasbakhsh, Pooria; Chan, Eng-Seng; Chai, Siang-Piao

2016-10-03

Alginate microcapsules containing epoxy resin were developed through electrospraying method and embedded into epoxy matrix to produce a capsule-based self-healing composite system. These formaldehyde free alginate/epoxy microcapsules were characterized via light microscope, field emission scanning electron microscope, fourier transform infrared spectroscopy and thermogravimetric analysis. Results showed that epoxy resin was successfully encapsulated within alginate matrix to form porous (multi-core) microcapsules with pore size ranged from 5-100 μm. The microcapsules had an average size of 320 ± 20 μm with decomposition temperature at 220 °C. The loading capacity of these capsules was estimated to be 79%. Under in situ healing test, impact specimens showed healing efficiency as high as 86% and the ability to heal up to 3 times due to the multi-core capsule structure and the high impact energy test that triggered the released of epoxy especially in the second and third healings. TDCB specimens showed one-time healing only with the highest healing efficiency of 76%. The single healing event was attributed by the constant crack propagation rate of TDCB fracture test. For the first time, a cost effective, environmentally benign and sustainable capsule-based self-healing system with multiple healing capabilities and high healing performance was developed.
Electrosprayed Multi-Core Alginate Microcapsules as Novel Self-Healing Containers

PubMed Central

Hia, Iee Lee; Pasbakhsh, Pooria; Chan, Eng-Seng; Chai, Siang-Piao

2016-01-01

Alginate microcapsules containing epoxy resin were developed through electrospraying method and embedded into epoxy matrix to produce a capsule-based self-healing composite system. These formaldehyde free alginate/epoxy microcapsules were characterized via light microscope, field emission scanning electron microscope, fourier transform infrared spectroscopy and thermogravimetric analysis. Results showed that epoxy resin was successfully encapsulated within alginate matrix to form porous (multi-core) microcapsules with pore size ranged from 5–100 μm. The microcapsules had an average size of 320 ± 20 μm with decomposition temperature at 220 °C. The loading capacity of these capsules was estimated to be 79%. Under in situ healing test, impact specimens showed healing efficiency as high as 86% and the ability to heal up to 3 times due to the multi-core capsule structure and the high impact energy test that triggered the released of epoxy especially in the second and third healings. TDCB specimens showed one-time healing only with the highest healing efficiency of 76%. The single healing event was attributed by the constant crack propagation rate of TDCB fracture test. For the first time, a cost effective, environmentally benign and sustainable capsule-based self-healing system with multiple healing capabilities and high healing performance was developed. PMID:27694922

Evaluating Multi-core Architectures through Accelerating the Three-Dimensional Lax–Wendroff Correction

DOE Office of Scientific and Technical Information (OSTI.GOV)

You, Yang; Fu, Haohuan; Song, Shuaiwen

2014-07-18

Wave propagation forward modeling is a widely used computational method in oil and gas exploration. The iterative stencil loops in such problems have broad applications in scientific computing. However, executing such loops can be highly time time-consuming, which greatly limits application’s performance and power efficiency. In this paper, we accelerate the forward modeling technique on the latest multi-core and many-core architectures such as Intel Sandy Bridge CPUs, NVIDIA Fermi C2070 GPU, NVIDIA Kepler K20x GPU, and the Intel Xeon Phi Co-processor. For the GPU platforms, we propose two parallel strategies to explore the performance optimization opportunities for our stencil kernels.more » For Sandy Bridge CPUs and MIC, we also employ various optimization techniques in order to achieve the best.« less
Analysis of scalability of high-performance 3D image processing platform for virtual colonoscopy

NASA Astrophysics Data System (ADS)

Yoshida, Hiroyuki; Wu, Yin; Cai, Wenli

2014-03-01

One of the key challenges in three-dimensional (3D) medical imaging is to enable the fast turn-around time, which is often required for interactive or real-time response. This inevitably requires not only high computational power but also high memory bandwidth due to the massive amount of data that need to be processed. For this purpose, we previously developed a software platform for high-performance 3D medical image processing, called HPC 3D-MIP platform, which employs increasingly available and affordable commodity computing systems such as the multicore, cluster, and cloud computing systems. To achieve scalable high-performance computing, the platform employed size-adaptive, distributable block volumes as a core data structure for efficient parallelization of a wide range of 3D-MIP algorithms, supported task scheduling for efficient load distribution and balancing, and consisted of a layered parallel software libraries that allow image processing applications to share the common functionalities. We evaluated the performance of the HPC 3D-MIP platform by applying it to computationally intensive processes in virtual colonoscopy. Experimental results showed a 12-fold performance improvement on a workstation with 12-core CPUs over the original sequential implementation of the processes, indicating the efficiency of the platform. Analysis of performance scalability based on the Amdahl's law for symmetric multicore chips showed the potential of a high performance scalability of the HPC 3DMIP platform when a larger number of cores is available.
Fast multi-core based multimodal registration of 2D cross-sections and 3D datasets.

PubMed

Scharfe, Michael; Pielot, Rainer; Schreiber, Falk

2010-01-11

Solving bioinformatics tasks often requires extensive computational power. Recent trends in processor architecture combine multiple cores into a single chip to improve overall performance. The Cell Broadband Engine (CBE), a heterogeneous multi-core processor, provides power-efficient and cost-effective high-performance computing. One application area is image analysis and visualisation, in particular registration of 2D cross-sections into 3D image datasets. Such techniques can be used to put different image modalities into spatial correspondence, for example, 2D images of histological cuts into morphological 3D frameworks. We evaluate the CBE-driven PlayStation 3 as a high performance, cost-effective computing platform by adapting a multimodal alignment procedure to several characteristic hardware properties. The optimisations are based on partitioning, vectorisation, branch reducing and loop unrolling techniques with special attention to 32-bit multiplies and limited local storage on the computing units. We show how a typical image analysis and visualisation problem, the multimodal registration of 2D cross-sections and 3D datasets, benefits from the multi-core based implementation of the alignment algorithm. We discuss several CBE-based optimisation methods and compare our results to standard solutions. More information and the source code are available from http://cbe.ipk-gatersleben.de. The results demonstrate that the CBE processor in a PlayStation 3 accelerates computational intensive multimodal registration, which is of great importance in biological/medical image processing. The PlayStation 3 as a low cost CBE-based platform offers an efficient option to conventional hardware to solve computational problems in image processing and bioinformatics.
VLBI-resolution radio-map algorithms: Performance analysis of different levels of data-sharing on multi-socket, multi-core architectures

NASA Astrophysics Data System (ADS)

Tabik, S.; Romero, L. F.; Mimica, P.; Plata, O.; Zapata, E. L.

2012-09-01

A broad area in astronomy focuses on simulating extragalactic objects based on Very Long Baseline Interferometry (VLBI) radio-maps. Several algorithms in this scope simulate what would be the observed radio-maps if emitted from a predefined extragalactic object. This work analyzes the performance and scaling of this kind of algorithms on multi-socket, multi-core architectures. In particular, we evaluate a sharing approach, a privatizing approach and a hybrid approach on systems with complex memory hierarchy that includes shared Last Level Cache (LLC). In addition, we investigate which manual processes can be systematized and then automated in future works. The experiments show that the data-privatizing model scales efficiently on medium scale multi-socket, multi-core systems (up to 48 cores) while regardless of algorithmic and scheduling optimizations, the sharing approach is unable to reach acceptable scalability on more than one socket. However, the hybrid model with a specific level of data-sharing provides the best scalability over all used multi-socket, multi-core systems.
Parallelization of interpolation, solar radiation and water flow simulation modules in GRASS GIS using OpenMP

NASA Astrophysics Data System (ADS)

Hofierka, Jaroslav; Lacko, Michal; Zubal, Stanislav

2017-10-01

In this paper, we describe the parallelization of three complex and computationally intensive modules of GRASS GIS using the OpenMP application programming interface for multi-core computers. These include the v.surf.rst module for spatial interpolation, the r.sun module for solar radiation modeling and the r.sim.water module for water flow simulation. We briefly describe the functionality of the modules and parallelization approaches used in the modules. Our approach includes the analysis of the module's functionality, identification of source code segments suitable for parallelization and proper application of OpenMP parallelization code to create efficient threads processing the subtasks. We document the efficiency of the solutions using the airborne laser scanning data representing land surface in the test area and derived high-resolution digital terrain model grids. We discuss the performance speed-up and parallelization efficiency depending on the number of processor threads. The study showed a substantial increase in computation speeds on a standard multi-core computer while maintaining the accuracy of results in comparison to the output from original modules. The presented parallelization approach showed the simplicity and efficiency of the parallelization of open-source GRASS GIS modules using OpenMP, leading to an increased performance of this geospatial software on standard multi-core computers.
Hardware design and implementation of fast DOA estimation method based on multicore DSP

NASA Astrophysics Data System (ADS)

Guo, Rui; Zhao, Yingxiao; Zhang, Yue; Lin, Qianqiang; Chen, Zengping

2016-10-01

In this paper, we present a high-speed real-time signal processing hardware platform based on multicore digital signal processor (DSP). The real-time signal processing platform shows several excellent characteristics including high performance computing, low power consumption, large-capacity data storage and high speed data transmission, which make it able to meet the constraint of real-time direction of arrival (DOA) estimation. To reduce the high computational complexity of DOA estimation algorithm, a novel real-valued MUSIC estimator is used. The algorithm is decomposed into several independent steps and the time consumption of each step is counted. Based on the statistics of the time consumption, we present a new parallel processing strategy to distribute the task of DOA estimation to different cores of the real-time signal processing hardware platform. Experimental results demonstrate that the high processing capability of the signal processing platform meets the constraint of real-time direction of arrival (DOA) estimation.
Photonic-Networks-on-Chip for High Performance Radiation Survivable Multi-Core Processor Systems

DTIC Science & Technology

2013-12-01

Loss Spectra” Proceedings of SPIE 8255, (2012) and in a journal publication: M. T. Crowley, D. Murrell, N. Patel, M. Breivik , C.-Y. Lin, Y. Li, B.-O...Crowley, D. Murrell, N. Patel, M. Breivik , C.-Y. Lin, Y. Li, B.-O. Fimland and L. F. Lester, "Analytical Modeling of the Temperature Performance of
HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi

DOE PAGES

Dongarra, Jack; Gates, Mark; Haidar, Azzam; ...

2015-01-01

This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms for multicore with Intel Xeon Phi coprocessors. In particular, we consider algorithms for solving linear systems. Further, we give an overview of the MAGMA MIC library, an open source, high performance library, that incorporates the developments presented here and, more broadly, provides the DLA functionality equivalent to that of the popular LAPACK library while targeting heterogeneous architectures that feature a mix of multicore CPUs and coprocessors. The LAPACK-compliance simplifies the use of the MAGMA MIC library in applications, while providing them with portably performant DLA.more » High performance is obtained through the use of the high-performance BLAS, hardware-specific tuning, and a hybridization methodology whereby we split the algorithm into computational tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components. Our methodology and programming techniques are incorporated into the MAGMA MIC API, which abstracts the application developer from the specifics of the Xeon Phi architecture and is therefore applicable to algorithms beyond the scope of DLA.« less
Efficiency of static core turn-off in a system-on-a-chip with variation

DOEpatents

Cher, Chen-Yong; Coteus, Paul W; Gara, Alan; Kursun, Eren; Paulsen, David P; Schuelke, Brian A; Sheets, II, John E; Tian, Shurong

2013-10-29

A processor-implemented method for improving efficiency of a static core turn-off in a multi-core processor with variation, the method comprising: conducting via a simulation a turn-off analysis of the multi-core processor at the multi-core processor's design stage, wherein the turn-off analysis of the multi-core processor at the multi-core processor's design stage includes a first output corresponding to a first multi-core processor core to turn off; conducting a turn-off analysis of the multi-core processor at the multi-core processor's testing stage, wherein the turn-off analysis of the multi-core processor at the multi-core processor's testing stage includes a second output corresponding to a second multi-core processor core to turn off; comparing the first output and the second output to determine if the first output is referring to the same core to turn off as the second output; outputting a third output corresponding to the first multi-core processor core if the first output and the second output are both referring to the same core to turn off.
Fast multi-core based multimodal registration of 2D cross-sections and 3D datasets

PubMed Central

2010-01-01

Background Solving bioinformatics tasks often requires extensive computational power. Recent trends in processor architecture combine multiple cores into a single chip to improve overall performance. The Cell Broadband Engine (CBE), a heterogeneous multi-core processor, provides power-efficient and cost-effective high-performance computing. One application area is image analysis and visualisation, in particular registration of 2D cross-sections into 3D image datasets. Such techniques can be used to put different image modalities into spatial correspondence, for example, 2D images of histological cuts into morphological 3D frameworks. Results We evaluate the CBE-driven PlayStation 3 as a high performance, cost-effective computing platform by adapting a multimodal alignment procedure to several characteristic hardware properties. The optimisations are based on partitioning, vectorisation, branch reducing and loop unrolling techniques with special attention to 32-bit multiplies and limited local storage on the computing units. We show how a typical image analysis and visualisation problem, the multimodal registration of 2D cross-sections and 3D datasets, benefits from the multi-core based implementation of the alignment algorithm. We discuss several CBE-based optimisation methods and compare our results to standard solutions. More information and the source code are available from http://cbe.ipk-gatersleben.de. Conclusions The results demonstrate that the CBE processor in a PlayStation 3 accelerates computational intensive multimodal registration, which is of great importance in biological/medical image processing. The PlayStation 3 as a low cost CBE-based platform offers an efficient option to conventional hardware to solve computational problems in image processing and bioinformatics. PMID:20064262
Micromagnetics on high-performance workstation and mobile computational platforms

NASA Astrophysics Data System (ADS)

Fu, S.; Chang, R.; Couture, S.; Menarini, M.; Escobar, M. A.; Kuteifan, M.; Lubarda, M.; Gabay, D.; Lomakin, V.

2015-05-01

The feasibility of using high-performance desktop and embedded mobile computational platforms is presented, including multi-core Intel central processing unit, Nvidia desktop graphics processing units, and Nvidia Jetson TK1 Platform. FastMag finite element method-based micromagnetic simulator is used as a testbed, showing high efficiency on all the platforms. Optimization aspects of improving the performance of the mobile systems are discussed. The high performance, low cost, low power consumption, and rapid performance increase of the embedded mobile systems make them a promising candidate for micromagnetic simulations. Such architectures can be used as standalone systems or can be built as low-power computing clusters.
Performance implications from sizing a VM on multi-core systems: A Data analytic application s view

DOE Office of Scientific and Technical Information (OSTI.GOV)

Lim, Seung-Hwan; Horey, James L; Begoli, Edmon

In this paper, we present a quantitative performance analysis of data analytics applications running on multi-core virtual machines. Such environments form the core of cloud computing. In addition, data analytics applications, such as Cassandra and Hadoop, are becoming increasingly popular on cloud computing platforms. This convergence necessitates a better understanding of the performance and cost implications of such hybrid systems. For example, the very rst step in hosting applications in virtualized environments, requires the user to con gure the number of virtual processors and the size of memory. To understand performance implications of this step, we benchmarked three Yahoo Cloudmore » Serving Benchmark (YCSB) workloads in a virtualized multi-core environment. Our measurements indicate that the performance of Cassandra for YCSB workloads does not heavily depend on the processing capacity of a system, while the size of the data set is critical to performance relative to allocated memory. We also identi ed a strong relationship between the running time of workloads and various hardware events (last level cache loads, misses, and CPU migrations). From this analysis, we provide several suggestions to improve the performance of data analytics applications running on cloud computing environments.« less
CMS Readiness for Multi-Core Workload Scheduling

DOE Office of Scientific and Technical Information (OSTI.GOV)

Perez-Calero Yzquierdo, A.; Balcas, J.; Hernandez, J.

In the present run of the LHC, CMS data reconstruction and simulation algorithms benefit greatly from being executed as multiple threads running on several processor cores. The complexity of the Run 2 events requires parallelization of the code to reduce the memory-per- core footprint constraining serial execution programs, thus optimizing the exploitation of present multi-core processor architectures. The allocation of computing resources for multi-core tasks, however, becomes a complex problem in itself. The CMS workload submission infrastructure employs multi-slot partitionable pilots, built on HTCondor and GlideinWMS native features, to enable scheduling of single and multi-core jobs simultaneously. This provides amore » solution for the scheduling problem in a uniform way across grid sites running a diversity of gateways to compute resources and batch system technologies. This paper presents this strategy and the tools on which it has been implemented. The experience of managing multi-core resources at the Tier-0 and Tier-1 sites during 2015, along with the deployment phase to Tier-2 sites during early 2016 is reported. The process of performance monitoring and optimization to achieve efficient and flexible use of the resources is also described.« less
CMS readiness for multi-core workload scheduling

NASA Astrophysics Data System (ADS)

Perez-Calero Yzquierdo, A.; Balcas, J.; Hernandez, J.; Aftab Khan, F.; Letts, J.; Mason, D.; Verguilov, V.

2017-10-01

In the present run of the LHC, CMS data reconstruction and simulation algorithms benefit greatly from being executed as multiple threads running on several processor cores. The complexity of the Run 2 events requires parallelization of the code to reduce the memory-per- core footprint constraining serial execution programs, thus optimizing the exploitation of present multi-core processor architectures. The allocation of computing resources for multi-core tasks, however, becomes a complex problem in itself. The CMS workload submission infrastructure employs multi-slot partitionable pilots, built on HTCondor and GlideinWMS native features, to enable scheduling of single and multi-core jobs simultaneously. This provides a solution for the scheduling problem in a uniform way across grid sites running a diversity of gateways to compute resources and batch system technologies. This paper presents this strategy and the tools on which it has been implemented. The experience of managing multi-core resources at the Tier-0 and Tier-1 sites during 2015, along with the deployment phase to Tier-2 sites during early 2016 is reported. The process of performance monitoring and optimization to achieve efficient and flexible use of the resources is also described.
Interfacial redox reaction-directed synthesis of silver@cerium oxide core-shell nanocomposites as catalysts for rechargeable lithium-air batteries

NASA Astrophysics Data System (ADS)

Liu, Ying; Wang, Man; Cao, Lu-Jie; Yang, Ming-Yang; Ho-Sum Cheng, Samson; Cao, Chen-Wei; Leung, Kwan-Lan; Chung, Chi-Yuen; Lu, Zhou-Guang

2015-07-01

A facile oxidation-reduction reaction method has been implemented to prepare pomegranate-like Ag@CeO2 multicore-shell structured nanocomposites. Under Ar atmosphere, redox reaction automatically occurs between AgNO3 and Ce(NO3)3 in an alkaline solution, where Ag+ is reduced to Ag nanopartilces and Ce3+ is simultaneously oxidized to form CeO2, followed by the self-assembly to form the pomegranate-like multicore-shell structured Ag@CeO2 nanocomposites driven by thermodynamic equilibrium. No other organic amines or surfactants are utilized in the whole reaction system and only NaOH instead of organic reducing agent is used to prevent the introduction of a secondary reducing byproduct. The as-obtained pomegranate-like Ag@CeO2 multicore-shell structured nanocomposites have been characterized as electro-catalysts for the air cathode of lithium-air batteries operated in a simulated air environment. Superior electrochemical performance with high discharge capacity of 3415 mAh g-1 at 100 mA g-1, stable cycling and small charge/discharge polarization voltage is achieved, which is much better than that of the CeO2 or simple mixture of CeO2 and Ag. The enhanced properties can be primarily attributed to the synergy effect between the Ag core and the CeO2 shell resulting from the unique pomegranate-like multicore-shell nanostructures possessing plenty of active sites to promote the facile formation and decomposition of Li2O2.
Electronic Structure Calculations and Adaptation Scheme in Multi-core Computing Environments

DOE Office of Scientific and Technical Information (OSTI.GOV)

Seshagiri, Lakshminarasimhan; Sosonkina, Masha; Zhang, Zhao

2009-05-20

Multi-core processing environments have become the norm in the generic computing environment and are being considered for adding an extra dimension to the execution of any application. The T2 Niagara processor is a very unique environment where it consists of eight cores having a capability of running eight threads simultaneously in each of the cores. Applications like General Atomic and Molecular Electronic Structure (GAMESS), used for ab-initio molecular quantum chemistry calculations, can be good indicators of the performance of such machines and would be a guideline for both hardware designers and application programmers. In this paper we try to benchmarkmore » the GAMESS performance on a T2 Niagara processor for a couple of molecules. We also show the suitability of using a middleware based adaptation algorithm on GAMESS on such a multi-core environment.« less
Exploiting multicore compute resources in the CMS experiment

NASA Astrophysics Data System (ADS)

Ramírez, J. E.; Pérez-Calero Yzquierdo, A.; Hernández, J. M.; CMS Collaboration

2016-10-01

CMS has developed a strategy to efficiently exploit the multicore architecture of the compute resources accessible to the experiment. A coherent use of the multiple cores available in a compute node yields substantial gains in terms of resource utilization. The implemented approach makes use of the multithreading support of the event processing framework and the multicore scheduling capabilities of the resource provisioning system. Multicore slots are acquired and provisioned by means of multicore pilot agents which internally schedule and execute single and multicore payloads. Multicore scheduling and multithreaded processing are currently used in production for online event selection and prompt data reconstruction. More workflows are being adapted to run in multicore mode. This paper presents a review of the experience gained in the deployment and operation of the multicore scheduling and processing system, the current status and future plans.
Coupled-mode propagation in multicore fibers characterized by optical low-coherence reflectometry.

PubMed

Salathé, R P; Gilgen, H; Bodmer, G

1996-07-01

A fiber-optical low-coherence ref lectometer has been used to probe a multicore fiber locally at a wavelength of 1.3 microm. This technique allows one to determine the group index of refraction of the modes in the multicore fiber with high accuracy. Light propagation that is due to noncoherent coupling of energy from one fiber core to adjacent cores through cladding modes can be distinguished quantitatively from light propagating in coherently coupled modes. Intercore coupling constants in the range of 0.6-2 mm(-1) have been evaluated for the coupled modes.
Novel Designs and Coupling Schemes for Affordable High Energy Laser Modules

DTIC Science & Technology

2007-09-28

possibility of single polarization operation of phase- locked multicore fiber lasers and amplifiers. 5.5. UV...transverse direction (propagation and polarization vectors shown as solid arrows and dashed lines, respectively) having a dipole-like wave front from an...31 5.4. Phase Locking in Monolithic Multicore Fiber Laser..................................................... 38 5.5. UV
pyPaSWAS: Python-based multi-core CPU and GPU sequence alignment.

PubMed

Warris, Sven; Timal, N Roshan N; Kempenaar, Marcel; Poortinga, Arne M; van de Geest, Henri; Varbanescu, Ana L; Nap, Jan-Peter

2018-01-01

Our previously published CUDA-only application PaSWAS for Smith-Waterman (SW) sequence alignment of any type of sequence on NVIDIA-based GPUs is platform-specific and therefore adopted less than could be. The OpenCL language is supported more widely and allows use on a variety of hardware platforms. Moreover, there is a need to promote the adoption of parallel computing in bioinformatics by making its use and extension more simple through more and better application of high-level languages commonly used in bioinformatics, such as Python. The novel application pyPaSWAS presents the parallel SW sequence alignment code fully packed in Python. It is a generic SW implementation running on several hardware platforms with multi-core systems and/or GPUs that provides accurate sequence alignments that also can be inspected for alignment details. Additionally, pyPaSWAS support the affine gap penalty. Python libraries are used for automated system configuration, I/O and logging. This way, the Python environment will stimulate further extension and use of pyPaSWAS. pyPaSWAS presents an easy Python-based environment for accurate and retrievable parallel SW sequence alignments on GPUs and multi-core systems. The strategy of integrating Python with high-performance parallel compute languages to create a developer- and user-friendly environment should be considered for other computationally intensive bioinformatics algorithms.

Reconfigurable SDM Switching Using Novel Silicon Photonic Integrated Circuit.

PubMed

Ding, Yunhong; Kamchevska, Valerija; Dalgaard, Kjeld; Ye, Feihong; Asif, Rameez; Gross, Simon; Withford, Michael J; Galili, Michael; Morioka, Toshio; Oxenløwe, Leif Katsuo

2016-12-21

Space division multiplexing using multicore fibers is becoming a more and more promising technology. In space-division multiplexing fiber network, the reconfigurable switch is one of the most critical components in network nodes. In this paper we for the first time demonstrate reconfigurable space-division multiplexing switching using silicon photonic integrated circuit, which is fabricated on a novel silicon-on-insulator platform with buried Al mirror. The silicon photonic integrated circuit is composed of a 7 × 7 switch and low loss grating coupler array based multicore fiber couplers. Thanks to the Al mirror, grating couplers with ultra-low coupling loss with optical multicore fibers is achieved. The lowest total insertion loss of the silicon integrated circuit is as low as 4.5 dB, with low crosstalk lower than -30 dB. Excellent performances in terms of low insertion loss and low crosstalk are obtained for the whole C-band. 1 Tb/s/core transmission over a 2-km 7-core fiber and space-division multiplexing switching is demonstrated successfully. Bit error rate performance below 10 -9 is obtained for all spatial channels with low power penalty. The proposed design can be easily upgraded to reconfigurable optical add/drop multiplexer capable of switching several multicore fibers.
Reconfigurable SDM Switching Using Novel Silicon Photonic Integrated Circuit

NASA Astrophysics Data System (ADS)

Ding, Yunhong; Kamchevska, Valerija; Dalgaard, Kjeld; Ye, Feihong; Asif, Rameez; Gross, Simon; Withford, Michael J.; Galili, Michael; Morioka, Toshio; Oxenløwe, Leif Katsuo

2016-12-01

Space division multiplexing using multicore fibers is becoming a more and more promising technology. In space-division multiplexing fiber network, the reconfigurable switch is one of the most critical components in network nodes. In this paper we for the first time demonstrate reconfigurable space-division multiplexing switching using silicon photonic integrated circuit, which is fabricated on a novel silicon-on-insulator platform with buried Al mirror. The silicon photonic integrated circuit is composed of a 7 × 7 switch and low loss grating coupler array based multicore fiber couplers. Thanks to the Al mirror, grating couplers with ultra-low coupling loss with optical multicore fibers is achieved. The lowest total insertion loss of the silicon integrated circuit is as low as 4.5 dB, with low crosstalk lower than -30 dB. Excellent performances in terms of low insertion loss and low crosstalk are obtained for the whole C-band. 1 Tb/s/core transmission over a 2-km 7-core fiber and space-division multiplexing switching is demonstrated successfully. Bit error rate performance below 10-9 is obtained for all spatial channels with low power penalty. The proposed design can be easily upgraded to reconfigurable optical add/drop multiplexer capable of switching several multicore fibers.
Reconfigurable SDM Switching Using Novel Silicon Photonic Integrated Circuit

PubMed Central

Ding, Yunhong; Kamchevska, Valerija; Dalgaard, Kjeld; Ye, Feihong; Asif, Rameez; Gross, Simon; Withford, Michael J.; Galili, Michael; Morioka, Toshio; Oxenløwe, Leif Katsuo

2016-01-01

Space division multiplexing using multicore fibers is becoming a more and more promising technology. In space-division multiplexing fiber network, the reconfigurable switch is one of the most critical components in network nodes. In this paper we for the first time demonstrate reconfigurable space-division multiplexing switching using silicon photonic integrated circuit, which is fabricated on a novel silicon-on-insulator platform with buried Al mirror. The silicon photonic integrated circuit is composed of a 7 × 7 switch and low loss grating coupler array based multicore fiber couplers. Thanks to the Al mirror, grating couplers with ultra-low coupling loss with optical multicore fibers is achieved. The lowest total insertion loss of the silicon integrated circuit is as low as 4.5 dB, with low crosstalk lower than −30 dB. Excellent performances in terms of low insertion loss and low crosstalk are obtained for the whole C-band. 1 Tb/s/core transmission over a 2-km 7-core fiber and space-division multiplexing switching is demonstrated successfully. Bit error rate performance below 10−9 is obtained for all spatial channels with low power penalty. The proposed design can be easily upgraded to reconfigurable optical add/drop multiplexer capable of switching several multicore fibers. PMID:28000735
A multi-core fiber based interferometer for high temperature sensing

NASA Astrophysics Data System (ADS)

Zhou, Song; Huang, Bo; Shu, Xuewen

2017-04-01

In this paper, we have verified and implemented a Mach-Zehnder interferometer based on seven-core fiber for high temperature sensing application. This proposed structure is based on a multi-mode-multi-core-multi-mode fiber structure sandwiched by a single mode fiber. Between the single-mode and multi-core fiber, a 3 mm long multi-mode fiber is formed for lead-in and lead-out light. The basic operation principle of this device is the use of multi-core modes, single-mode and multi-mode interference coupling is also utilized. Experimental results indicate that this interferometer sensor is capable of accurate measurements of temperatures up to 800 °C, and the temperature sensitivity of the proposed sensor is as high as 170.2 pm/°C, which is much higher than the current existing MZI based temperature sensors (109 pm/°C). This type of sensor is promising for practical high temperature applications due to its advantages including high sensitivity, simple fabrication process, low cost and compactness.
Behavior-aware cache hierarchy optimization for low-power multi-core embedded systems

NASA Astrophysics Data System (ADS)

Zhao, Huatao; Luo, Xiao; Zhu, Chen; Watanabe, Takahiro; Zhu, Tianbo

2017-07-01

In modern embedded systems, the increasing number of cores requires efficient cache hierarchies to ensure data throughput, but such cache hierarchies are restricted by their tumid size and interference accesses which leads to both performance degradation and wasted energy. In this paper, we firstly propose a behavior-aware cache hierarchy (BACH) which can optimally allocate the multi-level cache resources to many cores and highly improved the efficiency of cache hierarchy, resulting in low energy consumption. The BACH takes full advantage of the explored application behaviors and runtime cache resource demands as the cache allocation bases, so that we can optimally configure the cache hierarchy to meet the runtime demand. The BACH was implemented on the GEM5 simulator. The experimental results show that energy consumption of a three-level cache hierarchy can be saved from 5.29% up to 27.94% compared with other key approaches while the performance of the multi-core system even has a slight improvement counting in hardware overhead.
A Parallel Framework with Block Matrices of a Discrete Fourier Transform for Vector-Valued Discrete-Time Signals.

PubMed

Soto-Quiros, Pablo

2015-01-01

This paper presents a parallel implementation of a kind of discrete Fourier transform (DFT): the vector-valued DFT. The vector-valued DFT is a novel tool to analyze the spectra of vector-valued discrete-time signals. This parallel implementation is developed in terms of a mathematical framework with a set of block matrix operations. These block matrix operations contribute to analysis, design, and implementation of parallel algorithms in multicore processors. In this work, an implementation and experimental investigation of the mathematical framework are performed using MATLAB with the Parallel Computing Toolbox. We found that there is advantage to use multicore processors and a parallel computing environment to minimize the high execution time. Additionally, speedup increases when the number of logical processors and length of the signal increase.
MDTM: Optimizing Data Transfer using Multicore-Aware I/O Scheduling

DOE Office of Scientific and Technical Information (OSTI.GOV)

Zhang, Liang; Demar, Phil; Wu, Wenji

2017-05-09

Bulk data transfer is facing significant challenges in the coming era of big data. There are multiple performance bottlenecks along the end-to-end path from the source to destination storage system. The limitations of current generation data transfer tools themselves can have a significant impact on end-to-end data transfer rates. In this paper, we identify the issues that lead to underperformance of these tools, and present a new data transfer tool with an innovative I/O scheduler called MDTM. The MDTM scheduler exploits underlying multicore layouts to optimize throughput by reducing delay and contention for I/O reading and writing operations. With ourmore » evaluations, we show how MDTM successfully avoids NUMA-based congestion and significantly improves end-to-end data transfer rates across high-speed wide area networks.« less
MDTM: Optimizing Data Transfer using Multicore-Aware I/O Scheduling

DOE Office of Scientific and Technical Information (OSTI.GOV)

Zhang, Liang; Demar, Phil; Wu, Wenji

2017-01-01

Bulk data transfer is facing significant challenges in the coming era of big data. There are multiple performance bottlenecks along the end-to-end path from the source to destination storage system. The limitations of current generation data transfer tools themselves can have a significant impact on end-to-end data transfer rates. In this paper, we identify the issues that lead to underperformance of these tools, and present a new data transfer tool with an innovative I/O scheduler called MDTM. The MDTM scheduler exploits underlying multicore layouts to optimize throughput by reducing delay and contention for I/O reading and writing operations. With ourmore » evaluations, we show how MDTM successfully avoids NUMA-based congestion and significantly improves end-to-end data transfer rates across high-speed wide area networks.« less
Spectral efficiency in crosstalk-impaired multi-core fiber links

NASA Astrophysics Data System (ADS)

Luís, Ruben S.; Puttnam, Benjamin J.; Rademacher, Georg; Klaus, Werner; Agrell, Erik; Awaji, Yoshinari; Wada, Naoya

2018-02-01

We review the latest advances on ultra-high throughput transmission using crosstalk-limited single-mode multicore fibers and compare these with the theoretical spectral efficiency of such systems. We relate the crosstalkimposed spectral efficiency limits with fiber parameters, such as core diameter, core pitch, and trench design. Furthermore, we investigate the potential of techniques such as direction interleaving and high-order MIMO to improve the throughput or reach of these systems when using various modulation formats.
Experimental investigation of inter-core crosstalk tolerance of MIMO-OFDM/OQAM radio over multicore fiber system.

PubMed

He, Jiale; Li, Borui; Deng, Lei; Tang, Ming; Gan, Lin; Fu, Songnian; Shum, Perry Ping; Liu, Deming

2016-06-13

In this paper, the feasibility of space division multiplexing for optical wireless fronthaul systems is experimentally demonstrated by implementing high speed MIMO-OFDM/OQAM radio signals over 20km 7-core fiber and 0.4m wireless link. Moreover, the impact of optical inter-core crosstalk in multicore fibers on the proposed MIMO-OFDM/OQAM radio over fiber system is experimentally evaluated in both SISO and MIMO configurations for comparison. The experimental results show that the inter-core crosstalk tolerance of the proposed radio over fiber system can be relaxed to -10 dB by using the proposed MIMO-OFDM/OQAM processing. These results could guide high density multicore fiber design to support a large number of antenna modules and a higher density of radio-access points for potential applications in 5G cellular system.
WOMBAT: A Scalable and High-performance Astrophysical Magnetohydrodynamics Code

NASA Astrophysics Data System (ADS)

Mendygral, P. J.; Radcliffe, N.; Kandalla, K.; Porter, D.; O'Neill, B. J.; Nolting, C.; Edmon, P.; Donnert, J. M. F.; Jones, T. W.

2017-02-01

We present a new code for astrophysical magnetohydrodynamics specifically designed and optimized for high performance and scaling on modern and future supercomputers. We describe a novel hybrid OpenMP/MPI programming model that emerged from a collaboration between Cray, Inc. and the University of Minnesota. This design utilizes MPI-RMA optimized for thread scaling, which allows the code to run extremely efficiently at very high thread counts ideal for the latest generation of multi-core and many-core architectures. Such performance characteristics are needed in the era of “exascale” computing. We describe and demonstrate our high-performance design in detail with the intent that it may be used as a model for other, future astrophysical codes intended for applications demanding exceptional performance.
A Review of High-Performance Computational Strategies for Modeling and Imaging of Electromagnetic Induction Data

NASA Astrophysics Data System (ADS)

Newman, Gregory A.

2014-01-01

Many geoscientific applications exploit electrostatic and electromagnetic fields to interrogate and map subsurface electrical resistivity—an important geophysical attribute for characterizing mineral, energy, and water resources. In complex three-dimensional geologies, where many of these resources remain to be found, resistivity mapping requires large-scale modeling and imaging capabilities, as well as the ability to treat significant data volumes, which can easily overwhelm single-core and modest multicore computing hardware. To treat such problems requires large-scale parallel computational resources, necessary for reducing the time to solution to a time frame acceptable to the exploration process. The recognition that significant parallel computing processes must be brought to bear on these problems gives rise to choices that must be made in parallel computing hardware and software. In this review, some of these choices are presented, along with the resulting trade-offs. We also discuss future trends in high-performance computing and the anticipated impact on electromagnetic (EM) geophysics. Topics discussed in this review article include a survey of parallel computing platforms, graphics processing units to multicore CPUs with a fast interconnect, along with effective parallel solvers and associated solver libraries effective for inductive EM modeling and imaging.
Performance evaluation of throughput computing workloads using multi-core processors and graphics processors

NASA Astrophysics Data System (ADS)

Dave, Gaurav P.; Sureshkumar, N.; Blessy Trencia Lincy, S. S.

2017-11-01

Current trend in processor manufacturing focuses on multi-core architectures rather than increasing the clock speed for performance improvement. Graphic processors have become as commodity hardware for providing fast co-processing in computer systems. Developments in IoT, social networking web applications, big data created huge demand for data processing activities and such kind of throughput intensive applications inherently contains data level parallelism which is more suited for SIMD architecture based GPU. This paper reviews the architectural aspects of multi/many core processors and graphics processors. Different case studies are taken to compare performance of throughput computing applications using shared memory programming in OpenMP and CUDA API based programming.
Parallelized computation for computer simulation of electrocardiograms using personal computers with multi-core CPU and general-purpose GPU.

PubMed

Shen, Wenfeng; Wei, Daming; Xu, Weimin; Zhu, Xin; Yuan, Shizhong

2010-10-01

Biological computations like electrocardiological modelling and simulation usually require high-performance computing environments. This paper introduces an implementation of parallel computation for computer simulation of electrocardiograms (ECGs) in a personal computer environment with an Intel CPU of Core (TM) 2 Quad Q6600 and a GPU of Geforce 8800GT, with software support by OpenMP and CUDA. It was tested in three parallelization device setups: (a) a four-core CPU without a general-purpose GPU, (b) a general-purpose GPU plus 1 core of CPU, and (c) a four-core CPU plus a general-purpose GPU. To effectively take advantage of a multi-core CPU and a general-purpose GPU, an algorithm based on load-prediction dynamic scheduling was developed and applied to setting (c). In the simulation with 1600 time steps, the speedup of the parallel computation as compared to the serial computation was 3.9 in setting (a), 16.8 in setting (b), and 20.0 in setting (c). This study demonstrates that a current PC with a multi-core CPU and a general-purpose GPU provides a good environment for parallel computations in biological modelling and simulation studies. Copyright 2010 Elsevier Ireland Ltd. All rights reserved.
Simulating electron wave dynamics in graphene superlattices exploiting parallel processing advantages

NASA Astrophysics Data System (ADS)

Rodrigues, Manuel J.; Fernandes, David E.; Silveirinha, Mário G.; Falcão, Gabriel

2018-01-01

This work introduces a parallel computing framework to characterize the propagation of electron waves in graphene-based nanostructures. The electron wave dynamics is modeled using both "microscopic" and effective medium formalisms and the numerical solution of the two-dimensional massless Dirac equation is determined using a Finite-Difference Time-Domain scheme. The propagation of electron waves in graphene superlattices with localized scattering centers is studied, and the role of the symmetry of the microscopic potential in the electron velocity is discussed. The computational methodologies target the parallel capabilities of heterogeneous multi-core CPU and multi-GPU environments and are built with the OpenCL parallel programming framework which provides a portable, vendor agnostic and high throughput-performance solution. The proposed heterogeneous multi-GPU implementation achieves speedup ratios up to 75x when compared to multi-thread and multi-core CPU execution, reducing simulation times from several hours to a couple of minutes.
The design of multi-core DSP parallel model based on message passing and multi-level pipeline

NASA Astrophysics Data System (ADS)

Niu, Jingyu; Hu, Jian; He, Wenjing; Meng, Fanrong; Li, Chuanrong

2017-10-01

Currently, the design of embedded signal processing system is often based on a specific application, but this idea is not conducive to the rapid development of signal processing technology. In this paper, a parallel processing model architecture based on multi-core DSP platform is designed, and it is mainly suitable for the complex algorithms which are composed of different modules. This model combines the ideas of multi-level pipeline parallelism and message passing, and summarizes the advantages of the mainstream model of multi-core DSP (the Master-Slave model and the Data Flow model), so that it has better performance. This paper uses three-dimensional image generation algorithm to validate the efficiency of the proposed model by comparing with the effectiveness of the Master-Slave and the Data Flow model.
Performance of an MPI-only semiconductor device simulator on a quad socket/quad core InfiniBand platform.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shadid, John Nicolas; Lin, Paul Tinphone

2009-01-01

This preliminary study considers the scaling and performance of a finite element (FE) semiconductor device simulator on a capacity cluster with 272 compute nodes based on a homogeneous multicore node architecture utilizing 16 cores. The inter-node communication backbone for this Tri-Lab Linux Capacity Cluster (TLCC) machine is comprised of an InfiniBand interconnect. The nonuniform memory access (NUMA) nodes consist of 2.2 GHz quad socket/quad core AMD Opteron processors. The performance results for this study are obtained with a FE semiconductor device simulation code (Charon) that is based on a fully-coupled Newton-Krylov solver with domain decomposition and multilevel preconditioners. Scaling andmore » multicore performance results are presented for large-scale problems of 100+ million unknowns on up to 4096 cores. A parallel scaling comparison is also presented with the Cray XT3/4 Red Storm capability platform. The results indicate that an MPI-only programming model for utilizing the multicore nodes is reasonably efficient on all 16 cores per compute node. However, the results also indicated that the multilevel preconditioner, which is critical for large-scale capability type simulations, scales better on the Red Storm machine than the TLCC machine.« less
An Energy-Aware Runtime Management of Multi-Core Sensory Swarms.

PubMed

Kim, Sungchan; Yang, Hoeseok

2017-08-24

In sensory swarms, minimizing energy consumption under performance constraint is one of the key objectives. One possible approach to this problem is to monitor application workload that is subject to change at runtime, and to adjust system configuration adaptively to satisfy the performance goal. As today's sensory swarms are usually implemented using multi-core processors with adjustable clock frequency, we propose to monitor the CPU workload periodically and adjust the task-to-core allocation or clock frequency in an energy-efficient way in response to the workload variations. In doing so, we present an online heuristic that determines the most energy-efficient adjustment that satisfies the performance requirement. The proposed method is based on a simple yet effective energy model that is built upon performance prediction using IPC (instructions per cycle) measured online and power equation derived empirically. The use of IPC accounts for memory intensities of a given workload, enabling the accurate prediction of execution time. Hence, the model allows us to rapidly and accurately estimate the effect of the two control knobs, clock frequency adjustment and core allocation. The experiments show that the proposed technique delivers considerable energy saving of up to 45%compared to the state-of-the-art multi-core energy management technique.
An Energy-Aware Runtime Management of Multi-Core Sensory Swarms

PubMed Central

Kim, Sungchan

2017-01-01

In sensory swarms, minimizing energy consumption under performance constraint is one of the key objectives. One possible approach to this problem is to monitor application workload that is subject to change at runtime, and to adjust system configuration adaptively to satisfy the performance goal. As today’s sensory swarms are usually implemented using multi-core processors with adjustable clock frequency, we propose to monitor the CPU workload periodically and adjust the task-to-core allocation or clock frequency in an energy-efficient way in response to the workload variations. In doing so, we present an online heuristic that determines the most energy-efficient adjustment that satisfies the performance requirement. The proposed method is based on a simple yet effective energy model that is built upon performance prediction using IPC (instructions per cycle) measured online and power equation derived empirically. The use of IPC accounts for memory intensities of a given workload, enabling the accurate prediction of execution time. Hence, the model allows us to rapidly and accurately estimate the effect of the two control knobs, clock frequency adjustment and core allocation. The experiments show that the proposed technique delivers considerable energy saving of up to 45%compared to the state-of-the-art multi-core energy management technique. PMID:28837094
Shared Memory Parallelism for 3D Cartesian Discrete Ordinates Solver

NASA Astrophysics Data System (ADS)

Moustafa, Salli; Dutka-Malen, Ivan; Plagne, Laurent; Ponçot, Angélique; Ramet, Pierre

2014-06-01

This paper describes the design and the performance of DOMINO, a 3D Cartesian SN solver that implements two nested levels of parallelism (multicore+SIMD) on shared memory computation nodes. DOMINO is written in C++, a multi-paradigm programming language that enables the use of powerful and generic parallel programming tools such as Intel TBB and Eigen. These two libraries allow us to combine multi-thread parallelism with vector operations in an efficient and yet portable way. As a result, DOMINO can exploit the full power of modern multi-core processors and is able to tackle very large simulations, that usually require large HPC clusters, using a single computing node. For example, DOMINO solves a 3D full core PWR eigenvalue problem involving 26 energy groups, 288 angular directions (S16), 46 × 106 spatial cells and 1 × 1012 DoFs within 11 hours on a single 32-core SMP node. This represents a sustained performance of 235 GFlops and 40:74% of the SMP node peak performance for the DOMINO sweep implementation. The very high Flops/Watt ratio of DOMINO makes it a very interesting building block for a future many-nodes nuclear simulation tool.

WOMBAT: A Scalable and High-performance Astrophysical Magnetohydrodynamics Code

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mendygral, P. J.; Radcliffe, N.; Kandalla, K.

2017-02-01

We present a new code for astrophysical magnetohydrodynamics specifically designed and optimized for high performance and scaling on modern and future supercomputers. We describe a novel hybrid OpenMP/MPI programming model that emerged from a collaboration between Cray, Inc. and the University of Minnesota. This design utilizes MPI-RMA optimized for thread scaling, which allows the code to run extremely efficiently at very high thread counts ideal for the latest generation of multi-core and many-core architectures. Such performance characteristics are needed in the era of “exascale” computing. We describe and demonstrate our high-performance design in detail with the intent that it maymore » be used as a model for other, future astrophysical codes intended for applications demanding exceptional performance.« less
Extending Automatic Parallelization to Optimize High-Level Abstractions for Multicore

DOE Office of Scientific and Technical Information (OSTI.GOV)

Liao, C; Quinlan, D J; Willcock, J J

2008-12-12

Automatic introduction of OpenMP for sequential applications has attracted significant attention recently because of the proliferation of multicore processors and the simplicity of using OpenMP to express parallelism for shared-memory systems. However, most previous research has only focused on C and Fortran applications operating on primitive data types. C++ applications using high-level abstractions, such as STL containers and complex user-defined types, are largely ignored due to the lack of research compilers that are readily able to recognize high-level object-oriented abstractions and leverage their associated semantics. In this paper, we automatically parallelize C++ applications using ROSE, a multiple-language source-to-source compiler infrastructuremore » which preserves the high-level abstractions and gives us access to their semantics. Several representative parallelization candidate kernels are used to explore semantic-aware parallelization strategies for high-level abstractions, combined with extended compiler analyses. Those kernels include an array-base computation loop, a loop with task-level parallelism, and a domain-specific tree traversal. Our work extends the applicability of automatic parallelization to modern applications using high-level abstractions and exposes more opportunities to take advantage of multicore processors.« less
Multicore Considerations for Legacy Flight Software Migration

NASA Technical Reports Server (NTRS)

Vines, Kenneth; Day, Len

2013-01-01

In this paper we will discuss potential benefits and pitfalls when considering a migration from an existing single core code base to a multicore processor implementation. The results of this study present options that should be considered before migrating fault managers, device handlers and tasks with time-constrained requirements to a multicore flight software environment. Possible future multicore test bed demonstrations are also discussed.
Performance analysis of distributed symmetric sparse matrix vector multiplication algorithm for multi-core architectures

DOE PAGES

Oryspayev, Dossay; Aktulga, Hasan Metin; Sosonkina, Masha; ...

2015-07-14

In this article, sparse matrix vector multiply (SpMVM) is an important kernel that frequently arises in high performance computing applications. Due to its low arithmetic intensity, several approaches have been proposed in literature to improve its scalability and efficiency in large scale computations. In this paper, our target systems are high end multi-core architectures and we use messaging passing interface + open multiprocessing hybrid programming model for parallelism. We analyze the performance of recently proposed implementation of the distributed symmetric SpMVM, originally developed for large sparse symmetric matrices arising in ab initio nuclear structure calculations. We also study important featuresmore » of this implementation and compare with previously reported implementations that do not exploit underlying symmetry. Our SpMVM implementations leverage the hybrid paradigm to efficiently overlap expensive communications with computations. Our main comparison criterion is the "CPU core hours" metric, which is the main measure of resource usage on supercomputers. We analyze the effects of topology-aware mapping heuristic using simplified network load model. Furthermore, we have tested the different SpMVM implementations on two large clusters with 3D Torus and Dragonfly topology. Our results show that the distributed SpMVM implementation that exploits matrix symmetry and hides communication yields the best value for the "CPU core hours" metric and significantly reduces data movement overheads.« less
Data Parallel Bin-Based Indexing for Answering Queries on Multi-Core Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gosink, Luke; Wu, Kesheng; Bethel, E. Wes

2009-06-02

The multi-core trend in CPUs and general purpose graphics processing units (GPUs) offers new opportunities for the database community. The increase of cores at exponential rates is likely to affect virtually every server and client in the coming decade, and presents database management systems with a huge, compelling disruption that will radically change how processing is done. This paper presents a new parallel indexing data structure for answering queries that takes full advantage of the increasing thread-level parallelism emerging in multi-core architectures. In our approach, our Data Parallel Bin-based Index Strategy (DP-BIS) first bins the base data, and then partitionsmore » and stores the values in each bin as a separate, bin-based data cluster. In answering a query, the procedures for examining the bin numbers and the bin-based data clusters offer the maximum possible level of concurrency; each record is evaluated by a single thread and all threads are processed simultaneously in parallel. We implement and demonstrate the effectiveness of DP-BIS on two multi-core architectures: a multi-core CPU and a GPU. The concurrency afforded by DP-BIS allows us to fully utilize the thread-level parallelism provided by each architecture--for example, our GPU-based DP-BIS implementation simultaneously evaluates over 12,000 records with an equivalent number of concurrently executing threads. In comparing DP-BIS's performance across these architectures, we show that the GPU-based DP-BIS implementation requires significantly less computation time to answer a query than the CPU-based implementation. We also demonstrate in our analysis that DP-BIS provides better overall performance than the commonly utilized CPU and GPU-based projection index. Finally, due to data encoding, we show that DP-BIS accesses significantly smaller amounts of data than index strategies that operate solely on a column's base data; this smaller data footprint is critical for parallel processors that possess limited memory resources (e.g., GPUs).« less
Evaluation of a Multicore-Optimized Implementation for Tomographic Reconstruction

PubMed Central

Agulleiro, Jose-Ignacio; Fernández, José Jesús

2012-01-01

Tomography allows elucidation of the three-dimensional structure of an object from a set of projection images. In life sciences, electron microscope tomography is providing invaluable information about the cell structure at a resolution of a few nanometres. Here, large images are required to combine wide fields of view with high resolution requirements. The computational complexity of the algorithms along with the large image size then turns tomographic reconstruction into a computationally demanding problem. Traditionally, high-performance computing techniques have been applied to cope with such demands on supercomputers, distributed systems and computer clusters. In the last few years, the trend has turned towards graphics processing units (GPUs). Here we present a detailed description and a thorough evaluation of an alternative approach that relies on exploitation of the power available in modern multicore computers. The combination of single-core code optimization, vector processing, multithreading and efficient disk I/O operations succeeds in providing fast tomographic reconstructions on standard computers. The approach turns out to be competitive with the fastest GPU-based solutions thus far. PMID:23139768
A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dong, Tingzing Tim; Tomov, Stanimire Z; Luszczek, Piotr R

As modern hardware keeps evolving, an increasingly effective approach to developing energy efficient and high-performance solvers is to design them to work on many small size and independent problems. Many applications already need this functionality, especially for GPUs, which are currently known to be about four to five times more energy efficient than multicore CPUs. We describe the development of one-sided factorizations that work for a set of small dense matrices in parallel, and we illustrate our techniques on the QR factorization based on Householder transformations. We refer to this mode of operation as a batched factorization. Our approach ismore » based on representing the algorithms as a sequence of batched BLAS routines for GPU-only execution. This is in contrast to the hybrid CPU-GPU algorithms that rely heavily on using the multicore CPU for specific parts of the workload. But for a system to benefit fully from the GPU's significantly higher energy efficiency, avoiding the use of the multicore CPU must be a primary design goal, so the system can rely more heavily on the more efficient GPU. Additionally, this will result in the removal of the costly CPU-to-GPU communication. Furthermore, we do not use a single symmetric multiprocessor(on the GPU) to factorize a single problem at a time. We illustrate how our performance analysis, and the use of profiling and tracing tools, guided the development and optimization of our batched factorization to achieve up to a 2-fold speedup and a 3-fold energy efficiency improvement compared to our highly optimized batched CPU implementations based on the MKL library(when using two sockets of Intel Sandy Bridge CPUs). Compared to a batched QR factorization featured in the CUBLAS library for GPUs, we achieved up to 5x speedup on the K40 GPU.« less
Application Performance Analysis and Efficient Execution on Systems with multi-core CPUs, GPUs and MICs: A Case Study with Microscopy Image Analysis

PubMed Central

Teodoro, George; Kurc, Tahsin; Andrade, Guilherme; Kong, Jun; Ferreira, Renato; Saltz, Joel

2015-01-01

We carry out a comparative performance study of multi-core CPUs, GPUs and Intel Xeon Phi (Many Integrated Core-MIC) with a microscopy image analysis application. We experimentally evaluate the performance of computing devices on core operations of the application. We correlate the observed performance with the characteristics of computing devices and data access patterns, computation complexities, and parallelization forms of the operations. The results show a significant variability in the performance of operations with respect to the device used. The performances of operations with regular data access are comparable or sometimes better on a MIC than that on a GPU. GPUs are more efficient than MICs for operations that access data irregularly, because of the lower bandwidth of the MIC for random data accesses. We propose new performance-aware scheduling strategies that consider variabilities in operation speedups. Our scheduling strategies significantly improve application performance compared to classic strategies in hybrid configurations. PMID:28239253
Scalable and Power Efficient Data Analytics for Hybrid Exascale Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Choudhary, Alok; Samatova, Nagiza; Wu, Kesheng

This project developed a generic and optimized set of core data analytics functions. These functions organically consolidate a broad constellation of high performance analytical pipelines. As the architectures of emerging HPC systems become inherently heterogeneous, there is a need to design algorithms for data analysis kernels accelerated on hybrid multi-node, multi-core HPC architectures comprised of a mix of CPUs, GPUs, and SSDs. Furthermore, the power-aware trend drives the advances in our performance-energy tradeoff analysis framework which enables our data analysis kernels algorithms and software to be parameterized so that users can choose the right power-performance optimizations.
Recent progress in InP/polymer-based devices for telecom and data center applications

NASA Astrophysics Data System (ADS)

Kleinert, Moritz; Zhang, Ziyang; de Felipe, David; Zawadzki, Crispin; Maese Novo, Alejandro; Brinker, Walter; Möhrle, Martin; Keil, Norbert

2015-02-01

Recent progress on polymer-based photonic devices and hybrid photonic integration technology using InP-based active components is presented. High performance thermo-optic components, including compact polymer variable optical attenuators and switches are powerful tools to regulate and control the light flow in the optical backbone. Polymer arrayed waveguide gratings integrated with InP laser and detector arrays function as low-cost optical line terminals (OLTs) in the WDM-PON network. External cavity tunable lasers combined with C/L band thinfilm filter, on-chip U-groove and 45° mirrors construct a compact, bi-directional and color-less optical network unit (ONU). A tunable laser integrated with VOAs, TFEs and two 90° hybrids builds the optical front-end of a colorless, dual-polarization coherent receiver. Multicore polymer waveguides and multi-step 45°mirrors are demonstrated as bridging devices between the spatialdivision- multiplexing transmission technology using multi-core fibers and the conventional PLCbased photonic platforms, appealing to the fast development of dense 3D photonic integration.
MPIGeneNet: Parallel Calculation of Gene Co-Expression Networks on Multicore Clusters.

PubMed

Gonzalez-Dominguez, Jorge; Martin, Maria J

2017-10-10

In this work we present MPIGeneNet, a parallel tool that applies Pearson's correlation and Random Matrix Theory to construct gene co-expression networks. It is based on the state-of-the-art sequential tool RMTGeneNet, which provides networks with high robustness and sensitivity at the expenses of relatively long runtimes for large scale input datasets. MPIGeneNet returns the same results as RMTGeneNet but improves the memory management, reduces the I/O cost, and accelerates the two most computationally demanding steps of co-expression network construction by exploiting the compute capabilities of common multicore CPU clusters. Our performance evaluation on two different systems using three typical input datasets shows that MPIGeneNet is significantly faster than RMTGeneNet. As an example, our tool is up to 175.41 times faster on a cluster with eight nodes, each one containing two 12-core Intel Haswell processors. Source code of MPIGeneNet, as well as a reference manual, are available at https://sourceforge.net/projects/mpigenenet/.
Real-Time Spatio-Temporal Twice Whitening for MIMO Energy Detector

DOE Office of Scientific and Technical Information (OSTI.GOV)

Humble, Travis S; Mitra, Pramita; Barhen, Jacob

2010-01-01

While many techniques exist for local spectrum sensing of a primary user, each represents a computationally demanding task to secondary user receivers. In software-defined radio, computational complexity lengthens the time for a cognitive radio to recognize changes in the transmission environment. This complexity is even more significant for spatially multiplexed receivers, e.g., in SIMO and MIMO, where the spatio-temporal data sets grow in size with the number of antennae. Limits on power and space for the processor hardware further constrain SDR performance. In this report, we discuss improvements in spatio-temporal twice whitening (STTW) for real-time local spectrum sensing by demonstratingmore » a form of STTW well suited for MIMO environments. We implement STTW on the Coherent Logix hx3100 processor, a multicore processor intended for low-power, high-throughput software-defined signal processing. These results demonstrate how coupling the novel capabilities of emerging multicore processors with algorithmic advances can enable real-time, software-defined processing of large spatio-temporal data sets.« less
Optimization of Selected Remote Sensing Algorithms for Embedded NVIDIA Kepler GPU Architecture

NASA Technical Reports Server (NTRS)

Riha, Lubomir; Le Moigne, Jacqueline; El-Ghazawi, Tarek

2015-01-01

This paper evaluates the potential of embedded Graphic Processing Units in the Nvidias Tegra K1 for onboard processing. The performance is compared to a general purpose multi-core CPU and full fledge GPU accelerator. This study uses two algorithms: Wavelet Spectral Dimension Reduction of Hyperspectral Imagery and Automated Cloud-Cover Assessment (ACCA) Algorithm. Tegra K1 achieved 51 for ACCA algorithm and 20 for the dimension reduction algorithm, as compared to the performance of the high-end 8-core server Intel Xeon CPU with 13.5 times higher power consumption.
First experience with particle-in-cell plasma physics code on ARM-based HPC systems

NASA Astrophysics Data System (ADS)

Sáez, Xavier; Soba, Alejandro; Sánchez, Edilberto; Mantsinen, Mervi; Mateo, Sergi; Cela, José M.; Castejón, Francisco

2015-09-01

In this work, we will explore the feasibility of porting a Particle-in-cell code (EUTERPE) to an ARM multi-core platform from the Mont-Blanc project. The used prototype is based on a system-on-chip Samsung Exynos 5 with an integrated GPU. It is the first prototype that could be used for High-Performance Computing (HPC), since it supports double precision and parallel programming languages.
Exact diagonalization of quantum lattice models on coprocessors

NASA Astrophysics Data System (ADS)

Siro, T.; Harju, A.

2016-10-01

We implement the Lanczos algorithm on an Intel Xeon Phi coprocessor and compare its performance to a multi-core Intel Xeon CPU and an NVIDIA graphics processor. The Xeon and the Xeon Phi are parallelized with OpenMP and the graphics processor is programmed with CUDA. The performance is evaluated by measuring the execution time of a single step in the Lanczos algorithm. We study two quantum lattice models with different particle numbers, and conclude that for small systems, the multi-core CPU is the fastest platform, while for large systems, the graphics processor is the clear winner, reaching speedups of up to 7.6 compared to the CPU. The Xeon Phi outperforms the CPU with sufficiently large particle number, reaching a speedup of 2.5.
Cross talk analysis in multicore optical fibers by supermode theory.

PubMed

Szostkiewicz, Lukasz; Napierala, Marek; Ziolowicz, Anna; Pytel, Anna; Tenderenda, Tadeusz; Nasilowski, Tomasz

2016-08-15

We discuss the theoretical aspects of core-to-core power transfer in multicore fibers relying on supermode theory. Based on a dual core fiber model, we investigate the consequences of this approach, such as the influence of initial excitation conditions on cross talk. Supermode interpretation of power coupling proves to be intuitive and thus may lead to new concepts of multicore fiber-based devices. As a conclusion, we propose a definition of a uniform cross talk parameter that describes multicore fiber design.
Concurrent and Accurate Short Read Mapping on Multicore Processors.

PubMed

Martínez, Héctor; Tárraga, Joaquín; Medina, Ignacio; Barrachina, Sergio; Castillo, Maribel; Dopazo, Joaquín; Quintana-Ortí, Enrique S

2015-01-01

We introduce a parallel aligner with a work-flow organization for fast and accurate mapping of RNA sequences on servers equipped with multicore processors. Our software, HPG Aligner SA (HPG Aligner SA is an open-source application. The software is available at http://www.opencb.org, exploits a suffix array to rapidly map a large fraction of the RNA fragments (reads), as well as leverages the accuracy of the Smith-Waterman algorithm to deal with conflictive reads. The aligner is enhanced with a careful strategy to detect splice junctions based on an adaptive division of RNA reads into small segments (or seeds), which are then mapped onto a number of candidate alignment locations, providing crucial information for the successful alignment of the complete reads. The experimental results on a platform with Intel multicore technology report the parallel performance of HPG Aligner SA, on RNA reads of 100-400 nucleotides, which excels in execution time/sensitivity to state-of-the-art aligners such as TopHat 2+Bowtie 2, MapSplice, and STAR.
Performance evaluation of canny edge detection on a tiled multicore architecture

NASA Astrophysics Data System (ADS)

Brethorst, Andrew Z.; Desai, Nehal; Enright, Douglas P.; Scrofano, Ronald

2011-01-01

In the last few years, a variety of multicore architectures have been used to parallelize image processing applications. In this paper, we focus on assessing the parallel speed-ups of different Canny edge detection parallelization strategies on the Tile64, a tiled multicore architecture developed by the Tilera Corporation. Included in these strategies are different ways Canny edge detection can be parallelized, as well as differences in data management. The two parallelization strategies examined were loop-level parallelism and domain decomposition. Loop-level parallelism is achieved through the use of OpenMP,1 and it is capable of parallelization across the range of values over which a loop iterates. Domain decomposition is the process of breaking down an image into subimages, where each subimage is processed independently, in parallel. The results of the two strategies show that for the same number of threads, programmer implemented, domain decomposition exhibits higher speed-ups than the compiler managed, loop-level parallelism implemented with OpenMP.
Performance evaluation of GPU parallelization, space-time adaptive algorithms, and their combination for simulating cardiac electrophysiology.

PubMed

Sachetto Oliveira, Rafael; Martins Rocha, Bernardo; Burgarelli, Denise; Meira, Wagner; Constantinides, Christakis; Weber Dos Santos, Rodrigo

2018-02-01

The use of computer models as a tool for the study and understanding of the complex phenomena of cardiac electrophysiology has attained increased importance nowadays. At the same time, the increased complexity of the biophysical processes translates into complex computational and mathematical models. To speed up cardiac simulations and to allow more precise and realistic uses, 2 different techniques have been traditionally exploited: parallel computing and sophisticated numerical methods. In this work, we combine a modern parallel computing technique based on multicore and graphics processing units (GPUs) and a sophisticated numerical method based on a new space-time adaptive algorithm. We evaluate each technique alone and in different combinations: multicore and GPU, multicore and GPU and space adaptivity, multicore and GPU and space adaptivity and time adaptivity. All the techniques and combinations were evaluated under different scenarios: 3D simulations on slabs, 3D simulations on a ventricular mouse mesh, ie, complex geometry, sinus-rhythm, and arrhythmic conditions. Our results suggest that multicore and GPU accelerate the simulations by an approximate factor of 33×, whereas the speedups attained by the space-time adaptive algorithms were approximately 48. Nevertheless, by combining all the techniques, we obtained speedups that ranged between 165 and 498. The tested methods were able to reduce the execution time of a simulation by more than 498× for a complex cellular model in a slab geometry and by 165× in a realistic heart geometry simulating spiral waves. The proposed methods will allow faster and more realistic simulations in a feasible time with no significant loss of accuracy. Copyright © 2017 John Wiley & Sons, Ltd.
GeauxDock: Accelerating Structure-Based Virtual Screening with Heterogeneous Computing

PubMed Central

Fang, Ye; Ding, Yun; Feinstein, Wei P.; Koppelman, David M.; Moreno, Juana; Jarrell, Mark; Ramanujam, J.; Brylinski, Michal

2016-01-01

Computational modeling of drug binding to proteins is an integral component of direct drug design. Particularly, structure-based virtual screening is often used to perform large-scale modeling of putative associations between small organic molecules and their pharmacologically relevant protein targets. Because of a large number of drug candidates to be evaluated, an accurate and fast docking engine is a critical element of virtual screening. Consequently, highly optimized docking codes are of paramount importance for the effectiveness of virtual screening methods. In this communication, we describe the implementation, tuning and performance characteristics of GeauxDock, a recently developed molecular docking program. GeauxDock is built upon the Monte Carlo algorithm and features a novel scoring function combining physics-based energy terms with statistical and knowledge-based potentials. Developed specifically for heterogeneous computing platforms, the current version of GeauxDock can be deployed on modern, multi-core Central Processing Units (CPUs) as well as massively parallel accelerators, Intel Xeon Phi and NVIDIA Graphics Processing Unit (GPU). First, we carried out a thorough performance tuning of the high-level framework and the docking kernel to produce a fast serial code, which was then ported to shared-memory multi-core CPUs yielding a near-ideal scaling. Further, using Xeon Phi gives 1.9× performance improvement over a dual 10-core Xeon CPU, whereas the best GPU accelerator, GeForce GTX 980, achieves a speedup as high as 3.5×. On that account, GeauxDock can take advantage of modern heterogeneous architectures to considerably accelerate structure-based virtual screening applications. GeauxDock is open-sourced and publicly available at www.brylinski.org/geauxdock and https://figshare.com/articles/geauxdock_tar_gz/3205249. PMID:27420300

GeauxDock: Accelerating Structure-Based Virtual Screening with Heterogeneous Computing.

PubMed

Fang, Ye; Ding, Yun; Feinstein, Wei P; Koppelman, David M; Moreno, Juana; Jarrell, Mark; Ramanujam, J; Brylinski, Michal

2016-01-01

Computational modeling of drug binding to proteins is an integral component of direct drug design. Particularly, structure-based virtual screening is often used to perform large-scale modeling of putative associations between small organic molecules and their pharmacologically relevant protein targets. Because of a large number of drug candidates to be evaluated, an accurate and fast docking engine is a critical element of virtual screening. Consequently, highly optimized docking codes are of paramount importance for the effectiveness of virtual screening methods. In this communication, we describe the implementation, tuning and performance characteristics of GeauxDock, a recently developed molecular docking program. GeauxDock is built upon the Monte Carlo algorithm and features a novel scoring function combining physics-based energy terms with statistical and knowledge-based potentials. Developed specifically for heterogeneous computing platforms, the current version of GeauxDock can be deployed on modern, multi-core Central Processing Units (CPUs) as well as massively parallel accelerators, Intel Xeon Phi and NVIDIA Graphics Processing Unit (GPU). First, we carried out a thorough performance tuning of the high-level framework and the docking kernel to produce a fast serial code, which was then ported to shared-memory multi-core CPUs yielding a near-ideal scaling. Further, using Xeon Phi gives 1.9× performance improvement over a dual 10-core Xeon CPU, whereas the best GPU accelerator, GeForce GTX 980, achieves a speedup as high as 3.5×. On that account, GeauxDock can take advantage of modern heterogeneous architectures to considerably accelerate structure-based virtual screening applications. GeauxDock is open-sourced and publicly available at www.brylinski.org/geauxdock and https://figshare.com/articles/geauxdock_tar_gz/3205249.
Fast data reconstructed method of Fourier transform imaging spectrometer based on multi-core CPU

NASA Astrophysics Data System (ADS)

Yu, Chunchao; Du, Debiao; Xia, Zongze; Song, Li; Zheng, Weijian; Yan, Min; Lei, Zhenggang

2017-10-01

Imaging spectrometer can gain two-dimensional space image and one-dimensional spectrum at the same time, which shows high utility in color and spectral measurements, the true color image synthesis, military reconnaissance and so on. In order to realize the fast reconstructed processing of the Fourier transform imaging spectrometer data, the paper designed the optimization reconstructed algorithm with OpenMP parallel calculating technology, which was further used for the optimization process for the HyperSpectral Imager of `HJ-1' Chinese satellite. The results show that the method based on multi-core parallel computing technology can control the multi-core CPU hardware resources competently and significantly enhance the calculation of the spectrum reconstruction processing efficiency. If the technology is applied to more cores workstation in parallel computing, it will be possible to complete Fourier transform imaging spectrometer real-time data processing with a single computer.
Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures

PubMed Central

Manolakos, Elias S.

2015-01-01

Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise protein comparison algorithms, and the trend in the domain towards using multiple criteria for protein structures comparison (MCPSC) and combining results. We have developed a software framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of the two parallel MCPSC implementations using Intel's experimental many-core Single-Chip Cloud Computer (SCC) as well as Intel's Core i7 multicore processor. We show that the 48-core SCC is more efficient than the latest generation Core i7, achieving a speedup factor of 42 (efficiency of 0.9), making many-core processors an exciting emerging technology for large-scale structural proteomics. We compare and contrast the performance of the two processors on several datasets and also show that MCPSC outperforms its component methods in grouping related domains, achieving a high F-measure of 0.91 on the benchmark CK34 dataset. The software implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub. PMID:26605332
Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures.

PubMed

Sharma, Anuj; Manolakos, Elias S

2015-01-01

Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise protein comparison algorithms, and the trend in the domain towards using multiple criteria for protein structures comparison (MCPSC) and combining results. We have developed a software framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of the two parallel MCPSC implementations using Intel's experimental many-core Single-Chip Cloud Computer (SCC) as well as Intel's Core i7 multicore processor. We show that the 48-core SCC is more efficient than the latest generation Core i7, achieving a speedup factor of 42 (efficiency of 0.9), making many-core processors an exciting emerging technology for large-scale structural proteomics. We compare and contrast the performance of the two processors on several datasets and also show that MCPSC outperforms its component methods in grouping related domains, achieving a high F-measure of 0.91 on the benchmark CK34 dataset. The software implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub.
Computational multicore on two-layer 1D shallow water equations for erodible dambreak

NASA Astrophysics Data System (ADS)

Simanjuntak, C. A.; Bagustara, B. A. R. H.; Gunawan, P. H.

2018-03-01

The simulation of erodible dambreak using two-layer shallow water equations and SCHR scheme are elaborated in this paper. The results show that the two-layer SWE model in a good agreement with the data experiment which is performed by Louvain-la-Neuve Université Catholique de Louvain. Moreover, the parallel algorithm with multicore architecture are given in the results. The results show that Computer I with processor Intel(R) Core(TM) i5-2500 CPU Quad-Core has the best performance to accelerate the computational time. Moreover, Computer III with processor AMD A6-5200 APU Quad-Core is observed has higher speedup and efficiency. The speedup and efficiency of Computer III with number of grids 3200 are 3.716050530 times and 92.9% respectively.
Extreme-scale Algorithms and Solver Resilience

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dongarra, Jack

A widening gap exists between the peak performance of high-performance computers and the performance achieved by complex applications running on these platforms. Over the next decade, extreme-scale systems will present major new challenges to algorithm development that could amplify this mismatch in such a way that it prevents the productive use of future DOE Leadership computers due to the following; Extreme levels of parallelism due to multicore processors; An increase in system fault rates requiring algorithms to be resilient beyond just checkpoint/restart; Complex memory hierarchies and costly data movement in both energy and performance; Heterogeneous system architectures (mixing CPUs, GPUs,more » etc.); and Conflicting goals of performance, resilience, and power requirements.« less
3D environment modeling and location tracking using off-the-shelf components

NASA Astrophysics Data System (ADS)

Luke, Robert H.

2016-05-01

The remarkable popularity of smartphones over the past decade has led to a technological race for dominance in market share. This has resulted in a flood of new processors and sensors that are inexpensive, low power and high performance. These sensors include accelerometers, gyroscope, barometers and most importantly cameras. This sensor suite, coupled with multicore processors, allows a new community of researchers to build small, high performance platforms for low cost. This paper describes a system using off-the-shelf components to perform position tracking as well as environment modeling. The system relies on tracking using stereo vision and inertial navigation to determine movement of the system as well as create a model of the environment sensed by the system.
High performance ultrasonic field simulation on complex geometries

NASA Astrophysics Data System (ADS)

Chouh, H.; Rougeron, G.; Chatillon, S.; Iehl, J. C.; Farrugia, J. P.; Ostromoukhov, V.

2016-02-01

Ultrasonic field simulation is a key ingredient for the design of new testing methods as well as a crucial step for NDT inspection simulation. As presented in a previous paper [1], CEA-LIST has worked on the acceleration of these simulations focusing on simple geometries (planar interfaces, isotropic materials). In this context, significant accelerations were achieved on multicore processors and GPUs (Graphics Processing Units), bringing the execution time of realistic computations in the 0.1 s range. In this paper, we present recent works that aim at similar performances on a wider range of configurations. We adapted the physical model used by the CIVA platform to design and implement a new algorithm providing a fast ultrasonic field simulation that yields nearly interactive results for complex cases. The improvements over the CIVA pencil-tracing method include adaptive strategies for pencil subdivisions to achieve a good refinement of the sensor geometry while keeping a reasonable number of ray-tracing operations. Also, interpolation of the times of flight was used to avoid time consuming computations in the impulse response reconstruction stage. To achieve the best performance, our algorithm runs on multi-core superscalar CPUs and uses high performance specialized libraries such as Intel Embree for ray-tracing, Intel MKL for signal processing and Intel TBB for parallelization. We validated the simulation results by comparing them to the ones produced by CIVA on identical test configurations including mono-element and multiple-element transducers, homogeneous, meshed 3D CAD specimens, isotropic and anisotropic materials and wave paths that can involve several interactions with interfaces. We show performance results on complete simulations that achieve computation times in the 1s range.
Fiber Bragg grating inscription in optical multicore fibers

NASA Astrophysics Data System (ADS)

Becker, Martin; Elsmann, Tino; Lorenz, Adrian; Spittel, Ron; Kobelke, Jens; Schuster, Kay; Rothhardt, Manfred; Latka, Ines; Dochow, Sebastian; Bartelt, Hartmut

2015-09-01

Fiber Bragg gratings as key components in telecommunication, fiber lasers, and sensing systems usually rely on the Bragg condition for single mode fibers. In special applications, such as in biophotonics and astrophysics, high light coupling efficiency is of great importance and therefore, multimode fibers are often preferred. The wavelength filtering effect of Bragg gratings in multimode fibers, however is spectrally blurred over a wide modal spectrum of the fiber. With a well-designed all solid multicore microstructured fiber a good light guiding efficiency in combination with narrow spectral filtering effect by Bragg gratings becomes possible.
Accelerating Large Scale Image Analyses on Parallel, CPU-GPU Equipped Systems

PubMed Central

Teodoro, George; Kurc, Tahsin M.; Pan, Tony; Cooper, Lee A.D.; Kong, Jun; Widener, Patrick; Saltz, Joel H.

2014-01-01

The past decade has witnessed a major paradigm shift in high performance computing with the introduction of accelerators as general purpose processors. These computing devices make available very high parallel computing power at low cost and power consumption, transforming current high performance platforms into heterogeneous CPU-GPU equipped systems. Although the theoretical performance achieved by these hybrid systems is impressive, taking practical advantage of this computing power remains a very challenging problem. Most applications are still deployed to either GPU or CPU, leaving the other resource under- or un-utilized. In this paper, we propose, implement, and evaluate a performance aware scheduling technique along with optimizations to make efficient collaborative use of CPUs and GPUs on a parallel system. In the context of feature computations in large scale image analysis applications, our evaluations show that intelligently co-scheduling CPUs and GPUs can significantly improve performance over GPU-only or multi-core CPU-only approaches. PMID:25419545
Re-Form: FPGA-Powered True Codesign Flow for High-Performance Computing In The Post-Moore Era

DOE Office of Scientific and Technical Information (OSTI.GOV)

Cappello, Franck; Yoshii, Kazutomo; Finkel, Hal

Multicore scaling will end soon because of practical power limits. Dark silicon is becoming a major issue even more than the end of Moore’s law. In the post-Moore era, the energy efficiency of computing will be a major concern. FPGAs could be a key to maximizing the energy efficiency. In this paper we address severe challenges in the adoption of FPGA in HPC and describe “Re-form,” an FPGA-powered codesign flow.
Multicore fibre photonic lanterns for precision radial velocity Science

NASA Astrophysics Data System (ADS)

Gris-Sánchez, Itandehui; Haynes, Dionne M.; Ehrlich, Katjana; Haynes, Roger; Birks, Tim A.

2018-04-01

Incomplete fibre scrambling and fibre modal noise can degrade high-precision spectroscopic applications (typically high spectral resolution and high signal to noise). For example, it can be the dominating error source for exoplanet finding spectrographs, limiting the maximum measurement precision possible with such facilities. This limitation is exacerbated in the next generation of infra-red based systems, as the number of modes supported by the fibre scales inversely with the wavelength squared and more modes typically equates to better scrambling. Substantial effort has been made by major research groups in this area to improve the fibre link performance by employing non-circular fibres, double scramblers, fibre shakers, and fibre stretchers. We present an original design of a multicore fibre (MCF) terminated with multimode photonic lantern ports. It is designed to act as a relay fibre with the coupling efficiency of a multimode fibre (MMF), modal stability similar to a single-mode fibre and low loss in a wide range of wavelengths (380 nm to 860 nm). It provides phase and amplitude scrambling to achieve a stable near field and far-field output illumination pattern despite input coupling variations, and low modal noise for increased stability for high signal-to-noise applications such as precision radial velocity (PRV) science. Preliminary results are presented for a 511-core MCF and compared with current state of the art octagonal fibre.
Multi-cored vortices support function of slotted wing tips of birds in gliding and flapping flight

PubMed Central

2017-01-01

Slotted wing tips of birds are commonly considered an adaptation to improve soaring performance, despite their presence in species that neither soar nor glide. We used particle image velocimetry to measure the airflow around the slotted wing tip of a jackdaw (Corvus monedula) as well as in its wake during unrestrained flight in a wind tunnel. The separated primary feathers produce individual wakes, confirming a multi-slotted function, in both gliding and flapping flight. The resulting multi-cored wingtip vortex represents a spreading of vorticity, which has previously been suggested as indicative of increased aerodynamic efficiency. Considering benefits of the slotted wing tips that are specific to flapping flight combined with the wide phylogenetic occurrence of this configuration, we propose the hypothesis that slotted wings evolved initially to improve performance in powered flight. PMID:28539482
A Parallel Point Matching Algorithm for Landmark Based Image Registration Using Multicore Platform

PubMed Central

Yang, Lin; Gong, Leiguang; Zhang, Hong; Nosher, John L.; Foran, David J.

2013-01-01

Point matching is crucial for many computer vision applications. Establishing the correspondence between a large number of data points is a computationally intensive process. Some point matching related applications, such as medical image registration, require real time or near real time performance if applied to critical clinical applications like image assisted surgery. In this paper, we report a new multicore platform based parallel algorithm for fast point matching in the context of landmark based medical image registration. We introduced a non-regular data partition algorithm which utilizes the K-means clustering algorithm to group the landmarks based on the number of available processing cores, which optimize the memory usage and data transfer. We have tested our method using the IBM Cell Broadband Engine (Cell/B.E.) platform. The results demonstrated a significant speed up over its sequential implementation. The proposed data partition and parallelization algorithm, though tested only on one multicore platform, is generic by its design. Therefore the parallel algorithm can be extended to other computing platforms, as well as other point matching related applications. PMID:24308014
Influence of fibre design and curvature on crosstalk in multi-core fibre

NASA Astrophysics Data System (ADS)

Egorova, O. N.; Astapovich, M. S.; Melnikov, L. A.; Salganskii, M. Yu; Mishkin, V. P.; Nishchev, K. N.; Semjonov, S. L.; Dianov, E. M.

2016-03-01

We have studied the influence of cross-sectional structure and bends on optical cross-talk in a multicore fibre. A reduced refractive index layer produced between the cores of such fibre with a small centre-to-centre spacing between neighbouring cores (27 μm) reduces optical cross-talk by 20 dB. The cross-talk level achieved, 30 dB per kilometre of the length of the multicore fibre, is acceptable for a number of applications where relatively small lengths of fibre are needed. Moreover, a significant decrease in optical cross-talk has been ensured by reducing the winding diameter of multicore fibres with identical cores.
Geocomputation over Hybrid Computer Architecture and Systems: Prior Works and On-going Initiatives at UARK

NASA Astrophysics Data System (ADS)

Shi, X.

2015-12-01

As NSF indicated - "Theory and experimentation have for centuries been regarded as two fundamental pillars of science. It is now widely recognized that computational and data-enabled science forms a critical third pillar." Geocomputation is the third pillar of GIScience and geosciences. With the exponential growth of geodata, the challenge of scalable and high performance computing for big data analytics become urgent because many research activities are constrained by the inability of software or tool that even could not complete the computation process. Heterogeneous geodata integration and analytics obviously magnify the complexity and operational time frame. Many large-scale geospatial problems may be not processable at all if the computer system does not have sufficient memory or computational power. Emerging computer architectures, such as Intel's Many Integrated Core (MIC) Architecture and Graphics Processing Unit (GPU), and advanced computing technologies provide promising solutions to employ massive parallelism and hardware resources to achieve scalability and high performance for data intensive computing over large spatiotemporal and social media data. Exploring novel algorithms and deploying the solutions in massively parallel computing environment to achieve the capability for scalable data processing and analytics over large-scale, complex, and heterogeneous geodata with consistent quality and high-performance has been the central theme of our research team in the Department of Geosciences at the University of Arkansas (UARK). New multi-core architectures combined with application accelerators hold the promise to achieve scalability and high performance by exploiting task and data levels of parallelism that are not supported by the conventional computing systems. Such a parallel or distributed computing environment is particularly suitable for large-scale geocomputation over big data as proved by our prior works, while the potential of such advanced infrastructure remains unexplored in this domain. Within this presentation, our prior and on-going initiatives will be summarized to exemplify how we exploit multicore CPUs, GPUs, and MICs, and clusters of CPUs, GPUs and MICs, to accelerate geocomputation in different applications.
Optimization of the coherence function estimation for multi-core central processing unit

NASA Astrophysics Data System (ADS)

Cheremnov, A. G.; Faerman, V. A.; Avramchuk, V. S.

2017-02-01

The paper considers use of parallel processing on multi-core central processing unit for optimization of the coherence function evaluation arising in digital signal processing. Coherence function along with other methods of spectral analysis is commonly used for vibration diagnosis of rotating machinery and its particular nodes. An algorithm is given for the function evaluation for signals represented with digital samples. The algorithm is analyzed for its software implementation and computational problems. Optimization measures are described, including algorithmic, architecture and compiler optimization, their results are assessed for multi-core processors from different manufacturers. Thus, speeding-up of the parallel execution with respect to sequential execution was studied and results are presented for Intel Core i7-4720HQ и AMD FX-9590 processors. The results show comparatively high efficiency of the optimization measures taken. In particular, acceleration indicators and average CPU utilization have been significantly improved, showing high degree of parallelism of the constructed calculating functions. The developed software underwent state registration and will be used as a part of a software and hardware solution for rotating machinery fault diagnosis and pipeline leak location with acoustic correlation method.
Application of Advanced Multi-Core Processor Technologies to Oceanographic Research

DTIC Science & Technology

2013-09-30

STM32 NXP LPC series No Proprietary Microchip PIC32/DSPIC No > 500 mW; < 5 W ARM Cortex TI OMAP TI Sitara Broadcom BCM2835 Varies FPGA...1 DISTRIBUTION STATEMENT A. Approved for public release; distribution is unlimited. Application of Advanced Multi-Core Processor Technologies...state-of-the-art information processing architectures. OBJECTIVES Next-generation processor architectures (multi-core, multi-threaded) hold the
Computational Particle Dynamic Simulations on Multicore Processors (CPDMu) Final Report Phase I

DOE Office of Scientific and Technical Information (OSTI.GOV)

Schmalz, Mark S

2011-07-24

Statement of Problem - Department of Energy has many legacy codes for simulation of computational particle dynamics and computational fluid dynamics applications that are designed to run on sequential processors and are not easily parallelized. Emerging high-performance computing architectures employ massively parallel multicore architectures (e.g., graphics processing units) to increase throughput. Parallelization of legacy simulation codes is a high priority, to achieve compatibility, efficiency, accuracy, and extensibility. General Statement of Solution - A legacy simulation application designed for implementation on mainly-sequential processors has been represented as a graph G. Mathematical transformations, applied to G, produce a graph representation {und G}more » for a high-performance architecture. Key computational and data movement kernels of the application were analyzed/optimized for parallel execution using the mapping G {yields} {und G}, which can be performed semi-automatically. This approach is widely applicable to many types of high-performance computing systems, such as graphics processing units or clusters comprised of nodes that contain one or more such units. Phase I Accomplishments - Phase I research decomposed/profiled computational particle dynamics simulation code for rocket fuel combustion into low and high computational cost regions (respectively, mainly sequential and mainly parallel kernels), with analysis of space and time complexity. Using the research team's expertise in algorithm-to-architecture mappings, the high-cost kernels were transformed, parallelized, and implemented on Nvidia Fermi GPUs. Measured speedups (GPU with respect to single-core CPU) were approximately 20-32X for realistic model parameters, without final optimization. Error analysis showed no loss of computational accuracy. Commercial Applications and Other Benefits - The proposed research will constitute a breakthrough in solution of problems related to efficient parallel computation of particle and fluid dynamics simulations. These problems occur throughout DOE, military and commercial sectors: the potential payoff is high. We plan to license or sell the solution to contractors for military and domestic applications such as disaster simulation (aerodynamic and hydrodynamic), Government agencies (hydrological and environmental simulations), and medical applications (e.g., in tomographic image reconstruction). Keywords - High-performance Computing, Graphic Processing Unit, Fluid/Particle Simulation. Summary for Members of Congress - Department of Energy has many simulation codes that must compute faster, to be effective. The Phase I research parallelized particle/fluid simulations for rocket combustion, for high-performance computing systems.« less
Improving the performance of heterogeneous multi-core processors by modifying the cache coherence protocol

NASA Astrophysics Data System (ADS)

Fang, Juan; Hao, Xiaoting; Fan, Qingwen; Chang, Zeqing; Song, Shuying

2017-05-01

In the Heterogeneous multi-core architecture, CPU and GPU processor are integrated on the same chip, which poses a new challenge to the last-level cache management. In this architecture, the CPU application and the GPU application execute concurrently, accessing the last-level cache. CPU and GPU have different memory access characteristics, so that they have differences in the sensitivity of last-level cache (LLC) capacity. For many CPU applications, a reduced share of the LLC could lead to significant performance degradation. On the contrary, GPU applications can tolerate increase in memory access latency when there is sufficient thread-level parallelism. Taking into account the GPU program memory latency tolerance characteristics, this paper presents a method that let GPU applications can access to memory directly, leaving lots of LLC space for CPU applications, in improving the performance of CPU applications and does not affect the performance of GPU applications. When the CPU application is cache sensitive, and the GPU application is insensitive to the cache, the overall performance of the system is improved significantly.

Influence of fibre design and curvature on crosstalk in multi-core fibre

DOE Office of Scientific and Technical Information (OSTI.GOV)

Egorova, O N; Astapovich, M S; Semjonov, S L

2016-03-31

We have studied the influence of cross-sectional structure and bends on optical cross-talk in a multicore fibre. A reduced refractive index layer produced between the cores of such fibre with a small centre-to-centre spacing between neighbouring cores (27 μm) reduces optical cross-talk by 20 dB. The cross-talk level achieved, 30 dB per kilometre of the length of the multicore fibre, is acceptable for a number of applications where relatively small lengths of fibre are needed. Moreover, a significant decrease in optical cross-talk has been ensured by reducing the winding diameter of multicore fibres with identical cores. (fiber optics)
Parallel k-means++ for Multiple Shared-Memory Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mackey, Patrick S.; Lewis, Robert R.

2016-09-22

In recent years k-means++ has become a popular initialization technique for improved k-means clustering. To date, most of the work done to improve its performance has involved parallelizing algorithms that are only approximations of k-means++. In this paper we present a parallelization of the exact k-means++ algorithm, with a proof of its correctness. We develop implementations for three distinct shared-memory architectures: multicore CPU, high performance GPU, and the massively multithreaded Cray XMT platform. We demonstrate the scalability of the algorithm on each platform. In addition we present a visual approach for showing which platform performed k-means++ the fastest for varyingmore » data sizes.« less
MILC Code Performance on High End CPU and GPU Supercomputer Clusters

NASA Astrophysics Data System (ADS)

DeTar, Carleton; Gottlieb, Steven; Li, Ruizi; Toussaint, Doug

2018-03-01

With recent developments in parallel supercomputing architecture, many core, multi-core, and GPU processors are now commonplace, resulting in more levels of parallelism, memory hierarchy, and programming complexity. It has been necessary to adapt the MILC code to these new processors starting with NVIDIA GPUs, and more recently, the Intel Xeon Phi processors. We report on our efforts to port and optimize our code for the Intel Knights Landing architecture. We consider performance of the MILC code with MPI and OpenMP, and optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on the staggered conjugate gradient and gauge force. We also consider performance on recent NVIDIA GPUs using the QUDA library.
1001 Ways to run AutoDock Vina for virtual screening

NASA Astrophysics Data System (ADS)

Jaghoori, Mohammad Mahdi; Bleijlevens, Boris; Olabarriaga, Silvia D.

2016-03-01

Large-scale computing technologies have enabled high-throughput virtual screening involving thousands to millions of drug candidates. It is not trivial, however, for biochemical scientists to evaluate the technical alternatives and their implications for running such large experiments. Besides experience with the molecular docking tool itself, the scientist needs to learn how to run it on high-performance computing (HPC) infrastructures, and understand the impact of the choices made. Here, we review such considerations for a specific tool, AutoDock Vina, and use experimental data to illustrate the following points: (1) an additional level of parallelization increases virtual screening throughput on a multi-core machine; (2) capturing of the random seed is not enough (though necessary) for reproducibility on heterogeneous distributed computing systems; (3) the overall time spent on the screening of a ligand library can be improved by analysis of factors affecting execution time per ligand, including number of active torsions, heavy atoms and exhaustiveness. We also illustrate differences among four common HPC infrastructures: grid, Hadoop, small cluster and multi-core (virtual machine on the cloud). Our analysis shows that these platforms are suitable for screening experiments of different sizes. These considerations can guide scientists when choosing the best computing platform and set-up for their future large virtual screening experiments.
1001 Ways to run AutoDock Vina for virtual screening.

PubMed

Jaghoori, Mohammad Mahdi; Bleijlevens, Boris; Olabarriaga, Silvia D

2016-03-01

Large-scale computing technologies have enabled high-throughput virtual screening involving thousands to millions of drug candidates. It is not trivial, however, for biochemical scientists to evaluate the technical alternatives and their implications for running such large experiments. Besides experience with the molecular docking tool itself, the scientist needs to learn how to run it on high-performance computing (HPC) infrastructures, and understand the impact of the choices made. Here, we review such considerations for a specific tool, AutoDock Vina, and use experimental data to illustrate the following points: (1) an additional level of parallelization increases virtual screening throughput on a multi-core machine; (2) capturing of the random seed is not enough (though necessary) for reproducibility on heterogeneous distributed computing systems; (3) the overall time spent on the screening of a ligand library can be improved by analysis of factors affecting execution time per ligand, including number of active torsions, heavy atoms and exhaustiveness. We also illustrate differences among four common HPC infrastructures: grid, Hadoop, small cluster and multi-core (virtual machine on the cloud). Our analysis shows that these platforms are suitable for screening experiments of different sizes. These considerations can guide scientists when choosing the best computing platform and set-up for their future large virtual screening experiments.
Multicore Programming Challenges

NASA Astrophysics Data System (ADS)

Perrone, Michael

The computer industry is facing fundamental challenges that are driving a major change in the design of computer processors. Due to restrictions imposed by quantum physics, one historical path to higher computer processor performance - by increased clock frequency - has come to an end. Increasing clock frequency now leads to power consumption costs that are too high to justify. As a result, we have seen in recent years that the processor frequencies have peaked and are receding from their high point. At the same time, competitive market conditions are giving business advantage to those companies that can field new streaming applications, handle larger data sets, and update their models to market conditions faster. The desire for newer, faster and larger is driving continued demand for higher computer performance.
Connectivity: Performance Portable Algorithms for graph connectivity v. 0.1

DOE Office of Scientific and Technical Information (OSTI.GOV)

Slota, George; Rajamanickam, Sivasankaran; Madduri, Kamesh

Graphs occur in several places in real world from road networks, social networks and scientific simulations. Connectivity is a graph analysis software to graph connectivity in modern architectures like multicore CPUs, Xeon Phi and GPUs.
Multi-cored vortices support function of slotted wing tips of birds in gliding and flapping flight.

PubMed

KleinHeerenbrink, Marco; Johansson, L Christoffer; Hedenström, Anders

2017-05-01

Slotted wing tips of birds are commonly considered an adaptation to improve soaring performance, despite their presence in species that neither soar nor glide. We used particle image velocimetry to measure the airflow around the slotted wing tip of a jackdaw ( Corvus monedula ) as well as in its wake during unrestrained flight in a wind tunnel. The separated primary feathers produce individual wakes, confirming a multi-slotted function, in both gliding and flapping flight. The resulting multi-cored wingtip vortex represents a spreading of vorticity, which has previously been suggested as indicative of increased aerodynamic efficiency. Considering benefits of the slotted wing tips that are specific to flapping flight combined with the wide phylogenetic occurrence of this configuration, we propose the hypothesis that slotted wings evolved initially to improve performance in powered flight. © 2017 The Author(s).
The parallel algorithm for the 2D discrete wavelet transform

NASA Astrophysics Data System (ADS)

Barina, David; Najman, Pavel; Kleparnik, Petr; Kula, Michal; Zemcik, Pavel

2018-04-01

The discrete wavelet transform can be found at the heart of many image-processing algorithms. Until now, the transform on general-purpose processors (CPUs) was mostly computed using a separable lifting scheme. As the lifting scheme consists of a small number of operations, it is preferred for processing using single-core CPUs. However, considering a parallel processing using multi-core processors, this scheme is inappropriate due to a large number of steps. On such architectures, the number of steps corresponds to the number of points that represent the exchange of data. Consequently, these points often form a performance bottleneck. Our approach appropriately rearranges calculations inside the transform, and thereby reduces the number of steps. In other words, we propose a new scheme that is friendly to parallel environments. When evaluating on multi-core CPUs, we consistently overcome the original lifting scheme. The evaluation was performed on 61-core Intel Xeon Phi and 8-core Intel Xeon processors.
Architected Lattices with High Stiffness and Toughness via Multicore-Shell 3D Printing.

PubMed

Mueller, Jochen; Raney, Jordan R; Shea, Kristina; Lewis, Jennifer A

2018-03-01

The ability to create architected materials that possess both high stiffness and toughness remains an elusive goal, since these properties are often mutually exclusive. Natural materials, such as bone, overcome such limitations by combining different toughening mechanisms across multiple length scales. Here, a new method for creating architected lattices composed of core-shell struts that are both stiff and tough is reported. Specifically, these lattices contain orthotropic struts with flexible epoxy core-brittle epoxy shell motifs in the absence and presence of an elastomeric silicone interfacial layer, which are fabricated by a multicore-shell, 3D printing technique. It is found that architected lattices produced with a flexible core-elastomeric interface-brittle shell motif exhibit both high stiffness and toughness. © 2018 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
A Programming Model Performance Study Using the NAS Parallel Benchmarks

DOE PAGES

Shan, Hongzhang; Blagojević, Filip; Min, Seung-Jai; ...

2010-01-01

Harnessing the power of multicore platforms is challenging due to the additional levels of parallelism present. In this paper we use the NAS Parallel Benchmarks to study three programming models, MPI, OpenMP and PGAS to understand their performance and memory usage characteristics on current multicore architectures. To understand these characteristics we use the Integrated Performance Monitoring tool and other ways to measure communication versus computation time, as well as the fraction of the run time spent in OpenMP. The benchmarks are run on two different Cray XT5 systems and an Infiniband cluster. Our results show that in general the threemore » programming models exhibit very similar performance characteristics. In a few cases, OpenMP is significantly faster because it explicitly avoids communication. For these particular cases, we were able to re-write the UPC versions and achieve equal performance to OpenMP. Using OpenMP was also the most advantageous in terms of memory usage. Also we compare performance differences between the two Cray systems, which have quad-core and hex-core processors. We show that at scale the performance is almost always slower on the hex-core system because of increased contention for network resources.« less
Design Tools for Accelerating Development and Usage of Multi-Core Computing Platforms

DTIC Science & Technology

2014-04-01

Government formulated or supplied the drawings, specifications, or other data does not license the holder or any other person or corporation ; or convey...multicore PDSP platforms. The GPU- based capabilities of TDIF are currently oriented towards NVIDIA GPUs, based on the Compute Unified Device Architecture...CUDA) programming language [ NVIDIA 2007], which can be viewed as an extension of C. The multicore PDSP capabilities currently in TDIF are oriented
Reducing Response Time Bounds for DAG-Based Task Systems on Heterogeneous Multicore Platforms

DTIC Science & Technology

2016-01-01

synchronous parallel tasks on multicore platforms. In 25th ECRTS, 2013. [10] U. Devi. Soft Real - Time Scheduling on Multiprocessors. PhD thesis...report, Washington University in St Louis, 2014. [18] C. Liu and J. Anderson. Supporting soft real - time DAG-based sys- tems on multiprocessors with...analysis for DAG-based real - time task systems im- plemented on heterogeneous multicore platforms. The spe- cific analysis problem that is considered was
Thread mapping using system-level model for shared memory multicores

NASA Astrophysics Data System (ADS)

Mitra, Reshmi

Exploring thread-to-core mapping options for a parallel application on a multicore architecture is computationally very expensive. For the same algorithm, the mapping strategy (MS) with the best response time may change with data size and thread counts. The primary challenge is to design a fast, accurate and automatic framework for exploring these MSs for large data-intensive applications. This is to ensure that the users can explore the design space within reasonable machine hours, without thorough understanding on how the code interacts with the platform. Response time is related to the cycles per instructions retired (CPI), taking into account both active and sleep states of the pipeline. This work establishes a hybrid approach, based on Markov Chain Model (MCM) and Model Tree (MT) for system-level steady state CPI prediction. It is designed for shared memory multicore processors with coarse-grained multithreading. The thread status is represented by the MCM states. The program characteristics are modeled as the transition probabilities, representing the system moving between active and suspended thread states. The MT model extrapolates these probabilities for the actual application size (AS) from the smaller AS performance. This aspect of the framework, along with, the use of mathematical expressions for the actual AS performance information, results in a tremendous reduction in the CPI prediction time. The framework is validated using an electromagnetics application. The average performance prediction error for steady state CPI results with 12 different MSs is less than 1%. The total run time of model is of the order of minutes, whereas the actual application execution time is in terms of days.
Benchmarking GPU and CPU codes for Heisenberg spin glass over-relaxation

NASA Astrophysics Data System (ADS)

Bernaschi, M.; Parisi, G.; Parisi, L.

2011-06-01

We present a set of possible implementations for Graphics Processing Units (GPU) of the Over-relaxation technique applied to the 3D Heisenberg spin glass model. The results show that a carefully tuned code can achieve more than 100 GFlops/s of sustained performance and update a single spin in about 0.6 nanoseconds. A multi-hit technique that exploits the GPU shared memory further reduces this time. Such results are compared with those obtained by means of a highly-tuned vector-parallel code on latest generation multi-core CPUs.
A multicore optical fiber for distributed sensing

NASA Astrophysics Data System (ADS)

Sun, Xiaoguang; Li, Jie; Burgess, David T.; Hines, Mike; Zhu, Beyuan

2014-06-01

With advancements in optical fiber technology, the incorporation of multiple sensing functionalities within a single fiber structure opens the possibility to deploy dielectric, fully distributed, long-length optical sensors in an extremely small cross section. To illustrate the concept, we designed and manufactured a multicore optical fiber with three graded-index (GI) multimode (MM) cores and one single mode (SM) core. The fiber was coated with both a silicone primary layer and an ETFE buffer for high temperature applications. The fiber properties such as geometry, crosstalk and attenuation are described. A method for coupling the signal from the individual cores into separate optical fibers is also presented.
New multicore low mode noise scrambling fiber for applications in high-resolution spectroscopy

NASA Astrophysics Data System (ADS)

Haynes, Dionne M.; Gris-Sanchez, Itandehui; Ehrlich, Katjana; Birks, Tim A.; Giannone, Domenico; Haynes, Roger

2014-07-01

We present a new type of multicore fiber (MCF) and photonic lantern that consists of 511 individual cores designed to operate over a broadband visible wavelength range (380-860nm). It combines the coupling efficiency of a multimode fiber with modal stability intrinsic to a single mode fibre. It is designed to provide phase and amplitude scrambling to achieve a stable near field and far field illumination pattern during input coupling variations; it also has low modal noise for increased photometric stability. Preliminary results are presented for the new MCF as well as current state of the art octagonal fiber for comparison.
Nonlinear combining and compression in multicore fibers

DOE PAGES

Chekhovskoy, I. S.; Rubenchik, A. M.; Shtyrina, O. V.; ...

2016-10-25

In this paper, we demonstrate numerically light-pulse combining and pulse compression using wave-collapse (self-focusing) energy-localization dynamics in a continuous-discrete nonlinear system, as implemented in a multicore fiber (MCF) using one-dimensional (1D) and 2D core distribution designs. Large-scale numerical simulations were performed to determine the conditions of the most efficient coherent combining and compression of pulses injected into the considered MCFs. We demonstrate the possibility of combining in a single core 90% of the total energy of pulses initially injected into all cores of a 7-core MCF with a hexagonal lattice. Finally, a pulse compression factor of about 720 can bemore » obtained with a 19-core ring MCF.« less
SiN-assisted polarization-insensitive multicore fiber to silicon photonics interface

NASA Astrophysics Data System (ADS)

Poulopoulos, Giannis N.; Kalavrouziotis, Dimitrios; Mitchell, Paul; Macdonald, John R.; Bakopoulos, Paraskevas; Avramopoulos, Hercules

2015-06-01

We demonstrate a polarization-insensitive coupler interfacing multicore-fiber (MCF) to silicon waveguides. It comprises a 3D glass fanout transforming the circular MCF core-arrangement to linear and performing initial tapering, followed by a Spot-Size-Converter on the silicon chip. Glass waveguides are formed of multiple overlapped modification elements and appropriate offsetting thereof yields tapers with symmetric cross-section. The Spot-Size-Converter is an inverselytapered silicon waveguide with a tapered polymer overcladding where light is initially coupled, whereas phase-matching gradually shifts it towards the silicon core. Co-design of the glass fanout and Spot-Size-Converter obtains theoretical loss below 1dB for the overall Si-to-MCF transition in both polarizations.
Efficient computation of the phylogenetic likelihood function on multi-gene alignments and multi-core architectures.

PubMed

Stamatakis, Alexandros; Ott, Michael

2008-12-27

The continuous accumulation of sequence data, for example, due to novel wet-laboratory techniques such as pyrosequencing, coupled with the increasing popularity of multi-gene phylogenies and emerging multi-core processor architectures that face problems of cache congestion, poses new challenges with respect to the efficient computation of the phylogenetic maximum-likelihood (ML) function. Here, we propose two approaches that can significantly speed up likelihood computations that typically represent over 95 per cent of the computational effort conducted by current ML or Bayesian inference programs. Initially, we present a method and an appropriate data structure to efficiently compute the likelihood score on 'gappy' multi-gene alignments. By 'gappy' we denote sampling-induced gaps owing to missing sequences in individual genes (partitions), i.e. not real alignment gaps. A first proof-of-concept implementation in RAXML indicates that this approach can accelerate inferences on large and gappy alignments by approximately one order of magnitude. Moreover, we present insights and initial performance results on multi-core architectures obtained during the transition from an OpenMP-based to a Pthreads-based fine-grained parallelization of the ML function.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Aliaga, José I., E-mail: aliaga@uji.es; Alonso, Pedro; Badía, José M.

We introduce a new iterative Krylov subspace-based eigensolver for the simulation of macromolecular motions on desktop multithreaded platforms equipped with multicore processors and, possibly, a graphics accelerator (GPU). The method consists of two stages, with the original problem first reduced into a simpler band-structured form by means of a high-performance compute-intensive procedure. This is followed by a memory-intensive but low-cost Krylov iteration, which is off-loaded to be computed on the GPU by means of an efficient data-parallel kernel. The experimental results reveal the performance of the new eigensolver. Concretely, when applied to the simulation of macromolecules with a few thousandsmore » degrees of freedom and the number of eigenpairs to be computed is small to moderate, the new solver outperforms other methods implemented as part of high-performance numerical linear algebra packages for multithreaded architectures.« less
Diderot: a Domain-Specific Language for Portable Parallel Scientific Visualization and Image Analysis.

PubMed

Kindlmann, Gordon; Chiw, Charisee; Seltzer, Nicholas; Samuels, Lamont; Reppy, John

2016-01-01

Many algorithms for scientific visualization and image analysis are rooted in the world of continuous scalar, vector, and tensor fields, but are programmed in low-level languages and libraries that obscure their mathematical foundations. Diderot is a parallel domain-specific language that is designed to bridge this semantic gap by providing the programmer with a high-level, mathematical programming notation that allows direct expression of mathematical concepts in code. Furthermore, Diderot provides parallel performance that takes advantage of modern multicore processors and GPUs. The high-level notation allows a concise and natural expression of the algorithms and the parallelism allows efficient execution on real-world datasets.
mdtmFTP and its evaluation on ESNET SDN testbed

DOE PAGES

Zhang, Liang; Wu, Wenji; DeMar, Phil; ...

2017-04-21

In this paper, to address the high-performance challenges of data transfer in the big data era, we are developing and implementing mdtmFTP: a high-performance data transfer tool for big data. mdtmFTP has four salient features. First, it adopts an I/O centric architecture to execute data transfer tasks. Second, it more efficiently utilizes the underlying multicore platform through optimized thread scheduling. Third, it implements a large virtual file mechanism to address the lots-of-small-files (LOSF) problem. In conclusion, mdtmFTP integrates multiple optimization mechanisms, including–zero copy, asynchronous I/O, pipelining, batch processing, and pre-allocated buffer pools–to enhance performance. mdtmFTP has been extensively tested andmore » evaluated within the ESNET 100G testbed. Evaluations show that mdtmFTP can achieve higher performance than existing data transfer tools, such as GridFTP, FDT, and BBCP.« less
Exploiting Vector and Multicore Parallelsim for Recursive, Data- and Task-Parallel Programs

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ren, Bin; Krishnamoorthy, Sriram; Agrawal, Kunal

Modern hardware contains parallel execution resources that are well-suited for data-parallelism-vector units-and task parallelism-multicores. However, most work on parallel scheduling focuses on one type of hardware or the other. In this work, we present a scheduling framework that allows for a unified treatment of task- and data-parallelism. Our key insight is an abstraction, task blocks, that uniformly handles data-parallel iterations and task-parallel tasks, allowing them to be scheduled on vector units or executed independently as multicores. Our framework allows us to define schedulers that can dynamically select between executing task- blocks on vector units or multicores. We show that thesemore » schedulers are asymptotically optimal, and deliver the maximum amount of parallelism available in computation trees. To evaluate our schedulers, we develop program transformations that can convert mixed data- and task-parallel pro- grams into task block-based programs. Using a prototype instantiation of our scheduling framework, we show that, on an 8-core system, we can simultaneously exploit vector and multicore parallelism to achieve 14×-108× speedup over sequential baselines.« less
Options for Parallelizing a Planning and Scheduling Algorithm

NASA Technical Reports Server (NTRS)

Clement, Bradley J.; Estlin, Tara A.; Bornstein, Benjamin D.

2011-01-01

Space missions have a growing interest in putting multi-core processors onboard spacecraft. For many missions processing power significantly slows operations. We investigate how continual planning and scheduling algorithms can exploit multi-core processing and outline different potential design decisions for a parallelized planning architecture. This organization of choices and challenges helps us with an initial design for parallelizing the CASPER planning system for a mesh multi-core processor. This work extends that presented at another workshop with some preliminary results.
Composition and Realization of Source-to-Sink High-Performance Flows: File Systems, Storage, Hosts, LAN and WAN

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wu, Chase Qishi

A number of Department of Energy (DOE) science applications, involving exascale computing systems and large experimental facilities, are expected to generate large volumes of data, in the range of petabytes to exabytes, which will be transported over wide-area networks for the purpose of storage, visualization, and analysis. To support such capabilities, significant progress has been made in various components including the deployment of 100 Gbps networks with future 1 Tbps bandwidth, increases in end-host capabilities with multiple cores and buses, capacity improvements in large disk arrays, and deployment of parallel file systems such as Lustre and GPFS. High-performance source-to-sink datamore » flows must be composed of these component systems, which requires significant optimizations of the storage-to-host data and execution paths to match the edge and long-haul network connections. In particular, end systems are currently supported by 10-40 Gbps Network Interface Cards (NIC) and 8-32 Gbps storage Host Channel Adapters (HCAs), which carry the individual flows that collectively must reach network speeds of 100 Gbps and higher. Indeed, such data flows must be synthesized using multicore, multibus hosts connected to high-performance storage systems on one side and to the network on the other side. Current experimental results show that the constituent flows must be optimally composed and preserved from storage systems, across the hosts and the networks with minimal interference. Furthermore, such a capability must be made available transparently to the science users without placing undue demands on them to account for the details of underlying systems and networks. And, this task is expected to become even more complex in the future due to the increasing sophistication of hosts, storage systems, and networks that constitute the high-performance flows. The objectives of this proposal are to (1) develop and test the component technologies and their synthesis methods to achieve source-to-sink high-performance flows, and (2) develop tools that provide these capabilities through simple interfaces to users and applications. In terms of the former, we propose to develop (1) optimization methods that align and transition multiple storage flows to multiple network flows on multicore, multibus hosts; and (2) edge and long-haul network path realization and maintenance using advanced provisioning methods including OSCARS and OpenFlow. We also propose synthesis methods that combine these individual technologies to compose high-performance flows using a collection of constituent storage-network flows, and realize them across the storage and local network connections as well as long-haul connections. We propose to develop automated user tools that profile the hosts, storage systems, and network connections; compose the source-to-sink complex flows; and set up and maintain the needed network connections. These solutions will be tested using (1) 100 Gbps connection(s) between Oak Ridge National Laboratory (ORNL) and Argonne National Laboratory (ANL) with storage systems supported by Lustre and GPFS file systems with an asymmetric connection to University of Memphis (UM); (2) ORNL testbed with multicore and multibus hosts, switches with OpenFlow capabilities, and network emulators; and (3) 100 Gbps connections from ESnet and their Openflow testbed, and other experimental connections. This proposal brings together the expertise and facilities of the two national laboratories, ORNL and ANL, and UM. It also represents a collaboration between DOE and the Department of Defense (DOD) projects at ORNL by sharing technical expertise and personnel costs, and leveraging the existing DOD Extreme Scale Systems Center (ESSC) facilities at ORNL.« less
Spiking neural networks on high performance computer clusters

NASA Astrophysics Data System (ADS)

Chen, Chong; Taha, Tarek M.

2011-09-01

In this paper we examine the acceleration of two spiking neural network models on three clusters of multicore processors representing three categories of processors: x86, STI Cell, and NVIDIA GPGPUs. The x86 cluster utilized consists of 352 dualcore AMD Opterons, the Cell cluster consists of 320 Sony Playstation 3s, while the GPGPU cluster contains 32 NVIDIA Tesla S1070 systems. The results indicate that the GPGPU platform can dominate in performance compared to the Cell and x86 platforms examined. From a cost perspective, the GPGPU is more expensive in terms of neuron/s throughput. If the cost of GPGPUs go down in the future, this platform will become very cost effective for these models.
Performance comparison of a fiber optic communication system based on optical OFDM and an optical OFDM-MIMO with Alamouti code by using numerical simulations

NASA Astrophysics Data System (ADS)

Serpa-Imbett, C. M.; Marín-Alfonso, J.; Gómez-Santamaría, C.; Betancur-Agudelo, L.; Amaya-Fernández, F.

2013-12-01

Space division multiplexing in multicore fibers is one of the most promise technologies in order to support transmissions of next-generation peta-to-exaflop-scale supercomputers and mega data centers, owing to advantages in terms of costs and space saving of the new optical fibers with multiple cores. Additionally, multicore fibers allow photonic signal processing in optical communication systems, taking advantage of the mode coupling phenomena. In this work, we numerically have simulated an optical MIMO-OFDM (multiple-input multiple-output orthogonal frequency division multiplexing) by using the coded Alamouti to be transmitted through a twin-core fiber with low coupling. Furthermore, an optical OFDM is transmitted through a core of a singlemode fiber, using pilot-aided channel estimation. We compare the transmission performance in the twin-core fiber and in the singlemode fiber taking into account numerical results of the bit-error rate, considering linear propagation, and Gaussian noise through an optical fiber link. We carry out an optical fiber transmission of OFDM frames using 8 PSK and 16 QAM, with bit rates values of 130 Gb/s and 170 Gb/s, respectively. We obtain a penalty around 4 dB for the 8 PSK transmissions, after 100 km of linear fiber optic propagation for both singlemode and twin core fiber. We obtain a penalty around 6 dB for the 16 QAM transmissions, with linear propagation after 100 km of optical fiber. The transmission in a two-core fiber by using Alamouti coded OFDM-MIMO exhibits a better performance, offering a good alternative in the mitigation of fiber impairments, allowing to expand Alamouti coded in multichannel systems spatially multiplexed in multicore fibers.
Multi-core and GPU accelerated simulation of a radial star target imaged with equivalent t-number circular and Gaussian pupils

NASA Astrophysics Data System (ADS)

Greynolds, Alan W.

2013-09-01

Results from the GelOE optical engineering software are presented for the through-focus, monochromatic coherent and polychromatic incoherent imaging of a radial "star" target for equivalent t-number circular and Gaussian pupils. The FFT-based simulations are carried out using OpenMP threading on a multi-core desktop computer, with and without the aid of a many-core NVIDIA GPU accessing its cuFFT library. It is found that a custom FFT optimized for the 12-core host has similar performance to a simply implemented 256-core GPU FFT. A more sophisticated version of the latter but tuned to reduce overhead on a 448-core GPU is 20 to 28 times faster than a basic FFT implementation running on one CPU core.
Using OpenMP vs. Threading Building Blocks for Medical Imaging on Multi-cores

NASA Astrophysics Data System (ADS)

Kegel, Philipp; Schellmann, Maraike; Gorlatch, Sergei

We compare two parallel programming approaches for multi-core systems: the well-known OpenMP and the recently introduced Threading Building Blocks (TBB) library by Intel®. The comparison is made using the parallelization of a real-world numerical algorithm for medical imaging. We develop several parallel implementations, and compare them w.r.t. programming effort, programming style and abstraction, and runtime performance. We show that TBB requires a considerable program re-design, whereas with OpenMP simple compiler directives are sufficient. While TBB appears to be less appropriate for parallelizing existing implementations, it fosters a good programming style and higher abstraction level for newly developed parallel programs. Our experimental measurements on a dual quad-core system demonstrate that OpenMP slightly outperforms TBB in our implementation.
Energy Efficient Real-Time Scheduling Using DPM on Mobile Sensors with a Uniform Multi-Cores

PubMed Central

Kim, Youngmin; Lee, Chan-Gun

2017-01-01

In wireless sensor networks (WSNs), sensor nodes are deployed for collecting and analyzing data. These nodes use limited energy batteries for easy deployment and low cost. The use of limited energy batteries is closely related to the lifetime of the sensor nodes when using wireless sensor networks. Efficient-energy management is important to extending the lifetime of the sensor nodes. Most effort for improving power efficiency in tiny sensor nodes has focused mainly on reducing the power consumed during data transmission. However, recent emergence of sensor nodes equipped with multi-cores strongly requires attention to be given to the problem of reducing power consumption in multi-cores. In this paper, we propose an energy efficient scheduling method for sensor nodes supporting a uniform multi-cores. We extend the proposed T-Ler plane based scheduling for global optimal scheduling of a uniform multi-cores and multi-processors to enable power management using dynamic power management. In the proposed approach, processor selection for a scheduling and mapping method between the tasks and processors is proposed to efficiently utilize dynamic power management. Experiments show the effectiveness of the proposed approach compared to other existing methods. PMID:29240695
Fault Mitigation Schemes for Future Spaceflight Multicore Processors

NASA Technical Reports Server (NTRS)

Some, Rafi; Gostelow, Kim P.; Lai, John; Reder, Leonard; Alexander, James; Clement, Brad

2012-01-01

The goal of this work is to achieve fail-operational and graceful-degradation behavior in realistic flight mission scenarios, of multicore processors such as Mars Entry-Descent-Landing (EDL) and Primitive Body proximity operations.
Comparing an FPGA to a Cell for an Image Processing Application

NASA Astrophysics Data System (ADS)

Rakvic, Ryan N.; Ngo, Hau; Broussard, Randy P.; Ives, Robert W.

2010-12-01

Modern advancements in configurable hardware, most notably Field-Programmable Gate Arrays (FPGAs), have provided an exciting opportunity to discover the parallel nature of modern image processing algorithms. On the other hand, PlayStation3 (PS3) game consoles contain a multicore heterogeneous processor known as the Cell, which is designed to perform complex image processing algorithms at a high performance. In this research project, our aim is to study the differences in performance of a modern image processing algorithm on these two hardware platforms. In particular, Iris Recognition Systems have recently become an attractive identification method because of their extremely high accuracy. Iris matching, a repeatedly executed portion of a modern iris recognition algorithm, is parallelized on an FPGA system and a Cell processor. We demonstrate a 2.5 times speedup of the parallelized algorithm on the FPGA system when compared to a Cell processor-based version.
Identifying, Quantifying, Extracting and Enhancing Implicit Parallelism

ERIC Educational Resources Information Center

Agarwal, Mayank

2009-01-01

The shift of the microprocessor industry towards multicore architectures has placed a huge burden on the programmers by requiring explicit parallelization for performance. Implicit Parallelization is an alternative that could ease the burden on programmers by parallelizing applications "under the covers" while maintaining sequential semantics…
Resource-Efficient, Hierarchical Auto-Tuning of a Hybrid Lattice Boltzmann Computation on the Cray XT4

DOE Office of Scientific and Technical Information (OSTI.GOV)

Computational Research Division, Lawrence Berkeley National Laboratory; NERSC, Lawrence Berkeley National Laboratory; Computer Science Department, University of California, Berkeley

2009-05-04

We apply auto-tuning to a hybrid MPI-pthreads lattice Boltzmann computation running on the Cray XT4 at National Energy Research Scientific Computing Center (NERSC). Previous work showed that multicore-specific auto-tuning can improve the performance of lattice Boltzmann magnetohydrodynamics (LBMHD) by a factor of 4x when running on dual- and quad-core Opteron dual-socket SMPs. We extend these studies to the distributed memory arena via a hybrid MPI/pthreads implementation. In addition to conventional auto-tuning at the local SMP node, we tune at the message-passing level to determine the optimal aspect ratio as well as the correct balance between MPI tasks and threads permore » MPI task. Our study presents a detailed performance analysis when moving along an isocurve of constant hardware usage: fixed total memory, total cores, and total nodes. Overall, our work points to approaches for improving intra- and inter-node efficiency on large-scale multicore systems for demanding scientific applications.« less
High-accuracy fiber-optic shape sensing

NASA Astrophysics Data System (ADS)

Duncan, Roger G.; Froggatt, Mark E.; Kreger, Stephen T.; Seeley, Ryan J.; Gifford, Dawn K.; Sang, Alexander K.; Wolfe, Matthew S.

2007-04-01

We describe the results of a study of the performance characteristics of a monolithic fiber-optic shape sensor array. Distributed strain measurements in a multi-core optical fiber interrogated with the optical frequency domain reflectometry technique are used to deduce the shape of the optical fiber; referencing to a coordinate system yields position information. Two sensing techniques are discussed herein: the first employing fiber Bragg gratings and the second employing the intrinsic Rayleigh backscatter of the optical fiber. We have measured shape and position under a variety of circumstances and report the accuracy and precision of these measurements. A discussion of error sources is included.
T-L Plane Abstraction-Based Energy-Efficient Real-Time Scheduling for Multi-Core Wireless Sensors.

PubMed

Kim, Youngmin; Lee, Ki-Seong; Pham, Ngoc-Son; Lee, Sun-Ro; Lee, Chan-Gun

2016-07-08

Energy efficiency is considered as a critical requirement for wireless sensor networks. As more wireless sensor nodes are equipped with multi-cores, there are emerging needs for energy-efficient real-time scheduling algorithms. The T-L plane-based scheme is known to be an optimal global scheduling technique for periodic real-time tasks on multi-cores. Unfortunately, there has been a scarcity of studies on extending T-L plane-based scheduling algorithms to exploit energy-saving techniques. In this paper, we propose a new T-L plane-based algorithm enabling energy-efficient real-time scheduling on multi-core sensor nodes with dynamic power management (DPM). Our approach addresses the overhead of processor mode transitions and reduces fragmentations of the idle time, which are inherent in T-L plane-based algorithms. Our experimental results show the effectiveness of the proposed algorithm compared to other energy-aware scheduling methods on T-L plane abstraction.
Tablet—next generation sequence assembly visualization

PubMed Central

Milne, Iain; Bayer, Micha; Cardle, Linda; Shaw, Paul; Stephen, Gordon; Wright, Frank; Marshall, David

2010-01-01

Summary: Tablet is a lightweight, high-performance graphical viewer for next-generation sequence assemblies and alignments. Supporting a range of input assembly formats, Tablet provides high-quality visualizations showing data in packed or stacked views, allowing instant access and navigation to any region of interest, and whole contig overviews and data summaries. Tablet is both multi-core aware and memory efficient, allowing it to handle assemblies containing millions of reads, even on a 32-bit desktop machine. Availability: Tablet is freely available for Microsoft Windows, Apple Mac OS X, Linux and Solaris. Fully bundled installers can be downloaded from http://bioinf.scri.ac.uk/tablet in 32- and 64-bit versions. Contact: tablet@scri.ac.uk PMID:19965881
Parallel Agent-Based Simulations on Clusters of GPUs and Multi-Core Processors

DOE Office of Scientific and Technical Information (OSTI.GOV)

Aaby, Brandon G; Perumalla, Kalyan S; Seal, Sudip K

2010-01-01

An effective latency-hiding mechanism is presented in the parallelization of agent-based model simulations (ABMS) with millions of agents. The mechanism is designed to accommodate the hierarchical organization as well as heterogeneity of current state-of-the-art parallel computing platforms. We use it to explore the computation vs. communication trade-off continuum available with the deep computational and memory hierarchies of extant platforms and present a novel analytical model of the tradeoff. We describe our implementation and report preliminary performance results on two distinct parallel platforms suitable for ABMS: CUDA threads on multiple, networked graphical processing units (GPUs), and pthreads on multi-core processors. Messagemore » Passing Interface (MPI) is used for inter-GPU as well as inter-socket communication on a cluster of multiple GPUs and multi-core processors. Results indicate the benefits of our latency-hiding scheme, delivering as much as over 100-fold improvement in runtime for certain benchmark ABMS application scenarios with several million agents. This speed improvement is obtained on our system that is already two to three orders of magnitude faster on one GPU than an equivalent CPU-based execution in a popular simulator in Java. Thus, the overall execution of our current work is over four orders of magnitude faster when executed on multiple GPUs.« less
Multiplexed fibre optic sensing in the distal lung (Conference Presentation)

NASA Astrophysics Data System (ADS)

Choudhary, Tushar R.; Tanner, Michael G.; Megia-Fernandez, Alicia; Harrington, Kerrianne; Wood, Harry A.; Chankeshwara, Sunay; Zhu, Patricia; Choudhury, Debaditya; Yu, Fei; Thomson, Robert R.; Duncan, Rory R.; Dhaliwal, Kevin; Bradley, Mark

2017-02-01

We present a toolkit for a multiplexed pH and oxygen sensing probe in the distal lung using multicore fibres. Measuring physiological relevant parameters like pH and oxygen is of significant importance in understanding changes associated with disease pathology. We present here, a single multicore fibre based pH and oxygen sensing probe which can be used with a standard bronchoscope to perform in vivo measurements in the distal lung. The multiplexed probe consists of fluorescent pH sensors (fluorescein based) and oxygen sensors (Palladium porphyrin complex based) covalently bonded to silica microspheres (10 µm) loaded on the distal facet of a 19 core (10 µm core diameter) multicore fibre (total diameter of 150 µm excluding coating). Pits are formed by selectively etching the cores using hydrofluoric acid, multiplexing is achieved through the self-location of individual probes on differing cores. This architecture can be expanded to include probes for further parameters. Robust measurements are demonstrated of self-referencing fluorophores, not limited by photobleaching, with short (100ms) measurement times at low ( 10µW) illumination powers. We have performed on bench calibration and tests of in vitro tissue models and in an ovine whole lung model to validate our sensors. The pH sensor is demonstrated in the physiologically relevant range of pH 5 to pH 8.5 and with an accuracy of ± 0.05 pH units. The oxygen sensor is demonstrated in gas mixtures downwards from 20% oxygen and in liquid saturated with 20% oxygen mixtures ( 8mg/L) down to full depletion (0mg/L) with 0.5mg/L accuracy.

Multi-Threaded DNA Tag/Anti-Tag Library Generator for Multi-Core Platforms

DTIC Science & Technology

2009-05-01

base pair) Watson ‐ Crick strand pairs that bind perfectly within pairs, but poorly across pairs. A variety of DNA strand hybridization metrics...AFRL-RI-RS-TR-2009-131 Final Technical Report May 2009 MULTI-THREADED DNA TAG/ANTI-TAG LIBRARY GENERATOR FOR MULTI-CORE PLATFORMS...TYPE Final 3. DATES COVERED (From - To) Jun 08 – Feb 09 4. TITLE AND SUBTITLE MULTI-THREADED DNA TAG/ANTI-TAG LIBRARY GENERATOR FOR MULTI-CORE
Core-to-core uniformity improvement in multi-core fiber Bragg gratings

NASA Astrophysics Data System (ADS)

Lindley, Emma; Min, Seong-Sik; Leon-Saval, Sergio; Cvetojevic, Nick; Jovanovic, Nemanja; Bland-Hawthorn, Joss; Lawrence, Jon; Gris-Sanchez, Itandehui; Birks, Tim; Haynes, Roger; Haynes, Dionne

2014-07-01

Multi-core fiber Bragg gratings (MCFBGs) will be a valuable tool not only in communications but also various astronomical, sensing and industry applications. In this paper we address some of the technical challenges of fabricating effective multi-core gratings by simulating improvements to the writing method. These methods allow a system designed for inscribing single-core fibers to cope with MCFBG fabrication with only minor, passive changes to the writing process. Using a capillary tube that was polished on one side, the field entering the fiber was flattened which improved the coverage and uniformity of all cores.
Towards large dynamic range and ultrahigh measurement resolution in distributed fiber sensing based on multicore fiber.

PubMed

Dang, Yunli; Zhao, Zhiyong; Tang, Ming; Zhao, Can; Gan, Lin; Fu, Songnian; Liu, Tongqing; Tong, Weijun; Shum, Perry Ping; Liu, Deming

2017-08-21

Featuring a dependence of Brillouin frequency shift (BFS) on temperature and strain changes over a wide range, Brillouin distributed optical fiber sensors are however essentially subjected to the relatively poor temperature/strain measurement resolution. On the other hand, phase-sensitive optical time-domain reflectometry (Φ-OTDR) offers ultrahigh temperature/strain measurement resolution, but the available frequency scanning range is normally narrow thereby severely restricts its measurement dynamic range. In order to achieve large dynamic range and high measurement resolution simultaneously, we propose to employ both the Brillouin optical time domain analysis (BOTDA) and Φ-OTDR through space-division multiplexed (SDM) configuration based on the multicore fiber (MCF), in which the two sensors are spatially separately implemented in the central core and a side core, respectively. As a proof of concept, the temperature sensing has been performed for validation with 2.5 m spatial resolution over 1.565 km MCF. Large temperature range (10 °C) has been measured by BOTDA and the 0.1 °C small temperature variation is successfully identified by Φ-OTDR with ~0.001 °C resolution. Moreover, the temperature changing process has been recorded by continuously performing the measurement of Φ-OTDR with 80 s frequency scanning period, showing about 0.02 °C temperature spacing at the monitored profile. The proposed system enables the capability to see finer and/or farther upon requirement in distributed optical fiber sensing.
Rapid Onboard Data Product Generation with Multicore Processors and FPGA

NASA Astrophysics Data System (ADS)

Mandl, D.; Sohlberg, R. A.; Cappelaere, P. G.; Frye, S. W.; Ly, V.; Handy, M.; Ambrosia, V. G.; Sullivan, D. V.; Bland, G.; Pastor, E.; Crago, S.; Flatley, C.; Shah, N.; Bronston, J.; Creech, T.

2012-12-01

The Intelligent Payload Module (IPM) is an experimental testbed with multicore processors and Field Programmable Gate Array (FPGA). This effort is being funded by the NASA Earth Science Technology Office as part of an Advanced Information Systems Technology (AIST) 2011 research grant to investigate the use of high performance onboard processing to create an onboard data processing pipeline that can rapidly process a subset of onboard imaging spectrometer data (1) through radiance to reflectance conversion (2) atmospheric correction (3) geolocation and co-registration and (4) level 2 data product generation. The requirements are driven by the mission concept for the HyspIRI NASA Decadal mission, although other NASA Decadal missions could use the same concept. The system is being set up to make use of the same ground and flight software being used by other satellites at NASA/GSFC. Furthermore, a Web Coverage Processing Service (WCPS) is installed as part of the flight software which enables a user on the ground to specify the desired algorithm to run onboard against the data in realtime. Benchmark demonstrations are being run and will be run through the three year effort on various platforms including a helicopter and various airplane platforms with various instruments to demonstrate various configurations that would be compatible with the HyspIRI mission and other similar missions. This presentation will lay out the demonstrations conducted to date along with any benchmark performance metrics and future demonstration efforts and objectives.Initial IPM Test Box
An MPI-1 Compliant Thread-Based Implementation

NASA Astrophysics Data System (ADS)

Díaz Martín, J. C.; Rico Gallego, J. A.; Álvarez Llorente, J. M.; Perogil Duque, J. F.

This work presents AzequiaMPI, the first full compliant implementation of the MPI-1 standard where the MPI node is a thread. Performance comparisons with MPICH2-Nemesis show that thread-based implementations exploit adequately the multicore architectures under oversubscription, what could make MPI competitive with OpenMP-like solutions.
Scalable Algorithms for Parallel Discrete Event Simulation Systems in Multicore Environments

DTIC Science & Technology

2013-05-01

consolidated at the sender side. At the receiver side, the messages are deconsolidated and delivered to the appropriate thread. This approach bears some...Jiang, S. Kini, W. Yu, D. Buntinas, P. Wyckoff, and D. Panda . Performance comparison of mpi implementations over infiniband, myrinet and quadrics
Performance analysis of the FDTD method applied to holographic volume gratings: Multi-core CPU versus GPU computing

NASA Astrophysics Data System (ADS)

Francés, J.; Bleda, S.; Neipp, C.; Márquez, A.; Pascual, I.; Beléndez, A.

2013-03-01

The finite-difference time-domain method (FDTD) allows electromagnetic field distribution analysis as a function of time and space. The method is applied to analyze holographic volume gratings (HVGs) for the near-field distribution at optical wavelengths. Usually, this application requires the simulation of wide areas, which implies more memory and time processing. In this work, we propose a specific implementation of the FDTD method including several add-ons for a precise simulation of optical diffractive elements. Values in the near-field region are computed considering the illumination of the grating by means of a plane wave for different angles of incidence and including absorbing boundaries as well. We compare the results obtained by FDTD with those obtained using a matrix method (MM) applied to diffraction gratings. In addition, we have developed two optimized versions of the algorithm, for both CPU and GPU, in order to analyze the improvement of using the new NVIDIA Fermi GPU architecture versus highly tuned multi-core CPU as a function of the size simulation. In particular, the optimized CPU implementation takes advantage of the arithmetic and data transfer streaming SIMD (single instruction multiple data) extensions (SSE) included explicitly in the code and also of multi-threading by means of OpenMP directives. A good agreement between the results obtained using both FDTD and MM methods is obtained, thus validating our methodology. Moreover, the performance of the GPU is compared to the SSE+OpenMP CPU implementation, and it is quantitatively determined that a highly optimized CPU program can be competitive for a wider range of simulation sizes, whereas GPU computing becomes more powerful for large-scale simulations.
Genetic mapping of 15 human X chromosomal forensic short tandem repeat (STR) loci by means of multi-core parallelization.

PubMed

Diegoli, Toni Marie; Rohde, Heinrich; Borowski, Stefan; Krawczak, Michael; Coble, Michael D; Nothnagel, Michael

2016-11-01

Typing of X chromosomal short tandem repeat (X STR) markers has become a standard element of human forensic genetic analysis. Joint consideration of many X STR markers at a time increases their discriminatory power but, owing to physical linkage, requires inter-marker recombination rates to be accurately known. We estimated the recombination rates between 15 well established X STR markers using genotype data from 158 families (1041 individuals) and following a previously proposed likelihood-based approach that allows for single-step mutations. To meet the computational requirements of this family-based type of analysis, we modified a previous implementation so as to allow multi-core parallelization on a high-performance computing system. While we obtained recombination rate estimates larger than zero for all but one pair of adjacent markers within the four previously proposed linkage groups, none of the three X STR pairs defining the junctions of these groups yielded a recombination rate estimate of 0.50. Corroborating previous studies, our results therefore argue against a simple model of independent X chromosomal linkage groups. Moreover, the refined recombination fraction estimates obtained in our study will facilitate the appropriate joint consideration of all 15 investigated markers in forensic analysis. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
RTEMS SMP and MTAPI for Efficient Multi-Core Space Applications on LEON3/LEON4 Processors

NASA Astrophysics Data System (ADS)

Cederman, Daniel; Hellstrom, Daniel; Sherrill, Joel; Bloom, Gedare; Patte, Mathieu; Zulianello, Marco

2015-09-01

This paper presents the final result of an European Space Agency (ESA) activity aimed at improving the software support for LEON processors used in SMP configurations. One of the benefits of using a multicore system in a SMP configuration is that in many instances it is possible to better utilize the available processing resources by load balancing between cores. This however comes with the cost of having to synchronize operations between cores, leading to increased complexity. While in an AMP system one can use multiple instances of operating systems that are only uni-processor capable, a SMP system requires the operating system to be written to support multicore systems. In this activity we have improved and extended the SMP support of the RTEMS real-time operating system and ensured that it fully supports the multicore capable LEON processors. The targeted hardware in the activity has been the GR712RC, a dual-core core LEON3FT processor, and the functional prototype of ESA's Next Generation Multiprocessor (NGMP), a quad core LEON4 processor. The final version of the NGMP is now available as a product under the name GR740. An implementation of the Multicore Task Management API (MTAPI) has been developed as part of this activity to aid in the parallelization of applications for RTEMS SMP. It allows for simplified development of parallel applications using the task-based programming model. An existing space application, the Gaia Video Processing Unit, has been ported to RTEMS SMP using the MTAPI implementation to demonstrate the feasibility and usefulness of multicore processors for space payload software. The activity is funded by ESA under contract 4000108560/13/NL/JK. Gedare Bloom is supported in part by NSF CNS-0934725.
High performance in silico virtual drug screening on many-core processors.

PubMed

McIntosh-Smith, Simon; Price, James; Sessions, Richard B; Ibarra, Amaurys A

2015-05-01

Drug screening is an important part of the drug development pipeline for the pharmaceutical industry. Traditional, lab-based methods are increasingly being augmented with computational methods, ranging from simple molecular similarity searches through more complex pharmacophore matching to more computationally intensive approaches, such as molecular docking. The latter simulates the binding of drug molecules to their targets, typically protein molecules. In this work, we describe BUDE, the Bristol University Docking Engine, which has been ported to the OpenCL industry standard parallel programming language in order to exploit the performance of modern many-core processors. Our highly optimized OpenCL implementation of BUDE sustains 1.43 TFLOP/s on a single Nvidia GTX 680 GPU, or 46% of peak performance. BUDE also exploits OpenCL to deliver effective performance portability across a broad spectrum of different computer architectures from different vendors, including GPUs from Nvidia and AMD, Intel's Xeon Phi and multi-core CPUs with SIMD instruction sets.
High performance in silico virtual drug screening on many-core processors

PubMed Central

Price, James; Sessions, Richard B; Ibarra, Amaurys A

2015-01-01

Drug screening is an important part of the drug development pipeline for the pharmaceutical industry. Traditional, lab-based methods are increasingly being augmented with computational methods, ranging from simple molecular similarity searches through more complex pharmacophore matching to more computationally intensive approaches, such as molecular docking. The latter simulates the binding of drug molecules to their targets, typically protein molecules. In this work, we describe BUDE, the Bristol University Docking Engine, which has been ported to the OpenCL industry standard parallel programming language in order to exploit the performance of modern many-core processors. Our highly optimized OpenCL implementation of BUDE sustains 1.43 TFLOP/s on a single Nvidia GTX 680 GPU, or 46% of peak performance. BUDE also exploits OpenCL to deliver effective performance portability across a broad spectrum of different computer architectures from different vendors, including GPUs from Nvidia and AMD, Intel’s Xeon Phi and multi-core CPUs with SIMD instruction sets. PMID:25972727
Node Resource Manager: A Distributed Computing Software Framework Used for Solving Geophysical Problems

NASA Astrophysics Data System (ADS)

Lawry, B. J.; Encarnacao, A.; Hipp, J. R.; Chang, M.; Young, C. J.

2011-12-01

With the rapid growth of multi-core computing hardware, it is now possible for scientific researchers to run complex, computationally intensive software on affordable, in-house commodity hardware. Multi-core CPUs (Central Processing Unit) and GPUs (Graphics Processing Unit) are now commonplace in desktops and servers. Developers today have access to extremely powerful hardware that enables the execution of software that could previously only be run on expensive, massively-parallel systems. It is no longer cost-prohibitive for an institution to build a parallel computing cluster consisting of commodity multi-core servers. In recent years, our research team has developed a distributed, multi-core computing system and used it to construct global 3D earth models using seismic tomography. Traditionally, computational limitations forced certain assumptions and shortcuts in the calculation of tomographic models; however, with the recent rapid growth in computational hardware including faster CPU's, increased RAM, and the development of multi-core computers, we are now able to perform seismic tomography, 3D ray tracing and seismic event location using distributed parallel algorithms running on commodity hardware, thereby eliminating the need for many of these shortcuts. We describe Node Resource Manager (NRM), a system we developed that leverages the capabilities of a parallel computing cluster. NRM is a software-based parallel computing management framework that works in tandem with the Java Parallel Processing Framework (JPPF, http://www.jppf.org/), a third party library that provides a flexible and innovative way to take advantage of modern multi-core hardware. NRM enables multiple applications to use and share a common set of networked computers, regardless of their hardware platform or operating system. Using NRM, algorithms can be parallelized to run on multiple processing cores of a distributed computing cluster of servers and desktops, which results in a dramatic speedup in execution time. NRM is sufficiently generic to support applications in any domain, as long as the application is parallelizable (i.e., can be subdivided into multiple individual processing tasks). At present, NRM has been effective in decreasing the overall runtime of several algorithms: 1) the generation of a global 3D model of the compressional velocity distribution in the Earth using tomographic inversion, 2) the calculation of the model resolution matrix, model covariance matrix, and travel time uncertainty for the aforementioned velocity model, and 3) the correlation of waveforms with archival data on a massive scale for seismic event detection. Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000.
T-L Plane Abstraction-Based Energy-Efficient Real-Time Scheduling for Multi-Core Wireless Sensors

PubMed Central

Kim, Youngmin; Lee, Ki-Seong; Pham, Ngoc-Son; Lee, Sun-Ro; Lee, Chan-Gun

2016-01-01

Energy efficiency is considered as a critical requirement for wireless sensor networks. As more wireless sensor nodes are equipped with multi-cores, there are emerging needs for energy-efficient real-time scheduling algorithms. The T-L plane-based scheme is known to be an optimal global scheduling technique for periodic real-time tasks on multi-cores. Unfortunately, there has been a scarcity of studies on extending T-L plane-based scheduling algorithms to exploit energy-saving techniques. In this paper, we propose a new T-L plane-based algorithm enabling energy-efficient real-time scheduling on multi-core sensor nodes with dynamic power management (DPM). Our approach addresses the overhead of processor mode transitions and reduces fragmentations of the idle time, which are inherent in T-L plane-based algorithms. Our experimental results show the effectiveness of the proposed algorithm compared to other energy-aware scheduling methods on T-L plane abstraction. PMID:27399722
Writing Bragg Gratings in Multicore Fibers.

PubMed

Lindley, Emma Y; Min, Seong-Sik; Leon-Saval, Sergio G; Cvetojevic, Nick; Lawrence, Jon; Ellis, Simon C; Bland-Hawthorn, Joss

2016-04-20

Fiber Bragg gratings in multicore fibers can be used as compact and robust filters in astronomical and other research and commercial applications. Strong suppression at a single wavelength requires that all cores have matching transmission profiles. These gratings cannot be inscribed using the same method as for single-core fibers because the curved surface of the cladding acts as a lens, focusing the incoming UV laser beam and causing variations in exposure between cores. Therefore we use an additional optical element to ensure that the beam shape does not change while passing through the cross-section of the multicore fiber. This consists of a glass capillary tube which has been polished flat on one side, which is then placed over the section of the fiber to be inscribed. The laser beam enters the fiber through the flat surface of the capillary tube and hence maintains its original dimensions. This paper demonstrates the improvements in core-to-core uniformity for a 7-core fiber using this method. The technique can be generalized to larger multicore fibers.
Writing Bragg Gratings in Multicore Fibers

PubMed Central

Lindley, Emma Y.; Min, Seong-sik; Leon-Saval, Sergio G.; Cvetojevic, Nick; Lawrence, Jon; Ellis, Simon C.; Bland-Hawthorn, Joss

2016-01-01

Fiber Bragg gratings in multicore fibers can be used as compact and robust filters in astronomical and other research and commercial applications. Strong suppression at a single wavelength requires that all cores have matching transmission profiles. These gratings cannot be inscribed using the same method as for single-core fibers because the curved surface of the cladding acts as a lens, focusing the incoming UV laser beam and causing variations in exposure between cores. Therefore we use an additional optical element to ensure that the beam shape does not change while passing through the cross-section of the multicore fiber. This consists of a glass capillary tube which has been polished flat on one side, which is then placed over the section of the fiber to be inscribed. The laser beam enters the fiber through the flat surface of the capillary tube and hence maintains its original dimensions. This paper demonstrates the improvements in core-to-core uniformity for a 7-core fiber using this method. The technique can be generalized to larger multicore fibers. PMID:27167576
Single-step generation of metal-plasma polymer multicore@shell nanoparticles from the gas phase.

PubMed

Solař, Pavel; Polonskyi, Oleksandr; Olbricht, Ansgar; Hinz, Alexander; Shelemin, Artem; Kylián, Ondřej; Choukourov, Andrei; Faupel, Franz; Biederman, Hynek

2017-08-17

Nanoparticles composed of multiple silver cores and a plasma polymer shell (multicore@shell) were prepared in a single step with a gas aggregation cluster source operating with Ar/hexamethyldisiloxane mixtures and optionally oxygen. The size distribution of the metal inclusions as well as the chemical composition and the thickness of the shells were found to be controlled by the composition of the working gas mixture. Shell matrices ranging from organosilicon plasma polymer to nearly stoichiometric SiO 2 were obtained. The method allows facile fabrication of multicore@shell nanoparticles with tailored functional properties, as demonstrated here with the optical response.
Experimental demonstration of large capacity WSDM optical access network with multicore fibers and advanced modulation formats.

PubMed

Li, Borui; Feng, Zhenhua; Tang, Ming; Xu, Zhilin; Fu, Songnian; Wu, Qiong; Deng, Lei; Tong, Weijun; Liu, Shuang; Shum, Perry Ping

2015-05-04

Towards the next generation optical access network supporting large capacity data transmission to enormous number of users covering a wider area, we proposed a hybrid wavelength-space division multiplexing (WSDM) optical access network architecture utilizing multicore fibers with advanced modulation formats. As a proof of concept, we experimentally demonstrated a WSDM optical access network with duplex transmission using our developed and fabricated multicore (7-core) fibers with 58.7km distance. As a cost-effective modulation scheme for access network, the optical OFDM-QPSK signal has been intensity modulated on the downstream transmission in the optical line terminal (OLT) and it was directly detected in the optical network unit (ONU) after MCF transmission. 10 wavelengths with 25GHz channel spacing from an optical comb generator are employed and each wavelength is loaded with 5Gb/s OFDM-QPSK signal. After amplification, power splitting, and fan-in multiplexer, 10-wavelength downstream signal was injected into six outer layer cores simultaneously and the aggregation downstream capacity reaches 300 Gb/s. -16 dBm sensitivity has been achieved for 3.8 × 10^-3 bit error ratio (BER) with 7% Forward Error Correction (FEC) limit for all wavelengths in every core. Upstream signal from ONU side has also been generated and the bidirectional transmission in the same core causes negligible performance degradation to the downstream signal. As a universal platform for wired/wireless data access, our proposed architecture provides additional dimension for high speed mobile signal transmission and we hence demonstrated an upstream delivery of 20Gb/s per wavelength with QPSK modulation formats using the inner core of MCF emulating a mobile backhaul service. The IQ modulated data was coherently detected in the OLT side. -19 dBm sensitivity has been achieved under the FEC limit and more than 18 dB power budget is guaranteed.
On Designing Multicore-Aware Simulators for Systems Biology Endowed with OnLine Statistics

PubMed Central

Calcagno, Cristina; Coppo, Mario

2014-01-01

The paper arguments are on enabling methodologies for the design of a fully parallel, online, interactive tool aiming to support the bioinformatics scientists .In particular, the features of these methodologies, supported by the FastFlow parallel programming framework, are shown on a simulation tool to perform the modeling, the tuning, and the sensitivity analysis of stochastic biological models. A stochastic simulation needs thousands of independent simulation trajectories turning into big data that should be analysed by statistic and data mining tools. In the considered approach the two stages are pipelined in such a way that the simulation stage streams out the partial results of all simulation trajectories to the analysis stage that immediately produces a partial result. The simulation-analysis workflow is validated for performance and effectiveness of the online analysis in capturing biological systems behavior on a multicore platform and representative proof-of-concept biological systems. The exploited methodologies include pattern-based parallel programming and data streaming that provide key features to the software designers such as performance portability and efficient in-memory (big) data management and movement. Two paradigmatic classes of biological systems exhibiting multistable and oscillatory behavior are used as a testbed. PMID:25050327
On designing multicore-aware simulators for systems biology endowed with OnLine statistics.

PubMed

Aldinucci, Marco; Calcagno, Cristina; Coppo, Mario; Damiani, Ferruccio; Drocco, Maurizio; Sciacca, Eva; Spinella, Salvatore; Torquati, Massimo; Troina, Angelo

2014-01-01

The paper arguments are on enabling methodologies for the design of a fully parallel, online, interactive tool aiming to support the bioinformatics scientists .In particular, the features of these methodologies, supported by the FastFlow parallel programming framework, are shown on a simulation tool to perform the modeling, the tuning, and the sensitivity analysis of stochastic biological models. A stochastic simulation needs thousands of independent simulation trajectories turning into big data that should be analysed by statistic and data mining tools. In the considered approach the two stages are pipelined in such a way that the simulation stage streams out the partial results of all simulation trajectories to the analysis stage that immediately produces a partial result. The simulation-analysis workflow is validated for performance and effectiveness of the online analysis in capturing biological systems behavior on a multicore platform and representative proof-of-concept biological systems. The exploited methodologies include pattern-based parallel programming and data streaming that provide key features to the software designers such as performance portability and efficient in-memory (big) data management and movement. Two paradigmatic classes of biological systems exhibiting multistable and oscillatory behavior are used as a testbed.
Ultrasound phase rotation beamforming on multi-core DSP.

PubMed

Ma, Jieming; Karadayi, Kerem; Ali, Murtaza; Kim, Yongmin

2014-01-01

Phase rotation beamforming (PRBF) is a commonly-used digital receive beamforming technique. However, due to its high computational requirement, it has traditionally been supported by hardwired architectures, e.g., application-specific integrated circuits (ASICs) or more recently field-programmable gate arrays (FPGAs). In this study, we investigated the feasibility of supporting software-based PRBF on a multi-core DSP. To alleviate the high computing requirement, the analog front-end (AFE) chips integrating quadrature demodulation in addition to analog-to-digital conversion were defined and used. With these new AFE chips, only delay alignment and phase rotation need to be performed by DSP, substantially reducing the computational load. We implemented the delay alignment and phase rotation modules on a Texas Instruments C6678 DSP with 8 cores. We found it takes 200 μs to beamform 2048 samples from 64 channels using 2 cores. With 4 cores, 20 million samples can be beamformed in one second. Therefore, ADC frequencies up to 40 MHz with 2:1 decimation in AFE chips or up to 20 MHz with no decimation can be supported as long as the ADC-to-DSP I/O requirement can be met. The remaining 4 cores can work on back-end processing tasks and applications, e.g., color Doppler or ultrasound elastography. One DSP being able to handle both beamforming and back-end processing could lead to low-power and low-cost ultrasound machines, benefiting ultrasound imaging in general, particularly portable ultrasound machines. Copyright © 2013 Elsevier B.V. All rights reserved.

Study of Solid State Drives performance in PROOF distributed analysis system

NASA Astrophysics Data System (ADS)

Panitkin, S. Y.; Ernst, M.; Petkus, R.; Rind, O.; Wenaus, T.

2010-04-01

Solid State Drives (SSD) is a promising storage technology for High Energy Physics parallel analysis farms. Its combination of low random access time and relatively high read speed is very well suited for situations where multiple jobs concurrently access data located on the same drive. It also has lower energy consumption and higher vibration tolerance than Hard Disk Drive (HDD) which makes it an attractive choice in many applications raging from personal laptops to large analysis farms. The Parallel ROOT Facility - PROOF is a distributed analysis system which allows to exploit inherent event level parallelism of high energy physics data. PROOF is especially efficient together with distributed local storage systems like Xrootd, when data are distributed over computing nodes. In such an architecture the local disk subsystem I/O performance becomes a critical factor, especially when computing nodes use multi-core CPUs. We will discuss our experience with SSDs in PROOF environment. We will compare performance of HDD with SSD in I/O intensive analysis scenarios. In particular we will discuss PROOF system performance scaling with a number of simultaneously running analysis jobs.
Tiled architecture of a CNN-mostly IP system

NASA Astrophysics Data System (ADS)

Spaanenburg, Lambert; Malki, Suleyman

2009-05-01

Multi-core architectures have been popularized with the advent of the IBM CELL. On a finer grain the problems in scheduling multi-cores have already existed in the tiled architectures, such as the EPIC and Da Vinci. It is not easy to evaluate the performance of a schedule on such architecture as historical data are not available. One solution is to compile algorithms for which an optimal schedule is known by analysis. A typical example is an algorithm that is already defined in terms of many collaborating simple nodes, such as a Cellular Neural Network (CNN). A simple node with a local register stack together with a 'rotating wheel' internal communication mechanism has been proposed. Though the basic CNN allows for a tiled implementation of a tiled algorithm on a tiled structure, a practical CNN system will have to disturb this regularity by the additional need for arithmetical and logical operations. Arithmetic operations are needed for instance to accommodate for low-level image processing, while logical operations are needed to fork and merge different data streams without use of the external memory. It is found that the 'rotating wheel' internal communication mechanism still handles such mechanisms without the need for global control. Overall the CNN system provides for a practical network size as implemented on a FPGA, can be easily used as embedded IP and provides a clear benchmark for a multi-core compiler.
Interaction sorting method for molecular dynamics on multi-core SIMD CPU architecture.

PubMed

Matvienko, Sergey; Alemasov, Nikolay; Fomin, Eduard

2015-02-01

Molecular dynamics (MD) is widely used in computational biology for studying binding mechanisms of molecules, molecular transport, conformational transitions, protein folding, etc. The method is computationally expensive; thus, the demand for the development of novel, much more efficient algorithms is still high. Therefore, the new algorithm designed in 2007 and called interaction sorting (IS) clearly attracted interest, as it outperformed the most efficient MD algorithms. In this work, a new IS modification is proposed which allows the algorithm to utilize SIMD processor instructions. This paper shows that the improvement provides an additional gain in performance, 9% to 45% in comparison to the original IS method.
Scaling GDL for Multi-cores to Process Planck HFI Beams Monte Carlo on HPC

NASA Astrophysics Data System (ADS)

Coulais, A.; Schellens, M.; Duvert, G.; Park, J.; Arabas, S.; Erard, S.; Roudier, G.; Hivon, E.; Mottet, S.; Laurent, B.; Pinter, M.; Kasradze, N.; Ayad, M.

2014-05-01

After reviewing the majors progress done in GDL -now in 0.9.4- on performance and plotting capabilities since ADASS XXI paper (Coulais et al. 2012), we detail how a large code for Planck HFI beams Monte Carlo was successfully transposed from IDL to GDL on HPC.
An embedded multi-core parallel model for real-time stereo imaging

NASA Astrophysics Data System (ADS)

He, Wenjing; Hu, Jian; Niu, Jingyu; Li, Chuanrong; Liu, Guangyu

2018-04-01

The real-time processing based on embedded system will enhance the application capability of stereo imaging for LiDAR and hyperspectral sensor. The task partitioning and scheduling strategies for embedded multiprocessor system starts relatively late, compared with that for PC computer. In this paper, aimed at embedded multi-core processing platform, a parallel model for stereo imaging is studied and verified. After analyzing the computing amount, throughout capacity and buffering requirements, a two-stage pipeline parallel model based on message transmission is established. This model can be applied to fast stereo imaging for airborne sensors with various characteristics. To demonstrate the feasibility and effectiveness of the parallel model, a parallel software was designed using test flight data, based on the 8-core DSP processor TMS320C6678. The results indicate that the design performed well in workload distribution and had a speed-up ratio up to 6.4.
Parallel transformation of K-SVD solar image denoising algorithm

NASA Astrophysics Data System (ADS)

Liang, Youwen; Tian, Yu; Li, Mei

2017-02-01

The images obtained by observing the sun through a large telescope always suffered with noise due to the low SNR. K-SVD denoising algorithm can effectively remove Gauss white noise. Training dictionaries for sparse representations is a time consuming task, due to the large size of the data involved and to the complexity of the training algorithms. In this paper, an OpenMP parallel programming language is proposed to transform the serial algorithm to the parallel version. Data parallelism model is used to transform the algorithm. Not one atom but multiple atoms updated simultaneously is the biggest change. The denoising effect and acceleration performance are tested after completion of the parallel algorithm. Speedup of the program is 13.563 in condition of using 16 cores. This parallel version can fully utilize the multi-core CPU hardware resources, greatly reduce running time and easily to transplant in multi-core platform.
Parallelization of the preconditioned IDR solver for modern multicore computer systems

NASA Astrophysics Data System (ADS)

Bessonov, O. A.; Fedoseyev, A. I.

2012-10-01

This paper present the analysis, parallelization and optimization approach for the large sparse matrix solver CNSPACK for modern multicore microprocessors. CNSPACK is an advanced solver successfully used for coupled solution of stiff problems arising in multiphysics applications such as CFD, semiconductor transport, kinetic and quantum problems. It employs iterative IDR algorithm with ILU preconditioning (user chosen ILU preconditioning order). CNSPACK has been successfully used during last decade for solving problems in several application areas, including fluid dynamics and semiconductor device simulation. However, there was a dramatic change in processor architectures and computer system organization in recent years. Due to this, performance criteria and methods have been revisited, together with involving the parallelization of the solver and preconditioner using Open MP environment. Results of the successful implementation for efficient parallelization are presented for the most advances computer system (Intel Core i7-9xx or two-processor Xeon 55xx/56xx).
Scrambled coherent superposition for enhanced optical fiber communication in the nonlinear transmission regime.

PubMed

Liu, Xiang; Chandrasekhar, S; Winzer, P J; Chraplyvy, A R; Tkach, R W; Zhu, B; Taunay, T F; Fishteyn, M; DiGiovanni, D J

2012-08-13

Coherent superposition of light waves has long been used in various fields of science, and recent advances in digital coherent detection and space-division multiplexing have enabled the coherent superposition of information-carrying optical signals to achieve better communication fidelity on amplified-spontaneous-noise limited communication links. However, fiber nonlinearity introduces highly correlated distortions on identical signals and diminishes the benefit of coherent superposition in nonlinear transmission regime. Here we experimentally demonstrate that through coordinated scrambling of signal constellations at the transmitter, together with appropriate unscrambling at the receiver, the full benefit of coherent superposition is retained in the nonlinear transmission regime of a space-diversity fiber link based on an innovatively engineered multi-core fiber. This scrambled coherent superposition may provide the flexibility of trading communication capacity for performance in future optical fiber networks, and may open new possibilities in high-performance and secure optical communications.
Crystal MD: The massively parallel molecular dynamics software for metal with BCC structure

NASA Astrophysics Data System (ADS)

Hu, Changjun; Bai, He; He, Xinfu; Zhang, Boyao; Nie, Ningming; Wang, Xianmeng; Ren, Yingwen

2017-02-01

Material irradiation effect is one of the most important keys to use nuclear power. However, the lack of high-throughput irradiation facility and knowledge of evolution process, lead to little understanding of the addressed issues. With the help of high-performance computing, we could make a further understanding of micro-level-material. In this paper, a new data structure is proposed for the massively parallel simulation of the evolution of metal materials under irradiation environment. Based on the proposed data structure, we developed the new molecular dynamics software named Crystal MD. The simulation with Crystal MD achieved over 90% parallel efficiency in test cases, and it takes more than 25% less memory on multi-core clusters than LAMMPS and IMD, which are two popular molecular dynamics simulation software. Using Crystal MD, a two trillion particles simulation has been performed on Tianhe-2 cluster.
High performance 3D adaptive filtering for DSP based portable medical imaging systems

NASA Astrophysics Data System (ADS)

Bockenbach, Olivier; Ali, Murtaza; Wainwright, Ian; Nadeski, Mark

2015-03-01

Portable medical imaging devices have proven valuable for emergency medical services both in the field and hospital environments and are becoming more prevalent in clinical settings where the use of larger imaging machines is impractical. Despite their constraints on power, size and cost, portable imaging devices must still deliver high quality images. 3D adaptive filtering is one of the most advanced techniques aimed at noise reduction and feature enhancement, but is computationally very demanding and hence often cannot be run with sufficient performance on a portable platform. In recent years, advanced multicore digital signal processors (DSP) have been developed that attain high processing performance while maintaining low levels of power dissipation. These processors enable the implementation of complex algorithms on a portable platform. In this study, the performance of a 3D adaptive filtering algorithm on a DSP is investigated. The performance is assessed by filtering a volume of size 512x256x128 voxels sampled at a pace of 10 MVoxels/sec with an Ultrasound 3D probe. Relative performance and power is addressed between a reference PC (Quad Core CPU) and a TMS320C6678 DSP from Texas Instruments.
Wireless Interconnects for Intra-chip & Inter-chip Transmission

NASA Astrophysics Data System (ADS)

Narde, Rounak Singh

With the emergence of Internet of Things and information revolution, the demand of high performance computing systems is increasing. The copper interconnects inside the computing chips have evolved into a sophisticated network of interconnects known as Network on Chip (NoC) comprising of routers, switches, repeaters, just like computer networks. When network on chip is implemented on a large scale like in Multicore Multichip (MCMC) systems for High Performance Computing (HPC) systems, length of interconnects increases and so are the problems like power dissipation, interconnect delays, clock synchronization and electrical noise. In this thesis, wireless interconnects are chosen as the substitute for wired copper interconnects. Wireless interconnects offer easy integration with CMOS fabrication and chip packaging. Using wireless interconnects working at unlicensed mm-wave band (57-64GHz), high data rate of Gbps can be achieved. This thesis presents study of transmission between zigzag antennas as wireless interconnects for Multichip multicores (MCMC) systems and 3D IC. For MCMC systems, a four-chips 16-cores model is analyzed with only four wireless interconnects in three configurations with different antenna orientations and locations. Return loss and transmission coefficients are simulated in ANSYS HFSS. Moreover, wireless interconnects are designed, fabricated and tested on a 6'' silicon wafer with resistivity of 55O-cm using a basic standard CMOS process. Wireless interconnect are designed to work at 30GHz using ANSYS HFSS. The fabricated antennas are resonating around 20GHz with a return loss of less than -10dB. The transmission coefficients between antenna pair within a 20mm x 20mm silicon die is found to be varying between -45dB to -55dB. Furthermore, wireless interconnect approach is extended for 3D IC. Wireless interconnects are implemented as zigzag antenna. This thesis extends the work of analyzing the wireless interconnects in 3D IC with different configurations of antenna orientations and coolants. The return loss and transmission coefficients are simulated using ANSYS HFSS.
Power splitting of 1 × 16 in multicore photonic crystal fibers

NASA Astrophysics Data System (ADS)

Malka, Dror; Peled, Aaron

2017-09-01

A novel concept of 1 × 16 power splitter based on a variable multicore photonic crystal fiber (PCF) structure is described. Numerical simulations showed how the optical signal can be split in a PCF structure having dimensions of 60 μm × 60 μm × 3.582 mm. The coupled mode analysis and beam propagation method (BPM) was used for analyzing the multicore PCF based 1 × 16 splitter. The input optical signal at a wavelength of 1.55 μm inserted into the central core was divided into sixteen output cores, each with a 6.25% of the total power. The full width half maximum (FWHM) bandwidth found for each core was 100 nm.
Considerations for Future Climate Data Stewardship

NASA Astrophysics Data System (ADS)

Halem, M.; Nguyen, P. T.; Chapman, D. R.

2009-12-01

In this talk, we will describe the lessons learned based on processing and generating a decade of gridded AIRS and MODIS IR sounding data. We describe the challenges faced in accessing and sharing very large data sets, maintaining data provenance under evolving technologies, obtaining access to legacy calibration data and the permanent preservation of Earth science data records for on demand services. These lessons suggest a new approach to data stewardship will be required for the next decade of hyper spectral instruments combined with cloud resolving models. It will not be sufficient for stewards of future data centers to just provide the public with access to archived data but our experience indicates that data needs to reside close to computers with ultra large disc farms and tens of thousands of processors to deliver complex services on demand over very high speed networks much like the offerings of search engines today. Over the first decade of the 21st century, petabyte data records were acquired from the AIRS instrument on Aqua and the MODIS instrument on Aqua and Terra. NOAA data centers also maintain petabytes of operational IR sounders collected over the past four decades. The UMBC Multicore Computational Center (MC2) developed a Service Oriented Atmospheric Radiance gridding system (SOAR) to allow users to select IR sounding instruments from multiple archives and choose space-time- spectral periods of Level 1B data to download, grid, visualize and analyze on demand. Providing this service requires high data rate bandwidth access to the on line disks at Goddard. After 10 years, cost effective disk storage technology finally caught up with the MODIS data volume making it possible for Level 1B MODIS data to be available on line. However, 10Ge fiber optic networks to access large volumes of data are still not available from CSFC to serve the broader community. Data transfer rates are well below 10MB/s limiting their usefulness for climate studies. During this decade, processor performance hit a power wall leading computer vendors to design multicore processor chips. High performance computer systems obtained petaflop performance by clustering tens of thousands of multicore processor chips. Thus, power consumption and autonomic recovery from processor and disc failures have become major cost and technical considerations for future data archives. To address these new architecture requirements, a transparent parallel programming paradigm, the Hadoop MapReduce cloud computing system, became available as an open S/W system. In addition, the Hadoop File System and manages the distribution of data to these processors as well as backs up the processing in the event of any processor or disc failure. However, to employ this paradigm, the data needs to be stored on the computer system. We conclude this talk with a climate data preservation approach that addresses the scalability crisis to exabyte data requirements for the next decade based on projections of processor, disc data density and bandwidth doubling rates.
High Performance Data Transfer for Distributed Data Intensive Sciences

DOE Office of Scientific and Technical Information (OSTI.GOV)

Fang, Chin; Cottrell, R 'Les' A.; Hanushevsky, Andrew B.

We report on the development of ZX software providing high performance data transfer and encryption. The design scales in: computation power, network interfaces, and IOPS while carefully balancing the available resources. Two U.S. patent-pending algorithms help tackle data sets containing lots of small files and very large files, and provide insensitivity to network latency. It has a cluster-oriented architecture, using peer-to-peer technologies to ease deployment, operation, usage, and resource discovery. Its unique optimizations enable effective use of flash memory. Using a pair of existing data transfer nodes at SLAC and NERSC, we compared its performance to that of bbcp andmore » GridFTP and determined that they were comparable. With a proof of concept created using two four-node clusters with multiple distributed multi-core CPUs, network interfaces and flash memory, we achieved 155Gbps memory-to-memory over a 2x100Gbps link aggregated channel and 70Gbps file-to-file with encryption over a 5000 mile 100Gbps link.« less
Exploiting MIC architectures for the simulation of channeling of charged particles in crystals

NASA Astrophysics Data System (ADS)

Bagli, Enrico; Karpusenko, Vadim

2016-08-01

Coherent effects of ultra-relativistic particles in crystals is an area of science under development. DYNECHARM + + is a toolkit for the simulation of coherent interactions between high-energy charged particles and complex crystal structures. The particle trajectory in a crystal is computed through numerical integration of the equation of motion. The code was revised and improved in order to exploit parallelization on multi-cores and vectorization of single instructions on multiple data. An Intel Xeon Phi card was adopted for the performance measurements. The computation time was proved to scale linearly as a function of the number of physical and virtual cores. By enabling the auto-vectorization flag of the compiler a three time speedup was obtained. The performances of the card were compared to the Dual Xeon ones.
Implications of Multi-Core Architectures on the Development of Multiple Independent Levels of Security (MILS) Compliant Systems

DTIC Science & Technology

2012-10-01

REPORT 3. DATES COVERED (From - To) MAR 2010 – APR 2012 4 . TITLE AND SUBTITLE IMPLICATIONS OF MULT-CORE ARCHITECTURES ON THE DEVELOPMENT OF...Framework for Multicore Information Flow Analysis ...................................... 23 4 4.1 A Hypothetical Reference Architecture... 4 Figure 2: Pentium II Block Diagram
"Photonic lantern" spectral filters in multi-core Fiber.

PubMed

Birks, T A; Mangan, B J; Díez, A; Cruz, J L; Murphy, D F

2012-06-18

Fiber Bragg gratings are written across all 120 single-mode cores of a multi-core optical Fiber. The Fiber is interfaced to multimode ports by tapering it within a depressed-index glass jacket. The result is a compact multimode "photonic lantern" filter with astrophotonic applications. The tapered structure is also an effective mode scrambler.
Low-power, transparent optical network interface for high bandwidth off-chip interconnects.

PubMed

Liboiron-Ladouceur, Odile; Wang, Howard; Garg, Ajay S; Bergman, Keren

2009-04-13

The recent emergence of multicore architectures and chip multiprocessors (CMPs) has accelerated the bandwidth requirements in high-performance processors for both on-chip and off-chip interconnects. For next generation computing clusters, the delivery of scalable power efficient off-chip communications to each compute node has emerged as a key bottleneck to realizing the full computational performance of these systems. The power dissipation is dominated by the off-chip interface and the necessity to drive high-speed signals over long distances. We present a scalable photonic network interface approach that fully exploits the bandwidth capacity offered by optical interconnects while offering significant power savings over traditional E/O and O/E approaches. The power-efficient interface optically aggregates electronic serial data streams into a multiple WDM channel packet structure at time-of-flight latencies. We demonstrate a scalable optical network interface with 70% improvement in power efficiency for a complete end-to-end PCI Express data transfer.
Secure and Resilient Functional Modeling for Navy Cyber-Physical Systems

DTIC Science & Technology

2017-05-24

Functional Modeling Compiler (SCCT) FM Compiler and Key Performance Indicators (KPI) May 2018 Pending. Model Management Backbone (SCCT) MMB Demonstration...implement the agent- based distributed runtime. - KPIs for single/multicore controllers and temporal/spatial domains. - Integration of the model management ...Distributed Runtime (UCI) Not started. Model Management Backbone (SCCT) Not started. Siemens Corporation Corporate Technology Unrestricted
Scalable, High-performance 3D Imaging Software Platform: System Architecture and Application to Virtual Colonoscopy

PubMed Central

Yoshida, Hiroyuki; Wu, Yin; Cai, Wenli; Brett, Bevin

2013-01-01

One of the key challenges in three-dimensional (3D) medical imaging is to enable the fast turn-around time, which is often required for interactive or real-time response. This inevitably requires not only high computational power but also high memory bandwidth due to the massive amount of data that need to be processed. In this work, we have developed a software platform that is designed to support high-performance 3D medical image processing for a wide range of applications using increasingly available and affordable commodity computing systems: multi-core, clusters, and cloud computing systems. To achieve scalable, high-performance computing, our platform (1) employs size-adaptive, distributable block volumes as a core data structure for efficient parallelization of a wide range of 3D image processing algorithms; (2) supports task scheduling for efficient load distribution and balancing; and (3) consists of a layered parallel software libraries that allow a wide range of medical applications to share the same functionalities. We evaluated the performance of our platform by applying it to an electronic cleansing system in virtual colonoscopy, with initial experimental results showing a 10 times performance improvement on an 8-core workstation over the original sequential implementation of the system. PMID:23366803

High-performance computing for airborne applications

DOE Office of Scientific and Technical Information (OSTI.GOV)

Quinn, Heather M; Manuzzato, Andrea; Fairbanks, Tom

2010-06-28

Recently, there has been attempts to move common satellite tasks to unmanned aerial vehicles (UAVs). UAVs are significantly cheaper to buy than satellites and easier to deploy on an as-needed basis. The more benign radiation environment also allows for an aggressive adoption of state-of-the-art commercial computational devices, which increases the amount of data that can be collected. There are a number of commercial computing devices currently available that are well-suited to high-performance computing. These devices range from specialized computational devices, such as field-programmable gate arrays (FPGAs) and digital signal processors (DSPs), to traditional computing platforms, such as microprocessors. Even thoughmore » the radiation environment is relatively benign, these devices could be susceptible to single-event effects. In this paper, we will present radiation data for high-performance computing devices in a accelerated neutron environment. These devices include a multi-core digital signal processor, two field-programmable gate arrays, and a microprocessor. From these results, we found that all of these devices are suitable for many airplane environments without reliability problems.« less
Ultrathin endoscopes based on multicore fibers and adaptive optics: a status review and perspectives.

PubMed

Andresen, Esben Ravn; Sivankutty, Siddharth; Tsvirkun, Viktor; Bouwmans, Géraud; Rigneault, Hervé

2016-12-01

We take stock of the progress that has been made into developing ultrathin endoscopes assisted by wave front shaping. We focus our review on multicore fiber-based lensless endoscopes intended for multiphoton imaging applications. We put the work into perspective by comparing with alternative approaches and by outlining the challenges that lie ahead.
Amplification and noise properties of an erbium-doped multicore fiber amplifier.

PubMed

Abedin, K S; Taunay, T F; Fishteyn, M; Yan, M F; Zhu, B; Fini, J M; Monberg, E M; Dimarcello, F V; Wisk, P W

2011-08-15

A multicore erbium-doped fiber (MC-EDF) amplifier for simultaneous amplification in the 7-cores has been developed, and the gain and noise properties of individual cores have been studied. The pump and signal radiation were coupled to individual cores of MC-EDF using two tapered fiber bundled (TFB) couplers with low insertion loss. For a pump power of 146 mW, the average gain achieved in the MC-EDF fiber was 30 dB, and noise figure was less than 4 dB. The net useful gain from the multicore-amplifier, after taking into consideration of all the passive losses, was about 23-27 dB. Pump induced ASE noise transfer between the neighboring channel was negligible. © 2011 Optical Society of America
Encapsulating model complexity and landscape-scale analyses of state-and-transition simulation models: an application of ecoinformatics and juniper encroachment in sagebrush steppe ecosystems

USGS Publications Warehouse

O'Donnell, Michael

2015-01-01

State-and-transition simulation modeling relies on knowledge of vegetation composition and structure (states) that describe community conditions, mechanistic feedbacks such as fire that can affect vegetation establishment, and ecological processes that drive community conditions as well as the transitions between these states. However, as the need for modeling larger and more complex landscapes increase, a more advanced awareness of computing resources becomes essential. The objectives of this study include identifying challenges of executing state-and-transition simulation models, identifying common bottlenecks of computing resources, developing a workflow and software that enable parallel processing of Monte Carlo simulations, and identifying the advantages and disadvantages of different computing resources. To address these objectives, this study used the ApexRMS® SyncroSim software and embarrassingly parallel tasks of Monte Carlo simulations on a single multicore computer and on distributed computing systems. The results demonstrated that state-and-transition simulation models scale best in distributed computing environments, such as high-throughput and high-performance computing, because these environments disseminate the workloads across many compute nodes, thereby supporting analysis of larger landscapes, higher spatial resolution vegetation products, and more complex models. Using a case study and five different computing environments, the top result (high-throughput computing versus serial computations) indicated an approximate 96.6% decrease of computing time. With a single, multicore compute node (bottom result), the computing time indicated an 81.8% decrease relative to using serial computations. These results provide insight into the tradeoffs of using different computing resources when research necessitates advanced integration of ecoinformatics incorporating large and complicated data inputs and models. - See more at: http://aimspress.com/aimses/ch/reader/view_abstract.aspx?file_no=Environ2015030&flag=1#sthash.p1XKDtF8.dpuf
Accelerating nuclear configuration interaction calculations through a preconditioned block iterative eigensolver

NASA Astrophysics Data System (ADS)

Shao, Meiyue; Aktulga, H. Metin; Yang, Chao; Ng, Esmond G.; Maris, Pieter; Vary, James P.

2018-01-01

We describe a number of recently developed techniques for improving the performance of large-scale nuclear configuration interaction calculations on high performance parallel computers. We show the benefit of using a preconditioned block iterative method to replace the Lanczos algorithm that has traditionally been used to perform this type of computation. The rapid convergence of the block iterative method is achieved by a proper choice of starting guesses of the eigenvectors and the construction of an effective preconditioner. These acceleration techniques take advantage of special structure of the nuclear configuration interaction problem which we discuss in detail. The use of a block method also allows us to improve the concurrency of the computation, and take advantage of the memory hierarchy of modern microprocessors to increase the arithmetic intensity of the computation relative to data movement. We also discuss the implementation details that are critical to achieving high performance on massively parallel multi-core supercomputers, and demonstrate that the new block iterative solver is two to three times faster than the Lanczos based algorithm for problems of moderate sizes on a Cray XC30 system.
802.11ac WLAN MIMO radio-over-fiber distributed antenna system for in-building networks based on multicore fiber

NASA Astrophysics Data System (ADS)

Morant, Maria; Llorente, Roberto

2017-01-01

In this work we propose and evaluate experimentally the performance of IEEE 802.11ac WLAN standard signals in radio-over-fiber (RoF) distributed-antenna systems based on multicore fiber (MCF) for in-building WLAN connectivity. The RoF performance of WLAN signals with different bandwidth is investigated considering up to IEEE 802.11ac maximum of 160 MHz per user. We evaluate experimentally the performance of WLAN signals employing different modulation and coding schemes achieving bitrates from 78 Mbps to 1404 Mbps per user in distances up to 300 m in a 4-core MCF. The performance of the wireless standard multiple-input multiple-output (MIMO) processing algorithms included in WLAN signals applied to the RoF transmission in MCF optical systems is also evaluated. The impact on the quality of the signal from one of the cores in the MIMO processing is investigated and compared with the results achieved with single-input single-output (SISO) transmission in each core. We measured the error vector magnitude (EVM) and the OFDM data burst information of the received WLAN signals after RoF transmission for different distributed-antenna systems with uni- and bi-directional MCF communication. Finally, we compare the received EVM of a single-antenna system (SISO arrangement) with WLAN systems using two antennas (2×2 MIMO) and four antennas (4×4 MIMO).
Accelerating 3D Elastic Wave Equations on Knights Landing based Intel Xeon Phi processors

NASA Astrophysics Data System (ADS)

Sourouri, Mohammed; Birger Raknes, Espen

2017-04-01

In advanced imaging methods like reverse-time migration (RTM) and full waveform inversion (FWI) the elastic wave equation (EWE) is numerically solved many times to create the seismic image or the elastic parameter model update. Thus, it is essential to optimize the solution time for solving the EWE as this will have a major impact on the total computational cost in running RTM or FWI. From a computational point of view applications implementing EWEs are associated with two major challenges. The first challenge is the amount of memory-bound computations involved, while the second challenge is the execution of such computations over very large datasets. So far, multi-core processors have not been able to tackle these two challenges, which eventually led to the adoption of accelerators such as Graphics Processing Units (GPUs). Compared to conventional CPUs, GPUs are densely populated with many floating-point units and fast memory, a type of architecture that has proven to map well to many scientific computations. Despite its architectural advantages, full-scale adoption of accelerators has yet to materialize. First, accelerators require a significant programming effort imposed by programming models such as CUDA or OpenCL. Second, accelerators come with a limited amount of memory, which also require explicit data transfers between the CPU and the accelerator over the slow PCI bus. The second generation of the Xeon Phi processor based on the Knights Landing (KNL) architecture, promises the computational capabilities of an accelerator but require the same programming effort as traditional multi-core processors. The high computational performance is realized through many integrated cores (number of cores and tiles and memory varies with the model) organized in tiles that are connected via a 2D mesh based interconnect. In contrary to accelerators, KNL is a self-hosted system, meaning explicit data transfers over the PCI bus are no longer required. However, like most accelerators, KNL sports a memory subsystem consisting of low-level caches and 16GB of high-bandwidth MCDRAM memory. For capacity computing, up to 400GB of conventional DDR4 memory is provided. Such a strict hierarchical memory layout means that data locality is imperative if the true potential of this product is to be harnessed. In this work, we study a series of optimizations specifically targeting KNL for our EWE based application to reduce the time-to-solution time for the following 3D model sizes in grid points: 1283, 2563 and 5123. We compare the results with an optimized version for multi-core CPUs running on a dual-socket Xeon E5 2680v3 system using OpenMP. Our initial naive implementation on the KNL is roughly 20% faster than the multi-core version, but by using only one thread per core and careful memory placement using the memkind library, we could achieve higher speedups. Additionally, by using the MCDRAM as cache for problem sizes that are smaller than 16 GB further performance improvements were unlocked. Depending on the problem size, our overall results indicate that the KNL based system is approximately 2.2x faster than the 24-core Xeon E5 2680v3 system, with only modest changes to the code.
A Real-Time Linux for Multicore Platforms

DTIC Science & Technology

2013-12-20

under ARO support) to obtain a fully-functional OS for supporting real-time workloads on multicore platforms. This system, called LITMUS -RT...to be specified as plugin components. LITMUS -RT is open-source software (available at The views, opinions and/or findings contained in this report... LITMUS -RT (LInux Testbed for MUltiprocessor Scheduling in Real-Time systems), allows different multiprocessor real-time scheduling and
Nonlinear Light Dynamics in Multi-Core Structures

DTIC Science & Technology

2017-02-27

be generated in continuous- discrete optical media such as multi-core optical fiber or waveguide arrays; localisation dynamics in a continuous... discrete nonlinear system. Detailed theoretical analysis is presented of the existence and stability of the discrete -continuous light bullets using a very...and pulse compression using wave collapse (self-focusing) energy localisation dynamics in a continuous- discrete nonlinear system, as implemented in a
Myalgia as the revealing symptom of multicore disease and fibre type disproportion myopathy

PubMed Central

Sobreira*, C; Marques, W; Barreira, A

2003-01-01

Objective: To report the occurrence of myalgia as the revealing symptom of multicore disease and fibre type disproportion myopathy. Methods: The clinical cases of three patients with fibre type disproportion myopathy and one with multicore disease are described. Skeletal muscle biopsies were processed for routine histological and histochemical studies. Results: The clinical picture was unusual in that the symptoms were of late onset and the predominant complaint was muscle pain exacerbated by exercise. Muscle weakness was found in only a single patient, the mother of a patient with fibre type disproportion myopathy. Physical examination was unremarkable in the other patients. Muscle biopsies from patients 1 and 2 contained type I fibres that were considerably smaller than the type II fibres, supporting the diagnosis of fibre type disproportion myopathy. Skeletal muscle of patient 4 showed multiple areas, predominantly but not exclusively in the type I fibres, from which oxidative enzyme activities were absent, as seen in multicore disease. Conclusions: Muscle pain was the main clinical manifestation in our patients. Recognition of the broader clinical expression of these myopathies is important for prognostic reasons and for genetic counselling of the family members. PMID:12933945
Efficient parallel linear scaling construction of the density matrix for Born-Oppenheimer molecular dynamics.

PubMed

Mniszewski, S M; Cawkwell, M J; Wall, M E; Mohd-Yusof, J; Bock, N; Germann, T C; Niklasson, A M N

2015-10-13

We present an algorithm for the calculation of the density matrix that for insulators scales linearly with system size and parallelizes efficiently on multicore, shared memory platforms with small and controllable numerical errors. The algorithm is based on an implementation of the second-order spectral projection (SP2) algorithm [ Niklasson, A. M. N. Phys. Rev. B 2002 , 66 , 155115 ] in sparse matrix algebra with the ELLPACK-R data format. We illustrate the performance of the algorithm within self-consistent tight binding theory by total energy calculations of gas phase poly(ethylene) molecules and periodic liquid water systems containing up to 15,000 atoms on up to 16 CPU cores. We consider algorithm-specific performance aspects, such as local vs nonlocal memory access and the degree of matrix sparsity. Comparisons to sparse matrix algebra implementations using off-the-shelf libraries on multicore CPUs, graphics processing units (GPUs), and the Intel many integrated core (MIC) architecture are also presented. The accuracy and stability of the algorithm are illustrated with long duration Born-Oppenheimer molecular dynamics simulations of 1000 water molecules and a 303 atom Trp cage protein solvated by 2682 water molecules.
A novel heterogeneous algorithm to simulate multiphase flow in porous media on multicore CPU-GPU systems

NASA Astrophysics Data System (ADS)

McClure, J. E.; Prins, J. F.; Miller, C. T.

2014-07-01

Multiphase flow implementations of the lattice Boltzmann method (LBM) are widely applied to the study of porous medium systems. In this work, we construct a new variant of the popular "color" LBM for two-phase flow in which a three-dimensional, 19-velocity (D3Q19) lattice is used to compute the momentum transport solution while a three-dimensional, seven velocity (D3Q7) lattice is used to compute the mass transport solution. Based on this formulation, we implement a novel heterogeneous GPU-accelerated algorithm in which the mass transport solution is computed by multiple shared memory CPU cores programmed using OpenMP while a concurrent solution of the momentum transport is performed using a GPU. The heterogeneous solution is demonstrated to provide speedup of 2.6 × as compared to multi-core CPU solution and 1.8 × compared to GPU solution due to concurrent utilization of both CPU and GPU bandwidths. Furthermore, we verify that the proposed formulation provides an accurate physical representation of multiphase flow processes and demonstrate that the approach can be applied to perform heterogeneous simulations of two-phase flow in porous media using a typical GPU-accelerated workstation.
Novel magnetic multicore nanoparticles designed for MPI and other biomedical applications: From synthesis to first in vivo studies

PubMed Central

Taupitz, Matthias; Ariza de Schellenberger, Angela; Kosch, Olaf; Eberbeck, Dietmar; Wagner, Susanne; Trahms, Lutz; Hamm, Bernd; Schnorr, Jörg

2018-01-01

Synthesis of novel magnetic multicore particles (MCP) in the nano range, involves alkaline precipitation of iron(II) chloride in the presence of atmospheric oxygen. This step yields green rust, which is oxidized to obtain magnetic nanoparticles, which probably consist of a magnetite/maghemite mixed-phase. Final growth and annealing at 90°C in the presence of a large excess of carboxymethyl dextran gives MCP very promising magnetic properties for magnetic particle imaging (MPI), an emerging medical imaging modality, and magnetic resonance imaging (MRI). The magnetic nanoparticles are biocompatible and thus potential candidates for future biomedical applications such as cardiovascular imaging, sentinel lymph node mapping in cancer patients, and stem cell tracking. The new MCP that we introduce here have three times higher magnetic particle spectroscopy performance at lower and middle harmonics and five times higher MPS signal strength at higher harmonics compared with Resovist®. In addition, the new MCP have also an improved in vivo MPI performance compared to Resovist®, and we here report the first in vivo MPI investigation of this new generation of magnetic nanoparticles. PMID:29300729
DKIST Adaptive Optics System: Simulation Results

NASA Astrophysics Data System (ADS)

Marino, Jose; Schmidt, Dirk

2016-05-01

The 4 m class Daniel K. Inouye Solar Telescope (DKIST), currently under construction, will be equipped with an ultra high order solar adaptive optics (AO) system. The requirements and capabilities of such a solar AO system are beyond those of any other solar AO system currently in operation. We must rely on solar AO simulations to estimate and quantify its performance.We present performance estimation results of the DKIST AO system obtained with a new solar AO simulation tool. This simulation tool is a flexible and fast end-to-end solar AO simulator which produces accurate solar AO simulations while taking advantage of current multi-core computer technology. It relies on full imaging simulations of the extended field Shack-Hartmann wavefront sensor (WFS), which directly includes important secondary effects such as field dependent distortions and varying contrast of the WFS sub-aperture images.
Batched matrix computations on hardware accelerators based on GPUs

DOE PAGES

Haidar, Azzam; Dong, Tingxing; Luszczek, Piotr; ...

2015-02-09

Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need for an effective approach to develop energy-efficient, high-performance codes for these small matrix problems that we call batched factorizations. The many applications that need this functionality could especially benefit from the use of GPUs, which currently are four to five times more energy efficient than multicore CPUs on important scientific workloads. This study, consequently, describes the development of the most common, one-sidedmore » factorizations, Cholesky, LU, and QR, for a set of small dense matrices. The algorithms we present together with their implementations are, by design, inherently parallel. In particular, our approach is based on representing the process as a sequence of batched BLAS routines that are executed entirely on a GPU. Importantly, this is unlike the LAPACK and the hybrid MAGMA factorization algorithms that work under drastically different assumptions of hardware design and efficiency of execution of the various computational kernels involved in the implementation. Thus, our approach is more efficient than what works for a combination of multicore CPUs and GPUs for the problems sizes of interest of the application use cases. The paradigm where upon a single chip (a GPU or a CPU) factorizes a single problem at a time is not at all efficient in our applications’ context. We illustrate all of these claims through a detailed performance analysis. With the help of profiling and tracing tools, we guide our development of batched factorizations to achieve up to two-fold speedup and three-fold better energy efficiency as compared against our highly optimized batched CPU implementations based on MKL library. Finally, the tested system featured two sockets of Intel Sandy Bridge CPUs and we compared with a batched LU factorizations featured in the CUBLAS library for GPUs, we achieve as high as 2.5× speedup on the NVIDIA K40 GPU.« less
Optimizing Irregular Applications for Energy and Performance on the Tilera Many-core Architecture

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chavarría-Miranda, Daniel; Panyala, Ajay R.; Halappanavar, Mahantesh

Optimizing applications simultaneously for energy and performance is a complex problem. High performance, parallel, irregular applications are notoriously hard to optimize due to their data-dependent memory accesses, lack of structured locality and complex data structures and code patterns. Irregular kernels are growing in importance in applications such as machine learning, graph analytics and combinatorial scientific computing. Performance- and energy-efficient implementation of these kernels on modern, energy efficient, multicore and many-core platforms is therefore an important and challenging problem. We present results from optimizing two irregular applications { the Louvain method for community detection (Grappolo), and high-performance conjugate gradient (HPCCG) {more » on the Tilera many-core system. We have significantly extended MIT's OpenTuner auto-tuning framework to conduct a detailed study of platform-independent and platform-specific optimizations to improve performance as well as reduce total energy consumption. We explore the optimization design space along three dimensions: memory layout schemes, compiler-based code transformations, and optimization of parallel loop schedules. Using auto-tuning, we demonstrate whole node energy savings of up to 41% relative to a baseline instantiation, and up to 31% relative to manually optimized variants.« less
Multiplexed single-mode wavelength-to-time mapping of multimode light

PubMed Central

Chandrasekharan, Harikumar K; Izdebski, Frauke; Gris-Sánchez, Itandehui; Krstajić, Nikola; Walker, Richard; Bridle, Helen L.; Dalgarno, Paul A.; MacPherson, William N.; Henderson, Robert K.; Birks, Tim A.; Thomson, Robert R.

2017-01-01

When an optical pulse propagates along an optical fibre, different wavelengths travel at different group velocities. As a result, wavelength information is converted into arrival-time information, a process known as wavelength-to-time mapping. This phenomenon is most cleanly observed using a single-mode fibre transmission line, where spatial mode dispersion is not present, but the use of such fibres restricts possible applications. Here we demonstrate that photonic lanterns based on tapered single-mode multicore fibres provide an efficient way to couple multimode light to an array of single-photon avalanche detectors, each of which has its own time-to-digital converter for time-correlated single-photon counting. Exploiting this capability, we demonstrate the multiplexed single-mode wavelength-to-time mapping of multimode light using a multicore fibre photonic lantern with 121 single-mode cores, coupled to 121 detectors on a 32 × 32 detector array. This work paves the way to efficient multimode wavelength-to-time mapping systems with the spectral performance of single-mode systems. PMID:28120822
ParallelStructure: A R Package to Distribute Parallel Runs of the Population Genetics Program STRUCTURE on Multi-Core Computers

PubMed Central

Besnier, Francois; Glover, Kevin A.

2013-01-01

This software package provides an R-based framework to make use of multi-core computers when running analyses in the population genetics program STRUCTURE. It is especially addressed to those users of STRUCTURE dealing with numerous and repeated data analyses, and who could take advantage of an efficient script to automatically distribute STRUCTURE jobs among multiple processors. It also consists of additional functions to divide analyses among combinations of populations within a single data set without the need to manually produce multiple projects, as it is currently the case in STRUCTURE. The package consists of two main functions: MPI_structure() and parallel_structure() as well as an example data file. We compared the performance in computing time for this example data on two computer architectures and showed that the use of the present functions can result in several-fold improvements in terms of computation time. ParallelStructure is freely available at https://r-forge.r-project.org/projects/parallstructure/. PMID:23923012
Parallel k-means++

DOE Office of Scientific and Technical Information (OSTI.GOV)

A parallelization of the k-means++ seed selection algorithm on three distinct hardware platforms: GPU, multicore CPU, and multithreaded architecture. K-means++ was developed by David Arthur and Sergei Vassilvitskii in 2007 as an extension of the k-means data clustering technique. These algorithms allow people to cluster multidimensional data, by attempting to minimize the mean distance of data points within a cluster. K-means++ improved upon traditional k-means by using a more intelligent approach to selecting the initial seeds for the clustering process. While k-means++ has become a popular alternative to traditional k-means clustering, little work has been done to parallelize this technique.more » We have developed original C++ code for parallelizing the algorithm on three unique hardware architectures: GPU using NVidia's CUDA/Thrust framework, multicore CPU using OpenMP, and the Cray XMT multithreaded architecture. By parallelizing the process for these platforms, we are able to perform k-means++ clustering much more quickly than it could be done before.« less
All-fiber orbital angular momentum mode generation and transmission system

NASA Astrophysics Data System (ADS)

Heng, Xiaobo; Gan, Jiulin; Zhang, Zhishen; Qian, Qi; Xu, Shanhui; Yang, Zhongmin

2017-11-01

We proposed and demonstrated an all-fiber system for generating and transmitting orbital angular momentum (OAM) mode light. A specially designed multi-core fiber (MCF) was used to endow with guide modes different phase change and two tapered transition regions were used for providing low-loss interfaces between different fiber structures. By arranging the refractive index distribution among the multi-cores and controlling the length of MCF, which essentially change the phase difference between the neighboring cores, OAM modes with different topological charge l can be generated selectively. Through two tapered transition regions, the non-OAM mode light can be effectively injected into the MCF and the generated OAM mode light can be easily launched into OAM mode supporting fiber for long distance and high purity transmission. Such an all-fiber OAM mode generation and transmission system owns the merits of flexibility, compactness, portability, and would have practical application value in OAM optical fiber communication systems.

Document Image Parsing and Understanding using Neuromorphic Architecture

DTIC Science & Technology

2015-03-01

processing speed at different layers. In the pattern matching layer, the computing power of multicore processors is explored to reduce the processing...developed to reduce the processing speed at different layers. In the pattern matching layer, the computing power of multicore processors is explored... cortex where the complex data is reduced to abstract representations. The abstract representation is compared to stored patterns in massively parallel
Photonic Crystal Fibers

DTIC Science & Technology

2005-12-01

passive and active versions of each fiber designed under this task. Crystal Fibre shall provide characteristics of the fiber fabricated to include core...passive version of multicore fiber iteration 2. 15. SUBJECT TERMS EOARD, Laser physics, Fibre Lasers, Photonic Crystal, Multicore, Fiber Laser 16...9 00* 0 " CRYSTAL FIBRE INT ODUCTION This report describes the photonic crystal fibers developed under agreement No FA8655-o5-a- 3046. All
Multi-Core Processors: An Enabling Technology for Embedded Distributed Model-Based Control (Postprint)

DTIC Science & Technology

2008-07-01

generation of process partitioning, a thread pipelining becomes possible. In this paper we briefly summarize the requirements and trends for FADEC based... FADEC environment, presenting a hypothetical realization of an example application. Finally we discuss the application of Time-Triggered...based control applications of the future. 15. SUBJECT TERMS Gas turbine, FADEC , Multi-core processing technology, disturbed based control
FOS: A Factored Operating Systems for High Assurance and Scalability on Multicores

DTIC Science & Technology

2012-08-01

computing. It builds on previous work in distributed and microkernel OSes by factoring services out of the kernel, and then further distributing each...2 3.0 Methods, Assumptions, and Procedures (System Design) .................................................. 4 3.1 Microkernel ...cooperating servers. We term such a service a fleet. Figure 2 shows the high-level architecture of fos. A small microkernel runs on every core
Comparative Performance Analysis of Coarse Solvers for Algebraic Multigrid on Multicore and Manycore Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Druinsky, Alex; Ghysels, Pieter; Li, Xiaoye S.

In this paper, we study the performance of a two-level algebraic-multigrid algorithm, with a focus on the impact of the coarse-grid solver on performance. We consider two algorithms for solving the coarse-space systems: the preconditioned conjugate gradient method and a new robust HSS-embedded low-rank sparse-factorization algorithm. Our test data comes from the SPE Comparative Solution Project for oil-reservoir simulations. We contrast the performance of our code on one 12-core socket of a Cray XC30 machine with performance on a 60-core Intel Xeon Phi coprocessor. To obtain top performance, we optimized the code to take full advantage of fine-grained parallelism andmore » made it thread-friendly for high thread count. We also developed a bounds-and-bottlenecks performance model of the solver which we used to guide us through the optimization effort, and also carried out performance tuning in the solver’s large parameter space. Finally, as a result, significant speedups were obtained on both machines.« less
GoCxx: a tool to easily leverage C++ legacy code for multicore-friendly Go libraries and frameworks

NASA Astrophysics Data System (ADS)

Binet, Sébastien

2012-12-01

Current HENP libraries and frameworks were written before multicore systems became widely deployed and used. From this environment, a ‘single-thread’ processing model naturally emerged but the implicit assumptions it encouraged are greatly impairing our abilities to scale in a multicore/manycore world. Writing scalable code in C++ for multicore architectures, while doable, is no panacea. Sure, C++11 will improve on the current situation (by standardizing on std::thread, introducing lambda functions and defining a memory model) but it will do so at the price of complicating further an already quite sophisticated language. This level of sophistication has probably already strongly motivated analysis groups to migrate to CPython, hoping for its current limitations with respect to multicore scalability to be either lifted (Grand Interpreter Lock removal) or for the advent of a new Python VM better tailored for this kind of environment (PyPy, Jython, …) Could HENP migrate to a language with none of the deficiencies of C++ (build time, deployment, low level tools for concurrency) and with the fast turn-around time, simplicity and ease of coding of Python? This paper will try to make the case for Go - a young open source language with built-in facilities to easily express and expose concurrency - being such a language. We introduce GoCxx, a tool leveraging gcc-xml's output to automatize the tedious work of creating Go wrappers for foreign languages, a critical task for any language wishing to leverage legacy and field-tested code. We will conclude with the first results of applying GoCxx to real C++ code.
Real-time 3D adaptive filtering for portable imaging systems

NASA Astrophysics Data System (ADS)

Bockenbach, Olivier; Ali, Murtaza; Wainwright, Ian; Nadeski, Mark

2015-03-01

Portable imaging devices have proven valuable for emergency medical services both in the field and hospital environments and are becoming more prevalent in clinical settings where the use of larger imaging machines is impractical. 3D adaptive filtering is one of the most advanced techniques aimed at noise reduction and feature enhancement, but is computationally very demanding and hence often not able to run with sufficient performance on a portable platform. In recent years, advanced multicore DSPs have been introduced that attain high processing performance while maintaining low levels of power dissipation. These processors enable the implementation of complex algorithms like 3D adaptive filtering, improving the image quality of portable medical imaging devices. In this study, the performance of a 3D adaptive filtering algorithm on a digital signal processor (DSP) is investigated. The performance is assessed by filtering a volume of size 512x256x128 voxels sampled at a pace of 10 MVoxels/sec.
Dynamic Voltage-Frequency and Workload Joint Scaling Power Management for Energy Harvesting Multi-Core WSN Node SoC

PubMed Central

Li, Xiangyu; Xie, Nijie; Tian, Xinyue

2017-01-01

This paper proposes a scheduling and power management solution for energy harvesting heterogeneous multi-core WSN node SoC such that the system continues to operate perennially and uses the harvested energy efficiently. The solution consists of a heterogeneous multi-core system oriented task scheduling algorithm and a low-complexity dynamic workload scaling and configuration optimization algorithm suitable for light-weight platforms. Moreover, considering the power consumption of most WSN applications have the characteristic of data dependent behavior, we introduce branches handling mechanism into the solution as well. The experimental result shows that the proposed algorithm can operate in real-time on a lightweight embedded processor (MSP430), and that it can make a system do more valuable works and make more than 99.9% use of the power budget. PMID:28208730
Dynamic Voltage-Frequency and Workload Joint Scaling Power Management for Energy Harvesting Multi-Core WSN Node SoC.

PubMed

Li, Xiangyu; Xie, Nijie; Tian, Xinyue

2017-02-08

This paper proposes a scheduling and power management solution for energy harvesting heterogeneous multi-core WSN node SoC such that the system continues to operate perennially and uses the harvested energy efficiently. The solution consists of a heterogeneous multi-core system oriented task scheduling algorithm and a low-complexity dynamic workload scaling and configuration optimization algorithm suitable for light-weight platforms. Moreover, considering the power consumption of most WSN applications have the characteristic of data dependent behavior, we introduce branches handling mechanism into the solution as well. The experimental result shows that the proposed algorithm can operate in real-time on a lightweight embedded processor (MSP430), and that it can make a system do more valuable works and make more than 99.9% use of the power budget.
Efficient Aho-Corasick String Matching on Emerging Multicore Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tumeo, Antonino; Villa, Oreste; Secchi, Simone

String matching algorithms are critical to several scientific fields. Beside text processing and databases, emerging applications such as DNA protein sequence analysis, data mining, information security software, antivirus, ma- chine learning, all exploit string matching algorithms [3]. All these applica- tions usually process large quantity of textual data, require high performance and/or predictable execution times. Among all the string matching algorithms, one of the most studied, especially for text processing and security applica- tions, is the Aho-Corasick algorithm. 1 2 Book title goes here Aho-Corasick is an exact, multi-pattern string matching algorithm which performs the search in a time linearlymore » proportional to the length of the input text independently from pattern set size. However, depending on the imple- mentation, when the number of patterns increase, the memory occupation may raise drastically. In turn, this can lead to significant variability in the performance, due to the memory access times and the caching effects. This is a significant concern for many mission critical applications and modern high performance architectures. For example, security applications such as Network Intrusion Detection Systems (NIDS), must be able to scan network traffic against very large dictionaries in real time. Modern Ethernet links reach up to 10 Gbps, and malicious threats are already well over 1 million, and expo- nentially growing [28]. When performing the search, a NIDS should not slow down the network, or let network packets pass unchecked. Nevertheless, on the current state-of-the-art cache based processors, there may be a large per- formance variability when dealing with big dictionaries and inputs that have different frequencies of matching patterns. In particular, when few patterns are matched and they are all in the cache, the procedure is fast. Instead, when they are not in the cache, often because many patterns are matched and the caches are continuously thrashed, they should be retrieved from the system memory and the procedure is slowed down by the increased latency. Efficient implementations of string matching algorithms have been the fo- cus of several works, targeting Field Programmable Gate Arrays [4, 25, 15, 5], highly multi-threaded solutions like the Cray XMT [34], multicore proces- sors [19] or heterogeneous processors like the Cell Broadband Engine [35, 22]. Recently, several researchers have also started to investigate the use Graphic Processing Units (GPUs) for string matching algorithms in security applica- tions [20, 10, 32, 33]. Most of these approaches mainly focus on reaching high peak performance, or try to optimize the memory occupation, rather than looking at performance stability. However, hardware solutions supports only small dictionary sizes due to lack of memory and are difficult to customize, while platforms such as the Cell/B.E. are very complex to program.« less
Rapid Calculation of Max-Min Fair Rates for Multi-Commodity Flows in Fat-Tree Networks

DOE PAGES

Mollah, Md Atiqul; Yuan, Xin; Pakin, Scott; ...

2017-08-29

Max-min fairness is often used in the performance modeling of interconnection networks. Existing methods to compute max-min fair rates for multi-commodity flows have high complexity and are computationally infeasible for large networks. In this paper, we show that by considering topological features, this problem can be solved efficiently for the fat-tree topology that is widely used in data centers and high performance compute clusters. Several efficient new algorithms are developed for this problem, including a parallel algorithm that can take advantage of multi-core and shared-memory architectures. Using these algorithms, we demonstrate that it is possible to find the max-min fairmore » rate allocation for multi-commodity flows in fat-tree networks that support tens of thousands of nodes. We evaluate the run-time performance of the proposed algorithms and show improvement in orders of magnitude over the previously best known method. Finally, we further demonstrate a new application of max-min fair rate allocation that is only computationally feasible using our new algorithms.« less
Rapid Calculation of Max-Min Fair Rates for Multi-Commodity Flows in Fat-Tree Networks

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mollah, Md Atiqul; Yuan, Xin; Pakin, Scott

Max-min fairness is often used in the performance modeling of interconnection networks. Existing methods to compute max-min fair rates for multi-commodity flows have high complexity and are computationally infeasible for large networks. In this paper, we show that by considering topological features, this problem can be solved efficiently for the fat-tree topology that is widely used in data centers and high performance compute clusters. Several efficient new algorithms are developed for this problem, including a parallel algorithm that can take advantage of multi-core and shared-memory architectures. Using these algorithms, we demonstrate that it is possible to find the max-min fairmore » rate allocation for multi-commodity flows in fat-tree networks that support tens of thousands of nodes. We evaluate the run-time performance of the proposed algorithms and show improvement in orders of magnitude over the previously best known method. Finally, we further demonstrate a new application of max-min fair rate allocation that is only computationally feasible using our new algorithms.« less
Multicore Architectures for Multiple Independent Levels of Security Applications

DTIC Science & Technology

2012-09-01

to bolster the MILS effort. However, current MILS operating systems are not designed for multi-core platforms. They do not have the hardware support...current MILS operating systems are not designed for multi‐core platforms. They do not have the hardware support to ensure that the separation...the availability of information at different security classification levels while increasing the overall security of the computing system . Due to the
Polytopol computing for multi-core and distributed systems

NASA Astrophysics Data System (ADS)

Spaanenburg, Henk; Spaanenburg, Lambert; Ranefors, Johan

2009-05-01

Multi-core computing provides new challenges to software engineering. The paper addresses such issues in the general setting of polytopol computing, that takes multi-core problems in such widely differing areas as ambient intelligence sensor networks and cloud computing into account. It argues that the essence lies in a suitable allocation of free moving tasks. Where hardware is ubiquitous and pervasive, the network is virtualized into a connection of software snippets judiciously injected to such hardware that a system function looks as one again. The concept of polytopol computing provides a further formalization in terms of the partitioning of labor between collector and sensor nodes. Collectors provide functions such as a knowledge integrator, awareness collector, situation displayer/reporter, communicator of clues and an inquiry-interface provider. Sensors provide functions such as anomaly detection (only communicating singularities, not continuous observation), they are generally powered or self-powered, amorphous (not on a grid) with generation-and-attrition, field re-programmable, and sensor plug-and-play-able. Together the collector and the sensor are part of the skeleton injector mechanism, added to every node, and give the network the ability to organize itself into some of many topologies. Finally we will discuss a number of applications and indicate how a multi-core architecture supports the security aspects of the skeleton injector.
Active Job Monitoring in Pilots

NASA Astrophysics Data System (ADS)

Kuehn, Eileen; Fischer, Max; Giffels, Manuel; Jung, Christopher; Petzold, Andreas

2015-12-01

Recent developments in high energy physics (HEP) including multi-core jobs and multi-core pilots require data centres to gain a deep understanding of the system to monitor, design, and upgrade computing clusters. Networking is a critical component. Especially the increased usage of data federations, for example in diskless computing centres or as a fallback solution, relies on WAN connectivity and availability. The specific demands of different experiments and communities, but also the need for identification of misbehaving batch jobs, requires an active monitoring. Existing monitoring tools are not capable of measuring fine-grained information at batch job level. This complicates network-aware scheduling and optimisations. In addition, pilots add another layer of abstraction. They behave like batch systems themselves by managing and executing payloads of jobs internally. The number of real jobs being executed is unknown, as the original batch system has no access to internal information about the scheduling process inside the pilots. Therefore, the comparability of jobs and pilots for predicting run-time behaviour or network performance cannot be ensured. Hence, identifying the actual payload is important. At the GridKa Tier 1 centre a specific tool is in use that allows the monitoring of network traffic information at batch job level. This contribution presents the current monitoring approach and discusses recent efforts and importance to identify pilots and their substructures inside the batch system. It will also show how to determine monitoring data of specific jobs from identified pilots. Finally, the approach is evaluated.
Multicore fiber beamforming network for broadband satellite communications

NASA Astrophysics Data System (ADS)

Zainullin, Airat; Vidal, Borja; Macho, Andres; Llorente, Roberto

2017-02-01

Multi-core fiber (MCF) has been one of the main innovations in fiber optics in the last decade. Reported work on MCF has been focused on increasing the transmission capacity of optical communication links by exploiting space-division multiplexing. Additionally, MCF presents a strong potential in optical beamforming networks. The use of MCF can increase the compactness of the broadband antenna array controller. This is of utmost importance in platforms where size and weight are critical parameters such as communications satellites and airplanes. Here, an optical beamforming architecture that exploits the space-division capacity of MCF to implement compact optical beamforming networks is proposed, being a new application field for MCF. The experimental demonstration of this system using a 4-core MCF that controls a four-element antenna array is reported. An analysis of the impact of MCF on the performance of antenna arrays is presented. The analysis indicates that the main limitation comes from the relatively high insertion loss in the MCF fan-in and fan-out devices, which leads to angle dependent losses which can be mitigated by using fixed optical attenuators or a photonic lantern to reduce MCF insertion loss. The crosstalk requirements are also experimentally evaluated for the proposed MCF-based architecture. The potential signal impairment in the beamforming network is analytically evaluated, being of special importance when MCF with a large number of cores is considered. Finally, the optimization of the proposed MCF-based beamforming network is addressed targeting the scalability to large arrays.
Multicore-shell nanofiber architecture of polyimide/polyvinylidene fluoride blend for thermal and long-term stability of lithium ion battery separator.

PubMed

Park, Sejoon; Son, Chung Woo; Lee, Sungho; Kim, Dong Young; Park, Cheolmin; Eom, Kwang Sup; Fuller, Thomas F; Joh, Han-Ik; Jo, Seong Mu

2016-11-11

Li-ion battery, separator, multicoreshell structure, thermal stability, long-term stability. A nanofibrous membrane with multiple cores of polyimide (PI) in the shell of polyvinylidene fluoride (PVdF) was prepared using a facile one-pot electrospinning technique with a single nozzle. Unique multicore-shell (MCS) structure of the electrospun composite fibers was obtained, which resulted from electrospinning a phase-separated polymer composite solution. Multiple PI core fibrils with high molecular orientation were well-embedded across the cross-section and contributed remarkable thermal stabilities to the MCS membrane. Thus, no outbreaks were found in its dimension and ionic resistance up to 200 and 250 °C, respectively. Moreover, the MCS membrane (at ~200 °C), as a lithium ion battery (LIB) separator, showed superior thermal and electrochemical stabilities compared with a widely used commercial separator (~120 °C). The average capacity decay rate of LIB for 500 cycles was calculated to be approximately 0.030 mAh/g/cycle. This value demonstrated exceptional long-term stability compared with commercial LIBs and with two other types (single core-shell and co-electrospun separators incorporating with functionalized TiO 2 ) of PI/PVdF composite separators. The proper architecture and synergy effects of multiple PI nanofibrils as a thermally stable polymer in the PVdF shell as electrolyte compatible polymers are responsible for the superior thermal performance and long-term stability of the LIB.
Multicore-shell nanofiber architecture of polyimide/polyvinylidene fluoride blend for thermal and long-term stability of lithium ion battery separator

PubMed Central

Park, Sejoon; Son, Chung Woo; Lee, Sungho; Kim, Dong Young; Park, Cheolmin; Eom, Kwang Sup; Fuller, Thomas F.; Joh, Han-Ik; Jo, Seong Mu

2016-01-01

Li-ion battery, separator, multicoreshell structure, thermal stability, long-term stability. A nanofibrous membrane with multiple cores of polyimide (PI) in the shell of polyvinylidene fluoride (PVdF) was prepared using a facile one-pot electrospinning technique with a single nozzle. Unique multicore-shell (MCS) structure of the electrospun composite fibers was obtained, which resulted from electrospinning a phase-separated polymer composite solution. Multiple PI core fibrils with high molecular orientation were well-embedded across the cross-section and contributed remarkable thermal stabilities to the MCS membrane. Thus, no outbreaks were found in its dimension and ionic resistance up to 200 and 250 °C, respectively. Moreover, the MCS membrane (at ~200 °C), as a lithium ion battery (LIB) separator, showed superior thermal and electrochemical stabilities compared with a widely used commercial separator (~120 °C). The average capacity decay rate of LIB for 500 cycles was calculated to be approximately 0.030 mAh/g/cycle. This value demonstrated exceptional long-term stability compared with commercial LIBs and with two other types (single core-shell and co-electrospun separators incorporating with functionalized TiO2) of PI/PVdF composite separators. The proper architecture and synergy effects of multiple PI nanofibrils as a thermally stable polymer in the PVdF shell as electrolyte compatible polymers are responsible for the superior thermal performance and long-term stability of the LIB. PMID:27833132
Multicore fibre technology: the road to multimode photonics

NASA Astrophysics Data System (ADS)

Bland-Hawthorn, J.; Min, Seong-Sik; Lindley, Emma; Leon-Saval, Sergio; Ellis, Simon; Lawrence, Jon; Beyrand, Nicolas; Roth, Martin; Löhmannsröben, Hans-Gerd; Veilleux, Sylvain

2016-07-01

For the past forty years, optical fibres have found widespread use in ground-based and space-based instruments. In most applications, these fibres are used in conjunction with conventional optics to transport light. But photonics offers a huge range of optical manipulations beyond light transport that were rarely exploited before 2001. The fundamental obstacle to the broader use of photonics is the difficulty of achieving photonic action in a multimode fibre. The first step towards a general solution was the invention of the photonic lantern1 in 2004 and the delivery of high-efficiency devices (< 1 dB loss) five years on2. Multicore fibres (MCF), used in conjunction with lanterns, are now enabling an even bigger leap towards multimode photonics. Until recently, the single-moded cores in MCFs were not sufficiently uniform to achieve telecom (SMF-28) performance. Now that high-quality MCFs have been realized, we turn our attention to printing complex functions (e.g. Bragg gratings for OH suppression) into their N cores. Our first work in this direction used a Mach-Zehnder interferometer (near-field phase mask) but this approach was only adequate for N=7 MCFs as measured by the grating uniformity3. We have now built a Sagnac interferometer that gives a three-fold increase in the depth of field sufficient to print across N >= 127 cores. We achieved first light this year with our 500mW Sabre FRED laser. These are sophisticated and complex interferometers. We report on our progress to date and summarize our first-year goals which include multimode OH suppression fibres for the Anglo-Australian Telescope/PRAXIS instrument and the Discovery Channel Telescope/MOHSIS instrument under development at the University of Maryland.
Quantum key distribution in multicore fibre for secure radio access networks

NASA Astrophysics Data System (ADS)

Llorente, Roberto; Provot, Antoine; Morant, Maria

2018-01-01

Broadband access in optical domain usually focuses in providing a pervasive cost-effective high bitrate communication in a given area. Nowadays, it is of utmost interest also to be able to provide a secure communication to the costumers in the area. Wireless access networks rely on optical domain for both fronthaul and backhaul of the radio access network (C-RAN). Multicore fiber (MCF) has been proposed as a promising candidate for the optical media of choice in nextgeneration wireless. The capacity demand of next-generation 5G networks makes interesting the use of high-capacity optical solutions as space-division multiplexing of different signals over MCF media. This work addresses secure MCF communication supporting C-RAN architectures. The paper proposes the use of one core in the MCF to transport securely an optical quantum key encoding altogether with end-to-end wireless signal transmitted in the remaining cores in radio-over-fiber (RoF). The RoF wireless signals are suitable for radio access fronthaul and backhaul. The theoretical principle and simulation analysis of quantum key distribution (QKD) are presented in this paper. The potential impact of optical RoF transmission crosstalk impairments is assessed experimentally considering different cellular signals on the remaining optical cores in the MCF. The experimental results report fronthaul performance over a four-core optical fiber with RoF transmission of full-standard CDMA signals providing 3.5G services in one core, HSPA+ signals providing 3.9G services in the second core and 3GPP LTEAdvanced signals providing 4G services in the third core, considering that the QKD signal is allocated in the fourth core.

Accelerating Pathology Image Data Cross-Comparison on CPU-GPU Hybrid Systems

PubMed Central

Wang, Kaibo; Huai, Yin; Lee, Rubao; Wang, Fusheng; Zhang, Xiaodong; Saltz, Joel H.

2012-01-01

As an important application of spatial databases in pathology imaging analysis, cross-comparing the spatial boundaries of a huge amount of segmented micro-anatomic objects demands extremely data- and compute-intensive operations, requiring high throughput at an affordable cost. However, the performance of spatial database systems has not been satisfactory since their implementations of spatial operations cannot fully utilize the power of modern parallel hardware. In this paper, we provide a customized software solution that exploits GPUs and multi-core CPUs to accelerate spatial cross-comparison in a cost-effective way. Our solution consists of an efficient GPU algorithm and a pipelined system framework with task migration support. Extensive experiments with real-world data sets demonstrate the effectiveness of our solution, which improves the performance of spatial cross-comparison by over 18 times compared with a parallelized spatial database approach. PMID:23355955
Runtime Performance and Virtual Network Control Alternatives in VM-Based High-Fidelity Network Simulations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Yoginath, Srikanth B; Perumalla, Kalyan S; Henz, Brian J

2012-01-01

In prior work (Yoginath and Perumalla, 2011; Yoginath, Perumalla and Henz, 2012), the motivation, challenges and issues were articulated in favor of virtual time ordering of Virtual Machines (VMs) in network simulations hosted on multi-core machines. Two major components in the overall virtualization challenge are (1) virtual timeline establishment and scheduling of VMs, and (2) virtualization of inter-VM communication. Here, we extend prior work by presenting scaling results for the first component, with experiment results on up to 128 VMs scheduled in virtual time order on a single 12-core host. We also explore the solution space of design alternatives formore » the second component, and present performance results from a multi-threaded, multi-queue implementation of inter-VM network control for synchronized execution with VM scheduling, incorporated in our NetWarp simulation system.« less
Discrimination of Temperature and Strain in Brillouin Optical Time Domain Analysis Using a Multicore Optical Fiber

PubMed Central

Zaghloul, Mohamed A. S.; Wang, Mohan; Milione, Giovanni; Li, Ming-Jun; Li, Shenping; Huang, Yue-Kai; Wang, Ting; Chen, Kevin P.

2018-01-01

Brillouin optical time domain analysis is the sensing of temperature and strain changes along an optical fiber by measuring the frequency shift changes of Brillouin backscattering. Because frequency shift changes are a linear combination of temperature and strain changes, their discrimination is a challenge. Here, a multicore optical fiber that has two cores is fabricated. The differences between the cores’ temperature and strain coefficients are such that temperature (strain) changes can be discriminated with error amplification factors of 4.57 °C/MHz (69.11 μϵ/MHz), which is 2.63 (3.67) times lower than previously demonstrated. As proof of principle, using the multicore optical fiber and a commercial Brillouin optical time domain analyzer, the temperature (strain) changes of a thermally expanding metal cylinder are discriminated with an error of 0.24% (3.7%). PMID:29649148
Optimization of multicore-shell Fe3O4-SiO2 magnetic nanocomposites synthesis and retention in cellulose pulp

NASA Astrophysics Data System (ADS)

Buteica, Dan; Borbath, Istvan; Nicolae, Ionel Valentin; Turcu, Rodica; Marinica, Oana; Socoliuc, Vlad

2017-12-01

The use of magnetite nanoparticles to produce magnetic paper has a severe effect on the color of the paper, which is worth searching means to alleviate. Multicore-shell Fe3O4-SiO2 magnetic nanocomposites were synthesized. The nanocomposite powder was dispersed in cellulose pulp and paper was produced by dehydration on a Rapid Kothen machine. The nanocomposite retention efficiency was investigated in correlation with nanocomposite shell thickness, the resinous vs. deciduous fiber content of the cellulose pulp, the long and short fibers' grinding degree, the cationic starch and polymeric retention agent content of the pulp. The whiteness and magnetization was measured for all paper samples. It was proved that the use of multi-core shell magnetic nanocomposites leads to weaker paper coloring. This effect is enhanced by increasing the polymeric retention agent content of the pulp, in spite of higher composite content.
Discrimination of Temperature and Strain in Brillouin Optical Time Domain Analysis Using a Multicore Optical Fiber.

PubMed

Zaghloul, Mohamed A S; Wang, Mohan; Milione, Giovanni; Li, Ming-Jun; Li, Shenping; Huang, Yue-Kai; Wang, Ting; Chen, Kevin P

2018-04-12

Brillouin optical time domain analysis is the sensing of temperature and strain changes along an optical fiber by measuring the frequency shift changes of Brillouin backscattering. Because frequency shift changes are a linear combination of temperature and strain changes, their discrimination is a challenge. Here, a multicore optical fiber that has two cores is fabricated. The differences between the cores' temperature and strain coefficients are such that temperature (strain) changes can be discriminated with error amplification factors of 4.57 °C/MHz (69.11 μ ϵ /MHz), which is 2.63 (3.67) times lower than previously demonstrated. As proof of principle, using the multicore optical fiber and a commercial Brillouin optical time domain analyzer, the temperature (strain) changes of a thermally expanding metal cylinder are discriminated with an error of 0.24% (3.7%).
BilKristal 2.0: A tool for pattern information extraction from crystal structures

NASA Astrophysics Data System (ADS)

Okuyan, Erhan; Güdükbay, Uğur

2014-01-01

We present a revised version of the BilKristal tool of Okuyan et al. (2007). We converted the development environment into Microsoft Visual Studio 2005 in order to resolve compatibility issues. We added multi-core CPU support and improvements are made to graphics functions in order to improve performance. Discovered bugs are fixed and exporting functionality to a material visualization tool is added.
Mobile Thread Task Manager

NASA Technical Reports Server (NTRS)

Clement, Bradley J.; Estlin, Tara A.; Bornstein, Benjamin J.

2013-01-01

The Mobile Thread Task Manager (MTTM) is being applied to parallelizing existing flight software to understand the benefits and to develop new techniques and architectural concepts for adapting software to multicore architectures. It allocates and load-balances tasks for a group of threads that migrate across processors to improve cache performance. In order to balance-load across threads, the MTTM augments a basic map-reduce strategy to draw jobs from a global queue. In a multicore processor, memory may be "homed" to the cache of a specific processor and must be accessed from that processor. The MTTB architecture wraps access to data with thread management to move threads to the home processor for that data so that the computation follows the data in an attempt to avoid L2 cache misses. Cache homing is also handled by a memory manager that translates identifiers to processor IDs where the data will be homed (according to rules defined by the user). The user can also specify the number of threads and processors separately, which is important for tuning performance for different patterns of computation and memory access. MTTM efficiently processes tasks in parallel on a multiprocessor computer. It also provides an interface to make it easier to adapt existing software to a multiprocessor environment.
LIBS data analysis using a predictor-corrector based digital signal processor algorithm

NASA Astrophysics Data System (ADS)

Sanders, Alex; Griffin, Steven T.; Robinson, Aaron

2012-06-01

There are many accepted sensor technologies for generating spectra for material classification. Once the spectra are generated, communication bandwidth limitations favor local material classification with its attendant reduction in data transfer rates and power consumption. Transferring sensor technologies such as Cavity Ring-Down Spectroscopy (CRDS) and Laser Induced Breakdown Spectroscopy (LIBS) require effective material classifiers. A result of recent efforts has been emphasis on Partial Least Squares - Discriminant Analysis (PLS-DA) and Principle Component Analysis (PCA). Implementation of these via general purpose computers is difficult in small portable sensor configurations. This paper addresses the creation of a low mass, low power, robust hardware spectra classifier for a limited set of predetermined materials in an atmospheric matrix. Crucial to this is the incorporation of PCA or PLS-DA classifiers into a predictor-corrector style implementation. The system configuration guarantees rapid convergence. Software running on multi-core Digital Signal Processor (DSPs) simulates a stream-lined plasma physics model estimator, reducing Analog-to-Digital (ADC) power requirements. This paper presents the results of a predictorcorrector model implemented on a low power multi-core DSP to perform substance classification. This configuration emphasizes the hardware system and software design via a predictor corrector model that simultaneously decreases the sample rate while performing the classification.
QR-decomposition based SENSE reconstruction using parallel architecture.

PubMed

Ullah, Irfan; Nisar, Habab; Raza, Haseeb; Qasim, Malik; Inam, Omair; Omer, Hammad

2018-04-01

Magnetic Resonance Imaging (MRI) is a powerful medical imaging technique that provides essential clinical information about the human body. One major limitation of MRI is its long scan time. Implementation of advance MRI algorithms on a parallel architecture (to exploit inherent parallelism) has a great potential to reduce the scan time. Sensitivity Encoding (SENSE) is a Parallel Magnetic Resonance Imaging (pMRI) algorithm that utilizes receiver coil sensitivities to reconstruct MR images from the acquired under-sampled k-space data. At the heart of SENSE lies inversion of a rectangular encoding matrix. This work presents a novel implementation of GPU based SENSE algorithm, which employs QR decomposition for the inversion of the rectangular encoding matrix. For a fair comparison, the performance of the proposed GPU based SENSE reconstruction is evaluated against single and multicore CPU using openMP. Several experiments against various acceleration factors (AFs) are performed using multichannel (8, 12 and 30) phantom and in-vivo human head and cardiac datasets. Experimental results show that GPU significantly reduces the computation time of SENSE reconstruction as compared to multi-core CPU (approximately 12x speedup) and single-core CPU (approximately 53x speedup) without any degradation in the quality of the reconstructed images. Copyright © 2018 Elsevier Ltd. All rights reserved.
Accelerating nuclear configuration interaction calculations through a preconditioned block iterative eigensolver

DOE PAGES

Shao, Meiyue; Aktulga, H. Metin; Yang, Chao; ...

2017-09-14

In this paper, we describe a number of recently developed techniques for improving the performance of large-scale nuclear configuration interaction calculations on high performance parallel computers. We show the benefit of using a preconditioned block iterative method to replace the Lanczos algorithm that has traditionally been used to perform this type of computation. The rapid convergence of the block iterative method is achieved by a proper choice of starting guesses of the eigenvectors and the construction of an effective preconditioner. These acceleration techniques take advantage of special structure of the nuclear configuration interaction problem which we discuss in detail. Themore » use of a block method also allows us to improve the concurrency of the computation, and take advantage of the memory hierarchy of modern microprocessors to increase the arithmetic intensity of the computation relative to data movement. Finally, we also discuss the implementation details that are critical to achieving high performance on massively parallel multi-core supercomputers, and demonstrate that the new block iterative solver is two to three times faster than the Lanczos based algorithm for problems of moderate sizes on a Cray XC30 system.« less
Accelerating nuclear configuration interaction calculations through a preconditioned block iterative eigensolver

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shao, Meiyue; Aktulga, H. Metin; Yang, Chao

In this paper, we describe a number of recently developed techniques for improving the performance of large-scale nuclear configuration interaction calculations on high performance parallel computers. We show the benefit of using a preconditioned block iterative method to replace the Lanczos algorithm that has traditionally been used to perform this type of computation. The rapid convergence of the block iterative method is achieved by a proper choice of starting guesses of the eigenvectors and the construction of an effective preconditioner. These acceleration techniques take advantage of special structure of the nuclear configuration interaction problem which we discuss in detail. Themore » use of a block method also allows us to improve the concurrency of the computation, and take advantage of the memory hierarchy of modern microprocessors to increase the arithmetic intensity of the computation relative to data movement. Finally, we also discuss the implementation details that are critical to achieving high performance on massively parallel multi-core supercomputers, and demonstrate that the new block iterative solver is two to three times faster than the Lanczos based algorithm for problems of moderate sizes on a Cray XC30 system.« less
A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations

DOE PAGES

Aktulga, Hasan Metin; Afibuzzaman, Md.; Williams, Samuel; ...

2017-06-01

As on-node parallelism increases and the performance gap between the processor and the memory system widens, achieving high performance in large-scale scientific applications requires an architecture-aware design of algorithms and solvers. We focus on the eigenvalue problem arising in nuclear Configuration Interaction (CI) calculations, where a few extreme eigenpairs of a sparse symmetric matrix are needed. Here, we consider a block iterative eigensolver whose main computational kernels are the multiplication of a sparse matrix with multiple vectors (SpMM), and tall-skinny matrix operations. We then present techniques to significantly improve the SpMM and the transpose operation SpMM T by using themore » compressed sparse blocks (CSB) format. We achieve 3-4× speedup on the requisite operations over good implementations with the commonly used compressed sparse row (CSR) format. We develop a performance model that allows us to correctly estimate the performance of our SpMM kernel implementations, and we identify cache bandwidth as a potential performance bottleneck beyond DRAM. We also analyze and optimize the performance of LOBPCG kernels (inner product and linear combinations on multiple vectors) and show up to 15× speedup over using high performance BLAS libraries for these operations. The resulting high performance LOBPCG solver achieves 1.4× to 1.8× speedup over the existing Lanczos solver on a series of CI computations on high-end multicore architectures (Intel Xeons). We also analyze the performance of our techniques on an Intel Xeon Phi Knights Corner (KNC) processor.« less
A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Aktulga, Hasan Metin; Afibuzzaman, Md.; Williams, Samuel

As on-node parallelism increases and the performance gap between the processor and the memory system widens, achieving high performance in large-scale scientific applications requires an architecture-aware design of algorithms and solvers. We focus on the eigenvalue problem arising in nuclear Configuration Interaction (CI) calculations, where a few extreme eigenpairs of a sparse symmetric matrix are needed. Here, we consider a block iterative eigensolver whose main computational kernels are the multiplication of a sparse matrix with multiple vectors (SpMM), and tall-skinny matrix operations. We then present techniques to significantly improve the SpMM and the transpose operation SpMM T by using themore » compressed sparse blocks (CSB) format. We achieve 3-4× speedup on the requisite operations over good implementations with the commonly used compressed sparse row (CSR) format. We develop a performance model that allows us to correctly estimate the performance of our SpMM kernel implementations, and we identify cache bandwidth as a potential performance bottleneck beyond DRAM. We also analyze and optimize the performance of LOBPCG kernels (inner product and linear combinations on multiple vectors) and show up to 15× speedup over using high performance BLAS libraries for these operations. The resulting high performance LOBPCG solver achieves 1.4× to 1.8× speedup over the existing Lanczos solver on a series of CI computations on high-end multicore architectures (Intel Xeons). We also analyze the performance of our techniques on an Intel Xeon Phi Knights Corner (KNC) processor.« less
Implementing Molecular Dynamics on Hybrid High Performance Computers - Particle-Particle Particle-Mesh

DOE Office of Scientific and Technical Information (OSTI.GOV)

Brown, W Michael; Kohlmeyer, Axel; Plimpton, Steven J

The use of accelerators such as graphics processing units (GPUs) has become popular in scientific computing applications due to their low cost, impressive floating-point capabilities, high memory bandwidth, and low electrical power requirements. Hybrid high-performance computers, machines with nodes containing more than one type of floating-point processor (e.g. CPU and GPU), are now becoming more prevalent due to these advantages. In this paper, we present a continuation of previous work implementing algorithms for using accelerators into the LAMMPS molecular dynamics software for distributed memory parallel hybrid machines. In our previous work, we focused on acceleration for short-range models with anmore » approach intended to harness the processing power of both the accelerator and (multi-core) CPUs. To augment the existing implementations, we present an efficient implementation of long-range electrostatic force calculation for molecular dynamics. Specifically, we present an implementation of the particle-particle particle-mesh method based on the work by Harvey and De Fabritiis. We present benchmark results on the Keeneland InfiniBand GPU cluster. We provide a performance comparison of the same kernels compiled with both CUDA and OpenCL. We discuss limitations to parallel efficiency and future directions for improving performance on hybrid or heterogeneous computers.« less
Benchmarking the ATLAS software through the Kit Validation engine

NASA Astrophysics Data System (ADS)

De Salvo, Alessandro; Brasolin, Franco

2010-04-01

The measurement of the experiment software performance is a very important metric in order to choose the most effective resources to be used and to discover the bottlenecks of the code implementation. In this work we present the benchmark techniques used to measure the ATLAS software performance through the ATLAS offline testing engine Kit Validation and the online portal Global Kit Validation. The performance measurements, the data collection, the online analysis and display of the results will be presented. The results of the measurement on different platforms and architectures will be shown, giving a full report on the CPU power and memory consumption of the Monte Carlo generation, simulation, digitization and reconstruction of the most CPU-intensive channels. The impact of the multi-core computing on the ATLAS software performance will also be presented, comparing the behavior of different architectures when increasing the number of concurrent processes. The benchmark techniques described in this paper have been used in the HEPiX group since the beginning of 2008 to help defining the performance metrics for the High Energy Physics applications, based on the real experiment software.
A 60 GOPS/W, -1.8 V to 0.9 V body bias ULP cluster in 28 nm UTBB FD-SOI technology

NASA Astrophysics Data System (ADS)

Rossi, Davide; Pullini, Antonio; Loi, Igor; Gautschi, Michael; Gürkaynak, Frank K.; Bartolini, Andrea; Flatresse, Philippe; Benini, Luca

2016-03-01

Ultra-low power operation and extreme energy efficiency are strong requirements for a number of high-growth application areas, such as E-health, Internet of Things, and wearable Human-Computer Interfaces. A promising approach to achieve up to one order of magnitude of improvement in energy efficiency over current generation of integrated circuits is near-threshold computing. However, frequency degradation due to aggressive voltage scaling may not be acceptable across all performance-constrained applications. Thread-level parallelism over multiple cores can be used to overcome the performance degradation at low voltage. Moreover, enabling the processors to operate on-demand and over a wide supply voltage and body bias ranges allows to achieve the best possible energy efficiency while satisfying a large spectrum of computational demands. In this work we present the first ever implementation of a 4-core cluster fabricated using conventional-well 28 nm UTBB FD-SOI technology. The multi-core architecture we present in this work is able to operate on a wide range of supply voltages starting from 0.44 V to 1.2 V. In addition, the architecture allows a wide range of body bias to be applied from -1.8 V to 0.9 V. The peak energy efficiency 60 GOPS/W is achieved at 0.5 V supply voltage and 0.5 V forward body bias. Thanks to the extended body bias range of conventional-well FD-SOI technology, high energy efficiency can be guaranteed for a wide range of process and environmental conditions. We demonstrate the ability to compensate for up to 99.7% of chips for process variation with only ±0.2 V of body biasing, and compensate temperature variation in the range -40 °C to 120 °C exploiting -1.1 V to 0.8 V body biasing. When compared to leading-edge near-threshold RISC processors optimized for extremely low power applications, the multi-core architecture we propose has 144× more performance at comparable energy efficiency levels. Even when compared to other low-power processors with comparable performance, including those implemented in 28 nm technology, our platform provides 1.4× to 3.7× better energy efficiency.
An acquisition system for CMOS imagers with a genuine 10 Gbit/s bandwidth

NASA Astrophysics Data System (ADS)

Guérin, C.; Mahroug, J.; Tromeur, W.; Houles, J.; Calabria, P.; Barbier, R.

2012-12-01

This paper presents a high data throughput acquisition system for pixel detector readout such as CMOS imagers. This CMOS acquisition board offers a genuine 10 Gbit/s bandwidth to the workstation and can provide an on-line and continuous high frame rate imaging capability. On-line processing can be implemented either on the Data Acquisition Board or on the multi-cores workstation depending on the complexity of the algorithms. The different parts composing the acquisition board have been designed to be used first with a single-photon detector called LUSIPHER (800×800 pixels), developed in our laboratory for scientific applications ranging from nano-photonics to adaptive optics. The architecture of the acquisition board is presented and the performances achieved by the produced boards are described. The future developments (hardware and software) concerning the on-line implementation of algorithms dedicated to single-photon imaging are tackled.
QMCPACK: an open source ab initio quantum Monte Carlo package for the electronic structure of atoms, molecules and solids

NASA Astrophysics Data System (ADS)

Kim, Jeongnim; Baczewski, Andrew D.; Beaudet, Todd D.; Benali, Anouar; Chandler Bennett, M.; Berrill, Mark A.; Blunt, Nick S.; Josué Landinez Borda, Edgar; Casula, Michele; Ceperley, David M.; Chiesa, Simone; Clark, Bryan K.; Clay, Raymond C., III; Delaney, Kris T.; Dewing, Mark; Esler, Kenneth P.; Hao, Hongxia; Heinonen, Olle; Kent, Paul R. C.; Krogel, Jaron T.; Kylänpää, Ilkka; Li, Ying Wai; Lopez, M. Graham; Luo, Ye; Malone, Fionn D.; Martin, Richard M.; Mathuriya, Amrita; McMinis, Jeremy; Melton, Cody A.; Mitas, Lubos; Morales, Miguel A.; Neuscamman, Eric; Parker, William D.; Pineda Flores, Sergio D.; Romero, Nichols A.; Rubenstein, Brenda M.; Shea, Jacqueline A. R.; Shin, Hyeondeok; Shulenburger, Luke; Tillack, Andreas F.; Townsend, Joshua P.; Tubman, Norm M.; Van Der Goetz, Brett; Vincent, Jordan E.; ChangMo Yang, D.; Yang, Yubo; Zhang, Shuai; Zhao, Luning

2018-05-01

QMCPACK is an open source quantum Monte Carlo package for ab initio electronic structure calculations. It supports calculations of metallic and insulating solids, molecules, atoms, and some model Hamiltonians. Implemented real space quantum Monte Carlo algorithms include variational, diffusion, and reptation Monte Carlo. QMCPACK uses Slater–Jastrow type trial wavefunctions in conjunction with a sophisticated optimizer capable of optimizing tens of thousands of parameters. The orbital space auxiliary-field quantum Monte Carlo method is also implemented, enabling cross validation between different highly accurate methods. The code is specifically optimized for calculations with large numbers of electrons on the latest high performance computing architectures, including multicore central processing unit and graphical processing unit systems. We detail the program’s capabilities, outline its structure, and give examples of its use in current research calculations. The package is available at http://qmcpack.org.
QMCPACK: an open source ab initio quantum Monte Carlo package for the electronic structure of atoms, molecules and solids.

PubMed

Kim, Jeongnim; Baczewski, Andrew T; Beaudet, Todd D; Benali, Anouar; Bennett, M Chandler; Berrill, Mark A; Blunt, Nick S; Borda, Edgar Josué Landinez; Casula, Michele; Ceperley, David M; Chiesa, Simone; Clark, Bryan K; Clay, Raymond C; Delaney, Kris T; Dewing, Mark; Esler, Kenneth P; Hao, Hongxia; Heinonen, Olle; Kent, Paul R C; Krogel, Jaron T; Kylänpää, Ilkka; Li, Ying Wai; Lopez, M Graham; Luo, Ye; Malone, Fionn D; Martin, Richard M; Mathuriya, Amrita; McMinis, Jeremy; Melton, Cody A; Mitas, Lubos; Morales, Miguel A; Neuscamman, Eric; Parker, William D; Pineda Flores, Sergio D; Romero, Nichols A; Rubenstein, Brenda M; Shea, Jacqueline A R; Shin, Hyeondeok; Shulenburger, Luke; Tillack, Andreas F; Townsend, Joshua P; Tubman, Norm M; Van Der Goetz, Brett; Vincent, Jordan E; Yang, D ChangMo; Yang, Yubo; Zhang, Shuai; Zhao, Luning

2018-05-16

QMCPACK is an open source quantum Monte Carlo package for ab initio electronic structure calculations. It supports calculations of metallic and insulating solids, molecules, atoms, and some model Hamiltonians. Implemented real space quantum Monte Carlo algorithms include variational, diffusion, and reptation Monte Carlo. QMCPACK uses Slater-Jastrow type trial wavefunctions in conjunction with a sophisticated optimizer capable of optimizing tens of thousands of parameters. The orbital space auxiliary-field quantum Monte Carlo method is also implemented, enabling cross validation between different highly accurate methods. The code is specifically optimized for calculations with large numbers of electrons on the latest high performance computing architectures, including multicore central processing unit and graphical processing unit systems. We detail the program's capabilities, outline its structure, and give examples of its use in current research calculations. The package is available at http://qmcpack.org.
Design and Development of a Run-Time Monitor for Multi-Core Architectures in Cloud Computing

PubMed Central

Kang, Mikyung; Kang, Dong-In; Crago, Stephen P.; Park, Gyung-Leen; Lee, Junghoon

2011-01-01

Cloud computing is a new information technology trend that moves computing and data away from desktops and portable PCs into large data centers. The basic principle of cloud computing is to deliver applications as services over the Internet as well as infrastructure. A cloud is a type of parallel and distributed system consisting of a collection of inter-connected and virtualized computers that are dynamically provisioned and presented as one or more unified computing resources. The large-scale distributed applications on a cloud require adaptive service-based software, which has the capability of monitoring system status changes, analyzing the monitored information, and adapting its service configuration while considering tradeoffs among multiple QoS features simultaneously. In this paper, we design and develop a Run-Time Monitor (RTM) which is a system software to monitor the application behavior at run-time, analyze the collected information, and optimize cloud computing resources for multi-core architectures. RTM monitors application software through library instrumentation as well as underlying hardware through a performance counter optimizing its computing configuration based on the analyzed data. PMID:22163811

Real-Time Agent-Based Modeling Simulation with in-situ Visualization of Complex Biological Systems: A Case Study on Vocal Fold Inflammation and Healing.

PubMed

Seekhao, Nuttiiya; Shung, Caroline; JaJa, Joseph; Mongeau, Luc; Li-Jessen, Nicole Y K

2016-05-01

We present an efficient and scalable scheme for implementing agent-based modeling (ABM) simulation with In Situ visualization of large complex systems on heterogeneous computing platforms. The scheme is designed to make optimal use of the resources available on a heterogeneous platform consisting of a multicore CPU and a GPU, resulting in minimal to no resource idle time. Furthermore, the scheme was implemented under a client-server paradigm that enables remote users to visualize and analyze simulation data as it is being generated at each time step of the model. Performance of a simulation case study of vocal fold inflammation and wound healing with 3.8 million agents shows 35× and 7× speedup in execution time over single-core and multi-core CPU respectively. Each iteration of the model took less than 200 ms to simulate, visualize and send the results to the client. This enables users to monitor the simulation in real-time and modify its course as needed.
Design and development of a run-time monitor for multi-core architectures in cloud computing.

PubMed

Kang, Mikyung; Kang, Dong-In; Crago, Stephen P; Park, Gyung-Leen; Lee, Junghoon

2011-01-01

Cloud computing is a new information technology trend that moves computing and data away from desktops and portable PCs into large data centers. The basic principle of cloud computing is to deliver applications as services over the Internet as well as infrastructure. A cloud is a type of parallel and distributed system consisting of a collection of inter-connected and virtualized computers that are dynamically provisioned and presented as one or more unified computing resources. The large-scale distributed applications on a cloud require adaptive service-based software, which has the capability of monitoring system status changes, analyzing the monitored information, and adapting its service configuration while considering tradeoffs among multiple QoS features simultaneously. In this paper, we design and develop a Run-Time Monitor (RTM) which is a system software to monitor the application behavior at run-time, analyze the collected information, and optimize cloud computing resources for multi-core architectures. RTM monitors application software through library instrumentation as well as underlying hardware through a performance counter optimizing its computing configuration based on the analyzed data.
Magnetophoretic separation ICP-MS immunoassay using Cs-doped multicore magnetic nanoparticles for the determination of salmonella typhimurium.

PubMed

Jeong, Arong; Lim, H B

2018-02-01

In this work, a magnetophoretic separation ICP-MS immunoassay using newly synthesized multicore magnetic nanoparticles (MMNPs) was developed for the determination of salmonella typhimurium (typhi). The uniqueness of this method was the use of MMNPs doped with Cs for both separation and detection, which enable us to achieve fast analysis, high sensitivity, and good reliability. For demonstration, heat-killed typhi in a phosphate buffer solution was determined by ICP-MS after the MMNP-typhi reaction product was separated from unreacted MMNPs in a micropipette tip filled with 25% polyethylene glycol through magnetophoretic separation. The calibration curve obtained by plotting 133 Cs intensity vs. the number of synthetic standard, showed a coefficient of determination (R 2 ) of 0.94 with a limit of detection (LOD) of 102 cells/mL without cell culturing. Excellent recoveries, between 98-100%, were obtained from four replicates and compared with a sandwich-type ICP-MS immunoassay for further confirmation. Copyright © 2017 Elsevier B.V. All rights reserved.
Contamination of arctic Fjord sediments by Pb-Zn mining at Maarmorilik in central West Greenland.

PubMed

Perner, K; Leipe, Th; Dellwig, O; Kuijpers, A; Mikkelsen, N; Andersen, T J; Harff, J

2010-07-01

This study focuses on heavy metal contamination of arctic sediments from a small Fjord system adjacent to the Pb-Zn "Black Angel" mine (West Greenland) to investigate the temporal and spatial development of contamination and to provide baseline levels before the mines re-opening in January 2009. For this purpose we collected multi-cores along a transect from Affarlikassaa Fjord, which received high amounts of tailings from 1973 to 1990, to the mouth of Qaumarujuk Fjord. Along with radiochemical dating by (210)Pb and (137)Cs, geochemical analyses of heavy metals (e.g. As, Cd, Hg, Pb, and Zn) were carried out. Maximum contents were found at 12 cm depth in Affarlikassaa. After 17 years the mine last closed, specific local hydrographic conditions continue to disperse heavy metal enriched material derived from the Affarlikassaa into Qaumarujuk. Total Hg profiles from multi-cores along the transect clearly illustrate this transport and spatial distribution pattern of the contaminated material. Copyright 2010 Elsevier Ltd. All rights reserved.
Non-proximity resonant tunneling in multi-core photonic band gap fibers: An efficient mechanism for engineering highly-selective ultra-narrow band pass splitters

NASA Astrophysics Data System (ADS)

Florous, Nikolaos J.; Saitoh, Kunimasa; Murao, Tadashi; Koshiba, Masanori; Skorobogatiy, Maksim

2006-05-01

The objective of the present investigation is to demonstrate the possibility of designing compact ultra-narrow band-pass filters based on the phenomenon of non-proximity resonant tunneling in multi-core photonic band gap fibers (PBGFs). The proposed PBGF consists of three identical air-cores separated by two defected air-holes which act as highly-selective resonators. With a fine adjustment of the design parameters associated with the resonant-air-holes, phase matching at two distinct wavelengths can be achieved, thus enabling very narrow-band resonant directional coupling between the input and the two output cores. The validation of the proposed design is ensured with an accurate PBGF analysis based on finite element modal and beam propagation algorithms. Typical characteristics of the proposed device over a single polarization are: reasonable short coupling length of 2.7 mm, dual bandpass transmission response at wavelengths of 1.339 and 1.357 μm, with corresponding full width at half maximum bandwidths of 1.2 nm and 1.1 nm respectively, and a relatively high transmission of 95% at the exact resonance wavelengths. The proposed ultra-narrow band-pass filter can be employed in various applications such as all-fiber bandpass/bandstop filtering and resonant sensors.
Non-proximity resonant tunneling in multi-core photonic band gap fibers: An efficient mechanism for engineering highly-selective ultra-narrow band pass splitters.

PubMed

Florous, Nikolaos J; Saitoh, Kunimasa; Murao, Tadashi; Koshiba, Masanori; Skorobogatiy, Maksim

2006-05-29

The objective of the present investigation is to demonstrate the possibility of designing compact ultra-narrow band-pass filters based on the phenomenon of non-proximity resonant tunneling in multi-core photonic band gap fibers (PBGFs). The proposed PBGF consists of three identical air-cores separated by two defected air-holes which act as highly-selective resonators. With a fine adjustment of the design parameters associated with the resonant-air-holes, phase matching at two distinct wavelengths can be achieved, thus enabling very narrow-band resonant directional coupling between the input and the two output cores. The validation of the proposed design is ensured with an accurate PBGF analysis based on finite element modal and beam propagation algorithms. Typical characteristics of the proposed device over a single polarization are: reasonable short coupling length of 2.7 mm, dual bandpass transmission response at wavelengths of 1.339 and 1.357 mum, with corresponding full width at half maximum bandwidths of 1.2 nm and 1.1 nm respectively, and a relatively high transmission of 95% at the exact resonance wavelengths. The proposed ultra-narrow band-pass filter can be employed in various applications such as all-fiber bandpass/bandstop filtering and resonant sensors.
Self-powered information measuring wireless networks using the distribution of tasks within multicore processors

NASA Astrophysics Data System (ADS)

Zhuravska, Iryna M.; Koretska, Oleksandra O.; Musiyenko, Maksym P.; Surtel, Wojciech; Assembay, Azat; Kovalev, Vladimir; Tleshova, Akmaral

2017-08-01

The article contains basic approaches to develop the self-powered information measuring wireless networks (SPIM-WN) using the distribution of tasks within multicore processors critical applying based on the interaction of movable components - as in the direction of data transmission as wireless transfer of energy coming from polymetric sensors. Base mathematic model of scheduling tasks within multiprocessor systems was modernized to schedule and allocate tasks between cores of one-crystal computer (SoC) to increase energy efficiency SPIM-WN objects.
A multicore compound glass optical fiber for neutron imaging

NASA Astrophysics Data System (ADS)

Moore, Michael; Zhang, Xiaodong; Feng, Xian; Brambilla, Gilberto; Hayward, Jason

2017-04-01

Optical fibers have been successfully utilized for point sensors targeting physical quantities (stress, strain, rotation, acceleration), chemical compounds (humidity, oil, nitrates, alcohols, DNA) or radiation fields (X-rays, β particles, γ-rays). Similarly, bundles of fibers have been extremely successful in imaging visible wavelengths for medical endoscopy and industrial boroscopy. This work presents the progress in the fabrication and experimental evaluation of multicore fiber as neutron scattering instrumentation designed to detect and image neutrons with micron level spatial resolution.
Fully-elastic multi-granular network with space/frequency/time switching using multi-core fibres and programmable optical nodes.

PubMed

Amaya, N; Irfan, M; Zervas, G; Nejabati, R; Simeonidou, D; Sakaguchi, J; Klaus, W; Puttnam, B J; Miyazawa, T; Awaji, Y; Wada, N; Henning, I

2013-04-08

We present the first elastic, space division multiplexing, and multi-granular network based on two 7-core MCF links and four programmable optical nodes able to switch traffic utilising the space, frequency and time dimensions with over 6000-fold bandwidth granularity. Results show good end-to-end performance on all channels with power penalties between 0.75 dB and 3.7 dB.
Optimal Configuration and Deployment of Software on Multi-Core Processing Architectures

DTIC Science & Technology

2008-07-01

between the event generating threads and the collector thread is implemented through semaphores . The Perseus data logger is designed to minimize the...performance counters (through the PAPI API) and opens up access to the shared memory logger through a semaphore and Remote Procedure Call (RPC) buffer... synchronization events. Using this rich data, the TMAM is able to output all of the information necessary to identify precisely which pairs of thread
Cyber-Physical Multi-Core Optimization for Resource and Cache Effects (C2ORES)

DTIC Science & Technology

2014-03-01

DoD-sponsored ATAACK mobile cloud testbed funded through the DURIP program, which is deployed at Virginia Tech and Vanderbilt University to conduct...0.9.2. Jug was configured to use a filesystem (network file system (nfs)) backend for locking and task synchronization. 4.1.7.2 Experiment 1...and performance-aware virtual machine placement technique that is realized as cloud infrastructure middleware. The key contributions of iPlace include
High-performance dynamic quantum clustering on graphics processors

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wittek, Peter, E-mail: peterwittek@acm.org

2013-01-15

Clustering methods in machine learning may benefit from borrowing metaphors from physics. Dynamic quantum clustering associates a Gaussian wave packet with the multidimensional data points and regards them as eigenfunctions of the Schroedinger equation. The clustering structure emerges by letting the system evolve and the visual nature of the algorithm has been shown to be useful in a range of applications. Furthermore, the method only uses matrix operations, which readily lend themselves to parallelization. In this paper, we develop an implementation on graphics hardware and investigate how this approach can accelerate the computations. We achieve a speedup of up tomore » two magnitudes over a multicore CPU implementation, which proves that quantum-like methods and acceleration by graphics processing units have a great relevance to machine learning.« less
A New Generation of Real-Time Systems in the JET Tokamak

NASA Astrophysics Data System (ADS)

Alves, Diogo; Neto, Andre C.; Valcarcel, Daniel F.; Felton, Robert; Lopez, Juan M.; Barbalace, Antonio; Boncagni, Luca; Card, Peter; De Tommasi, Gianmaria; Goodyear, Alex; Jachmich, Stefan; Lomas, Peter J.; Maviglia, Francesco; McCullen, Paul; Murari, Andrea; Rainford, Mark; Reux, Cedric; Rimini, Fernanda; Sartori, Filippo; Stephen, Adam V.; Vega, Jesus; Vitelli, Riccardo; Zabeo, Luca; Zastrow, Klaus-Dieter

2014-04-01

Recently, a new recipe for developing and deploying real-time systems has become increasingly adopted in the JET tokamak. Powered by the advent of x86 multi-core technology and the reliability of JET's well established Real-Time Data Network (RTDN) to handle all real-time I/O, an official Linux vanilla kernel has been demonstrated to be able to provide real-time performance to user-space applications that are required to meet stringent timing constraints. In particular, a careful rearrangement of the Interrupt ReQuests' (IRQs) affinities together with the kernel's CPU isolation mechanism allows one to obtain either soft or hard real-time behavior depending on the synchronization mechanism adopted. Finally, the Multithreaded Application Real-Time executor (MARTe) framework is used for building applications particularly optimised for exploring multi-core architectures. In the past year, four new systems based on this philosophy have been installed and are now part of JET's routine operation. The focus of the present work is on the configuration aspects that enable these new systems' real-time capability. Details are given about the common real-time configuration of these systems, followed by a brief description of each system together with results regarding their real-time performance. A cycle time jitter analysis of a user-space MARTe based application synchronizing over a network is also presented. The goal is to compare its deterministic performance while running on a vanilla and on a Messaging Real time Grid (MRG) Linux kernel.
Group delay spread analysis of coupled-multicore fibers: A comparison between weak and tight bending conditions

NASA Astrophysics Data System (ADS)

Fujisawa, Takeshi; Saitoh, Kunimasa

2017-06-01

Group delay spread of coupled three-core fiber is investigated based on coupled-wave theory. The differences between supermode and discrete core mode models are thoroughly investigated to reveal applicability of both models for specific fiber bending condition. A macrobending with random twisting is taken into account for random modal mixing in the fiber. It is found that for weakly bent condition, both supermode and discrete core mode models are applicable. On the other hand, for strongly bent condition, the discrete core mode model should be used to account for increased differential modal group delay for the fiber without twisting and short correlation length, which were experimentally observed recently. Results presented in this paper indicate the discrete core mode model is superior to the supermode model for the analysis of coupled-multicore fibers for various bent condition. Also, for estimating GDS of coupled-multicore fiber, it is critically important to take into account the fiber bending condition.
Neural networks within multi-core optic fibers

PubMed Central

Cohen, Eyal; Malka, Dror; Shemer, Amir; Shahmoon, Asaf; Zalevsky, Zeev; London, Michael

2016-01-01

Hardware implementation of artificial neural networks facilitates real-time parallel processing of massive data sets. Optical neural networks offer low-volume 3D connectivity together with large bandwidth and minimal heat production in contrast to electronic implementation. Here, we present a conceptual design for in-fiber optical neural networks. Neurons and synapses are realized as individual silica cores in a multi-core fiber. Optical signals are transferred transversely between cores by means of optical coupling. Pump driven amplification in erbium-doped cores mimics synaptic interactions. We simulated three-layered feed-forward neural networks and explored their capabilities. Simulations suggest that networks can differentiate between given inputs depending on specific configurations of amplification; this implies classification and learning capabilities. Finally, we tested experimentally our basic neuronal elements using fibers, couplers, and amplifiers, and demonstrated that this configuration implements a neuron-like function. Therefore, devices similar to our proposed multi-core fiber could potentially serve as building blocks for future large-scale small-volume optical artificial neural networks. PMID:27383911
Neural networks within multi-core optic fibers.

PubMed

Cohen, Eyal; Malka, Dror; Shemer, Amir; Shahmoon, Asaf; Zalevsky, Zeev; London, Michael

2016-07-07

Hardware implementation of artificial neural networks facilitates real-time parallel processing of massive data sets. Optical neural networks offer low-volume 3D connectivity together with large bandwidth and minimal heat production in contrast to electronic implementation. Here, we present a conceptual design for in-fiber optical neural networks. Neurons and synapses are realized as individual silica cores in a multi-core fiber. Optical signals are transferred transversely between cores by means of optical coupling. Pump driven amplification in erbium-doped cores mimics synaptic interactions. We simulated three-layered feed-forward neural networks and explored their capabilities. Simulations suggest that networks can differentiate between given inputs depending on specific configurations of amplification; this implies classification and learning capabilities. Finally, we tested experimentally our basic neuronal elements using fibers, couplers, and amplifiers, and demonstrated that this configuration implements a neuron-like function. Therefore, devices similar to our proposed multi-core fiber could potentially serve as building blocks for future large-scale small-volume optical artificial neural networks.
Multicore and GPU algorithms for Nussinov RNA folding

PubMed Central

2014-01-01

Background One segment of a RNA sequence might be paired with another segment of the same RNA sequence due to the force of hydrogen bonds. This two-dimensional structure is called the RNA sequence's secondary structure. Several algorithms have been proposed to predict an RNA sequence's secondary structure. These algorithms are referred to as RNA folding algorithms. Results We develop cache efficient, multicore, and GPU algorithms for RNA folding using Nussinov's algorithm. Conclusions Our cache efficient algorithm provides a speedup between 1.6 and 3.0 relative to a naive straightforward single core code. The multicore version of the cache efficient single core algorithm provides a speedup, relative to the naive single core algorithm, between 7.5 and 14.0 on a 6 core hyperthreaded CPU. Our GPU algorithm for the NVIDIA C2050 is up to 1582 times as fast as the naive single core algorithm and between 5.1 and 11.2 times as fast as the fastest previously known GPU algorithm for Nussinov RNA folding. PMID:25082539
Hierarchical Parallelization of Gene Differential Association Analysis

PubMed Central

2011-01-01

Background Microarray gene differential expression analysis is a widely used technique that deals with high dimensional data and is computationally intensive for permutation-based procedures. Microarray gene differential association analysis is even more computationally demanding and must take advantage of multicore computing technology, which is the driving force behind increasing compute power in recent years. In this paper, we present a two-layer hierarchical parallel implementation of gene differential association analysis. It takes advantage of both fine- and coarse-grain (with granularity defined by the frequency of communication) parallelism in order to effectively leverage the non-uniform nature of parallel processing available in the cutting-edge systems of today. Results Our results show that this hierarchical strategy matches data sharing behavior to the properties of the underlying hardware, thereby reducing the memory and bandwidth needs of the application. The resulting improved efficiency reduces computation time and allows the gene differential association analysis code to scale its execution with the number of processors. The code and biological data used in this study are downloadable from http://www.urmc.rochester.edu/biostat/people/faculty/hu.cfm. Conclusions The performance sweet spot occurs when using a number of threads per MPI process that allows the working sets of the corresponding MPI processes running on the multicore to fit within the machine cache. Hence, we suggest that practitioners follow this principle in selecting the appropriate number of MPI processes and threads within each MPI process for their cluster configurations. We believe that the principles of this hierarchical approach to parallelization can be utilized in the parallelization of other computationally demanding kernels. PMID:21936916
Hierarchical parallelization of gene differential association analysis.

PubMed

Needham, Mark; Hu, Rui; Dwarkadas, Sandhya; Qiu, Xing

2011-09-21

Microarray gene differential expression analysis is a widely used technique that deals with high dimensional data and is computationally intensive for permutation-based procedures. Microarray gene differential association analysis is even more computationally demanding and must take advantage of multicore computing technology, which is the driving force behind increasing compute power in recent years. In this paper, we present a two-layer hierarchical parallel implementation of gene differential association analysis. It takes advantage of both fine- and coarse-grain (with granularity defined by the frequency of communication) parallelism in order to effectively leverage the non-uniform nature of parallel processing available in the cutting-edge systems of today. Our results show that this hierarchical strategy matches data sharing behavior to the properties of the underlying hardware, thereby reducing the memory and bandwidth needs of the application. The resulting improved efficiency reduces computation time and allows the gene differential association analysis code to scale its execution with the number of processors. The code and biological data used in this study are downloadable from http://www.urmc.rochester.edu/biostat/people/faculty/hu.cfm. The performance sweet spot occurs when using a number of threads per MPI process that allows the working sets of the corresponding MPI processes running on the multicore to fit within the machine cache. Hence, we suggest that practitioners follow this principle in selecting the appropriate number of MPI processes and threads within each MPI process for their cluster configurations. We believe that the principles of this hierarchical approach to parallelization can be utilized in the parallelization of other computationally demanding kernels.
Highly scalable parallel processing of extracellular recordings of Multielectrode Arrays.

PubMed

Gehring, Tiago V; Vasilaki, Eleni; Giugliano, Michele

2015-01-01

Technological advances of Multielectrode Arrays (MEAs) used for multisite, parallel electrophysiological recordings, lead to an ever increasing amount of raw data being generated. Arrays with hundreds up to a few thousands of electrodes are slowly seeing widespread use and the expectation is that more sophisticated arrays will become available in the near future. In order to process the large data volumes resulting from MEA recordings there is a pressing need for new software tools able to process many data channels in parallel. Here we present a new tool for processing MEA data recordings that makes use of new programming paradigms and recent technology developments to unleash the power of modern highly parallel hardware, such as multi-core CPUs with vector instruction sets or GPGPUs. Our tool builds on and complements existing MEA data analysis packages. It shows high scalability and can be used to speed up some performance critical pre-processing steps such as data filtering and spike detection, helping to make the analysis of larger data sets tractable.

Verification of Electromagnetic Physics Models for Parallel Computing Architectures in the GeantV Project

DOE Office of Scientific and Technical Information (OSTI.GOV)

Amadio, G.; et al.

An intensive R&D and programming effort is required to accomplish new challenges posed by future experimental high-energy particle physics (HEP) programs. The GeantV project aims to narrow the gap between the performance of the existing HEP detector simulation software and the ideal performance achievable, exploiting latest advances in computing technology. The project has developed a particle detector simulation prototype capable of transporting in parallel particles in complex geometries exploiting instruction level microparallelism (SIMD and SIMT), task-level parallelism (multithreading) and high-level parallelism (MPI), leveraging both the multi-core and the many-core opportunities. We present preliminary verification results concerning the electromagnetic (EM) physicsmore » models developed for parallel computing architectures within the GeantV project. In order to exploit the potential of vectorization and accelerators and to make the physics model effectively parallelizable, advanced sampling techniques have been implemented and tested. In this paper we introduce a set of automated statistical tests in order to verify the vectorized models by checking their consistency with the corresponding Geant4 models and to validate them against experimental data.« less
Orthorectification by Using Gpgpu Method

NASA Astrophysics Data System (ADS)

Sahin, H.; Kulur, S.

2012-07-01

Thanks to the nature of the graphics processing, the newly released products offer highly parallel processing units with high-memory bandwidth and computational power of more than teraflops per second. The modern GPUs are not only powerful graphic engines but also they are high level parallel programmable processors with very fast computing capabilities and high-memory bandwidth speed compared to central processing units (CPU). Data-parallel computations can be shortly described as mapping data elements to parallel processing threads. The rapid development of GPUs programmability and capabilities attracted the attentions of researchers dealing with complex problems which need high level calculations. This interest has revealed the concepts of "General Purpose Computation on Graphics Processing Units (GPGPU)" and "stream processing". The graphic processors are powerful hardware which is really cheap and affordable. So the graphic processors became an alternative to computer processors. The graphic chips which were standard application hardware have been transformed into modern, powerful and programmable processors to meet the overall needs. Especially in recent years, the phenomenon of the usage of graphics processing units in general purpose computation has led the researchers and developers to this point. The biggest problem is that the graphics processing units use different programming models unlike current programming methods. Therefore, an efficient GPU programming requires re-coding of the current program algorithm by considering the limitations and the structure of the graphics hardware. Currently, multi-core processors can not be programmed by using traditional programming methods. Event procedure programming method can not be used for programming the multi-core processors. GPUs are especially effective in finding solution for repetition of the computing steps for many data elements when high accuracy is needed. Thus, it provides the computing process more quickly and accurately. Compared to the GPUs, CPUs which perform just one computing in a time according to the flow control are slower in performance. This structure can be evaluated for various applications of computer technology. In this study covers how general purpose parallel programming and computational power of the GPUs can be used in photogrammetric applications especially direct georeferencing. The direct georeferencing algorithm is coded by using GPGPU method and CUDA (Compute Unified Device Architecture) programming language. Results provided by this method were compared with the traditional CPU programming. In the other application the projective rectification is coded by using GPGPU method and CUDA programming language. Sample images of various sizes, as compared to the results of the program were evaluated. GPGPU method can be used especially in repetition of same computations on highly dense data, thus finding the solution quickly.
THC-MP: High performance numerical simulation of reactive transport and multiphase flow in porous media

NASA Astrophysics Data System (ADS)

Wei, Xiaohui; Li, Weishan; Tian, Hailong; Li, Hongliang; Xu, Haixiao; Xu, Tianfu

2015-07-01

The numerical simulation of multiphase flow and reactive transport in the porous media on complex subsurface problem is a computationally intensive application. To meet the increasingly computational requirements, this paper presents a parallel computing method and architecture. Derived from TOUGHREACT that is a well-established code for simulating subsurface multi-phase flow and reactive transport problems, we developed a high performance computing THC-MP based on massive parallel computer, which extends greatly on the computational capability for the original code. The domain decomposition method was applied to the coupled numerical computing procedure in the THC-MP. We designed the distributed data structure, implemented the data initialization and exchange between the computing nodes and the core solving module using the hybrid parallel iterative and direct solver. Numerical accuracy of the THC-MP was verified through a CO2 injection-induced reactive transport problem by comparing the results obtained from the parallel computing and sequential computing (original code). Execution efficiency and code scalability were examined through field scale carbon sequestration applications on the multicore cluster. The results demonstrate successfully the enhanced performance using the THC-MP on parallel computing facilities.
NeuroFlow: A General Purpose Spiking Neural Network Simulation Platform using Customizable Processors.

PubMed

Cheung, Kit; Schultz, Simon R; Luk, Wayne

2015-01-01

NeuroFlow is a scalable spiking neural network simulation platform for off-the-shelf high performance computing systems using customizable hardware processors such as Field-Programmable Gate Arrays (FPGAs). Unlike multi-core processors and application-specific integrated circuits, the processor architecture of NeuroFlow can be redesigned and reconfigured to suit a particular simulation to deliver optimized performance, such as the degree of parallelism to employ. The compilation process supports using PyNN, a simulator-independent neural network description language, to configure the processor. NeuroFlow supports a number of commonly used current or conductance based neuronal models such as integrate-and-fire and Izhikevich models, and the spike-timing-dependent plasticity (STDP) rule for learning. A 6-FPGA system can simulate a network of up to ~600,000 neurons and can achieve a real-time performance of 400,000 neurons. Using one FPGA, NeuroFlow delivers a speedup of up to 33.6 times the speed of an 8-core processor, or 2.83 times the speed of GPU-based platforms. With high flexibility and throughput, NeuroFlow provides a viable environment for large-scale neural network simulation.
NeuroFlow: A General Purpose Spiking Neural Network Simulation Platform using Customizable Processors

PubMed Central

Cheung, Kit; Schultz, Simon R.; Luk, Wayne

2016-01-01

NeuroFlow is a scalable spiking neural network simulation platform for off-the-shelf high performance computing systems using customizable hardware processors such as Field-Programmable Gate Arrays (FPGAs). Unlike multi-core processors and application-specific integrated circuits, the processor architecture of NeuroFlow can be redesigned and reconfigured to suit a particular simulation to deliver optimized performance, such as the degree of parallelism to employ. The compilation process supports using PyNN, a simulator-independent neural network description language, to configure the processor. NeuroFlow supports a number of commonly used current or conductance based neuronal models such as integrate-and-fire and Izhikevich models, and the spike-timing-dependent plasticity (STDP) rule for learning. A 6-FPGA system can simulate a network of up to ~600,000 neurons and can achieve a real-time performance of 400,000 neurons. Using one FPGA, NeuroFlow delivers a speedup of up to 33.6 times the speed of an 8-core processor, or 2.83 times the speed of GPU-based platforms. With high flexibility and throughput, NeuroFlow provides a viable environment for large-scale neural network simulation. PMID:26834542
SequenceL: Automated Parallel Algorithms Derived from CSP-NT Computational Laws

NASA Technical Reports Server (NTRS)

Cooke, Daniel; Rushton, Nelson

2013-01-01

With the introduction of new parallel architectures like the cell and multicore chips from IBM, Intel, AMD, and ARM, as well as the petascale processing available for highend computing, a larger number of programmers will need to write parallel codes. Adding the parallel control structure to the sequence, selection, and iterative control constructs increases the complexity of code development, which often results in increased development costs and decreased reliability. SequenceL is a high-level programming language that is, a programming language that is closer to a human s way of thinking than to a machine s. Historically, high-level languages have resulted in decreased development costs and increased reliability, at the expense of performance. In recent applications at JSC and in industry, SequenceL has demonstrated the usual advantages of high-level programming in terms of low cost and high reliability. SequenceL programs, however, have run at speeds typically comparable with, and in many cases faster than, their counterparts written in C and C++ when run on single-core processors. Moreover, SequenceL is able to generate parallel executables automatically for multicore hardware, gaining parallel speedups without any extra effort from the programmer beyond what is required to write the sequen tial/singlecore code. A SequenceL-to-C++ translator has been developed that automatically renders readable multithreaded C++ from a combination of a SequenceL program and sample data input. The SequenceL language is based on two fundamental computational laws, Consume-Simplify- Produce (CSP) and Normalize-Trans - pose (NT), which enable it to automate the creation of parallel algorithms from high-level code that has no annotations of parallelism whatsoever. In our anecdotal experience, SequenceL development has been in every case less costly than development of the same algorithm in sequential (that is, single-core, single process) C or C++, and an order of magnitude less costly than development of comparable parallel code. Moreover, SequenceL not only automatically parallelizes the code, but since it is based on CSP-NT, it is provably race free, thus eliminating the largest quality challenge the parallelized software developer faces.
MetAlign 3.0: performance enhancement by efficient use of advances in computer hardware.

PubMed

Lommen, Arjen; Kools, Harrie J

2012-08-01

A new, multi-threaded version of the GC-MS and LC-MS data processing software, metAlign, has been developed which is able to utilize multiple cores on one PC. This new version was tested using three different multi-core PCs with different operating systems. The performance of noise reduction, baseline correction and peak-picking was 8-19 fold faster compared to the previous version on a single core machine from 2008. The alignment was 5-10 fold faster. Factors influencing the performance enhancement are discussed. Our observations show that performance scales with the increase in processor core numbers we currently see in consumer PC hardware development.
Few Mode Multicore Photonic Lantern Multiplexer

DTIC Science & Technology

2016-01-01

2015, Valencia (2015). [6] S. G. Leon-Saval, T. A. Birks, J. Bland- Hawthorn , and M. Englund, “Multimode fiber devices with single-mode performance...Opt. Lett. 30, 2545–2547 (2005). [7] D. Noordegraaf, P. M. W. Skovgaard, M. D. Nielsen, and J. Bland- Hawthorn , “Efficient multi-mode to single mode...coupling in a photonic lantern,” Opt. Express 17, 1988–1994 (2009). [8] S. G. Leon-Saval, A. Argyros, and J. Bland- Hawthorn , “Photonic lanterns: a
Scalable Triadic Analysis of Large-Scale Graphs: Multi-Core vs. Multi-Processor vs. Multi-Threaded Shared Memory Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chin, George; Marquez, Andres; Choudhury, Sutanay

2012-09-01

Triadic analysis encompasses a useful set of graph mining methods that is centered on the concept of a triad, which is a subgraph of three nodes and the configuration of directed edges across the nodes. Such methods are often applied in the social sciences as well as many other diverse fields. Triadic methods commonly operate on a triad census that counts the number of triads of every possible edge configuration in a graph. Like other graph algorithms, triadic census algorithms do not scale well when graphs reach tens of millions to billions of nodes. To enable the triadic analysis ofmore » large-scale graphs, we developed and optimized a triad census algorithm to efficiently execute on shared memory architectures. We will retrace the development and evolution of a parallel triad census algorithm. Over the course of several versions, we continually adapted the code’s data structures and program logic to expose more opportunities to exploit parallelism on shared memory that would translate into improved computational performance. We will recall the critical steps and modifications that occurred during code development and optimization. Furthermore, we will compare the performances of triad census algorithm versions on three specific systems: Cray XMT, HP Superdome, and AMD multi-core NUMA machine. These three systems have shared memory architectures but with markedly different hardware capabilities to manage parallelism.« less
Comparison of multihardware parallel implementations for a phase unwrapping algorithm

NASA Astrophysics Data System (ADS)

Hernandez-Lopez, Francisco Javier; Rivera, Mariano; Salazar-Garibay, Adan; Legarda-Sáenz, Ricardo

2018-04-01

Phase unwrapping is an important problem in the areas of optical metrology, synthetic aperture radar (SAR) image analysis, and magnetic resonance imaging (MRI) analysis. These images are becoming larger in size and, particularly, the availability and need for processing of SAR and MRI data have increased significantly with the acquisition of remote sensing data and the popularization of magnetic resonators in clinical diagnosis. Therefore, it is important to develop faster and accurate phase unwrapping algorithms. We propose a parallel multigrid algorithm of a phase unwrapping method named accumulation of residual maps, which builds on a serial algorithm that consists of the minimization of a cost function; minimization achieved by means of a serial Gauss-Seidel kind algorithm. Our algorithm also optimizes the original cost function, but unlike the original work, our algorithm is a parallel Jacobi class with alternated minimizations. This strategy is known as the chessboard type, where red pixels can be updated in parallel at same iteration since they are independent. Similarly, black pixels can be updated in parallel in an alternating iteration. We present parallel implementations of our algorithm for different parallel multicore architecture such as CPU-multicore, Xeon Phi coprocessor, and Nvidia graphics processing unit. In all the cases, we obtain a superior performance of our parallel algorithm when compared with the original serial version. In addition, we present a detailed comparative performance of the developed parallel versions.
Scaling Support Vector Machines On Modern HPC Platforms

DOE Office of Scientific and Technical Information (OSTI.GOV)

You, Yang; Fu, Haohuan; Song, Shuaiwen

2015-02-01

We designed and implemented MIC-SVM, a highly efficient parallel SVM for x86 based multicore and many-core architectures, such as the Intel Ivy Bridge CPUs and Intel Xeon Phi co-processor (MIC). We propose various novel analysis methods and optimization techniques to fully utilize the multilevel parallelism provided by these architectures and serve as general optimization methods for other machine learning tools.
Photogrammetric Verification of Fiber Optic Shape Sensors on Flexible Aerospace Structures

NASA Technical Reports Server (NTRS)

Moore, Jason P.; Rogge, Matthew D.; Jones, Thomas W.

2012-01-01

Multi-core fiber (MCF) optic shape sensing offers the possibility of providing in-flight shape measurements of highly flexible aerospace structures and control surfaces for such purposes as gust load alleviation, flutter suppression, general flight control and structural health monitoring. Photogrammetric measurements of surface mounted MCF shape sensing cable can be used to quantify the MCF installation path and verify measurement methods.
Research of real-time video processing system based on 6678 multi-core DSP

NASA Astrophysics Data System (ADS)

Li, Xiangzhen; Xie, Xiaodan; Yin, Xiaoqiang

2017-10-01

In the information age, the rapid development in the direction of intelligent video processing, complex algorithm proposed the powerful challenge on the performance of the processor. In this article, through the FPGA + TMS320C6678 frame structure, the image to fog, merge into an organic whole, to stabilize the image enhancement, its good real-time, superior performance, break through the traditional function of video processing system is simple, the product defects such as single, solved the video application in security monitoring, video, etc. Can give full play to the video monitoring effectiveness, improve enterprise economic benefits.
Improvement of Speckle Contrast Image Processing by an Efficient Algorithm.

PubMed

Steimers, A; Farnung, W; Kohl-Bareis, M

2016-01-01

We demonstrate an efficient algorithm for the temporal and spatial based calculation of speckle contrast for the imaging of blood flow by laser speckle contrast analysis (LASCA). It reduces the numerical complexity of necessary calculations, facilitates a multi-core and many-core implementation of the speckle analysis and enables an independence of temporal or spatial resolution and SNR. The new algorithm was evaluated for both spatial and temporal based analysis of speckle patterns with different image sizes and amounts of recruited pixels as sequential, multi-core and many-core code.
Traditional Tracking with Kalman Filter on Parallel Architectures

NASA Astrophysics Data System (ADS)

Cerati, Giuseppe; Elmer, Peter; Lantz, Steven; MacNeill, Ian; McDermott, Kevin; Riley, Dan; Tadel, Matevž; Wittich, Peter; Würthwein, Frank; Yagil, Avi

2015-05-01

Power density constraints are limiting the performance improvements of modern CPUs. To address this, we have seen the introduction of lower-power, multi-core processors, but the future will be even more exciting. In order to stay within the power density limits but still obtain Moore's Law performance/price gains, it will be necessary to parallelize algorithms to exploit larger numbers of lightweight cores and specialized functions like large vector units. Example technologies today include Intel's Xeon Phi and GPGPUs. Track finding and fitting is one of the most computationally challenging problems for event reconstruction in particle physics. At the High Luminosity LHC, for example, this will be by far the dominant problem. The most common track finding techniques in use today are however those based on the Kalman Filter. Significant experience has been accumulated with these techniques on real tracking detector systems, both in the trigger and offline. We report the results of our investigations into the potential and limitations of these algorithms on the new parallel hardware.
Accelerated event-by-event Monte Carlo microdosimetric calculations of electrons and protons tracks on a multi-core CPU and a CUDA-enabled GPU.

PubMed

Kalantzis, Georgios; Tachibana, Hidenobu

2014-01-01

For microdosimetric calculations event-by-event Monte Carlo (MC) methods are considered the most accurate. The main shortcoming of those methods is the extensive requirement for computational time. In this work we present an event-by-event MC code of low projectile energy electron and proton tracks for accelerated microdosimetric MC simulations on a graphic processing unit (GPU). Additionally, a hybrid implementation scheme was realized by employing OpenMP and CUDA in such a way that both GPU and multi-core CPU were utilized simultaneously. The two implementation schemes have been tested and compared with the sequential single threaded MC code on the CPU. Performance comparison was established on the speed-up for a set of benchmarking cases of electron and proton tracks. A maximum speedup of 67.2 was achieved for the GPU-based MC code, while a further improvement of the speedup up to 20% was achieved for the hybrid approach. The results indicate the capability of our CPU-GPU implementation for accelerated MC microdosimetric calculations of both electron and proton tracks without loss of accuracy. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.
Aeroacoustic Codes for Rotor Harmonic and BVI Noise. CAMRAD.Mod1/HIRES: Methodology and Users' Manual

NASA Technical Reports Server (NTRS)

Boyd, D. Douglas, Jr.; Brooks, Thomas F.; Burley, Casey L.; Jolly, J. Ralph, Jr.

1998-01-01

This document details the methodology and use of the CAMRAD.Mod1/HIRES codes, which were developed at NASA Langley Research Center for the prediction of helicopter harmonic and Blade-Vortex Interaction (BVI) noise. CANMAD.Mod1 is a substantially modified version of the performance/trim/wake code CANMAD. High resolution blade loading is determined in post-processing by HIRES and an associated indicial aerodynamics code. Extensive capabilities of importance to noise prediction accuracy are documented, including a new multi-core tip vortex roll-up wake model, higher harmonic and individual blade control, tunnel and fuselage correction input, diagnostic blade motion input, and interfaces for acoustic and CFD aerodynamics codes. Modifications and new code capabilities are documented with examples. A users' job preparation guide and listings of variables and namelists are given.
A History-based Estimation for LHCb job requirements

NASA Astrophysics Data System (ADS)

Rauschmayr, Nathalie

2015-12-01

The main goal of a Workload Management System (WMS) is to find and allocate resources for the given tasks. The more and better job information the WMS receives, the easier will be to accomplish its task, which directly translates into higher utilization of resources. Traditionally, the information associated with each job, like expected runtime, is defined beforehand by the Production Manager in best case and fixed arbitrary values by default. In the case of LHCb's Workload Management System no mechanisms are provided which automate the estimation of job requirements. As a result, much more CPU time is normally requested than actually needed. Particularly, in the context of multicore jobs this presents a major problem, since single- and multicore jobs shall share the same resources. Consequently, grid sites need to rely on estimations given by the VOs in order to not decrease the utilization of their worker nodes when making multicore job slots available. The main reason for going to multicore jobs is the reduction of the overall memory footprint. Therefore, it also needs to be studied how memory consumption of jobs can be estimated. A detailed workload analysis of past LHCb jobs is presented. It includes a study of job features and their correlation with runtime and memory consumption. Following the features, a supervised learning algorithm is developed based on a history based prediction. The aim is to learn over time how jobs’ runtime and memory evolve influenced due to changes in experiment conditions and software versions. It will be shown that estimation can be notably improved if experiment conditions are taken into account.
High-throughput Bayesian Network Learning using Heterogeneous Multicore Computers

PubMed Central

Linderman, Michael D.; Athalye, Vivek; Meng, Teresa H.; Asadi, Narges Bani; Bruggner, Robert; Nolan, Garry P.

2017-01-01

Aberrant intracellular signaling plays an important role in many diseases. The causal structure of signal transduction networks can be modeled as Bayesian Networks (BNs), and computationally learned from experimental data. However, learning the structure of Bayesian Networks (BNs) is an NP-hard problem that, even with fast heuristics, is too time consuming for large, clinically important networks (20–50 nodes). In this paper, we present a novel graphics processing unit (GPU)-accelerated implementation of a Monte Carlo Markov Chain-based algorithm for learning BNs that is up to 7.5-fold faster than current general-purpose processor (GPP)-based implementations. The GPU-based implementation is just one of several implementations within the larger application, each optimized for a different input or machine configuration. We describe the methodology we use to build an extensible application, assembled from these variants, that can target a broad range of heterogeneous systems, e.g., GPUs, multicore GPPs. Specifically we show how we use the Merge programming model to efficiently integrate, test and intelligently select among the different potential implementations. PMID:28819655
Matrix Algebra for GPU and Multicore Architectures (MAGMA) for Large Petascale Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dongarra, Jack J.; Tomov, Stanimire

2014-03-24

The goal of the MAGMA project is to create a new generation of linear algebra libraries that achieve the fastest possible time to an accurate solution on hybrid Multicore+GPU-based systems, using all the processing power that future high-end systems can make available within given energy constraints. Our efforts at the University of Tennessee achieved the goals set in all of the five areas identified in the proposal: 1. Communication optimal algorithms; 2. Autotuning for GPU and hybrid processors; 3. Scheduling and memory management techniques for heterogeneity and scale; 4. Fault tolerance and robustness for large scale systems; 5. Building energymore » efficiency into software foundations. The University of Tennessee’s main contributions, as proposed, were the research and software development of new algorithms for hybrid multi/many-core CPUs and GPUs, as related to two-sided factorizations and complete eigenproblem solvers, hybrid BLAS, and energy efficiency for dense, as well as sparse, operations. Furthermore, as proposed, we investigated and experimented with various techniques targeting the five main areas outlined.« less

Post-inscription tuning of multicore fiber Bragg gratings

NASA Astrophysics Data System (ADS)

Lindley, Emma Y.; Min, Seong-sik; Leon-Saval, Sergio G.; Bland-Hawthorn, Joss

2016-07-01

Fiber Bragg gratings are used in astronomy for their ability to suppress narrow atmospheric emission lines of temporally varying brightness before the light is dispersed. These gratings can only operate in a single-mode fiber as the suppressed wavelength depends on mode velocity in the core. Recent experiments with fibers containing multiple single-moded cores have demonstrated the potential for inscribing identical gratings across all cores in a single pass. We have already improved the uniformity of gratings in 7-core fibers via modifications to the writing process; further progress can be achieved by tuning the gratings of the outer and inner cores relative to one another. Our eventual goal is to make the entire fiber suppress one wavelength to a depth of 30 dB or greater. By coating the fiber in a heat-conductive material with a high expansion coefficient, we can examine the effects of temperature and strain on the spectral response of each core. In this paper we present methods and results from experiments concerning the post-write tuning of gratings in multicore fibers.
Multi-Core Programming Design Patterns: Stream Processing Algorithms for Dynamic Scene Perceptions

DTIC Science & Technology

2014-05-01

processor developed by IBM and other companies , incorpo- rates the verb—POWER5— processor as the Power Processor Element (PPE), one of the early general...deliver an power efficient single-precision peak performance of more than 256 GFlops. Substantially more raw power became available later, when nVIDIA ...algorithms, including IBM’s Cell/B.E., GPUs from NVidia and AMD and many-core CPUs from Intel.27 The vast growth of digital video content has been a
MIMO signal progressing with RLSCMA algorithm for multi-mode multi-core optical transmission system

NASA Astrophysics Data System (ADS)

Bi, Yuan; Liu, Bo; Zhang, Li-jia; Xin, Xiang-jun; Zhang, Qi; Wang, Yong-jun; Tian, Qing-hua; Tian, Feng; Mao, Ya-ya

2018-01-01

In the process of transmitting signals of multi-mode multi-core fiber, there will be mode coupling between modes. The mode dispersion will also occur because each mode has different transmission speed in the link. Mode coupling and mode dispersion will cause damage to the useful signal in the transmission link, so the receiver needs to deal received signal with digital signal processing, and compensate the damage in the link. We first analyzes the influence of mode coupling and mode dispersion in the process of transmitting signals of multi-mode multi-core fiber, then presents the relationship between the coupling coefficient and dispersion coefficient. Then we carry out adaptive signal processing with MIMO equalizers based on recursive least squares constant modulus algorithm (RLSCMA). The MIMO equalization algorithm offers adaptive equalization taps according to the degree of crosstalk in cores or modes, which eliminates the interference among different modes and cores in space division multiplexing(SDM) transmission system. The simulation results show that the distorted signals are restored efficiently with fast convergence speed.
Multicore Architecture-aware Scientific Applications

DOE Office of Scientific and Technical Information (OSTI.GOV)

Srinivasa, Avinash

Modern high performance systems are becoming increasingly complex and powerful due to advancements in processor and memory architecture. In order to keep up with this increasing complexity, applications have to be augmented with certain capabilities to fully exploit such systems. These may be at the application level, such as static or dynamic adaptations or at the system level, like having strategies in place to override some of the default operating system polices, the main objective being to improve computational performance of the application. The current work proposes two such capabilites with respect to multi-threaded scientific applications, in particular a largemore » scale physics application computing ab-initio nuclear structure. The first involves using a middleware tool to invoke dynamic adaptations in the application, so as to be able to adjust to the changing computational resource availability at run-time. The second involves a strategy for effective placement of data in main memory, to optimize memory access latencies and bandwidth. These capabilties when included were found to have a significant impact on the application performance, resulting in average speedups of as much as two to four times.« less
DPM — efficient storage in diverse environments

NASA Astrophysics Data System (ADS)

Hellmich, Martin; Furano, Fabrizio; Smith, David; Brito da Rocha, Ricardo; Álvarez Ayllón, Alejandro; Manzi, Andrea; Keeble, Oliver; Calvet, Ivan; Regala, Miguel Antonio

2014-06-01

Recent developments, including low power devices, cluster file systems and cloud storage, represent an explosion in the possibilities for deploying and managing grid storage. In this paper we present how different technologies can be leveraged to build a storage service with differing cost, power, performance, scalability and reliability profiles, using the popular storage solution Disk Pool Manager (DPM/dmlite) as the enabling technology. The storage manager DPM is designed for these new environments, allowing users to scale up and down as they need it, and optimizing their computing centers energy efficiency and costs. DPM runs on high-performance machines, profiting from multi-core and multi-CPU setups. It supports separating the database from the metadata server, the head node, largely reducing its hard disk requirements. Since version 1.8.6, DPM is released in EPEL and Fedora, simplifying distribution and maintenance, but also supporting the ARM architecture beside i386 and x86_64, allowing it to run the smallest low-power machines such as the Raspberry Pi or the CuBox. This usage is facilitated by the possibility to scale horizontally using a main database and a distributed memcached-powered namespace cache. Additionally, DPM supports a variety of storage pools in the backend, most importantly HDFS, S3-enabled storage, and cluster file systems, allowing users to fit their DPM installation exactly to their needs. In this paper, we investigate the power-efficiency and total cost of ownership of various DPM configurations. We develop metrics to evaluate the expected performance of a setup both in terms of namespace and disk access considering the overall cost including equipment, power consumptions, or data/storage fees. The setups tested range from the lowest scale using Raspberry Pis with only 700MHz single cores and a 100Mbps network connections, over conventional multi-core servers to typical virtual machine instances in cloud settings. We evaluate the combinations of different name server setups, for example load-balanced clusters, with different storage setups, from using a classic local configuration to private and public clouds.
Fast CPU-based Monte Carlo simulation for radiotherapy dose calculation.

PubMed

Ziegenhein, Peter; Pirner, Sven; Ph Kamerling, Cornelis; Oelfke, Uwe

2015-08-07

Monte-Carlo (MC) simulations are considered to be the most accurate method for calculating dose distributions in radiotherapy. Its clinical application, however, still is limited by the long runtimes conventional implementations of MC algorithms require to deliver sufficiently accurate results on high resolution imaging data. In order to overcome this obstacle we developed the software-package PhiMC, which is capable of computing precise dose distributions in a sub-minute time-frame by leveraging the potential of modern many- and multi-core CPU-based computers. PhiMC is based on the well verified dose planning method (DPM). We could demonstrate that PhiMC delivers dose distributions which are in excellent agreement to DPM. The multi-core implementation of PhiMC scales well between different computer architectures and achieves a speed-up of up to 37[Formula: see text] compared to the original DPM code executed on a modern system. Furthermore, we could show that our CPU-based implementation on a modern workstation is between 1.25[Formula: see text] and 1.95[Formula: see text] faster than a well-known GPU implementation of the same simulation method on a NVIDIA Tesla C2050. Since CPUs work on several hundreds of GB RAM the typical GPU memory limitation does not apply for our implementation and high resolution clinical plans can be calculated.
Virtual machine-based simulation platform for mobile ad-hoc network-based cyber infrastructure

DOE PAGES

Yoginath, Srikanth B.; Perumalla, Kayla S.; Henz, Brian J.

2015-09-29

In modeling and simulating complex systems such as mobile ad-hoc networks (MANETs) in de-fense communications, it is a major challenge to reconcile multiple important considerations: the rapidity of unavoidable changes to the software (network layers and applications), the difficulty of modeling the critical, implementation-dependent behavioral effects, the need to sustain larger scale scenarios, and the desire for faster simulations. Here we present our approach in success-fully reconciling them using a virtual time-synchronized virtual machine(VM)-based parallel ex-ecution framework that accurately lifts both the devices as well as the network communications to a virtual time plane while retaining full fidelity. At themore » core of our framework is a scheduling engine that operates at the level of a hypervisor scheduler, offering a unique ability to execute multi-core guest nodes over multi-core host nodes in an accurate, virtual time-synchronized manner. In contrast to other related approaches that suffer from either speed or accuracy issues, our framework provides MANET node-wise scalability, high fidelity of software behaviors, and time-ordering accuracy. The design and development of this framework is presented, and an ac-tual implementation based on the widely used Xen hypervisor system is described. Benchmarks with synthetic and actual applications are used to identify the benefits of our approach. The time inaccuracy of traditional emulation methods is demonstrated, in comparison with the accurate execution of our framework verified by theoretically correct results expected from analytical models of the same scenarios. In the largest high fidelity tests, we are able to perform virtual time-synchronized simulation of 64-node VM-based full-stack, actual software behaviors of MANETs containing a mix of static and mobile (unmanned airborne vehicle) nodes, hosted on a 32-core host, with full fidelity of unmodified ad-hoc routing protocols, unmodified application executables, and user-controllable physical layer effects including inter-device wireless signal strength, reachability, and connectivity.« less
Virtual machine-based simulation platform for mobile ad-hoc network-based cyber infrastructure

DOE Office of Scientific and Technical Information (OSTI.GOV)

Yoginath, Srikanth B.; Perumalla, Kayla S.; Henz, Brian J.

In modeling and simulating complex systems such as mobile ad-hoc networks (MANETs) in de-fense communications, it is a major challenge to reconcile multiple important considerations: the rapidity of unavoidable changes to the software (network layers and applications), the difficulty of modeling the critical, implementation-dependent behavioral effects, the need to sustain larger scale scenarios, and the desire for faster simulations. Here we present our approach in success-fully reconciling them using a virtual time-synchronized virtual machine(VM)-based parallel ex-ecution framework that accurately lifts both the devices as well as the network communications to a virtual time plane while retaining full fidelity. At themore » core of our framework is a scheduling engine that operates at the level of a hypervisor scheduler, offering a unique ability to execute multi-core guest nodes over multi-core host nodes in an accurate, virtual time-synchronized manner. In contrast to other related approaches that suffer from either speed or accuracy issues, our framework provides MANET node-wise scalability, high fidelity of software behaviors, and time-ordering accuracy. The design and development of this framework is presented, and an ac-tual implementation based on the widely used Xen hypervisor system is described. Benchmarks with synthetic and actual applications are used to identify the benefits of our approach. The time inaccuracy of traditional emulation methods is demonstrated, in comparison with the accurate execution of our framework verified by theoretically correct results expected from analytical models of the same scenarios. In the largest high fidelity tests, we are able to perform virtual time-synchronized simulation of 64-node VM-based full-stack, actual software behaviors of MANETs containing a mix of static and mobile (unmanned airborne vehicle) nodes, hosted on a 32-core host, with full fidelity of unmodified ad-hoc routing protocols, unmodified application executables, and user-controllable physical layer effects including inter-device wireless signal strength, reachability, and connectivity.« less
Multicore Hardware Experiments in Software Producibility

DTIC Science & Technology

2009-06-01

processors. 15. SUBJECT TERMS Multi-core, Real - time Systems , Testing, Software Modernization 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF... real ‐ time systems . The inputs to the dgclocalnav component are the path plan (received from highlevelplanner, discussed next), the drivable grid... time systems , robotics, and software. As frequently observed in cyber‐physical systems, the system designers may need experience in multiple
MC3: Multi-core Markov-chain Monte Carlo code

NASA Astrophysics Data System (ADS)

Cubillos, Patricio; Harrington, Joseph; Lust, Nate; Foster, AJ; Stemm, Madison; Loredo, Tom; Stevenson, Kevin; Campo, Chris; Hardin, Matt; Hardy, Ryan

2016-10-01

MC3 (Multi-core Markov-chain Monte Carlo) is a Bayesian statistics tool that can be executed from the shell prompt or interactively through the Python interpreter with single- or multiple-CPU parallel computing. It offers Markov-chain Monte Carlo (MCMC) posterior-distribution sampling for several algorithms, Levenberg-Marquardt least-squares optimization, and uniform non-informative, Jeffreys non-informative, or Gaussian-informative priors. MC3 can share the same value among multiple parameters and fix the value of parameters to constant values, and offers Gelman-Rubin convergence testing and correlated-noise estimation with time-averaging or wavelet-based likelihood estimation methods.
PIC codes for plasma accelerators on emerging computer architectures (GPUS, Multicore/Manycore CPUS)

NASA Astrophysics Data System (ADS)

Vincenti, Henri

2016-03-01

The advent of exascale computers will enable 3D simulations of a new laser-plasma interaction regimes that were previously out of reach of current Petasale computers. However, the paradigm used to write current PIC codes will have to change in order to fully exploit the potentialities of these new computing architectures. Indeed, achieving Exascale computing facilities in the next decade will be a great challenge in terms of energy consumption and will imply hardware developments directly impacting our way of implementing PIC codes. As data movement (from die to network) is by far the most energy consuming part of an algorithm future computers will tend to increase memory locality at the hardware level and reduce energy consumption related to data movement by using more and more cores on each compute nodes (''fat nodes'') that will have a reduced clock speed to allow for efficient cooling. To compensate for frequency decrease, CPU machine vendors are making use of long SIMD instruction registers that are able to process multiple data with one arithmetic operator in one clock cycle. SIMD register length is expected to double every four years. GPU's also have a reduced clock speed per core and can process Multiple Instructions on Multiple Datas (MIMD). At the software level Particle-In-Cell (PIC) codes will thus have to achieve both good memory locality and vectorization (for Multicore/Manycore CPU) to fully take advantage of these upcoming architectures. In this talk, we present the portable solutions we implemented in our high performance skeleton PIC code PICSAR to both achieve good memory locality and cache reuse as well as good vectorization on SIMD architectures. We also present the portable solutions used to parallelize the Pseudo-sepctral quasi-cylindrical code FBPIC on GPUs using the Numba python compiler.
Software Graphics Processing Unit (sGPU) for Deep Space Applications

NASA Technical Reports Server (NTRS)

McCabe, Mary; Salazar, George; Steele, Glen

2015-01-01

A graphics processing capability will be required for deep space missions and must include a range of applications, from safety-critical vehicle health status to telemedicine for crew health. However, preliminary radiation testing of commercial graphics processing cards suggest they cannot operate in the deep space radiation environment. Investigation into an Software Graphics Processing Unit (sGPU)comprised of commercial-equivalent radiation hardened/tolerant single board computers, field programmable gate arrays, and safety-critical display software shows promising results. Preliminary performance of approximately 30 frames per second (FPS) has been achieved. Use of multi-core processors may provide a significant increase in performance.
High-Performance Computational Analysis of Glioblastoma Pathology Images with Database Support Identifies Molecular and Survival Correlates.

PubMed

Kong, Jun; Wang, Fusheng; Teodoro, George; Cooper, Lee; Moreno, Carlos S; Kurc, Tahsin; Pan, Tony; Saltz, Joel; Brat, Daniel

2013-12-01

In this paper, we present a novel framework for microscopic image analysis of nuclei, data management, and high performance computation to support translational research involving nuclear morphometry features, molecular data, and clinical outcomes. Our image analysis pipeline consists of nuclei segmentation and feature computation facilitated by high performance computing with coordinated execution in multi-core CPUs and Graphical Processor Units (GPUs). All data derived from image analysis are managed in a spatial relational database supporting highly efficient scientific queries. We applied our image analysis workflow to 159 glioblastomas (GBM) from The Cancer Genome Atlas dataset. With integrative studies, we found statistics of four specific nuclear features were significantly associated with patient survival. Additionally, we correlated nuclear features with molecular data and found interesting results that support pathologic domain knowledge. We found that Proneural subtype GBMs had the smallest mean of nuclear Eccentricity and the largest mean of nuclear Extent, and MinorAxisLength. We also found gene expressions of stem cell marker MYC and cell proliferation maker MKI67 were correlated with nuclear features. To complement and inform pathologists of relevant diagnostic features, we queried the most representative nuclear instances from each patient population based on genetic and transcriptional classes. Our results demonstrate that specific nuclear features carry prognostic significance and associations with transcriptional and genetic classes, highlighting the potential of high throughput pathology image analysis as a complementary approach to human-based review and translational research.
A fiber optic temperature sensor based on multi-core microstructured fiber with coupled cores for a high temperature environment

NASA Astrophysics Data System (ADS)

Makowska, A.; Markiewicz, K.; Szostkiewicz, L.; Kolakowska, A.; Fidelus, J.; Stanczyk, T.; Wysokinski, K.; Budnicki, D.; Ostrowski, L.; Szymanski, M.; Makara, M.; Poturaj, K.; Tenderenda, T.; Mergo, P.; Nasilowski, T.

2018-02-01

Sensors based on fiber optics are irreplaceable wherever immunity to strong electro-magnetic fields or safe operation in explosive atmospheres is needed. Furthermore, it is often essential to be able to monitor high temperatures of over 500°C in such environments (e.g. in cooling systems or equipment monitoring in power plants). In order to meet this demand, we have designed and manufactured a fiber optic sensor with which temperatures up to 900°C can be measured. The sensor utilizes multi-core fibers which are recognized as the dedicated medium for telecommunication or shape sensing, but as we show may be also deployed advantageously in new types of fiber optic temperature sensors. The sensor presented in this paper is based on a dual-core microstructured fiber Michelson interferometer. The fiber is characterized by strongly coupled cores, hence it acts as an all-fiber coupler, but with an outer diameter significantly wider than a standard fused biconical taper coupler, which significantly increases the coupling region's mechanical reliability. Owing to the proposed interferometer imbalance, effective operation and high-sensitivity can be achieved. The presented sensor is designed to be used at high temperatures as a result of the developed low temperature chemical process of metal (copper or gold) coating. The hermetic metal coating can be applied directly to the silica cladding of the fiber or the fiber component. This operation significantly reduces the degradation of sensors due to hydrolysis in uncontrolled atmospheres and high temperatures.
Center for Technology for Advanced Scientific Componet Software (TASCS)

DOE Office of Scientific and Technical Information (OSTI.GOV)

Govindaraju, Madhusudhan

Advanced Scientific Computing Research Computer Science FY 2010Report Center for Technology for Advanced Scientific Component Software: Distributed CCA State University of New York, Binghamton, NY, 13902 Summary The overall objective of Binghamton's involvement is to work on enhancements of the CCA environment, motivated by the applications and research initiatives discussed in the proposal. This year we are working on re-focusing our design and development efforts to develop proof-of-concept implementations that have the potential to significantly impact scientific components. We worked on developing parallel implementations for non-hydrostatic code and worked on a model coupling interface for biogeochemical computations coded in MATLAB.more » We also worked on the design and implementation modules that will be required for the emerging MapReduce model to be effective for scientific applications. Finally, we focused on optimizing the processing of scientific datasets on multi-core processors. Research Details We worked on the following research projects that we are working on applying to CCA-based scientific applications. 1. Non-Hydrostatic Hydrodynamics: Non-static hydrodynamics are significantly more accurate at modeling internal waves that may be important in lake ecosystems. Non-hydrostatic codes, however, are significantly more computationally expensive, often prohibitively so. We have worked with Chin Wu at the University of Wisconsin to parallelize non-hydrostatic code. We have obtained a speed up of about 26 times maximum. Although this is significant progress, we hope to improve the performance further, such that it becomes a practical alternative to hydrostatic codes. 2. Model-coupling for water-based ecosystems: To answer pressing questions about water resources requires that physical models (hydrodynamics) be coupled with biological and chemical models. Most hydrodynamics codes are written in Fortran, however, while most ecologists work in MATLAB. This disconnect creates a great barrier. To address this, we are working on a model coupling interface that will allow biogeochemical computations written in MATLAB to couple with Fortran codes. This will greatly improve the productivity of ecosystem scientists. 2. Low overhead and Elastic MapReduce Implementation Optimized for Memory and CPU-Intensive Applications: Since its inception, MapReduce has frequently been associated with Hadoop and large-scale datasets. Its deployment at Amazon in the cloud, and its applications at Yahoo! for large-scale distributed document indexing and database building, among other tasks, have thrust MapReduce to the forefront of the data processing application domain. The applicability of the paradigm however extends far beyond its use with data intensive applications and diskbased systems, and can also be brought to bear in processing small but CPU intensive distributed applications. MapReduce however carries its own burdens. Through experiments using Hadoop in the context of diverse applications, we uncovered latencies and delay conditions potentially inhibiting the expected performance of a parallel execution in CPU-intensive applications. Furthermore, as it currently stands, MapReduce is favored for data-centric applications, and as such tends to be solely applied to disk-based applications. The paradigm, falls short in bringing its novelty to diskless systems dedicated to in-memory applications, and compute intensive programs processing much smaller data, but requiring intensive computations. In this project, we focused both on the performance of processing large-scale hierarchical data in distributed scientific applications, as well as the processing of smaller but demanding input sizes primarily used in diskless, and memory resident I/O systems. We designed LEMO-MR [1], a Low overhead, elastic, configurable for in- memory applications, and on-demand fault tolerance, an optimized implementation of MapReduce, for both on disk and in memory applications. We conducted experiments to identify not only the necessary components of this model, but also trade offs and factors to be considered. We have initial results to show the efficacy of our implementation in terms of potential speedup that can be achieved for representative data sets used by cloud applications. We have quantified the performance gains exhibited by our MapReduce implementation over Apache Hadoop in a compute intensive environment. 3. Cache Performance Optimization for Processing XML and HDF-based Application Data on Multi-core Processors: It is important to design and develop scientific middleware libraries to harness the opportunities presented by emerging multi-core processors. Implementations of scientific middleware and applications that do not adapt to the programming paradigm when executing on emerging processors can severely impact the overall performance. In this project, we focused on the utilization of the L2 cache, which is a critical shared resource on chip multiprocessors (CMP). The access pattern of the shared L2 cache, which is dependent on how the application schedules and assigns processing work to each thread, can either enhance or hurt the ability to hide memory latency on a multi-core processor. Therefore, while processing scientific datasets such as HDF5, it is essential to conduct fine-grained analysis of cache utilization, to inform scheduling decisions in multi-threaded programming. In this project, using the TAU toolkit for performance feedback from dual- and quad-core machines, we conducted performance analysis and recommendations on how processing threads can be scheduled on multi-core nodes to enhance the performance of a class of scientific applications that requires processing of HDF5 data. In particular, we quantified the gains associated with the use of the adaptations we have made to the Cache-Affinity and Balanced-Set scheduling algorithms to improve L2 cache performance, and hence the overall application execution time [2]. References: 1. Zacharia Fadika, Madhusudhan Govindaraju, ``MapReduce Implementation for Memory-Based and Processing Intensive Applications'', accepted in 2nd IEEE International Conference on Cloud Computing Technology and Science, Indianapolis, USA, Nov 30 - Dec 3, 2010. 2. Rajdeep Bhowmik, Madhusudhan Govindaraju, ``Cache Performance Optimization for Processing XML-based Application Data on Multi-core Processors'', in proceedings of The 10th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 17-20, 2010, Melbourne, Victoria, Australia. Contact Information: Madhusudhan Govindaraju Binghamton University State University of New York (SUNY) mgovinda@cs.binghamton.edu Phone: 607-777-4904« less
Supermodes in Coupled Multi-Core Waveguide Structures

DTIC Science & Technology

2016-04-01

and therefore can be treated as linear polarization (LP) modes. In essence, the LP modes are scalar approximations of the vector mode fields and contain...field, including the discovery of optical discrete solitons , Bragg and vector solitons in fibers, nonlinear surface waves, and the discovery of self...increased for an isolated core, it can guide high-order modes. For optical fibers with low re- fractive index contrast, the vector modes are weakly guided
DOE Office of Scientific and Technical Information (OSTI.GOV)

Wu, Chase

A number of Department of Energy (DOE) science applications, involving exascale computing systems and large experimental facilities, are expected to generate large volumes of data, in the range of petabytes to exabytes, which will be transported over wide-area networks for the purpose of storage, visualization, and analysis. The objectives of this proposal are to (1) develop and test the component technologies and their synthesis methods to achieve source-to-sink high-performance flows, and (2) develop tools that provide these capabilities through simple interfaces to users and applications. In terms of the former, we propose to develop (1) optimization methods that align andmore » transition multiple storage flows to multiple network flows on multicore, multibus hosts; and (2) edge and long-haul network path realization and maintenance using advanced provisioning methods including OSCARS and OpenFlow. We also propose synthesis methods that combine these individual technologies to compose high-performance flows using a collection of constituent storage-network flows, and realize them across the storage and local network connections as well as long-haul connections. We propose to develop automated user tools that profile the hosts, storage systems, and network connections; compose the source-to-sink complex flows; and set up and maintain the needed network connections.« less
Software for Brain Network Simulations: A Comparative Study

PubMed Central

Tikidji-Hamburyan, Ruben A.; Narayana, Vikram; Bozkus, Zeki; El-Ghazawi, Tarek A.

2017-01-01

Numerical simulations of brain networks are a critical part of our efforts in understanding brain functions under pathological and normal conditions. For several decades, the community has developed many software packages and simulators to accelerate research in computational neuroscience. In this article, we select the three most popular simulators, as determined by the number of models in the ModelDB database, such as NEURON, GENESIS, and BRIAN, and perform an independent evaluation of these simulators. In addition, we study NEST, one of the lead simulators of the Human Brain Project. First, we study them based on one of the most important characteristics, the range of supported models. Our investigation reveals that brain network simulators may be biased toward supporting a specific set of models. However, all simulators tend to expand the supported range of models by providing a universal environment for the computational study of individual neurons and brain networks. Next, our investigations on the characteristics of computational architecture and efficiency indicate that all simulators compile the most computationally intensive procedures into binary code, with the aim of maximizing their computational performance. However, not all simulators provide the simplest method for module development and/or guarantee efficient binary code. Third, a study of their amenability for high-performance computing reveals that NEST can almost transparently map an existing model on a cluster or multicore computer, while NEURON requires code modification if the model developed for a single computer has to be mapped on a computational cluster. Interestingly, parallelization is the weakest characteristic of BRIAN, which provides no support for cluster computations and limited support for multicore computers. Fourth, we identify the level of user support and frequency of usage for all simulators. Finally, we carry out an evaluation using two case studies: a large network with simplified neural and synaptic models and a small network with detailed models. These two case studies allow us to avoid any bias toward a particular software package. The results indicate that BRIAN provides the most concise language for both cases considered. Furthermore, as expected, NEST mostly favors large network models, while NEURON is better suited for detailed models. Overall, the case studies reinforce our general observation that simulators have a bias in the computational performance toward specific types of the brain network models. PMID:28775687
Fault-Tolerant Software-Defined Radio on Manycore

NASA Technical Reports Server (NTRS)

Ricketts, Scott

2015-01-01

Software-defined radio (SDR) platforms generally rely on field-programmable gate arrays (FPGAs) and digital signal processors (DSPs), but such architectures require significant software development. In addition, application demands for radiation mitigation and fault tolerance exacerbate programming challenges. MaXentric Technologies, LLC, has developed a manycore-based SDR technology that provides 100 times the throughput of conventional radiationhardened general purpose processors. Manycore systems (30-100 cores and beyond) have the potential to provide high processing performance at error rates that are equivalent to current space-deployed uniprocessor systems. MaXentric's innovation is a highly flexible radio, providing over-the-air reconfiguration; adaptability; and uninterrupted, real-time, multimode operation. The technology is also compliant with NASA's Space Telecommunications Radio System (STRS) architecture. In addition to its many uses within NASA communications, the SDR can also serve as a highly programmable research-stage prototyping device for new waveforms and other communications technologies. It can also support noncommunication codes on its multicore processor, collocated with the communications workload-reducing the size, weight, and power of the overall system by aggregating processing jobs to a single board computer.
300 Gb/s IM/DD based SDM-WDM-PON with laserless ONUs.

PubMed

Bao, Fangdi; Morioka, Toshio; Oxenløwe, Leif K; Hu, Hao

2018-04-02

A low-cost, high-speed SDM-WDM-PON architecture is proposed by using a multi-core fiber (MCF) and intensity modulation/directly detection (IM/DD). One of the MCF cores is used for sending laser sources from optical line terminal (OLT) to optical network unit (ONU), thus facilitating laserless and colorless ONUs, and providing ease of network management and maintenance. In addition, the wavelengths of the ONUs are controlled on the OLT side, which also enables flexible optical networks. Thanks to the low inter-core crosstalk of a MCF, downstream (DS) and upstream (US) signals are transmitted independently in different cores of the MCF, not only increasing the aggregated capacity but also avoiding the Rayleigh backscattering noise. Finally, a proof-of-principle experiment is performed by using a 7-core fiber, achieving 300 /120 Gb/s aggregated capacity for DS and US (3 × cores, 4 × wavelengths, 25/10 Gb/s per wavelength), respectively.

Exploring Manycore Multinode Systems for Irregular Applications with FPGA Prototyping

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ceriani, Marco; Palermo, Gianluca; Secchi, Simone

We present a prototype of a multi-core architecture implemented on FPGA, designed to enable efficient execution of irregular applications on distributed shared memory machines, while maintaining high performance on regular workloads. The architecture is composed of off-the-shelf soft-core cores, local interconnection and memory interface, integrated with custom components that optimize it for irregular applications. It relies on three key elements: a global address space, multithreading, and fine-grained synchronization. Global addresses are scrambled to reduce the formation of network hot-spots, while the latency of the transactions is covered by integrating an hardware scheduler within the custom load/store buffers to take advantagemore » from the availability of multiple executions threads, increasing the efficiency in a transparent way to the application. We evaluated a dual node system irregular kernels showing scalability in the number of cores and threads.« less
Diffractive optics for combined spatial- and mode- division demultiplexing of optical vortices: design, fabrication and optical characterization.

PubMed

Ruffato, Gianluca; Massari, Michele; Romanato, Filippo

2016-04-20

During the last decade, the orbital angular momentum (OAM) of light has attracted growing interest as a new degree of freedom for signal channel multiplexing in order to increase the information transmission capacity in today's optical networks. Here we present the design, fabrication and characterization of phase-only diffractive optical elements (DOE) performing mode-division (de)multiplexing (MDM) and spatial-division (de)multiplexing (SDM) at the same time. Samples have been fabricated with high-resolution electron-beam lithography patterning a polymethylmethacrylate (PMMA) resist layer spun over a glass substrate. Different DOE designs are presented for the sorting of optical vortices differing in either OAM content or beam size in the optical regime, with different steering geometries in far-field. These novel DOE designs appear promising for telecom applications both in free-space and in multi-core fibers propagation.
Magnetism and Mössbauer study of formation of multi-core γ -Fe2O3 nanoparticles

NASA Astrophysics Data System (ADS)

Kamali, Saeed; Bringas, Eugenio; Hah, Hien-Yoong; Bates, Brian; Johnson, Jacqueline A.; Johnson, Charles E.; Stroeve, Pieter

2018-04-01

A systematic investigation of magnetic nanoparticles and the formation of a core-shell structure, consisting of multiple maghemite (γ -Fe2O3) nanoparticles as the core and silica as the shell, has been performed using various techniques. High-resolution transmission electron microscopy clearly shows isolated maghemite nanoparticles with an average diameter of 13 nm and the formation of a core-shell structure. Low temperature Mössbauer spectroscopy reveals the presence of pure maghemite nanoparticles with all vacancies at the B-sites. Isothermal magnetization and zero-field-cooled and field-cooled measurements are used for investigating the magnetic properties of the nanoparticles. The magnetization results are in good accordance with the contents of the magnetic core and the non-magnetic shell. The multiple-core γ -Fe2O3 nanoparticles show similar behavior to isolated particles of the same size.
Stack-and-Draw Manufacture Process of a Seven-Core Optical Fiber for Fluorescence Measurements

NASA Astrophysics Data System (ADS)

Samir, Ahmed; Batagelj, Bostjan

2018-01-01

Multi-core, optical-fiber technology is expected to be used in telecommunications and sensory systems in a relatively short amount of time. However, a successful transition from research laboratories to industry applications will only be possible with an optimized design and manufacturing process. The fabrication process is an important aspect in designing and developing new multi-applicable, multi-core fibers, where the best candidate is a seven-core fiber. Here, the basics for designing and manufacturing a single-mode, seven-core fiber using the stack-and-draw process is described for the example of a fluorescence sensory system.
Design of Multi-core Fiber Patch Panel for Space Division Multiplexing Implementations

NASA Astrophysics Data System (ADS)

González, Luz E.; Morales, Alvaro; Rommel, Simon; Jørgensen, Bo F.; Porras-Montenegro, N.; Tafur Monroy, Idelfonso

2018-03-01

A multi-core fiber (MCF) patch panel was designed, allowing easy coupling of individual signals to and from a 7-core MCF. The device was characterized, measuring insertion loss and cross talk, finding highest insertion loss and lowest crosstalk at 1300 nm with values of 9.7 dB and -36.5 dB respectively, while at 1600 nm insertion loss drops to 4.8 dB and crosstalk increases to -24.1 dB. Two MCF splices between the fan-in module, the MCF, and the fan-out module are included in the characterization, and splicing parameters are discussed.
Experiments and Analyses of Data Transfers Over Wide-Area Dedicated Connections

DOE Office of Scientific and Technical Information (OSTI.GOV)

Rao, Nageswara S.; Liu, Qiang; Sen, Satyabrata

Dedicated wide-area network connections are increasingly employed in high-performance computing and big data scenarios. One might expect the performance and dynamics of data transfers over such connections to be easy to analyze due to the lack of competing traffic. However, non-linear transport dynamics and end-system complexities (e.g., multi-core hosts and distributed filesystems) can in fact make analysis surprisingly challenging. We present extensive measurements of memory-to-memory and disk-to-disk file transfers over 10 Gbps physical and emulated connections with 0–366 ms round trip times (RTTs). For memory-to-memory transfers, profiles of both TCP and UDT throughput as a function of RTT show concavemore » and convex regions; large buffer sizes and more parallel flows lead to wider concave regions, which are highly desirable. TCP and UDT both also display complex throughput dynamics, as indicated by their Poincare maps and Lyapunov exponents. For disk-to-disk transfers, we determine that high throughput can be achieved via a combination of parallel I/O threads, parallel network threads, and direct I/O mode. Our measurements also show that Lustre filesystems can be mounted over long-haul connections using LNet routers, although challenges remain in jointly optimizing file I/O and transport method parameters to achieve peak throughput.« less
Maximal clique enumeration with data-parallel primitives

DOE Office of Scientific and Technical Information (OSTI.GOV)

Lessley, Brenton; Perciano, Talita; Mathai, Manish

The enumeration of all maximal cliques in an undirected graph is a fundamental problem arising in several research areas. We consider maximal clique enumeration on shared-memory, multi-core architectures and introduce an approach consisting entirely of data-parallel operations, in an effort to achieve efficient and portable performance across different architectures. We study the performance of the algorithm via experiments varying over benchmark graphs and architectures. Overall, we observe that our algorithm achieves up to a 33-time speedup and 9-time speedup over state-of-the-art distributed and serial algorithms, respectively, for graphs with higher ratios of maximal cliques to total cliques. Further, we attainmore » additional speedups on a GPU architecture, demonstrating the portable performance of our data-parallel design.« less
Geospatial Applications on Different Parallel and Distributed Systems in enviroGRIDS Project

NASA Astrophysics Data System (ADS)

Rodila, D.; Bacu, V.; Gorgan, D.

2012-04-01

The execution of Earth Science applications and services on parallel and distributed systems has become a necessity especially due to the large amounts of Geospatial data these applications require and the large geographical areas they cover. The parallelization of these applications comes to solve important performance issues and can spread from task parallelism to data parallelism as well. Parallel and distributed architectures such as Grid, Cloud, Multicore, etc. seem to offer the necessary functionalities to solve important problems in the Earth Science domain: storing, distribution, management, processing and security of Geospatial data, execution of complex processing through task and data parallelism, etc. A main goal of the FP7-funded project enviroGRIDS (Black Sea Catchment Observation and Assessment System supporting Sustainable Development) [1] is the development of a Spatial Data Infrastructure targeting this catchment region but also the development of standardized and specialized tools for storing, analyzing, processing and visualizing the Geospatial data concerning this area. For achieving these objectives, the enviroGRIDS deals with the execution of different Earth Science applications, such as hydrological models, Geospatial Web services standardized by the Open Geospatial Consortium (OGC) and others, on parallel and distributed architecture to maximize the obtained performance. This presentation analysis the integration and execution of Geospatial applications on different parallel and distributed architectures and the possibility of choosing among these architectures based on application characteristics and user requirements through a specialized component. Versions of the proposed platform have been used in enviroGRIDS project on different use cases such as: the execution of Geospatial Web services both on Web and Grid infrastructures [2] and the execution of SWAT hydrological models both on Grid and Multicore architectures [3]. The current focus is to integrate in the proposed platform the Cloud infrastructure, which is still a paradigm with critical problems to be solved despite the great efforts and investments. Cloud computing comes as a new way of delivering resources while using a large set of old as well as new technologies and tools for providing the necessary functionalities. The main challenges in the Cloud computing, most of them identified also in the Open Cloud Manifesto 2009, address resource management and monitoring, data and application interoperability and portability, security, scalability, software licensing, etc. We propose a platform able to execute different Geospatial applications on different parallel and distributed architectures such as Grid, Cloud, Multicore, etc. with the possibility of choosing among these architectures based on application characteristics and complexity, user requirements, necessary performances, cost support, etc. The execution redirection on a selected architecture is realized through a specialized component and has the purpose of offering a flexible way in achieving the best performances considering the existing restrictions.
Using the cloud to speed-up calibration of watershed-scale hydrologic models (Invited)

NASA Astrophysics Data System (ADS)

Goodall, J. L.; Ercan, M. B.; Castronova, A. M.; Humphrey, M.; Beekwilder, N.; Steele, J.; Kim, I.

2013-12-01

This research focuses on using the cloud to address computational challenges associated with hydrologic modeling. One example is calibration of a watershed-scale hydrologic model, which can take days of execution time on typical computers. While parallel algorithms for model calibration exist and some researchers have used multi-core computers or clusters to run these algorithms, these solutions do not fully address the challenge because (i) calibration can still be too time consuming even on multicore personal computers and (ii) few in the community have the time and expertise needed to manage a compute cluster. Given this, another option for addressing this challenge that we are exploring through this work is the use of the cloud for speeding-up calibration of watershed-scale hydrologic models. The cloud used in this capacity provides a means for renting a specific number and type of machines for only the time needed to perform a calibration model run. The cloud allows one to precisely balance the duration of the calibration with the financial costs so that, if the budget allows, the calibration can be performed more quickly by renting more machines. Focusing specifically on the SWAT hydrologic model and a parallel version of the DDS calibration algorithm, we show significant speed-up time across a range of watershed sizes using up to 256 cores to perform a model calibration. The tool provides a simple web-based user interface and the ability to monitor the calibration job submission process during the calibration process. Finally this talk concludes with initial work to leverage the cloud for other tasks associated with hydrologic modeling including tasks related to preparing inputs for constructing place-based hydrologic models.
MUTILS - a set of efficient modeling tools for multi-core CPUs implemented in MEX

NASA Astrophysics Data System (ADS)

Krotkiewski, Marcin; Dabrowski, Marcin

2013-04-01

The need for computational performance is common in scientific applications, and in particular in numerical simulations, where high resolution models require efficient processing of large amounts of data. Especially in the context of geological problems the need to increase the model resolution to resolve physical and geometrical complexities seems to have no limits. Alas, the performance of new generations of CPUs does not improve any longer by simply increasing clock speeds. Current industrial trends are to increase the number of computational cores. As a result, parallel implementations are required in order to fully utilize the potential of new processors, and to study more complex models. We target simulations on small to medium scale shared memory computers: laptops and desktop PCs with ~8 CPU cores and up to tens of GB of memory to high-end servers with ~50 CPU cores and hundereds of GB of memory. In this setting MATLAB is often the environment of choice for scientists that want to implement their own models with little effort. It is a useful general purpose mathematical software package, but due to its versatility some of its functionality is not as efficient as it could be. In particular, the challanges of modern multi-core architectures are not fully addressed. We have developed MILAMIN 2 - an efficient FEM modeling environment written in native MATLAB. Amongst others, MILAMIN provides functions to define model geometry, generate and convert structured and unstructured meshes (also through interfaces to external mesh generators), compute element and system matrices, apply boundary conditions, solve the system of linear equations, address non-linear and transient problems, and perform post-processing. MILAMIN strives to combine the ease of code development and the computational efficiency. Where possible, the code is optimized and/or parallelized within the MATLAB framework. Native MATLAB is augmented with the MUTILS library - a set of MEX functions that implement the computationally intensive, performance critical parts of the code, which we have identified to be bottlenecks. Here, we discuss the functionality and performance of the MUTILS library. Currently, it includes: 1. time and memory efficient assembly of sparse matrices for FEM simulations 2. parallel sparse matrix - vector product with optimizations speficic to symmetric matrices and multiple degrees of freedom per node 3. parallel point in triangle location and point in tetrahedron location for unstructured, adaptive 2D and 3D meshes (useful for 'marker in cell' type of methods) 4. parallel FEM interpolation for 2D and 3D meshes of elements of different types and orders, and for different number of degrees of freedom per node 5. a stand-alone, MEX implementation of the Conjugate Gradients iterative solver 6. interface to METIS graph partitioning and a fast implementation of RCM reordering
We introduce an algorithm for the simultaneous reconstruction of faults and slip fields. We prove that the minimum of a related regularized functional converges to the unique solution of the fault inverse problem. We consider a Bayesian approach. We use a parallel multi-core platform and we discuss techniques to save on computational time.

NASA Astrophysics Data System (ADS)

Volkov, D.

2017-12-01

We introduce an algorithm for the simultaneous reconstruction of faults and slip fields on those faults. We define a regularized functional to be minimized for the reconstruction. We prove that the minimum of that functional converges to the unique solution of the related fault inverse problem. Due to inherent uncertainties in measurements, rather than seeking a deterministic solution to the fault inverse problem, we consider a Bayesian approach. The advantage of such an approach is that we obtain a way of quantifying uncertainties as part of our final answer. On the downside, this Bayesian approach leads to a very large computation. To contend with the size of this computation we developed an algorithm for the numerical solution to the stochastic minimization problem which can be easily implemented on a parallel multi-core platform and we discuss techniques to save on computational time. After showing how this algorithm performs on simulated data and assessing the effect of noise, we apply it to measured data. The data was recorded during a slow slip event in Guerrero, Mexico.
A Scalable Multicore Architecture With Heterogeneous Memory Structures for Dynamic Neuromorphic Asynchronous Processors (DYNAPs).

PubMed

Moradi, Saber; Qiao, Ning; Stefanini, Fabio; Indiveri, Giacomo

2018-02-01

Neuromorphic computing systems comprise networks of neurons that use asynchronous events for both computation and communication. This type of representation offers several advantages in terms of bandwidth and power consumption in neuromorphic electronic systems. However, managing the traffic of asynchronous events in large scale systems is a daunting task, both in terms of circuit complexity and memory requirements. Here, we present a novel routing methodology that employs both hierarchical and mesh routing strategies and combines heterogeneous memory structures for minimizing both memory requirements and latency, while maximizing programming flexibility to support a wide range of event-based neural network architectures, through parameter configuration. We validated the proposed scheme in a prototype multicore neuromorphic processor chip that employs hybrid analog/digital circuits for emulating synapse and neuron dynamics together with asynchronous digital circuits for managing the address-event traffic. We present a theoretical analysis of the proposed connectivity scheme, describe the methods and circuits used to implement such scheme, and characterize the prototype chip. Finally, we demonstrate the use of the neuromorphic processor with a convolutional neural network for the real-time classification of visual symbols being flashed to a dynamic vision sensor (DVS) at high speed.
Bristol Ridge: A 28-nm $$\\times$$ 86 Performance-Enhanced Microprocessor Through System Power Management

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sundaram, Sriram; Grenat, Aaron; Naffziger, Samuel

Power management techniques can be effective at extracting more performance and energy efficiency out of mature systems on chip (SoCs). For instance, the peak performance of microprocessors is often limited by worst case technology (Vmax), infrastructure (thermal/electrical), and microprocessor usage assumptions. Performance/watt of microprocessors also typically suffers from guard bands associated with the test and binning processes as well as worst case aging/lifetime degradation. Similarly, on multicore processors, shared voltage rails tend to limit the peak performance achievable in low thread count workloads. In this paper, we describe five power management techniques that maximize the per-part performance under the before-mentionedmore » constraints. Using these techniques, we demonstrate a net performance increase of up to 15% depending on the application and TDP of the SoC, implemented on 'Bristol Ridge,' a 28-nm CMOS, dual-core x 86 accelerated processing unit.« less
Parallel processing architecture for H.264 deblocking filter on multi-core platforms

NASA Astrophysics Data System (ADS)

Prasad, Durga P.; Sonachalam, Sekar; Kunchamwar, Mangesh K.; Gunupudi, Nageswara Rao

2012-03-01

Massively parallel computing (multi-core) chips offer outstanding new solutions that satisfy the increasing demand for high resolution and high quality video compression technologies such as H.264. Such solutions not only provide exceptional quality but also efficiency, low power, and low latency, previously unattainable in software based designs. While custom hardware and Application Specific Integrated Circuit (ASIC) technologies may achieve lowlatency, low power, and real-time performance in some consumer devices, many applications require a flexible and scalable software-defined solution. The deblocking filter in H.264 encoder/decoder poses difficult implementation challenges because of heavy data dependencies and the conditional nature of the computations. Deblocking filter implementations tend to be fixed and difficult to reconfigure for different needs. The ability to scale up for higher quality requirements such as 10-bit pixel depth or a 4:2:2 chroma format often reduces the throughput of a parallel architecture designed for lower feature set. A scalable architecture for deblocking filtering, created with a massively parallel processor based solution, means that the same encoder or decoder will be deployed in a variety of applications, at different video resolutions, for different power requirements, and at higher bit-depths and better color sub sampling patterns like YUV, 4:2:2, or 4:4:4 formats. Low power, software-defined encoders/decoders may be implemented using a massively parallel processor array, like that found in HyperX technology, with 100 or more cores and distributed memory. The large number of processor elements allows the silicon device to operate more efficiently than conventional DSP or CPU technology. This software programing model for massively parallel processors offers a flexible implementation and a power efficiency close to that of ASIC solutions. This work describes a scalable parallel architecture for an H.264 compliant deblocking filter for multi core platforms such as HyperX technology. Parallel techniques such as parallel processing of independent macroblocks, sub blocks, and pixel row level are examined in this work. The deblocking architecture consists of a basic cell called deblocking filter unit (DFU) and dependent data buffer manager (DFM). The DFU can be used in several instances, catering to different performance needs the DFM serves the data required for the different number of DFUs, and also manages all the neighboring data required for future data processing of DFUs. This approach achieves the scalability, flexibility, and performance excellence required in deblocking filters.
An accuracy aware low power wireless EEG unit with information content based adaptive data compression.

PubMed

Tolbert, Jeremy R; Kabali, Pratik; Brar, Simeranjit; Mukhopadhyay, Saibal

2009-01-01

We present a digital system for adaptive data compression for low power wireless transmission of Electroencephalography (EEG) data. The proposed system acts as a base-band processor between the EEG analog-to-digital front-end and RF transceiver. It performs a real-time accuracy energy trade-off for multi-channel EEG signal transmission by controlling the volume of transmitted data. We propose a multi-core digital signal processor for on-chip processing of EEG signals, to detect signal information of each channel and perform real-time adaptive compression. Our analysis shows that the proposed approach can provide significant savings in transmitter power with minimal impact on the overall signal accuracy.
Vascular system modeling in parallel environment - distributed and shared memory approaches

PubMed Central

Jurczuk, Krzysztof; Kretowski, Marek; Bezy-Wendling, Johanne

2011-01-01

The paper presents two approaches in parallel modeling of vascular system development in internal organs. In the first approach, new parts of tissue are distributed among processors and each processor is responsible for perfusing its assigned parts of tissue to all vascular trees. Communication between processors is accomplished by passing messages and therefore this algorithm is perfectly suited for distributed memory architectures. The second approach is designed for shared memory machines. It parallelizes the perfusion process during which individual processing units perform calculations concerning different vascular trees. The experimental results, performed on a computing cluster and multi-core machines, show that both algorithms provide a significant speedup. PMID:21550891
Parallel processing implementation for the coupled transport of photons and electrons using OpenMP

NASA Astrophysics Data System (ADS)

Doerner, Edgardo

2016-05-01

In this work the use of OpenMP to implement the parallel processing of the Monte Carlo (MC) simulation of the coupled transport for photons and electrons is presented. This implementation was carried out using a modified EGSnrc platform which enables the use of the Microsoft Visual Studio 2013 (VS2013) environment, together with the developing tools available in the Intel Parallel Studio XE 2015 (XE2015). The performance study of this new implementation was carried out in a desktop PC with a multi-core CPU, taking as a reference the performance of the original platform. The results were satisfactory, both in terms of scalability as parallelization efficiency.
SpaceCubeX: A Framework for Evaluating Hybrid Multi-Core CPU FPGA DSP Architectures

NASA Technical Reports Server (NTRS)

Schmidt, Andrew G.; Weisz, Gabriel; French, Matthew; Flatley, Thomas; Villalpando, Carlos Y.

2017-01-01

The SpaceCubeX project is motivated by the need for high performance, modular, and scalable on-board processing to help scientists answer critical 21st century questions about global climate change, air quality, ocean health, and ecosystem dynamics, while adding new capabilities such as low-latency data products for extreme event warnings. These goals translate into on-board processing throughput requirements that are on the order of 100-1,000 more than those of previous Earth Science missions for standard processing, compression, storage, and downlink operations. To study possible future architectures to achieve these performance requirements, the SpaceCubeX project provides an evolvable testbed and framework that enables a focused design space exploration of candidate hybrid CPU/FPGA/DSP processing architectures. The framework includes ArchGen, an architecture generator tool populated with candidate architecture components, performance models, and IP cores, that allows an end user to specify the type, number, and connectivity of a hybrid architecture. The framework requires minimal extensions to integrate new processors, such as the anticipated High Performance Spaceflight Computer (HPSC), reducing time to initiate benchmarking by months. To evaluate the framework, we leverage a wide suite of high performance embedded computing benchmarks and Earth science scenarios to ensure robust architecture characterization. We report on our projects Year 1 efforts and demonstrate the capabilities across four simulation testbed models, a baseline SpaceCube 2.0 system, a dual ARM A9 processor system, a hybrid quad ARM A53 and FPGA system, and a hybrid quad ARM A53 and DSP system.
An Efficient Multicore Implementation of a Novel HSS-Structured Multifrontal Solver Using Randomized Sampling

DOE PAGES

Ghysels, Pieter; Li, Xiaoye S.; Rouet, Francois -Henry; ...

2016-10-27

Here, we present a sparse linear system solver that is based on a multifrontal variant of Gaussian elimination and exploits low-rank approximation of the resulting dense frontal matrices. We use hierarchically semiseparable (HSS) matrices, which have low-rank off-diagonal blocks, to approximate the frontal matrices. For HSS matrix construction, a randomized sampling algorithm is used together with interpolative decompositions. The combination of the randomized compression with a fast ULV HSS factoriz ation leads to a solver with lower computational complexity than the standard multifrontal method for many applications, resulting in speedups up to 7 fold for problems in our test suite.more » The implementation targets many-core systems by using task parallelism with dynamic runtime scheduling. Numerical experiments show performance improvements over state-of-the-art sparse direct solvers. The implementation achieves high performance and good scalability on a range of modern shared memory parallel systems, including the Intel Xeon Phi (MIC). The code is part of a software package called STRUMPACK - STRUctured Matrices PACKage, which also has a distributed memory component for dense rank-structured matrices.« less
Parallel peak pruning for scalable SMP contour tree computation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Carr, Hamish A.; Weber, Gunther H.; Sewell, Christopher M.

As data sets grow to exascale, automated data analysis and visualisation are increasingly important, to intermediate human understanding and to reduce demands on disk storage via in situ analysis. Trends in architecture of high performance computing systems necessitate analysis algorithms to make effective use of combinations of massively multicore and distributed systems. One of the principal analytic tools is the contour tree, which analyses relationships between contours to identify features of more than local importance. Unfortunately, the predominant algorithms for computing the contour tree are explicitly serial, and founded on serial metaphors, which has limited the scalability of this formmore » of analysis. While there is some work on distributed contour tree computation, and separately on hybrid GPU-CPU computation, there is no efficient algorithm with strong formal guarantees on performance allied with fast practical performance. Here in this paper, we report the first shared SMP algorithm for fully parallel contour tree computation, withfor-mal guarantees of O(lgnlgt) parallel steps and O(n lgn) work, and implementations with up to 10x parallel speed up in OpenMP and up to 50x speed up in NVIDIA Thrust.« less

End-To-End performance test of the LINC-NIRVANA Wavefront-Sensor system.

NASA Astrophysics Data System (ADS)

Berwein, Juergen; Bertram, Thomas; Conrad, Al; Briegel, Florian; Kittmann, Frank; Zhang, Xiangyu; Mohr, Lars

2011-09-01

LINC-NIRVANA is an imaging Fizeau interferometer, for use in near infrared wavelengths, being built for the Large Binocular Telescope. Multi-conjugate adaptive optics (MCAO) increases the sky coverage and the field of view over which diffraction limited images can be obtained. For its MCAO implementation, Linc-Nirvana utilizes four total wavefront sensors; each of the two beams is corrected by both a ground-layer wavefront sensor (GWS) and a high-layer wavefront sensor (HWS). The GWS controls the adaptive secondary deformable mirror (DM), which is based on an DSP slope computing unit. Whereas the HWS controls an internal DM via computations provided by an off-the-shelf multi-core Linux system. Using wavefront sensor data collected from a prior lab experiment, we have shown via simulation that the Linux based system is sufficient to operate at 1kHz, with jitter well below the needs of the final system. Based on that setup we tested the end-to-end performance and latency through all parts of the system which includes the camera, the wavefront controller, and the deformable mirror. We will present our loop control structure and the results of those performance tests.
Experimental observation of spontaneous depolarized guided acoustic-wave Brillouin scattering in side cores of a multicore fiber

NASA Astrophysics Data System (ADS)

Hayashi, Neisei; Mizuno, Yosuke; Nakamura, Kentaro; Set, Sze Yun; Yamashita, Shinji

2018-06-01

Spontaneous depolarized guided acoustic-wave Brillouin scattering (GAWBS) was experimentally observed in one of the side cores of an uncoated multicore fiber (MCF). The frequency bandwidth in the side core was up to ∼400 MHz, which is 0.5 times that in the central core. The GAWBS spectrum of the side core of the MCF included intrinsic peaks, which had different acoustic resonance frequencies from those of the central core. In addition, the spontaneous depolarized GAWBS in the central/side core was unaffected by that in the other core. These results will lead to the development of polarization/phase modulators using an MCF.
Single-shot polarimetry imaging of multicore fiber.

PubMed

Sivankutty, Siddharth; Andresen, Esben Ravn; Bouwmans, Géraud; Brown, Thomas G; Alonso, Miguel A; Rigneault, Hervé

2016-05-01

We report an experimental test of single-shot polarimetry applied to the problem of real-time monitoring of the output polarization states in each core within a multicore fiber bundle. The technique uses a stress-engineered optical element, together with an analyzer, and provides a point spread function whose shape unambiguously reveals the polarization state of a point source. We implement this technique to monitor, simultaneously and in real time, the output polarization states of up to 180 single-mode fiber cores in both conventional and polarization-maintaining fiber bundles. We demonstrate also that the technique can be used to fully characterize the polarization properties of each individual fiber core, including eigen-polarization states, phase delay, and diattenuation.
Shape sensing using multi-core fiber optic cable and parametric curve solutions.

PubMed

Moore, Jason P; Rogge, Matthew D

2012-01-30

The shape of a multi-core optical fiber is calculated by numerically solving a set of Frenet-Serret equations describing the path of the fiber in three dimensions. Included in the Frenet-Serret equations are curvature and bending direction functions derived from distributed fiber Bragg grating strain measurements in each core. The method offers advantages over prior art in that it determines complex three-dimensional fiber shape as a continuous parametric solution rather than an integrated series of discrete planar bends. Results and error analysis of the method using a tri-core optical fiber is presented. Maximum error expressed as a percentage of fiber length was found to be 7.2%.
Tunable arbitrary unitary transformer based on multiple sections of multicore fibers with phase control.

PubMed

Zhou, Junhe; Wu, Jianjie; Hu, Qinsong

2018-02-05

In this paper, we propose a novel tunable unitary transformer, which can achieve arbitrary discrete unitary transforms. The unitary transformer is composed of multiple sections of multi-core fibers with closely aligned coupled cores. Phase shifters are inserted before and after the sections to control the phases of the waves in the cores. A simple algorithm is proposed to find the optimal phase setup for the phase shifters to realize the desired unitary transforms. The proposed device is fiber based and is particularly suitable for the mode division multiplexing systems. A tunable mode MUX/DEMUX for a three-mode fiber is designed based on the proposed structure.
Classification of Magnetic Nanoparticle Systems—Synthesis, Standardization and Analysis Methods in the NanoMag Project

PubMed Central

Bogren, Sara; Fornara, Andrea; Ludwig, Frank; del Puerto Morales, Maria; Steinhoff, Uwe; Fougt Hansen, Mikkel; Kazakova, Olga; Johansson, Christer

2015-01-01

This study presents classification of different magnetic single- and multi-core particle systems using their measured dynamic magnetic properties together with their nanocrystal and particle sizes. The dynamic magnetic properties are measured with AC (dynamical) susceptometry and magnetorelaxometry and the size parameters are determined from electron microscopy and dynamic light scattering. Using these methods, we also show that the nanocrystal size and particle morphology determines the dynamic magnetic properties for both single- and multi-core particles. The presented results are obtained from the four year EU NMP FP7 project, NanoMag, which is focused on standardization of analysis methods for magnetic nanoparticles. PMID:26343639
Fluorosilicate and fluorophosphate superfluorescent multicore optical fibers co-doped with Nd3+/Yb3+

NASA Astrophysics Data System (ADS)

Kochanowicz, M.; Zmojda, J.; Dorosz, D.

2014-06-01

In the paper spectroscopic properties of two fluorosilicate and fluorophosphate glass systems co-doped with Nd3+/Yb3+ ions are investigated. As a result of optical excitation at the wavelength of 808 nm strong and wide emission in the 1 μm region corresponding to the superposition of optical transitions 4F3/2 → 4I11/2 (Nd3+) and 2F5/2 → 2F7/2 (Yb3+) can be observed. The optimization of Nd3+ → Yb3+ energy transfer in both glasses allows to manufacture multicore optical fibers with narrowing and red-shifting of amplified spontaneous emission (ASE) at 1.1 μm.
Seven-core multicore fiber transmissions for passive optical network.

PubMed

Zhu, B; Taunay, T F; Yan, M F; Fini, J M; Fishteyn, M; Monberg, E M; Dimarcello, F V

2010-05-24

We design and fabricate a novel multicore fiber (MCF), with seven cores arranged in a hexagonal array. The fiber properties of MCF including low crosstalk, attenuation and splice loss are described. A new tapered MCF connector (TMC), showing ultra-low crosstalk and losses, is also designed and fabricated for coupling the individual signals in-and-out of the MCF. We further propose a novel network configuration using parallel transmissions with the MCF and TMC for passive optical network (PON). To the best of our knowledge, we demonstrate the first bi-directional parallel transmissions of 1310 nm and 1490 nm signals over 11.3-km of seven-core MCF with 64-way splitter for PON.
Fast parallel algorithm for slicing STL based on pipeline

NASA Astrophysics Data System (ADS)

Ma, Xulong; Lin, Feng; Yao, Bo

2016-05-01

In Additive Manufacturing field, the current researches of data processing mainly focus on a slicing process of large STL files or complicated CAD models. To improve the efficiency and reduce the slicing time, a parallel algorithm has great advantages. However, traditional algorithms can't make full use of multi-core CPU hardware resources. In the paper, a fast parallel algorithm is presented to speed up data processing. A pipeline mode is adopted to design the parallel algorithm. And the complexity of the pipeline algorithm is analyzed theoretically. To evaluate the performance of the new algorithm, effects of threads number and layers number are investigated by a serial of experiments. The experimental results show that the threads number and layers number are two remarkable factors to the speedup ratio. The tendency of speedup versus threads number reveals a positive relationship which greatly agrees with the Amdahl's law, and the tendency of speedup versus layers number also keeps a positive relationship agreeing with Gustafson's law. The new algorithm uses topological information to compute contours with a parallel method of speedup. Another parallel algorithm based on data parallel is used in experiments to show that pipeline parallel mode is more efficient. A case study at last shows a suspending performance of the new parallel algorithm. Compared with the serial slicing algorithm, the new pipeline parallel algorithm can make full use of the multi-core CPU hardware, accelerate the slicing process, and compared with the data parallel slicing algorithm, the new slicing algorithm in this paper adopts a pipeline parallel model, and a much higher speedup ratio and efficiency is achieved.
SAXS analysis of single- and multi-core iron oxide magnetic nanoparticles

PubMed Central

Szczerba, Wojciech; Costo, Rocio; Morales, Maria del Puerto; Thünemann, Andreas F.

2017-01-01

This article reports on the characterization of four superparamagnetic iron oxide nanoparticles stabilized with dimercaptosuccinic acid, which are suitable candidates for reference materials for magnetic properties. Particles p1 and p2 are single-core particles, while p3 and p4 are multi-core particles. Small-angle X-ray scattering analysis reveals a lognormal type of size distribution for the iron oxide cores of the particles. Their mean radii are 6.9 nm (p1), 10.6 nm (p2), 5.5 nm (p3) and 4.1 nm (p4), with narrow relative distribution widths of 0.08, 0.13, 0.08 and 0.12. The cores are arranged as a clustered network in the form of dense mass fractals with a fractal dimension of 2.9 in the multi-core particles p3 and p4, but the cores are well separated from each other by a protecting organic shell. The radii of gyration of the mass fractals are 48 and 44 nm, and each network contains 117 and 186 primary particles, respectively. The radius distributions of the primary particle were confirmed with transmission electron microscopy. All particles contain purely maghemite, as shown by X-ray absorption fine structure spectroscopy. PMID:28381973
Structural and magnetic properties of multi-core nanoparticles analysed using a generalised numerical inversion method

PubMed Central

Bender, P.; Bogart, L. K.; Posth, O.; Szczerba, W.; Rogers, S. E.; Castro, A.; Nilsson, L.; Zeng, L. J.; Sugunan, A.; Sommertune, J.; Fornara, A.; González-Alonso, D.; Barquín, L. Fernández; Johansson, C.

2017-01-01

The structural and magnetic properties of magnetic multi-core particles were determined by numerical inversion of small angle scattering and isothermal magnetisation data. The investigated particles consist of iron oxide nanoparticle cores (9 nm) embedded in poly(styrene) spheres (160 nm). A thorough physical characterisation of the particles included transmission electron microscopy, X-ray diffraction and asymmetrical flow field-flow fractionation. Their structure was ultimately disclosed by an indirect Fourier transform of static light scattering, small angle X-ray scattering and small angle neutron scattering data of the colloidal dispersion. The extracted pair distance distribution functions clearly indicated that the cores were mostly accumulated in the outer surface layers of the poly(styrene) spheres. To investigate the magnetic properties, the isothermal magnetisation curves of the multi-core particles (immobilised and dispersed in water) were analysed. The study stands out by applying the same numerical approach to extract the apparent moment distributions of the particles as for the indirect Fourier transform. It could be shown that the main peak of the apparent moment distributions correlated to the expected intrinsic moment distribution of the cores. Additional peaks were observed which signaled deviations of the isothermal magnetisation behavior from the non-interacting case, indicating weak dipolar interactions. PMID:28397851
LHCb detector and trigger performance in Run II

NASA Astrophysics Data System (ADS)

Francesca, Dordei

2017-12-01

The LHCb detector is a forward spectrometer at the LHC, designed to perform high precision studies of b- and c- hadrons. In Run II of the LHC, a new scheme for the software trigger at LHCb allows splitting the triggering of events into two stages, giving room to perform the alignment and calibration in real time. In the novel detector alignment and calibration strategy for Run II, data collected at the start of the fill are processed in a few minutes and used to update the alignment, while the calibration constants are evaluated for each run. This allows identical constants to be used in the online and offline reconstruction, thus improving the correlation between triggered and offline selected events. The required computing time constraints are met thanks to a new dedicated framework using the multi-core farm infrastructure for the trigger. The larger timing budget, available in the trigger, allows to perform the same track reconstruction online and offline. This enables LHCb to achieve the best reconstruction performance already in the trigger, and allows physics analyses to be performed directly on the data produced by the trigger reconstruction. The novel real-time processing strategy at LHCb is discussed from both the technical and operational point of view. The overall performance of the LHCb detector on the data of Run II is presented as well.
Parallelized Kalman-Filter-Based Reconstruction of Particle Tracks on Many-Core Processors and GPUs

DOE Office of Scientific and Technical Information (OSTI.GOV)

Cerati, Giuseppe; Elmer, Peter; Krutelyov, Slava

2017-01-01

For over a decade now, physical and energy constraints have limited clock speed improvements in commodity microprocessors. Instead, chipmakers have been pushed into producing lower-power, multi-core processors such as Graphical Processing Units (GPU), ARM CPUs, and Intel MICs. Broad-based efforts from manufacturers and developers have been devoted to making these processors user-friendly enough to perform general computations. However, extracting performance from a larger number of cores, as well as specialized vector or SIMD units, requires special care in algorithm design and code optimization. One of the most computationally challenging problems in high-energy particle experiments is finding and fitting the charged-particlemore » tracks during event reconstruction. This is expected to become by far the dominant problem at the High-Luminosity Large Hadron Collider (HL-LHC), for example. Today the most common track finding methods are those based on the Kalman filter. Experience with Kalman techniques on real tracking detector systems has shown that they are robust and provide high physics performance. This is why they are currently in use at the LHC, both in the trigger and offine. Previously we reported on the significant parallel speedups that resulted from our investigations to adapt Kalman filters to track fitting and track building on Intel Xeon and Xeon Phi. Here, we discuss our progresses toward the understanding of these processors and the new developments to port the Kalman filter to NVIDIA GPUs.« less
Parallelized Kalman-Filter-Based Reconstruction of Particle Tracks on Many-Core Processors and GPUs

NASA Astrophysics Data System (ADS)

Cerati, Giuseppe; Elmer, Peter; Krutelyov, Slava; Lantz, Steven; Lefebvre, Matthieu; Masciovecchio, Mario; McDermott, Kevin; Riley, Daniel; Tadel, Matevž; Wittich, Peter; Würthwein, Frank; Yagil, Avi

2017-08-01

For over a decade now, physical and energy constraints have limited clock speed improvements in commodity microprocessors. Instead, chipmakers have been pushed into producing lower-power, multi-core processors such as Graphical Processing Units (GPU), ARM CPUs, and Intel MICs. Broad-based efforts from manufacturers and developers have been devoted to making these processors user-friendly enough to perform general computations. However, extracting performance from a larger number of cores, as well as specialized vector or SIMD units, requires special care in algorithm design and code optimization. One of the most computationally challenging problems in high-energy particle experiments is finding and fitting the charged-particle tracks during event reconstruction. This is expected to become by far the dominant problem at the High-Luminosity Large Hadron Collider (HL-LHC), for example. Today the most common track finding methods are those based on the Kalman filter. Experience with Kalman techniques on real tracking detector systems has shown that they are robust and provide high physics performance. This is why they are currently in use at the LHC, both in the trigger and offine. Previously we reported on the significant parallel speedups that resulted from our investigations to adapt Kalman filters to track fitting and track building on Intel Xeon and Xeon Phi. Here, we discuss our progresses toward the understanding of these processors and the new developments to port the Kalman filter to NVIDIA GPUs.
Kalman Filter Tracking on Parallel Architectures

NASA Astrophysics Data System (ADS)

Cerati, Giuseppe; Elmer, Peter; Krutelyov, Slava; Lantz, Steven; Lefebvre, Matthieu; McDermott, Kevin; Riley, Daniel; Tadel, Matevž; Wittich, Peter; Würthwein, Frank; Yagil, Avi

2016-11-01

Power density constraints are limiting the performance improvements of modern CPUs. To address this we have seen the introduction of lower-power, multi-core processors such as GPGPU, ARM and Intel MIC. In order to achieve the theoretical performance gains of these processors, it will be necessary to parallelize algorithms to exploit larger numbers of lightweight cores and specialized functions like large vector units. Track finding and fitting is one of the most computationally challenging problems for event reconstruction in particle physics. At the High-Luminosity Large Hadron Collider (HL-LHC), for example, this will be by far the dominant problem. The need for greater parallelism has driven investigations of very different track finding techniques such as Cellular Automata or Hough Transforms. The most common track finding techniques in use today, however, are those based on a Kalman filter approach. Significant experience has been accumulated with these techniques on real tracking detector systems, both in the trigger and offline. They are known to provide high physics performance, are robust, and are in use today at the LHC. Given the utility of the Kalman filter in track finding, we have begun to port these algorithms to parallel architectures, namely Intel Xeon and Xeon Phi. We report here on our progress towards an end-to-end track reconstruction algorithm fully exploiting vectorization and parallelization techniques in a simplified experimental environment.
Fast prediction of RNA-RNA interaction using heuristic algorithm.

PubMed

Montaseri, Soheila

2015-01-01

Interaction between two RNA molecules plays a crucial role in many medical and biological processes such as gene expression regulation. In this process, an RNA molecule prohibits the translation of another RNA molecule by establishing stable interactions with it. Some algorithms have been formed to predict the structure of the RNA-RNA interaction. High computational time is a common challenge in most of the presented algorithms. In this context, a heuristic method is introduced to accurately predict the interaction between two RNAs based on minimum free energy (MFE). This algorithm uses a few dot matrices for finding the secondary structure of each RNA and binding sites between two RNAs. Furthermore, a parallel version of this method is presented. We describe the algorithm's concurrency and parallelism for a multicore chip. The proposed algorithm has been performed on some datasets including CopA-CopT, R1inv-R2inv, Tar-Tar*, DIS-DIS, and IncRNA54-RepZ in Escherichia coli bacteria. The method has high validity and efficiency, and it is run in low computational time in comparison to other approaches.
Resilient workflows for computational mechanics platforms

NASA Astrophysics Data System (ADS)

Nguyên, Toàn; Trifan, Laurentiu; Désidéri, Jean-Antoine

2010-06-01

Workflow management systems have recently been the focus of much interest and many research and deployment for scientific applications worldwide [26, 27]. Their ability to abstract the applications by wrapping application codes have also stressed the usefulness of such systems for multidiscipline applications [23, 24]. When complex applications need to provide seamless interfaces hiding the technicalities of the computing infrastructures, their high-level modeling, monitoring and execution functionalities help giving production teams seamless and effective facilities [25, 31, 33]. Software integration infrastructures based on programming paradigms such as Python, Mathlab and Scilab have also provided evidence of the usefulness of such approaches for the tight coupling of multidisciplne application codes [22, 24]. Also high-performance computing based on multi-core multi-cluster infrastructures open new opportunities for more accurate, more extensive and effective robust multi-discipline simulations for the decades to come [28]. This supports the goal of full flight dynamics simulation for 3D aircraft models within the next decade, opening the way to virtual flight-tests and certification of aircraft in the future [23, 24, 29].
Experimental Analysis of File Transfer Rates over Wide-Area Dedicated Connections

DOE Office of Scientific and Technical Information (OSTI.GOV)

Rao, Nageswara S.; Liu, Qiang; Sen, Satyabrata

2016-12-01

File transfers over dedicated connections, supported by large parallel file systems, have become increasingly important in high-performance computing and big data workflows. It remains a challenge to achieve peak rates for such transfers due to the complexities of file I/O, host, and network transport subsystems, and equally importantly, their interactions. We present extensive measurements of disk-to-disk file transfers using Lustre and XFS file systems mounted on multi-core servers over a suite of 10 Gbps emulated connections with 0-366 ms round trip times. Our results indicate that large buffer sizes and many parallel flows do not always guarantee high transfer rates.more » Furthermore, large variations in the measured rates necessitate repeated measurements to ensure confidence in inferences based on them. We propose a new method to efficiently identify the optimal joint file I/O and network transport parameters using a small number of measurements. We show that for XFS and Lustre with direct I/O, this method identifies configurations achieving 97% of the peak transfer rate while probing only 12% of the parameter space.« less
Efficient Geometric Sound Propagation Using Visibility Culling

NASA Astrophysics Data System (ADS)

Chandak, Anish

2011-07-01

Simulating propagation of sound can improve the sense of realism in interactive applications such as video games and can lead to better designs in engineering applications such as architectural acoustics. In this thesis, we present geometric sound propagation techniques which are faster than prior methods and map well to upcoming parallel multi-core CPUs. We model specular reflections by using the image-source method and model finite-edge diffraction by using the well-known Biot-Tolstoy-Medwin (BTM) model. We accelerate the computation of specular reflections by applying novel visibility algorithms, FastV and AD-Frustum, which compute visibility from a point. We accelerate finite-edge diffraction modeling by applying a novel visibility algorithm which computes visibility from a region. Our visibility algorithms are based on frustum tracing and exploit recent advances in fast ray-hierarchy intersections, data-parallel computations, and scalable, multi-core algorithms. The AD-Frustum algorithm adapts its computation to the scene complexity and allows small errors in computing specular reflection paths for higher computational efficiency. FastV and our visibility algorithm from a region are general, object-space, conservative visibility algorithms that together significantly reduce the number of image sources compared to other techniques while preserving the same accuracy. Our geometric propagation algorithms are an order of magnitude faster than prior approaches for modeling specular reflections and two to ten times faster for modeling finite-edge diffraction. Our algorithms are interactive, scale almost linearly on multi-core CPUs, and can handle large, complex, and dynamic scenes. We also compare the accuracy of our sound propagation algorithms with other methods. Once sound propagation is performed, it is desirable to listen to the propagated sound in interactive and engineering applications. We can generate smooth, artifact-free output audio signals by applying efficient audio-processing algorithms. We also present the first efficient audio-processing algorithm for scenarios with simultaneously moving source and moving receiver (MS-MR) which incurs less than 25% overhead compared to static source and moving receiver (SS-MR) or moving source and static receiver (MS-SR) scenario.
On Data Transfers Over Wide-Area Dedicated Connections

DOE Office of Scientific and Technical Information (OSTI.GOV)

Rao, Nageswara S.; Liu, Qiang

Dedicated wide-area network connections are employed in big data and high-performance computing scenarios, since the absence of cross-traffic promises to make it easier to analyze and optimize data transfers over them. However, nonlinear transport dynamics and end-system complexity due to multi-core hosts and distributed file systems make these tasks surprisingly challenging. We present an overview of methods to analyze memory and disk file transfers using extensive measurements over 10 Gbps physical and emulated connections with 0–366 ms round trip times (RTTs). For memory transfers, we derive performance profiles of TCP and UDT throughput as a function of RTT, which showmore » concave regions in contrast to entirely convex regions predicted by previous models. These highly desirable concave regions can be expanded by utilizing large buffers and more parallel flows. We also present Poincar´e maps and Lyapunov exponents of TCP and UDT throughputtraces that indicate complex throughput dynamics. For disk file transfers, we show that throughput can be optimized using a combination of parallel I/O and network threads under direct I/O mode. Our initial throughput measurements of Lustre filesystems mounted over long-haul connections using LNet routers show convex profiles indicative of I/O limits.« less

Impact of Recent Hardware and Software Trends on High Performance Transaction Processing and Analytics

NASA Astrophysics Data System (ADS)

Mohan, C.

In this paper, I survey briefly some of the recent and emerging trends in hardware and software features which impact high performance transaction processing and data analytics applications. These features include multicore processor chips, ultra large main memories, flash storage, storage class memories, database appliances, field programmable gate arrays, transactional memory, key-value stores, and cloud computing. While some applications, e.g., Web 2.0 ones, were initially built without traditional transaction processing functionality in mind, slowly system architects and designers are beginning to address such previously ignored issues. The availability, analytics and response time requirements of these applications were initially given more importance than ACID transaction semantics and resource consumption characteristics. A project at IBM Almaden is studying the implications of phase change memory on transaction processing, in the context of a key-value store. Bitemporal data management has also become an important requirement, especially for financial applications. Power consumption and heat dissipation properties are also major considerations in the emergence of modern software and hardware architectural features. Considerations relating to ease of configuration, installation, maintenance and monitoring, and improvement of total cost of ownership have resulted in database appliances becoming very popular. The MapReduce paradigm is now quite popular for large scale data analysis, in spite of the major inefficiencies associated with it.
Time-efficient simulations of tight-binding electronic structures with Intel Xeon PhiTM many-core processors

NASA Astrophysics Data System (ADS)

Ryu, Hoon; Jeong, Yosang; Kang, Ji-Hoon; Cho, Kyu Nam

2016-12-01

Modelling of multi-million atomic semiconductor structures is important as it not only predicts properties of physically realizable novel materials, but can accelerate advanced device designs. This work elaborates a new Technology-Computer-Aided-Design (TCAD) tool for nanoelectronics modelling, which uses a sp3d5s∗ tight-binding approach to describe multi-million atomic structures, and simulate electronic structures with high performance computing (HPC), including atomic effects such as alloy and dopant disorders. Being named as Quantum simulation tool for Advanced Nanoscale Devices (Q-AND), the tool shows nice scalability on traditional multi-core HPC clusters implying the strong capability of large-scale electronic structure simulations, particularly with remarkable performance enhancement on latest clusters of Intel Xeon PhiTM coprocessors. A review of the recent modelling study conducted to understand an experimental work of highly phosphorus-doped silicon nanowires, is presented to demonstrate the utility of Q-AND. Having been developed via Intel Parallel Computing Center project, Q-AND will be open to public to establish a sound framework of nanoelectronics modelling with advanced HPC clusters of a many-core base. With details of the development methodology and exemplary study of dopant electronics, this work will present a practical guideline for TCAD development to researchers in the field of computational nanoelectronics.
A Faster Parallel Algorithm and Efficient Multithreaded Implementations for Evaluating Betweenness Centrality on Massive Datasets

DOE Office of Scientific and Technical Information (OSTI.GOV)

Madduri, Kamesh; Ediger, David; Jiang, Karl

2009-02-15

We present a new lock-free parallel algorithm for computing betweenness centralityof massive small-world networks. With minor changes to the data structures, ouralgorithm also achieves better spatial cache locality compared to previous approaches. Betweenness centrality is a key algorithm kernel in HPCS SSCA#2, a benchmark extensively used to evaluate the performance of emerging high-performance computing architectures for graph-theoretic computations. We design optimized implementations of betweenness centrality and the SSCA#2 benchmark for two hardware multithreaded systems: a Cray XMT system with the Threadstorm processor, and a single-socket Sun multicore server with the UltraSPARC T2 processor. For a small-world network of 134 millionmore » vertices and 1.073 billion edges, the 16-processor XMT system and the 8-core Sun Fire T5120 server achieve TEPS scores (an algorithmic performance count for the SSCA#2 benchmark) of 160 million and 90 million respectively, which corresponds to more than a 2X performance improvement over the previous parallel implementations. To better characterize the performance of these multithreaded systems, we correlate the SSCA#2 performance results with data from the memory-intensive STREAM and RandomAccess benchmarks. Finally, we demonstrate the applicability of our implementation to analyze massive real-world datasets by computing approximate betweenness centrality for a large-scale IMDb movie-actor network.« less
A Faster Parallel Algorithm and Efficient Multithreaded Implementations for Evaluating Betweenness Centrality on Massive Datasets

DOE Office of Scientific and Technical Information (OSTI.GOV)

Madduri, Kamesh; Ediger, David; Jiang, Karl

2009-05-29

We present a new lock-free parallel algorithm for computing betweenness centrality of massive small-world networks. With minor changes to the data structures, our algorithm also achieves better spatial cache locality compared to previous approaches. Betweenness centrality is a key algorithm kernel in the HPCS SSCA#2 Graph Analysis benchmark, which has been extensively used to evaluate the performance of emerging high-performance computing architectures for graph-theoretic computations. We design optimized implementations of betweenness centrality and the SSCA#2 benchmark for two hardware multithreaded systems: a Cray XMT system with the ThreadStorm processor, and a single-socket Sun multicore server with the UltraSparc T2 processor.more » For a small-world network of 134 million vertices and 1.073 billion edges, the 16-processor XMT system and the 8-core Sun Fire T5120 server achieve TEPS scores (an algorithmic performance count for the SSCA#2 benchmark) of 160 million and 90 million respectively, which corresponds to more than a 2X performance improvement over the previous parallel implementations. To better characterize the performance of these multithreaded systems, we correlate the SSCA#2 performance results with data from the memory-intensive STREAM and RandomAccess benchmarks. Finally, we demonstrate the applicability of our implementation to analyze massive real-world datasets by computing approximate betweenness centrality for a large-scale IMDb movie-actor network.« less
Addressing the challenges of standalone multi-core simulations in molecular dynamics

NASA Astrophysics Data System (ADS)

Ocaya, R. O.; Terblans, J. J.

2017-07-01

Computational modelling in material science involves mathematical abstractions of force fields between particles with the aim to postulate, develop and understand materials by simulation. The aggregated pairwise interactions of the material's particles lead to a deduction of its macroscopic behaviours. For practically meaningful macroscopic scales, a large amount of data are generated, leading to vast execution times. Simulation times of hours, days or weeks for moderately sized problems are not uncommon. The reduction of simulation times, improved result accuracy and the associated software and hardware engineering challenges are the main motivations for many of the ongoing researches in the computational sciences. This contribution is concerned mainly with simulations that can be done on a "standalone" computer based on Message Passing Interfaces (MPI), parallel code running on hardware platforms with wide specifications, such as single/multi- processor, multi-core machines with minimal reconfiguration for upward scaling of computational power. The widely available, documented and standardized MPI library provides this functionality through the MPI_Comm_size (), MPI_Comm_rank () and MPI_Reduce () functions. A survey of the literature shows that relatively little is written with respect to the efficient extraction of the inherent computational power in a cluster. In this work, we discuss the main avenues available to tap into this extra power without compromising computational accuracy. We also present methods to overcome the high inertia encountered in single-node-based computational molecular dynamics. We begin by surveying the current state of the art and discuss what it takes to achieve parallelism, efficiency and enhanced computational accuracy through program threads and message passing interfaces. Several code illustrations are given. The pros and cons of writing raw code as opposed to using heuristic, third-party code are also discussed. The growing trend towards graphical processor units and virtual computing clouds for high-performance computing is also discussed. Finally, we present the comparative results of vacancy formation energy calculations using our own parallelized standalone code called Verlet-Stormer velocity (VSV) operating on 30,000 copper atoms. The code is based on the Sutton-Chen implementation of the Finnis-Sinclair pairwise embedded atom potential. A link to the code is also given.
Large Scale Document Inversion using a Multi-threaded Computing System

PubMed Central

Jung, Sungbo; Chang, Dar-Jen; Park, Juw Won

2018-01-01

Current microprocessor architecture is moving towards multi-core/multi-threaded systems. This trend has led to a surge of interest in using multi-threaded computing devices, such as the Graphics Processing Unit (GPU), for general purpose computing. We can utilize the GPU in computation as a massive parallel coprocessor because the GPU consists of multiple cores. The GPU is also an affordable, attractive, and user-programmable commodity. Nowadays a lot of information has been flooded into the digital domain around the world. Huge volume of data, such as digital libraries, social networking services, e-commerce product data, and reviews, etc., is produced or collected every moment with dramatic growth in size. Although the inverted index is a useful data structure that can be used for full text searches or document retrieval, a large number of documents will require a tremendous amount of time to create the index. The performance of document inversion can be improved by multi-thread or multi-core GPU. Our approach is to implement a linear-time, hash-based, single program multiple data (SPMD), document inversion algorithm on the NVIDIA GPU/CUDA programming platform utilizing the huge computational power of the GPU, to develop high performance solutions for document indexing. Our proposed parallel document inversion system shows 2-3 times faster performance than a sequential system on two different test datasets from PubMed abstract and e-commerce product reviews. CCS Concepts •Information systems➝Information retrieval • Computing methodologies➝Massively parallel and high-performance simulations. PMID:29861701
Large Scale Document Inversion using a Multi-threaded Computing System.

PubMed

Jung, Sungbo; Chang, Dar-Jen; Park, Juw Won

2017-06-01

Current microprocessor architecture is moving towards multi-core/multi-threaded systems. This trend has led to a surge of interest in using multi-threaded computing devices, such as the Graphics Processing Unit (GPU), for general purpose computing. We can utilize the GPU in computation as a massive parallel coprocessor because the GPU consists of multiple cores. The GPU is also an affordable, attractive, and user-programmable commodity. Nowadays a lot of information has been flooded into the digital domain around the world. Huge volume of data, such as digital libraries, social networking services, e-commerce product data, and reviews, etc., is produced or collected every moment with dramatic growth in size. Although the inverted index is a useful data structure that can be used for full text searches or document retrieval, a large number of documents will require a tremendous amount of time to create the index. The performance of document inversion can be improved by multi-thread or multi-core GPU. Our approach is to implement a linear-time, hash-based, single program multiple data (SPMD), document inversion algorithm on the NVIDIA GPU/CUDA programming platform utilizing the huge computational power of the GPU, to develop high performance solutions for document indexing. Our proposed parallel document inversion system shows 2-3 times faster performance than a sequential system on two different test datasets from PubMed abstract and e-commerce product reviews. •Information systems➝Information retrieval • Computing methodologies➝Massively parallel and high-performance simulations.
Static and Dynamic Frequency Scaling on Multicore CPUs

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bao, Wenlei; Hong, Changwan; Chunduri, Sudheer

2016-12-28

Dynamic voltage and frequency scaling (DVFS) adapts CPU power consumption by modifying a processor’s operating frequency (and the associated voltage). Typical approaches employing DVFS involve default strategies such as running at the lowest or the highest frequency, or observing the CPU’s runtime behavior and dynamically adapting the voltage/frequency configuration based on CPU usage. In this paper, we argue that many previous approaches suffer from inherent limitations, such as not account- ing for processor-specific impact of frequency changes on energy for different workload types. We first propose a lightweight runtime-based approach to automatically adapt the frequency based on the CPU workload,more » that is agnostic of the processor characteristics. We then show that further improvements can be achieved for affine kernels in the application, using a compile-time characterization instead of run-time monitoring to select the frequency and number of CPU cores to use. Our framework relies on a one-time energy characterization of CPU-specific DVFS profiles followed by a compile-time categorization of loop-based code segments in the application. These are combined to determine a priori of the frequency and the number of cores to use to execute the application so as to optimize energy or energy-delay product, outperforming runtime approach. Extensive evaluation on 60 benchmarks and five multi-core CPUs show that our approach systematically outperforms the powersave Linux governor, while improving overall performance.« less
Advanced and flexible multi-carrier receiver architecture for high-count multi-core fiber based space division multiplexed applications

PubMed Central

Asif, Rameez

2016-01-01

Space division multiplexing (SDM), incorporating multi-core fibers (MCFs), has been demonstrated for effectively maximizing the data capacity in an impending capacity crunch. To achieve high spectral-density through multi-carrier encoding while simultaneously maintaining transmission reach, benefits from inter-core crosstalk (XT) and non-linear compensation must be utilized. In this report, we propose a proof-of-concept unified receiver architecture that jointly compensates optical Kerr effects, intra- and inter-core XT in MCFs. The architecture is analysed in multi-channel 512 Gbit/s dual-carrier DP-16QAM system over 800 km 19-core MCF to validate the digital compensation of inter-core XT. Through this architecture: (a) we efficiently compensates the inter-core XT improving Q-factor by 4.82 dB and (b) achieve a momentous gain in transmission reach, increasing the maximum achievable distance from 480 km to 1208 km, via analytical analysis. Simulation results confirm that inter-core XT distortions are more relentless for cores fabricated around the central axis of cladding. Predominantly, XT induced Q-penalty can be suppressed to be less than 1 dB up-to −11.56 dB of inter-core XT over 800 km MCF, offering flexibility to fabricate dense core structures with same cladding diameter. Moreover, this report outlines the relationship between core pitch and forward-error correction (FEC). PMID:27270381
Parallel Evolutionary Optimization for Neuromorphic Network Training

DOE Office of Scientific and Technical Information (OSTI.GOV)

Schuman, Catherine D; Disney, Adam; Singh, Susheela

One of the key impediments to the success of current neuromorphic computing architectures is the issue of how best to program them. Evolutionary optimization (EO) is one promising programming technique; in particular, its wide applicability makes it especially attractive for neuromorphic architectures, which can have many different characteristics. In this paper, we explore different facets of EO on a spiking neuromorphic computing model called DANNA. We focus on the performance of EO in the design of our DANNA simulator, and on how to structure EO on both multicore and massively parallel computing systems. We evaluate how our parallel methods impactmore » the performance of EO on Titan, the U.S.'s largest open science supercomputer, and BOB, a Beowulf-style cluster of Raspberry Pi's. We also focus on how to improve the EO by evaluating commonality in higher performing neural networks, and present the result of a study that evaluates the EO performed by Titan.« less
Multicore job scheduling in the Worldwide LHC Computing Grid

NASA Astrophysics Data System (ADS)

Forti, A.; Pérez-Calero Yzquierdo, A.; Hartmann, T.; Alef, M.; Lahiff, A.; Templon, J.; Dal Pra, S.; Gila, M.; Skipsey, S.; Acosta-Silva, C.; Filipcic, A.; Walker, R.; Walker, C. J.; Traynor, D.; Gadrat, S.

2015-12-01

After the successful first run of the LHC, data taking is scheduled to restart in Summer 2015 with experimental conditions leading to increased data volumes and event complexity. In order to process the data generated in such scenario and exploit the multicore architectures of current CPUs, the LHC experiments have developed parallelized software for data reconstruction and simulation. However, a good fraction of their computing effort is still expected to be executed as single-core tasks. Therefore, jobs with diverse resources requirements will be distributed across the Worldwide LHC Computing Grid (WLCG), making workload scheduling a complex problem in itself. In response to this challenge, the WLCG Multicore Deployment Task Force has been created in order to coordinate the joint effort from experiments and WLCG sites. The main objective is to ensure the convergence of approaches from the different LHC Virtual Organizations (VOs) to make the best use of the shared resources in order to satisfy their new computing needs, minimizing any inefficiency originated from the scheduling mechanisms, and without imposing unnecessary complexities in the way sites manage their resources. This paper describes the activities and progress of the Task Force related to the aforementioned topics, including experiences from key sites on how to best use different batch system technologies, the evolution of workload submission tools by the experiments and the knowledge gained from scale tests of the different proposed job submission strategies.
Performance of adaptive DD-OFDM multicore fiber links and its relation with intercore crosstalk.

PubMed

Alves, Tiago M F; Luís, Ruben S; Puttnam, Benjamin J; Cartaxo, Adolfo V T; Awaji, Yoshinari; Wada, Naoya

2017-07-10

Adaptive direct-detection (DD) orthogonal frequency-division multiplexing (OFDM) is proposed to guarantee signal quality over time in weakly-coupled homogenous multicore fiber (MCFs) links impaired by stochastic intercore crosstalk (ICXT). For the first time, the received electrical power of the ICXT and the performance of the adaptive DD-OFDM MCF link are experimentally monitored quasi-simultaneously over a 210 hour period. Experimental results show that the time evolution of the error vector magnitude due to the ICXT can be suitably estimated from the normalized power of the detected crosstalk. The detected crosstalk results from the beating between the carrier in the test core and ICXT originating from the carrier and modulated signal from interfering core. The results show that the operation of DD-OFDM systems employing fixed modulation can be severely impaired by the presence of ICXT that may unpredictable vary in both power and frequency. The system may suffer from deleterious impact of moderate ICXT levels over a time duration of several hours or from peak ICXT levels occurring over a number of minutes. Such power fluctuations can lead to large variations in bit error ratio (BER) for static modulation schemes. Here, we show that BER fluctuations may be minimized by the use of adaptive modulation techniques and that in particular, the adaptive OFDM is a viable solution to guarantee link quality in MCF-based systems. An experimental model of an adaptive DD-OFDM MCF link shows an average throughput of 12 Gb/s that represents a reduction of only 9% compared to the maximum throughput measured without ICXT and an improvement of 23% relative to throughput obtained with static modulation.
The new landscape of parallel computer architecture

NASA Astrophysics Data System (ADS)

Shalf, John

2007-07-01

The past few years has seen a sea change in computer architecture that will impact every facet of our society as every electronic device from cell phone to supercomputer will need to confront parallelism of unprecedented scale. Whereas the conventional multicore approach (2, 4, and even 8 cores) adopted by the computing industry will eventually hit a performance plateau, the highest performance per watt and per chip area is achieved using manycore technology (hundreds or even thousands of cores). However, fully unleashing the potential of the manycore approach to ensure future advances in sustained computational performance will require fundamental advances in computer architecture and programming models that are nothing short of reinventing computing. In this paper we examine the reasons behind the movement to exponentially increasing parallelism, and its ramifications for system design, applications and programming models.
Reducing power usage on demand

NASA Astrophysics Data System (ADS)

Corbett, G.; Dewhurst, A.

2016-10-01

The Science and Technology Facilities Council (STFC) datacentre provides large- scale High Performance Computing facilities for the scientific community. It currently consumes approximately 1.5MW and this has risen by 25% in the past two years. STFC has been investigating leveraging preemption in the Tier 1 batch farm to save power. HEP experiments are increasing using jobs that can be killed to take advantage of opportunistic CPU resources or novel cost models such as Amazon's spot pricing. Additionally, schemes from energy providers are available that offer financial incentives to reduce power consumption at peak times. Under normal operating conditions, 3% of the batch farm capacity is wasted due to draining machines. By using preempt-able jobs, nodes can be rapidly made available to run multicore jobs without this wasted resource. The use of preempt-able jobs has been extended so that at peak times machines can be hibernated quickly to save energy. This paper describes the implementation of the above and demonstrates that STFC could in future take advantage of such energy saving schemes.
Advanced complex trait analysis.

PubMed

Gray, A; Stewart, I; Tenesa, A

2012-12-01

The Genome-wide Complex Trait Analysis (GCTA) software package can quantify the contribution of genetic variation to phenotypic variation for complex traits. However, as those datasets of interest continue to increase in size, GCTA becomes increasingly computationally prohibitive. We present an adapted version, Advanced Complex Trait Analysis (ACTA), demonstrating dramatically improved performance. We restructure the genetic relationship matrix (GRM) estimation phase of the code and introduce the highly optimized parallel Basic Linear Algebra Subprograms (BLAS) library combined with manual parallelization and optimization. We introduce the Linear Algebra PACKage (LAPACK) library into the restricted maximum likelihood (REML) analysis stage. For a test case with 8999 individuals and 279,435 single nucleotide polymorphisms (SNPs), we reduce the total runtime, using a compute node with two multi-core Intel Nehalem CPUs, from ∼17 h to ∼11 min. The source code is fully available under the GNU Public License, along with Linux binaries. For more information see http://www.epcc.ed.ac.uk/software-products/acta. a.gray@ed.ac.uk Supplementary data are available at Bioinformatics online.
Optimizing Blocking and Nonblocking Reduction Operations for Multicore Systems: Hierarchical Design and Implementation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gorentla Venkata, Manjunath; Shamis, Pavel; Graham, Richard L

2013-01-01

Many scientific simulations, using the Message Passing Interface (MPI) programming model, are sensitive to the performance and scalability of reduction collective operations such as MPI Allreduce and MPI Reduce. These operations are the most widely used abstractions to perform mathematical operations over all processes that are part of the simulation. In this work, we propose a hierarchical design to implement the reduction operations on multicore systems. This design aims to improve the efficiency of reductions by 1) tailoring the algorithms and customizing the implementations for various communication mechanisms in the system 2) providing the ability to configure the depth ofmore » hierarchy to match the system architecture, and 3) providing the ability to independently progress each of this hierarchy. Using this design, we implement MPI Allreduce and MPI Reduce operations (and its nonblocking variants MPI Iallreduce and MPI Ireduce) for all message sizes, and evaluate on multiple architectures including InfiniBand and Cray XT5. We leverage and enhance our existing infrastructure, Cheetah, which is a framework for implementing hierarchical collective operations to implement these reductions. The experimental results show that the Cheetah reduction operations outperform the production-grade MPI implementations such as Open MPI default, Cray MPI, and MVAPICH2, demonstrating its efficiency, flexibility and portability. On Infini- Band systems, with a microbenchmark, a 512-process Cheetah nonblocking Allreduce and Reduce achieves a speedup of 23x and 10x, respectively, compared to the default Open MPI reductions. The blocking variants of the reduction operations also show similar performance benefits. A 512-process nonblocking Cheetah Allreduce achieves a speedup of 3x, compared to the default MVAPICH2 Allreduce implementation. On a Cray XT5 system, a 6144-process Cheetah Allreduce outperforms the Cray MPI by 145%. The evaluation with an application kernel, Conjugate Gradient solver, shows that the Cheetah reductions speeds up total time to solution by 195%, demonstrating the potential benefits for scientific simulations.« less
Multi-Kepler GPU vs. multi-Intel MIC for spin systems simulations

NASA Astrophysics Data System (ADS)

Bernaschi, M.; Bisson, M.; Salvadore, F.

2014-10-01

We present and compare the performances of two many-core architectures: the Nvidia Kepler and the Intel MIC both in a single system and in cluster configuration for the simulation of spin systems. As a benchmark we consider the time required to update a single spin of the 3D Heisenberg spin glass model by using the Over-relaxation algorithm. We present data also for a traditional high-end multi-core architecture: the Intel Sandy Bridge. The results show that although on the two Intel architectures it is possible to use basically the same code, the performances of a Intel MIC change dramatically depending on (apparently) minor details. Another issue is that to obtain a reasonable scalability with the Intel Phi coprocessor (Phi is the coprocessor that implements the MIC architecture) in a cluster configuration it is necessary to use the so-called offload mode which reduces the performances of the single system. As to the GPU, the Kepler architecture offers a clear advantage with respect to the previous Fermi architecture maintaining exactly the same source code. Scalability of the multi-GPU implementation remains very good by using the CPU as a communication co-processor of the GPU. All source codes are provided for inspection and for double-checking the results.
Six-port optical switch for cluster-mesh photonic network-on-chip

NASA Astrophysics Data System (ADS)

Jia, Hao; Zhou, Ting; Zhao, Yunchou; Xia, Yuhao; Dai, Jincheng; Zhang, Lei; Ding, Jianfeng; Fu, Xin; Yang, Lin

2018-05-01

Photonic network-on-chip for high-performance multi-core processors has attracted substantial interest in recent years as it offers a systematic method to meet the demand of large bandwidth, low latency and low power dissipation. In this paper we demonstrate a non-blocking six-port optical switch for cluster-mesh photonic network-on-chip. The architecture is constructed by substituting three optical switching units of typical Spanke-Benes network to optical waveguide crossings. Compared with Spanke-Benes network, the number of optical switching units is reduced by 20%, while the connectivity of routing path is maintained. By this way the footprint and power consumption can be reduced at the expense of sacrificing the network latency performance in some cases. The device is realized by 12 thermally tuned silicon Mach-Zehnder optical switching units. Its theoretical spectral responses are evaluated by establishing a numerical model. The experimental spectral responses are also characterized, which indicates that the optical signal-to-noise ratios of the optical switch are larger than 13.5 dB in the wavelength range from 1525 nm to 1565 nm. Data transmission experiment with the data rate of 32 Gbps is implemented for each optical link.
Software defined multi-spectral imaging for Arctic sensor networks

NASA Astrophysics Data System (ADS)

Siewert, Sam; Angoth, Vivek; Krishnamurthy, Ramnarayan; Mani, Karthikeyan; Mock, Kenrick; Singh, Surjith B.; Srivistava, Saurav; Wagner, Chris; Claus, Ryan; Vis, Matthew Demi

2016-05-01

Availability of off-the-shelf infrared sensors combined with high definition visible cameras has made possible the construction of a Software Defined Multi-Spectral Imager (SDMSI) combining long-wave, near-infrared and visible imaging. The SDMSI requires a real-time embedded processor to fuse images and to create real-time depth maps for opportunistic uplink in sensor networks. Researchers at Embry Riddle Aeronautical University working with University of Alaska Anchorage at the Arctic Domain Awareness Center and the University of Colorado Boulder have built several versions of a low-cost drop-in-place SDMSI to test alternatives for power efficient image fusion. The SDMSI is intended for use in field applications including marine security, search and rescue operations and environmental surveys in the Arctic region. Based on Arctic marine sensor network mission goals, the team has designed the SDMSI to include features to rank images based on saliency and to provide on camera fusion and depth mapping. A major challenge has been the design of the camera computing system to operate within a 10 to 20 Watt power budget. This paper presents a power analysis of three options: 1) multi-core, 2) field programmable gate array with multi-core, and 3) graphics processing units with multi-core. For each test, power consumed for common fusion workloads has been measured at a range of frame rates and resolutions. Detailed analyses from our power efficiency comparison for workloads specific to stereo depth mapping and sensor fusion are summarized. Preliminary mission feasibility results from testing with off-the-shelf long-wave infrared and visible cameras in Alaska and Arizona are also summarized to demonstrate the value of the SDMSI for applications such as ice tracking, ocean color, soil moisture, animal and marine vessel detection and tracking. The goal is to select the most power efficient solution for the SDMSI for use on UAVs (Unoccupied Aerial Vehicles) and other drop-in-place installations in the Arctic. The prototype selected will be field tested in Alaska in the summer of 2016.
Fault Tolerance Middleware for a Multi-Core System

NASA Technical Reports Server (NTRS)

Some, Raphael R.; Springer, Paul L.; Zima, Hans P.; James, Mark; Wagner, David A.

2012-01-01

Fault Tolerance Middleware (FTM) provides a framework to run on a dedicated core of a multi-core system and handles detection of single-event upsets (SEUs), and the responses to those SEUs, occurring in an application running on multiple cores of the processor. This software was written expressly for a multi-core system and can support different kinds of fault strategies, such as introspection, algorithm-based fault tolerance (ABFT), and triple modular redundancy (TMR). It focuses on providing fault tolerance for the application code, and represents the first step in a plan to eventually include fault tolerance in message passing and the FTM itself. In the multi-core system, the FTM resides on a single, dedicated core, separate from the cores used by the application. This is done in order to isolate the FTM from application faults and to allow it to swap out any application core for a substitute. The structure of the FTM consists of an interface to a fault tolerant strategy module, a responder module, a fault manager module, an error factory, and an error mapper that determines the severity of the error. In the present reference implementation, the only fault tolerant strategy implemented is introspection. The introspection code waits for an application node to send an error notification to it. It then uses the error factory to create an error object, and at this time, a severity level is assigned to the error. The introspection code uses its built-in knowledge base to generate a recommended response to the error. Responses might include ignoring the error, logging it, rolling back the application to a previously saved checkpoint, swapping in a new node to replace a bad one, or restarting the application. The original error and recommended response are passed to the top-level fault manager module, which invokes the response. The responder module also notifies the introspection module of the generated response. This provides additional information to the introspection module that it can use in generating its next response. For example, if the responder triggers an application rollback and errors are still occurring, the introspection module may decide to recommend an application restart.

Improving the energy efficiency of sparse linear system solvers on multicore and manycore systems.

PubMed

Anzt, H; Quintana-Ortí, E S

2014-06-28

While most recent breakthroughs in scientific research rely on complex simulations carried out in large-scale supercomputers, the power draft and energy spent for this purpose is increasingly becoming a limiting factor to this trend. In this paper, we provide an overview of the current status in energy-efficient scientific computing by reviewing different technologies used to monitor power draft as well as power- and energy-saving mechanisms available in commodity hardware. For the particular domain of sparse linear algebra, we analyse the energy efficiency of a broad collection of hardware architectures and investigate how algorithmic and implementation modifications can improve the energy performance of sparse linear system solvers, without negatively impacting their performance. © 2014 The Author(s) Published by the Royal Society. All rights reserved.
Analysis OpenMP performance of AMD and Intel architecture for breaking waves simulation using MPS

NASA Astrophysics Data System (ADS)

Alamsyah, M. N. A.; Utomo, A.; Gunawan, P. H.

2018-03-01

Simulation of breaking waves by using Navier-Stokes equation via moving particle semi-implicit method (MPS) over close domain is given. The results show the parallel computing on multicore architecture using OpenMP platform can reduce the computational time almost half of the serial time. Here, the comparison using two computer architectures (AMD and Intel) are performed. The results using Intel architecture is shown better than AMD architecture in CPU time. However, in efficiency, the computer with AMD architecture gives slightly higher than the Intel. For the simulation by 1512 number of particles, the CPU time using Intel and AMD are 12662.47 and 28282.30 respectively. Moreover, the efficiency using similar number of particles, AMD obtains 50.09 % and Intel up to 49.42 %.
AnRAD: A Neuromorphic Anomaly Detection Framework for Massive Concurrent Data Streams.

PubMed

Chen, Qiuwen; Luley, Ryan; Wu, Qing; Bishop, Morgan; Linderman, Richard W; Qiu, Qinru

2018-05-01

The evolution of high performance computing technologies has enabled the large-scale implementation of neuromorphic models and pushed the research in computational intelligence into a new era. Among the machine learning applications, unsupervised detection of anomalous streams is especially challenging due to the requirements of detection accuracy and real-time performance. Designing a computing framework that harnesses the growing computing power of the multicore systems while maintaining high sensitivity and specificity to the anomalies is an urgent research topic. In this paper, we propose anomaly recognition and detection (AnRAD), a bioinspired detection framework that performs probabilistic inferences. We analyze the feature dependency and develop a self-structuring method that learns an efficient confabulation network using unlabeled data. This network is capable of fast incremental learning, which continuously refines the knowledge base using streaming data. Compared with several existing anomaly detection approaches, our method provides competitive detection quality. Furthermore, we exploit the massive parallel structure of the AnRAD framework. Our implementations of the detection algorithm on the graphic processing unit and the Xeon Phi coprocessor both obtain substantial speedups over the sequential implementation on general-purpose microprocessor. The framework provides real-time service to concurrent data streams within diversified knowledge contexts, and can be applied to large problems with multiple local patterns. Experimental results demonstrate high computing performance and memory efficiency. For vehicle behavior detection, the framework is able to monitor up to 16000 vehicles (data streams) and their interactions in real time with a single commodity coprocessor, and uses less than 0.2 ms for one testing subject. Finally, the detection network is ported to our spiking neural network simulator to show the potential of adapting to the emerging neuromorphic architectures.
BEAGLE: an application programming interface and high-performance computing library for statistical phylogenetics.

PubMed

Ayres, Daniel L; Darling, Aaron; Zwickl, Derrick J; Beerli, Peter; Holder, Mark T; Lewis, Paul O; Huelsenbeck, John P; Ronquist, Fredrik; Swofford, David L; Cummings, Michael P; Rambaut, Andrew; Suchard, Marc A

2012-01-01

Phylogenetic inference is fundamental to our understanding of most aspects of the origin and evolution of life, and in recent years, there has been a concentration of interest in statistical approaches such as Bayesian inference and maximum likelihood estimation. Yet, for large data sets and realistic or interesting models of evolution, these approaches remain computationally demanding. High-throughput sequencing can yield data for thousands of taxa, but scaling to such problems using serial computing often necessitates the use of nonstatistical or approximate approaches. The recent emergence of graphics processing units (GPUs) provides an opportunity to leverage their excellent floating-point computational performance to accelerate statistical phylogenetic inference. A specialized library for phylogenetic calculation would allow existing software packages to make more effective use of available computer hardware, including GPUs. Adoption of a common library would also make it easier for other emerging computing architectures, such as field programmable gate arrays, to be used in the future. We present BEAGLE, an application programming interface (API) and library for high-performance statistical phylogenetic inference. The API provides a uniform interface for performing phylogenetic likelihood calculations on a variety of compute hardware platforms. The library includes a set of efficient implementations and can currently exploit hardware including GPUs using NVIDIA CUDA, central processing units (CPUs) with Streaming SIMD Extensions and related processor supplementary instruction sets, and multicore CPUs via OpenMP. To demonstrate the advantages of a common API, we have incorporated the library into several popular phylogenetic software packages. The BEAGLE library is free open source software licensed under the Lesser GPL and available from http://beagle-lib.googlecode.com. An example client program is available as public domain software.
BEAGLE: An Application Programming Interface and High-Performance Computing Library for Statistical Phylogenetics

PubMed Central

Ayres, Daniel L.; Darling, Aaron; Zwickl, Derrick J.; Beerli, Peter; Holder, Mark T.; Lewis, Paul O.; Huelsenbeck, John P.; Ronquist, Fredrik; Swofford, David L.; Cummings, Michael P.; Rambaut, Andrew; Suchard, Marc A.

2012-01-01

Abstract Phylogenetic inference is fundamental to our understanding of most aspects of the origin and evolution of life, and in recent years, there has been a concentration of interest in statistical approaches such as Bayesian inference and maximum likelihood estimation. Yet, for large data sets and realistic or interesting models of evolution, these approaches remain computationally demanding. High-throughput sequencing can yield data for thousands of taxa, but scaling to such problems using serial computing often necessitates the use of nonstatistical or approximate approaches. The recent emergence of graphics processing units (GPUs) provides an opportunity to leverage their excellent floating-point computational performance to accelerate statistical phylogenetic inference. A specialized library for phylogenetic calculation would allow existing software packages to make more effective use of available computer hardware, including GPUs. Adoption of a common library would also make it easier for other emerging computing architectures, such as field programmable gate arrays, to be used in the future. We present BEAGLE, an application programming interface (API) and library for high-performance statistical phylogenetic inference. The API provides a uniform interface for performing phylogenetic likelihood calculations on a variety of compute hardware platforms. The library includes a set of efficient implementations and can currently exploit hardware including GPUs using NVIDIA CUDA, central processing units (CPUs) with Streaming SIMD Extensions and related processor supplementary instruction sets, and multicore CPUs via OpenMP. To demonstrate the advantages of a common API, we have incorporated the library into several popular phylogenetic software packages. The BEAGLE library is free open source software licensed under the Lesser GPL and available from http://beagle-lib.googlecode.com. An example client program is available as public domain software. PMID:21963610
Kalman Filter Tracking on Parallel Architectures

NASA Astrophysics Data System (ADS)

Cerati, Giuseppe; Elmer, Peter; Lantz, Steven; McDermott, Kevin; Riley, Dan; Tadel, Matevž; Wittich, Peter; Würthwein, Frank; Yagil, Avi

2015-12-01

Power density constraints are limiting the performance improvements of modern CPUs. To address this we have seen the introduction of lower-power, multi-core processors, but the future will be even more exciting. In order to stay within the power density limits but still obtain Moore's Law performance/price gains, it will be necessary to parallelize algorithms to exploit larger numbers of lightweight cores and specialized functions like large vector units. Example technologies today include Intel's Xeon Phi and GPGPUs. Track finding and fitting is one of the most computationally challenging problems for event reconstruction in particle physics. At the High Luminosity LHC, for example, this will be by far the dominant problem. The need for greater parallelism has driven investigations of very different track finding techniques including Cellular Automata or returning to Hough Transform. The most common track finding techniques in use today are however those based on the Kalman Filter [2]. Significant experience has been accumulated with these techniques on real tracking detector systems, both in the trigger and offline. They are known to provide high physics performance, are robust and are exactly those being used today for the design of the tracking system for HL-LHC. Our previous investigations showed that, using optimized data structures, track fitting with Kalman Filter can achieve large speedup both with Intel Xeon and Xeon Phi. We report here our further progress towards an end-to-end track reconstruction algorithm fully exploiting vectorization and parallelization techniques in a realistic simulation setup.
Development of small scale cluster computer for numerical analysis

NASA Astrophysics Data System (ADS)

Zulkifli, N. H. N.; Sapit, A.; Mohammed, A. N.

2017-09-01

In this study, two units of personal computer were successfully networked together to form a small scale cluster. Each of the processor involved are multicore processor which has four cores in it, thus made this cluster to have eight processors. Here, the cluster incorporate Ubuntu 14.04 LINUX environment with MPI implementation (MPICH2). Two main tests were conducted in order to test the cluster, which is communication test and performance test. The communication test was done to make sure that the computers are able to pass the required information without any problem and were done by using simple MPI Hello Program where the program written in C language. Additional, performance test was also done to prove that this cluster calculation performance is much better than single CPU computer. In this performance test, four tests were done by running the same code by using single node, 2 processors, 4 processors, and 8 processors. The result shows that with additional processors, the time required to solve the problem decrease. Time required for the calculation shorten to half when we double the processors. To conclude, we successfully develop a small scale cluster computer using common hardware which capable of higher computing power when compare to single CPU processor, and this can be beneficial for research that require high computing power especially numerical analysis such as finite element analysis, computational fluid dynamics, and computational physics analysis.
Evaluation of Emerging Energy-Efficient Heterogeneous Computing Platforms for Biomolecular and Cellular Simulation Workloads.

PubMed

Stone, John E; Hallock, Michael J; Phillips, James C; Peterson, Joseph R; Luthey-Schulten, Zaida; Schulten, Klaus

2016-05-01

Many of the continuing scientific advances achieved through computational biology are predicated on the availability of ongoing increases in computational power required for detailed simulation and analysis of cellular processes on biologically-relevant timescales. A critical challenge facing the development of future exascale supercomputer systems is the development of new computing hardware and associated scientific applications that dramatically improve upon the energy efficiency of existing solutions, while providing increased simulation, analysis, and visualization performance. Mobile computing platforms have recently become powerful enough to support interactive molecular visualization tasks that were previously only possible on laptops and workstations, creating future opportunities for their convenient use for meetings, remote collaboration, and as head mounted displays for immersive stereoscopic viewing. We describe early experiences adapting several biomolecular simulation and analysis applications for emerging heterogeneous computing platforms that combine power-efficient system-on-chip multi-core CPUs with high-performance massively parallel GPUs. We present low-cost power monitoring instrumentation that provides sufficient temporal resolution to evaluate the power consumption of individual CPU algorithms and GPU kernels. We compare the performance and energy efficiency of scientific applications running on emerging platforms with results obtained on traditional platforms, identify hardware and algorithmic performance bottlenecks that affect the usability of these platforms, and describe avenues for improving both the hardware and applications in pursuit of the needs of molecular modeling tasks on mobile devices and future exascale computers.
An auxiliary graph based dynamic traffic grooming algorithm in spatial division multiplexing enabled elastic optical networks with multi-core fibers

NASA Astrophysics Data System (ADS)

Zhao, Yongli; Tian, Rui; Yu, Xiaosong; Zhang, Jiawei; Zhang, Jie

2017-03-01

A proper traffic grooming strategy in dynamic optical networks can improve the utilization of bandwidth resources. An auxiliary graph (AG) is designed to solve the traffic grooming problem under a dynamic traffic scenario in spatial division multiplexing enabled elastic optical networks (SDM-EON) with multi-core fibers. Five traffic grooming policies achieved by adjusting the edge weights of an AG are proposed and evaluated through simulation: maximal electrical grooming (MEG), maximal optical grooming (MOG), maximal SDM grooming (MSG), minimize virtual hops (MVH), and minimize physical hops (MPH). Numeric results show that each traffic grooming policy has its own features. Among different traffic grooming policies, an MPH policy can achieve the lowest bandwidth blocking ratio, MEG can save the most transponders, and MSG can obtain the fewest cores for each request.
muBLASTP: database-indexed protein sequence search on multicore CPUs.

PubMed

Zhang, Jing; Misra, Sanchit; Wang, Hao; Feng, Wu-Chun

2016-11-04

The Basic Local Alignment Search Tool (BLAST) is a fundamental program in the life sciences that searches databases for sequences that are most similar to a query sequence. Currently, the BLAST algorithm utilizes a query-indexed approach. Although many approaches suggest that sequence search with a database index can achieve much higher throughput (e.g., BLAT, SSAHA, and CAFE), they cannot deliver the same level of sensitivity as the query-indexed BLAST, i.e., NCBI BLAST, or they can only support nucleotide sequence search, e.g., MegaBLAST. Due to different challenges and characteristics between query indexing and database indexing, the existing techniques for query-indexed search cannot be used into database indexed search. muBLASTP, a novel database-indexed BLAST for protein sequence search, delivers identical hits returned to NCBI BLAST. On Intel Haswell multicore CPUs, for a single query, the single-threaded muBLASTP achieves up to a 4.41-fold speedup for alignment stages, and up to a 1.75-fold end-to-end speedup over single-threaded NCBI BLAST. For a batch of queries, the multithreaded muBLASTP achieves up to a 5.7-fold speedups for alignment stages, and up to a 4.56-fold end-to-end speedup over multithreaded NCBI BLAST. With a newly designed index structure for protein database and associated optimizations in BLASTP algorithm, we re-factored BLASTP algorithm for modern multicore processors that achieves much higher throughput with acceptable memory footprint for the database index.
Fault-Tolerant, Real-Time, Multi-Core Computer System

NASA Technical Reports Server (NTRS)

Gostelow, Kim P.

2012-01-01

A document discusses a fault-tolerant, self-aware, low-power, multi-core computer for space missions with thousands of simple cores, achieving speed through concurrency. The proposed machine decides how to achieve concurrency in real time, rather than depending on programmers. The driving features of the system are simple hardware that is modular in the extreme, with no shared memory, and software with significant runtime reorganizing capability. The document describes a mechanism for moving ongoing computations and data that is based on a functional model of execution. Because there is no shared memory, the processor connects to its neighbors through a high-speed data link. Messages are sent to a neighbor switch, which in turn forwards that message on to its neighbor until reaching the intended destination. Except for the neighbor connections, processors are isolated and independent of each other. The processors on the periphery also connect chip-to-chip, thus building up a large processor net. There is no particular topology to the larger net, as a function at each processor allows it to forward a message in the correct direction. Some chip-to-chip connections are not necessarily nearest neighbors, providing short cuts for some of the longer physical distances. The peripheral processors also provide the connections to sensors, actuators, radios, science instruments, and other devices with which the computer system interacts.
Achieving High Performance With TCP Over 40 GbE on NUMA Architectures for CMS Data Acquisition

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bawej, Tomasz; et al.

2014-01-01

TCP and the socket abstraction have barely changed over the last two decades, but at the network layer there has been a giant leap from a few megabits to 100 gigabits in bandwidth. At the same time, CPU architectures have evolved into the multicore era and applications are expected to make full use of all available resources. Applications in the data acquisition domain based on the standard socket library running in a Non-Uniform Memory Access (NUMA) architecture are unable to reach full efficiency and scalability without the software being adequately aware about the IRQ (Interrupt Request), CPU and memory affinities.more » During the first long shutdown of LHC, the CMS DAQ system is going to be upgraded for operation from 2015 onwards and a new software component has been designed and developed in the CMS online framework for transferring data with sockets. This software attempts to wrap the low-level socket library to ease higher-level programming with an API based on an asynchronous event driven model similar to the DAT uDAPL API. It is an event-based application with NUMA optimizations, that allows for a high throughput of data across a large distributed system. This paper describes the architecture, the technologies involved and the performance measurements of the software in the context of the CMS distributed event building.« less
Science Notes.

ERIC Educational Resources Information Center

School Science Review, 1990

1990-01-01

Presented are 25 science activities on colorations of prey, evolution, blood, physiology, nutrition, enzyme kinetics, leaf pigments, analytical chemistry, milk, proteins, fermentation, surface effects of liquids, magnetism, drug synthesis, solvents, wintergreen synthesis, chemical reactions, multicore cables, diffraction, air resistance,…
Medical Implications of Space Radiation Exposure Due to Low-Altitude Polar Orbits.

PubMed

Chancellor, Jeffery C; Auñon-Chancellor, Serena M; Charles, John

2018-01-01

Space radiation research has progressed rapidly in recent years, but there remain large uncertainties in predicting and extrapolating biological responses to humans. Exposure to cosmic radiation and solar particle events (SPEs) may pose a critical health risk to future spaceflight crews and can have a serious impact on all biomedical aspects of space exploration. The relatively minimal shielding of the cancelled 1960s Manned Orbiting Laboratory (MOL) program's space vehicle and the high inclination polar orbits would have left the crew susceptible to high exposures of cosmic radiation and high dose-rate SPEs that are mostly unpredictable in frequency and intensity. In this study, we have modeled the nominal and off-nominal radiation environment that a MOL-like spacecraft vehicle would be exposed to during a 30-d mission using high performance, multicore computers. Projected doses from a historically large SPE (e.g., the August 1972 solar event) have been analyzed in the context of the MOL orbit profile, providing an opportunity to study its impact to crew health and subsequent contingencies. It is reasonable to presume that future commercial, government, and military spaceflight missions in low-Earth orbit (LEO) will have vehicles with similar shielding and orbital profiles. Studying the impact of cosmic radiation to the mission's operational integrity and the health of MOL crewmembers provides an excellent surrogate and case-study for future commercial and military spaceflight missions.Chancellor JC, Auñon-Chancellor SM, Charles J. Medical implications of space radiation exposure due to low-altitude polar orbits. Aerosp Med Hum Perform. 2018; 89(1):3-8.
Parallelization of combinatorial search when solving knapsack optimization problem on computing systems based on multicore processors

NASA Astrophysics Data System (ADS)

Rahman, P. A.

2018-05-01

This scientific paper deals with the model of the knapsack optimization problem and method of its solving based on directed combinatorial search in the boolean space. The offered by the author specialized mathematical model of decomposition of the search-zone to the separate search-spheres and the algorithm of distribution of the search-spheres to the different cores of the multi-core processor are also discussed. The paper also provides an example of decomposition of the search-zone to the several search-spheres and distribution of the search-spheres to the different cores of the quad-core processor. Finally, an offered by the author formula for estimation of the theoretical maximum of the computational acceleration, which can be achieved due to the parallelization of the search-zone to the search-spheres on the unlimited number of the processor cores, is also given.
Passing in Command Line Arguments and Parallel Cluster/Multicore Batching in R with batch.

PubMed

Hoffmann, Thomas J

2011-03-01

It is often useful to rerun a command line R script with some slight change in the parameters used to run it - a new set of parameters for a simulation, a different dataset to process, etc. The R package batch provides a means to pass in multiple command line options, including vectors of values in the usual R format, easily into R. The same script can be setup to run things in parallel via different command line arguments. The R package batch also provides a means to simplify this parallel batching by allowing one to use R and an R-like syntax for arguments to spread a script across a cluster or local multicore/multiprocessor computer, with automated syntax for several popular cluster types. Finally it provides a means to aggregate the results together of multiple processes run on a cluster.
Semiempirical Quantum Chemical Calculations Accelerated on a Hybrid Multicore CPU-GPU Computing Platform.

PubMed

Wu, Xin; Koslowski, Axel; Thiel, Walter

2012-07-10

In this work, we demonstrate that semiempirical quantum chemical calculations can be accelerated significantly by leveraging the graphics processing unit (GPU) as a coprocessor on a hybrid multicore CPU-GPU computing platform. Semiempirical calculations using the MNDO, AM1, PM3, OM1, OM2, and OM3 model Hamiltonians were systematically profiled for three types of test systems (fullerenes, water clusters, and solvated crambin) to identify the most time-consuming sections of the code. The corresponding routines were ported to the GPU and optimized employing both existing library functions and a GPU kernel that carries out a sequence of noniterative Jacobi transformations during pseudodiagonalization. The overall computation times for single-point energy calculations and geometry optimizations of large molecules were reduced by one order of magnitude for all methods, as compared to runs on a single CPU core.
Virtual optical network mapping and core allocation in elastic optical networks using multi-core fibers

NASA Astrophysics Data System (ADS)

Xuan, Hejun; Wang, Yuping; Xu, Zhanqi; Hao, Shanshan; Wang, Xiaoli

2017-11-01

Virtualization technology can greatly improve the efficiency of the networks by allowing the virtual optical networks to share the resources of the physical networks. However, it will face some challenges, such as finding the efficient strategies for virtual nodes mapping, virtual links mapping and spectrum assignment. It is even more complex and challenging when the physical elastic optical networks using multi-core fibers. To tackle these challenges, we establish a constrained optimization model to determine the optimal schemes of optical network mapping, core allocation and spectrum assignment. To solve the model efficiently, tailor-made encoding scheme, crossover and mutation operators are designed. Based on these, an efficient genetic algorithm is proposed to obtain the optimal schemes of the virtual nodes mapping, virtual links mapping, core allocation. The simulation experiments are conducted on three widely used networks, and the experimental results show the effectiveness of the proposed model and algorithm.
Adaptive multiphoton endomicroscopy through a dynamically deformed multicore optical fiber using proximal detection.

PubMed

Warren, Sean C; Kim, Youngchan; Stone, James M; Mitchell, Claire; Knight, Jonathan C; Neil, Mark A A; Paterson, Carl; French, Paul M W; Dunsby, Chris

2016-09-19

This paper demonstrates multiphoton excited fluorescence imaging through a polarisation maintaining multicore fiber (PM-MCF) while the fiber is dynamically deformed using all-proximal detection. Single-shot proximal measurement of the relative optical path lengths of all the cores of the PM-MCF in double pass is achieved using a Mach-Zehnder interferometer read out by a scientific CMOS camera operating at 416 Hz. A non-linear least squares fitting procedure is then employed to determine the deformation-induced lateral shift of the excitation spot at the distal tip of the PM-MCF. An experimental validation of this approach is presented that compares the proximally measured deformation-induced lateral shift in focal spot position to an independent distally measured ground truth. The proximal measurement of deformation-induced shift in focal spot position is applied to correct for deformation-induced shifts in focal spot position during raster-scanning multiphoton excited fluorescence imaging.
On the Usage of GPUs for Efficient Motion Estimation in Medical Image Sequences

PubMed Central

Thiyagalingam, Jeyarajan; Goodman, Daniel; Schnabel, Julia A.; Trefethen, Anne; Grau, Vicente

2011-01-01

Images are ubiquitous in biomedical applications from basic research to clinical practice. With the rapid increase in resolution, dimensionality of the images and the need for real-time performance in many applications, computational requirements demand proper exploitation of multicore architectures. Towards this, GPU-specific implementations of image analysis algorithms are particularly promising. In this paper, we investigate the mapping of an enhanced motion estimation algorithm to novel GPU-specific architectures, the resulting challenges and benefits therein. Using a database of three-dimensional image sequences, we show that the mapping leads to substantial performance gains, up to a factor of 60, and can provide near-real-time experience. We also show how architectural peculiarities of these devices can be best exploited in the benefit of algorithms, most specifically for addressing the challenges related to their access patterns and different memory configurations. Finally, we evaluate the performance of the algorithm on three different GPU architectures and perform a comprehensive analysis of the results. PMID:21869880

A hybrid algorithm for parallel molecular dynamics simulations

NASA Astrophysics Data System (ADS)

Mangiardi, Chris M.; Meyer, R.

2017-10-01

This article describes algorithms for the hybrid parallelization and SIMD vectorization of molecular dynamics simulations with short-range forces. The parallelization method combines domain decomposition with a thread-based parallelization approach. The goal of the work is to enable efficient simulations of very large (tens of millions of atoms) and inhomogeneous systems on many-core processors with hundreds or thousands of cores and SIMD units with large vector sizes. In order to test the efficiency of the method, simulations of a variety of configurations with up to 74 million atoms have been performed. Results are shown that were obtained on multi-core systems with Sandy Bridge and Haswell processors as well as systems with Xeon Phi many-core processors.
A fast ultrasonic simulation tool based on massively parallel implementations

NASA Astrophysics Data System (ADS)

Lambert, Jason; Rougeron, Gilles; Lacassagne, Lionel; Chatillon, Sylvain

2014-02-01

This paper presents a CIVA optimized ultrasonic inspection simulation tool, which takes benefit of the power of massively parallel architectures: graphical processing units (GPU) and multi-core general purpose processors (GPP). This tool is based on the classical approach used in CIVA: the interaction model is based on Kirchoff, and the ultrasonic field around the defect is computed by the pencil method. The model has been adapted and parallelized for both architectures. At this stage, the configurations addressed by the tool are : multi and mono-element probes, planar specimens made of simple isotropic materials, planar rectangular defects or side drilled holes of small diameter. Validations on the model accuracy and performances measurements are presented.
Damaris: Addressing performance variability in data management for post-petascale simulations

DOE PAGES

Dorier, Matthieu; Antoniu, Gabriel; Cappello, Franck; ...

2016-10-01

With exascale computing on the horizon, reducing performance variability in data management tasks (storage, visualization, analysis, etc.) is becoming a key challenge in sustaining high performance. Here, this variability significantly impacts the overall application performance at scale and its predictability over time. In this article, we present Damaris, a system that leverages dedicated cores in multicore nodes to offload data management tasks, including I/O, data compression, scheduling of data movements, in situ analysis, and visualization. We evaluate Damaris with the CM1 atmospheric simulation and the Nek5000 computational fluid dynamic simulation on four platforms, including NICS’s Kraken and NCSA’s Blue Waters.more » Our results show that (1) Damaris fully hides the I/O variability as well as all I/O-related costs, thus making simulation performance predictable; (2) it increases the sustained write throughput by a factor of up to 15 compared with standard I/O approaches; (3) it allows almost perfect scalability of the simulation up to over 9,000 cores, as opposed to state-of-the-art approaches that fail to scale; and (4) it enables a seamless connection to the VisIt visualization software to perform in situ analysis and visualization in a way that impacts neither the performance of the simulation nor its variability. In addition, we extended our implementation of Damaris to also support the use of dedicated nodes and conducted a thorough comparison of the two approaches—dedicated cores and dedicated nodes—for I/O tasks with the aforementioned applications.« less
Damaris: Addressing performance variability in data management for post-petascale simulations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dorier, Matthieu; Antoniu, Gabriel; Cappello, Franck

With exascale computing on the horizon, reducing performance variability in data management tasks (storage, visualization, analysis, etc.) is becoming a key challenge in sustaining high performance. Here, this variability significantly impacts the overall application performance at scale and its predictability over time. In this article, we present Damaris, a system that leverages dedicated cores in multicore nodes to offload data management tasks, including I/O, data compression, scheduling of data movements, in situ analysis, and visualization. We evaluate Damaris with the CM1 atmospheric simulation and the Nek5000 computational fluid dynamic simulation on four platforms, including NICS’s Kraken and NCSA’s Blue Waters.more » Our results show that (1) Damaris fully hides the I/O variability as well as all I/O-related costs, thus making simulation performance predictable; (2) it increases the sustained write throughput by a factor of up to 15 compared with standard I/O approaches; (3) it allows almost perfect scalability of the simulation up to over 9,000 cores, as opposed to state-of-the-art approaches that fail to scale; and (4) it enables a seamless connection to the VisIt visualization software to perform in situ analysis and visualization in a way that impacts neither the performance of the simulation nor its variability. In addition, we extended our implementation of Damaris to also support the use of dedicated nodes and conducted a thorough comparison of the two approaches—dedicated cores and dedicated nodes—for I/O tasks with the aforementioned applications.« less
FPGA Acceleration of the phylogenetic likelihood function for Bayesian MCMC inference methods.

PubMed

Zierke, Stephanie; Bakos, Jason D

2010-04-12

Likelihood (ML)-based phylogenetic inference has become a popular method for estimating the evolutionary relationships among species based on genomic sequence data. This method is used in applications such as RAxML, GARLI, MrBayes, PAML, and PAUP. The Phylogenetic Likelihood Function (PLF) is an important kernel computation for this method. The PLF consists of a loop with no conditional behavior or dependencies between iterations. As such it contains a high potential for exploiting parallelism using micro-architectural techniques. In this paper, we describe a technique for mapping the PLF and supporting logic onto a Field Programmable Gate Array (FPGA)-based co-processor. By leveraging the FPGA's on-chip DSP modules and the high-bandwidth local memory attached to the FPGA, the resultant co-processor can accelerate ML-based methods and outperform state-of-the-art multi-core processors. We use the MrBayes 3 tool as a framework for designing our co-processor. For large datasets, we estimate that our accelerated MrBayes, if run on a current-generation FPGA, achieves a 10x speedup relative to software running on a state-of-the-art server-class microprocessor. The FPGA-based implementation achieves its performance by deeply pipelining the likelihood computations, performing multiple floating-point operations in parallel, and through a natural log approximation that is chosen specifically to leverage a deeply pipelined custom architecture. Heterogeneous computing, which combines general-purpose processors with special-purpose co-processors such as FPGAs and GPUs, is a promising approach for high-performance phylogeny inference as shown by the growing body of literature in this field. FPGAs in particular are well-suited for this task because of their low power consumption as compared to many-core processors and Graphics Processor Units (GPUs).
Monitoring Evolution at CERN

NASA Astrophysics Data System (ADS)

Andrade, P.; Fiorini, B.; Murphy, S.; Pigueiras, L.; Santos, M.

2015-12-01

Over the past two years, the operation of the CERN Data Centres went through significant changes with the introduction of new mechanisms for hardware procurement, new services for cloud provisioning and configuration management, among other improvements. These changes resulted in an increase of resources being operated in a more dynamic environment. Today, the CERN Data Centres provide over 11000 multi-core processor servers, 130 PB disk servers, 100 PB tape robots, and 150 high performance tape drives. To cope with these developments, an evolution of the data centre monitoring tools was also required. This modernisation was based on a number of guiding rules: sustain the increase of resources, adapt to the new dynamic nature of the data centres, make monitoring data easier to share, give more flexibility to Service Managers on how they publish and consume monitoring metrics and logs, establish a common repository of monitoring data, optimise the handling of monitoring notifications, and replace the previous toolset by new open source technologies with large adoption and community support. This contribution describes how these improvements were delivered, present the architecture and technologies of the new monitoring tools, and review the experience of its production deployment.
Aeroacoustic Codes For Rotor Harmonic and BVI Noise--CAMRAD.Mod1/HIRES

NASA Technical Reports Server (NTRS)

Brooks, Thomas F.; Boyd, D. Douglas, Jr.; Burley, Casey L.; Jolly, J. Ralph, Jr.

1996-01-01

This paper presents a status of non-CFD aeroacoustic codes at NASA Langley Research Center for the prediction of helicopter harmonic and Blade-Vortex Interaction (BVI) noise. The prediction approach incorporates three primary components: CAMRAD.Mod1 - a substantially modified version of the performance/trim/wake code CAMRAD; HIRES - a high resolution blade loads post-processor; and WOPWOP - an acoustic code. The functional capabilities and physical modeling in CAMRAD.Mod1/HIRES will be summarized and illustrated. A new multi-core roll-up wake modeling approach is introduced and validated. Predictions of rotor wake and radiated noise are compared with to the results of the HART program, a model BO-105 windtunnel test at the DNW in Europe. Additional comparisons are made to results from a DNW test of a contemporary design four-bladed rotor, as well as from a Langley test of a single proprotor (tiltrotor) three-bladed model configuration. Because the method is shown to help eliminate the necessity of guesswork in setting code parameters between different rotor configurations, it should prove useful as a rotor noise design tool.
C3: A Command-line Catalogue Cross-matching tool for modern astrophysical survey data

NASA Astrophysics Data System (ADS)

Riccio, Giuseppe; Brescia, Massimo; Cavuoti, Stefano; Mercurio, Amata; di Giorgio, Anna Maria; Molinari, Sergio

2017-06-01

In the current data-driven science era, it is needed that data analysis techniques has to quickly evolve to face with data whose dimensions has increased up to the Petabyte scale. In particular, being modern astrophysics based on multi-wavelength data organized into large catalogues, it is crucial that the astronomical catalog cross-matching methods, strongly dependant from the catalogues size, must ensure efficiency, reliability and scalability. Furthermore, multi-band data are archived and reduced in different ways, so that the resulting catalogues may differ each other in formats, resolution, data structure, etc, thus requiring the highest generality of cross-matching features. We present C 3 (Command-line Catalogue Cross-match), a multi-platform application designed to efficiently cross-match massive catalogues from modern surveys. Conceived as a stand-alone command-line process or a module within generic data reduction/analysis pipeline, it provides the maximum flexibility, in terms of portability, configuration, coordinates and cross-matching types, ensuring high performance capabilities by using a multi-core parallel processing paradigm and a sky partitioning algorithm.
Experimental demonstration of high spectral efficient 4 × 4 MIMO SCMA-OFDM/OQAM radio over multi-core fiber system.

PubMed

Liu, Chang; Deng, Lei; He, Jiale; Li, Di; Fu, Songnian; Tang, Ming; Cheng, Mengfan; Liu, Deming

2017-07-24

In this paper, 4 × 4 multiple-input multiple-output (MIMO) radio over 7-core fiber system based on sparse code multiple access (SCMA) and OFDM/OQAM techniques is proposed. No cyclic prefix (CP) is required by properly designing the prototype filters in OFDM/OQAM modulator, and non-orthogonally overlaid codewords by using SCMA is help to serve more users simultaneously under the condition of using equal number of time and frequency resources compared with OFDMA, resulting in the increase of spectral efficiency (SE) and system capacity. In our experiment, 11.04 Gb/s 4 × 4 MIMO SCMA-OFDM/OQAM signal is successfully transmitted over 20 km 7-core fiber and 0.4 m air distance in both uplink and downlink. As a comparison, 6.681 Gb/s traditional MIMO-OFDM signal with the same occupied bandwidth has been evaluated for both uplink and downlink transmission. The experimental results show that SE could be increased by 65.2% with no bit error rate (BER) performance degradation compared with the traditional MIMO-OFDM technique.
Real-Time Three-Dimensional Cell Segmentation in Large-Scale Microscopy Data of Developing Embryos.

PubMed

Stegmaier, Johannes; Amat, Fernando; Lemon, William C; McDole, Katie; Wan, Yinan; Teodoro, George; Mikut, Ralf; Keller, Philipp J

2016-01-25

We present the Real-time Accurate Cell-shape Extractor (RACE), a high-throughput image analysis framework for automated three-dimensional cell segmentation in large-scale images. RACE is 55-330 times faster and 2-5 times more accurate than state-of-the-art methods. We demonstrate the generality of RACE by extracting cell-shape information from entire Drosophila, zebrafish, and mouse embryos imaged with confocal and light-sheet microscopes. Using RACE, we automatically reconstructed cellular-resolution tissue anisotropy maps across developing Drosophila embryos and quantified differences in cell-shape dynamics in wild-type and mutant embryos. We furthermore integrated RACE with our framework for automated cell lineaging and performed joint segmentation and cell tracking in entire Drosophila embryos. RACE processed these terabyte-sized datasets on a single computer within 1.4 days. RACE is easy to use, as it requires adjustment of only three parameters, takes full advantage of state-of-the-art multi-core processors and graphics cards, and is available as open-source software for Windows, Linux, and Mac OS. Copyright © 2016 Elsevier Inc. All rights reserved.
Programming for 1.6 Millon cores: Early experiences with IBM's BG/Q SMP architecture

NASA Astrophysics Data System (ADS)

Glosli, James

2013-03-01

With the stall in clock cycle improvements a decade ago, the drive for computational performance has continues along a path of increasing core counts on a processor. The multi-core evolution has been expressed in both a symmetric multi processor (SMP) architecture and cpu/GPU architecture. Debates rage in the high performance computing (HPC) community which architecture best serves HPC. In this talk I will not attempt to resolve that debate but perhaps fuel it. I will discuss the experience of exploiting Sequoia, a 98304 node IBM Blue Gene/Q SMP at Lawrence Livermore National Laboratory. The advantages and challenges of leveraging the computational power BG/Q will be detailed through the discussion of two applications. The first application is a Molecular Dynamics code called ddcMD. This is a code developed over the last decade at LLNL and ported to BG/Q. The second application is a cardiac modeling code called Cardioid. This is a code that was recently designed and developed at LLNL to exploit the fine scale parallelism of BG/Q's SMP architecture. Through the lenses of these efforts I'll illustrate the need to rethink how we express and implement our computational approaches. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Performance Comparison of EPICS IOC and MARTe in a Hard Real-Time Control Application

NASA Astrophysics Data System (ADS)

Barbalace, Antonio; Manduchi, Gabriele; Neto, A.; De Tommasi, G.; Sartori, F.; Valcarcel, D. F.

2011-12-01

EPICS is used worldwide mostly for controlling accelerators and large experimental physics facilities. Although EPICS is well fit for the design and development of automation systems, which are typically VME or PLC-based systems, and for soft real-time systems, it may present several drawbacks when used to develop hard real-time systems/applications especially when general purpose operating systems as plain Linux are chosen. This is in particular true in fusion research devices typically employing several hard real-time systems, such as the magnetic control systems, that may require strict determinism, and high performance in terms of jitter and latency. Serious deterioration of important plasma parameters may happen otherwise, possibly leading to an abrupt termination of the plasma discharge. The MARTe framework has been recently developed to fulfill the demanding requirements for such real-time systems that are alike to run on general purpose operating systems, possibly integrated with the low-latency real-time preemption patches. MARTe has been adopted to develop a number of real-time systems in different Tokamaks. In this paper, we first summarize differences and similarities between EPICS IOC and MARTe. Then we report on a set of performance measurements executed on an x86 64 bit multicore machine running Linux with an IO control algorithm implemented in an EPICS IOC and in MARTe.
High-Performance Agent-Based Modeling Applied to Vocal Fold Inflammation and Repair.

PubMed

Seekhao, Nuttiiya; Shung, Caroline; JaJa, Joseph; Mongeau, Luc; Li-Jessen, Nicole Y K

2018-01-01

Fast and accurate computational biology models offer the prospect of accelerating the development of personalized medicine. A tool capable of estimating treatment success can help prevent unnecessary and costly treatments and potential harmful side effects. A novel high-performance Agent-Based Model (ABM) was adopted to simulate and visualize multi-scale complex biological processes arising in vocal fold inflammation and repair. The computational scheme was designed to organize the 3D ABM sub-tasks to fully utilize the resources available on current heterogeneous platforms consisting of multi-core CPUs and many-core GPUs. Subtasks are further parallelized and convolution-based diffusion is used to enhance the performance of the ABM simulation. The scheme was implemented using a client-server protocol allowing the results of each iteration to be analyzed and visualized on the server (i.e., in-situ ) while the simulation is running on the same server. The resulting simulation and visualization software enables users to interact with and steer the course of the simulation in real-time as needed. This high-resolution 3D ABM framework was used for a case study of surgical vocal fold injury and repair. The new framework is capable of completing the simulation, visualization and remote result delivery in under 7 s per iteration, where each iteration of the simulation represents 30 min in the real world. The case study model was simulated at the physiological scale of a human vocal fold. This simulation tracks 17 million biological cells as well as a total of 1.7 billion signaling chemical and structural protein data points. The visualization component processes and renders all simulated biological cells and 154 million signaling chemical data points. The proposed high-performance 3D ABM was verified through comparisons with empirical vocal fold data. Representative trends of biomarker predictions in surgically injured vocal folds were observed.
Research on Optical Transmitter and Receiver Module Used for High-Speed Interconnection between CPU and Memory

NASA Astrophysics Data System (ADS)

He, Huimin; Liu, Fengman; Li, Baoxia; Xue, Haiyun; Wang, Haidong; Qiu, Delong; Zhou, Yunyan; Cao, Liqiang

2016-11-01

With the development of the multicore processor, the bandwidth and capacity of the memory, rather than the memory area, are the key factors in server performance. At present, however, the new architectures, such as fully buffered DIMM (FBDIMM), hybrid memory cube (HMC), and high bandwidth memory (HBM), cannot be commercially applied in the server. Therefore, a new architecture for the server is proposed. CPU and memory are separated onto different boards, and optical interconnection is used for the communication between them. Each optical module corresponds to each dual inline memory module (DIMM) with 64 channels. Compared to the previous technology, not only can the architecture realize high-capacity and wide-bandwidth memory, it also can reduce power consumption and cost, and be compatible with the existing dynamic random access memory (DRAM). In this article, the proposed module with system-in-package (SiP) integration is demonstrated. In the optical module, the silicon photonic chip is included, which is a promising technology to be applied in the next-generation data exchanging centers. And due to the bandwidth-distance performance of the optical interconnection, SerDes chips are introduced to convert the 64-bit data at 800 Mbps from/to 4-channel data at 12.8 Gbps after/before they are transmitted though optical fiber. All the devices are packaged on cheap organic substrates. To ensure the performance of the whole system, several optimization efforts have been performed on the two modules. High-speed interconnection traces have been designed and simulated with electromagnetic simulation software. Steady-state thermal characteristics of the transceiver module have been evaluated by ANSYS APLD based on finite-element methodology (FEM). Heat sinks are placed at the hotspot area to ensure the reliability of all working chips. Finally, this transceiver system based on silicon photonics is measured, and the eye diagrams of data and clock signals are verified.
High-Performance Agent-Based Modeling Applied to Vocal Fold Inflammation and Repair

PubMed Central

Seekhao, Nuttiiya; Shung, Caroline; JaJa, Joseph; Mongeau, Luc; Li-Jessen, Nicole Y. K.

2018-01-01

Fast and accurate computational biology models offer the prospect of accelerating the development of personalized medicine. A tool capable of estimating treatment success can help prevent unnecessary and costly treatments and potential harmful side effects. A novel high-performance Agent-Based Model (ABM) was adopted to simulate and visualize multi-scale complex biological processes arising in vocal fold inflammation and repair. The computational scheme was designed to organize the 3D ABM sub-tasks to fully utilize the resources available on current heterogeneous platforms consisting of multi-core CPUs and many-core GPUs. Subtasks are further parallelized and convolution-based diffusion is used to enhance the performance of the ABM simulation. The scheme was implemented using a client-server protocol allowing the results of each iteration to be analyzed and visualized on the server (i.e., in-situ) while the simulation is running on the same server. The resulting simulation and visualization software enables users to interact with and steer the course of the simulation in real-time as needed. This high-resolution 3D ABM framework was used for a case study of surgical vocal fold injury and repair. The new framework is capable of completing the simulation, visualization and remote result delivery in under 7 s per iteration, where each iteration of the simulation represents 30 min in the real world. The case study model was simulated at the physiological scale of a human vocal fold. This simulation tracks 17 million biological cells as well as a total of 1.7 billion signaling chemical and structural protein data points. The visualization component processes and renders all simulated biological cells and 154 million signaling chemical data points. The proposed high-performance 3D ABM was verified through comparisons with empirical vocal fold data. Representative trends of biomarker predictions in surgically injured vocal folds were observed. PMID:29706894
HACC: Simulating sky surveys on state-of-the-art supercomputing architectures

NASA Astrophysics Data System (ADS)

Habib, Salman; Pope, Adrian; Finkel, Hal; Frontiere, Nicholas; Heitmann, Katrin; Daniel, David; Fasel, Patricia; Morozov, Vitali; Zagaris, George; Peterka, Tom; Vishwanath, Venkatram; Lukić, Zarija; Sehrish, Saba; Liao, Wei-keng

2016-01-01

Current and future surveys of large-scale cosmic structure are associated with a massive and complex datastream to study, characterize, and ultimately understand the physics behind the two major components of the 'Dark Universe', dark energy and dark matter. In addition, the surveys also probe primordial perturbations and carry out fundamental measurements, such as determining the sum of neutrino masses. Large-scale simulations of structure formation in the Universe play a critical role in the interpretation of the data and extraction of the physics of interest. Just as survey instruments continue to grow in size and complexity, so do the supercomputers that enable these simulations. Here we report on HACC (Hardware/Hybrid Accelerated Cosmology Code), a recently developed and evolving cosmology N-body code framework, designed to run efficiently on diverse computing architectures and to scale to millions of cores and beyond. HACC can run on all current supercomputer architectures and supports a variety of programming models and algorithms. It has been demonstrated at scale on Cell- and GPU-accelerated systems, standard multi-core node clusters, and Blue Gene systems. HACC's design allows for ease of portability, and at the same time, high levels of sustained performance on the fastest supercomputers available. We present a description of the design philosophy of HACC, the underlying algorithms and code structure, and outline implementation details for several specific architectures. We show selected accuracy and performance results from some of the largest high resolution cosmological simulations so far performed, including benchmarks evolving more than 3.6 trillion particles.
HACC: Simulating sky surveys on state-of-the-art supercomputing architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Habib, Salman; Pope, Adrian; Finkel, Hal

2016-01-01

Current and future surveys of large-scale cosmic structure are associated with a massive and complex datastream to study, characterize, and ultimately understand the physics behind the two major components of the ‘Dark Universe’, dark energy and dark matter. In addition, the surveys also probe primordial perturbations and carry out fundamental measurements, such as determining the sum of neutrino masses. Large-scale simulations of structure formation in the Universe play a critical role in the interpretation of the data and extraction of the physics of interest. Just as survey instruments continue to grow in size and complexity, so do the supercomputers thatmore » enable these simulations. Here we report on HACC (Hardware/Hybrid Accelerated Cosmology Code), a recently developed and evolving cosmology N-body code framework, designed to run efficiently on diverse computing architectures and to scale to millions of cores and beyond. HACC can run on all current supercomputer architectures and supports a variety of programming models and algorithms. It has been demonstrated at scale on Cell- and GPU-accelerated systems, standard multi-core node clusters, and Blue Gene systems. HACC’s design allows for ease of portability, and at the same time, high levels of sustained performance on the fastest supercomputers available. We present a description of the design philosophy of HACC, the underlying algorithms and code structure, and outline implementation details for several specific architectures. We show selected accuracy and performance results from some of the largest high resolution cosmological simulations so far performed, including benchmarks evolving more than 3.6 trillion particles.« less
Adaptive MCMC in Bayesian phylogenetics: an application to analyzing partitioned data in BEAST.

PubMed

Baele, Guy; Lemey, Philippe; Rambaut, Andrew; Suchard, Marc A

2017-06-15

Advances in sequencing technology continue to deliver increasingly large molecular sequence datasets that are often heavily partitioned in order to accurately model the underlying evolutionary processes. In phylogenetic analyses, partitioning strategies involve estimating conditionally independent models of molecular evolution for different genes and different positions within those genes, requiring a large number of evolutionary parameters that have to be estimated, leading to an increased computational burden for such analyses. The past two decades have also seen the rise of multi-core processors, both in the central processing unit (CPU) and Graphics processing unit processor markets, enabling massively parallel computations that are not yet fully exploited by many software packages for multipartite analyses. We here propose a Markov chain Monte Carlo (MCMC) approach using an adaptive multivariate transition kernel to estimate in parallel a large number of parameters, split across partitioned data, by exploiting multi-core processing. Across several real-world examples, we demonstrate that our approach enables the estimation of these multipartite parameters more efficiently than standard approaches that typically use a mixture of univariate transition kernels. In one case, when estimating the relative rate parameter of the non-coding partition in a heterochronous dataset, MCMC integration efficiency improves by > 14-fold. Our implementation is part of the BEAST code base, a widely used open source software package to perform Bayesian phylogenetic inference. guy.baele@kuleuven.be. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
A parallel and sensitive software tool for methylation analysis on multicore platforms.

PubMed

Tárraga, Joaquín; Pérez, Mariano; Orduña, Juan M; Duato, José; Medina, Ignacio; Dopazo, Joaquín

2015-10-01

DNA methylation analysis suffers from very long processing time, as the advent of Next-Generation Sequencers has shifted the bottleneck of genomic studies from the sequencers that obtain the DNA samples to the software that performs the analysis of these samples. The existing software for methylation analysis does not seem to scale efficiently neither with the size of the dataset nor with the length of the reads to be analyzed. As it is expected that the sequencers will provide longer and longer reads in the near future, efficient and scalable methylation software should be developed. We present a new software tool, called HPG-Methyl, which efficiently maps bisulphite sequencing reads on DNA, analyzing DNA methylation. The strategy used by this software consists of leveraging the speed of the Burrows-Wheeler Transform to map a large number of DNA fragments (reads) rapidly, as well as the accuracy of the Smith-Waterman algorithm, which is exclusively employed to deal with the most ambiguous and shortest reads. Experimental results on platforms with Intel multicore processors show that HPG-Methyl significantly outperforms in both execution time and sensitivity state-of-the-art software such as Bismark, BS-Seeker or BSMAP, particularly for long bisulphite reads. Software in the form of C libraries and functions, together with instructions to compile and execute this software. Available by sftp to anonymous@clariano.uv.es (password 'anonymous'). juan.orduna@uv.es or jdopazo@cipf.es. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
FastGCN: A GPU Accelerated Tool for Fast Gene Co-Expression Networks

PubMed Central

Liang, Meimei; Zhang, Futao; Jin, Gulei; Zhu, Jun

2015-01-01

Gene co-expression networks comprise one type of valuable biological networks. Many methods and tools have been published to construct gene co-expression networks; however, most of these tools and methods are inconvenient and time consuming for large datasets. We have developed a user-friendly, accelerated and optimized tool for constructing gene co-expression networks that can fully harness the parallel nature of GPU (Graphic Processing Unit) architectures. Genetic entropies were exploited to filter out genes with no or small expression changes in the raw data preprocessing step. Pearson correlation coefficients were then calculated. After that, we normalized these coefficients and employed the False Discovery Rate to control the multiple tests. At last, modules identification was conducted to construct the co-expression networks. All of these calculations were implemented on a GPU. We also compressed the coefficient matrix to save space. We compared the performance of the GPU implementation with those of multi-core CPU implementations with 16 CPU threads, single-thread C/C++ implementation and single-thread R implementation. Our results show that GPU implementation largely outperforms single-thread C/C++ implementation and single-thread R implementation, and GPU implementation outperforms multi-core CPU implementation when the number of genes increases. With the test dataset containing 16,000 genes and 590 individuals, we can achieve greater than 63 times the speed using a GPU implementation compared with a single-thread R implementation when 50 percent of genes were filtered out and about 80 times the speed when no genes were filtered out. PMID:25602758

FastGCN: a GPU accelerated tool for fast gene co-expression networks.

PubMed

Liang, Meimei; Zhang, Futao; Jin, Gulei; Zhu, Jun

2015-01-01

Gene co-expression networks comprise one type of valuable biological networks. Many methods and tools have been published to construct gene co-expression networks; however, most of these tools and methods are inconvenient and time consuming for large datasets. We have developed a user-friendly, accelerated and optimized tool for constructing gene co-expression networks that can fully harness the parallel nature of GPU (Graphic Processing Unit) architectures. Genetic entropies were exploited to filter out genes with no or small expression changes in the raw data preprocessing step. Pearson correlation coefficients were then calculated. After that, we normalized these coefficients and employed the False Discovery Rate to control the multiple tests. At last, modules identification was conducted to construct the co-expression networks. All of these calculations were implemented on a GPU. We also compressed the coefficient matrix to save space. We compared the performance of the GPU implementation with those of multi-core CPU implementations with 16 CPU threads, single-thread C/C++ implementation and single-thread R implementation. Our results show that GPU implementation largely outperforms single-thread C/C++ implementation and single-thread R implementation, and GPU implementation outperforms multi-core CPU implementation when the number of genes increases. With the test dataset containing 16,000 genes and 590 individuals, we can achieve greater than 63 times the speed using a GPU implementation compared with a single-thread R implementation when 50 percent of genes were filtered out and about 80 times the speed when no genes were filtered out.
Cellular automata-based modelling and simulation of biofilm structure on multi-core computers.

PubMed

Skoneczny, Szymon

2015-01-01

The article presents a mathematical model of biofilm growth for aerobic biodegradation of a toxic carbonaceous substrate. Modelling of biofilm growth has fundamental significance in numerous processes of biotechnology and mathematical modelling of bioreactors. The process following double-substrate kinetics with substrate inhibition proceeding in a biofilm has not been modelled so far by means of cellular automata. Each process in the model proposed, i.e. diffusion of substrates, uptake of substrates, growth and decay of microorganisms and biofilm detachment, is simulated in a discrete manner. It was shown that for flat biofilm of constant thickness, the results of the presented model agree with those of a continuous model. The primary outcome of the study was to propose a mathematical model of biofilm growth; however a considerable amount of focus was also placed on the development of efficient algorithms for its solution. Two parallel algorithms were created, differing in the way computations are distributed. Computer programs were created using OpenMP Application Programming Interface for C++ programming language. Simulations of biofilm growth were performed on three high-performance computers. Speed-up coefficients of computer programs were compared. Both algorithms enabled a significant reduction of computation time. It is important, inter alia, in modelling and simulation of bioreactor dynamics.
Flexbar 3.0 - SIMD and multicore parallelization.

PubMed

Roehr, Johannes T; Dieterich, Christoph; Reinert, Knut

2017-09-15

High-throughput sequencing machines can process many samples in a single run. For Illumina systems, sequencing reads are barcoded with an additional DNA tag that is contained in the respective sequencing adapters. The recognition of barcode and adapter sequences is hence commonly needed for the analysis of next-generation sequencing data. Flexbar performs demultiplexing based on barcodes and adapter trimming for such data. The massive amounts of data generated on modern sequencing machines demand that this preprocessing is done as efficiently as possible. We present Flexbar 3.0, the successor of the popular program Flexbar. It employs now twofold parallelism: multi-threading and additionally SIMD vectorization. Both types of parallelism are used to speed-up the computation of pair-wise sequence alignments, which are used for the detection of barcodes and adapters. Furthermore, new features were included to cover a wide range of applications. We evaluated the performance of Flexbar based on a simulated sequencing dataset. Our program outcompetes other tools in terms of speed and is among the best tools in the presented quality benchmark. https://github.com/seqan/flexbar. johannes.roehr@fu-berlin.de or knut.reinert@fu-berlin.de. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
Spaceborne Processor Array

NASA Technical Reports Server (NTRS)

Chow, Edward T.; Schatzel, Donald V.; Whitaker, William D.; Sterling, Thomas

2008-01-01

A Spaceborne Processor Array in Multifunctional Structure (SPAMS) can lower the total mass of the electronic and structural overhead of spacecraft, resulting in reduced launch costs, while increasing the science return through dynamic onboard computing. SPAMS integrates the multifunctional structure (MFS) and the Gilgamesh Memory, Intelligence, and Network Device (MIND) multi-core in-memory computer architecture into a single-system super-architecture. This transforms every inch of a spacecraft into a sharable, interconnected, smart computing element to increase computing performance while simultaneously reducing mass. The MIND in-memory architecture provides a foundation for high-performance, low-power, and fault-tolerant computing. The MIND chip has an internal structure that includes memory, processing, and communication functionality. The Gilgamesh is a scalable system comprising multiple MIND chips interconnected to operate as a single, tightly coupled, parallel computer. The array of MIND components shares a global, virtual name space for program variables and tasks that are allocated at run time to the distributed physical memory and processing resources. Individual processor- memory nodes can be activated or powered down at run time to provide active power management and to configure around faults. A SPAMS system is comprised of a distributed Gilgamesh array built into MFS, interfaces into instrument and communication subsystems, a mass storage interface, and a radiation-hardened flight computer.
Scheduling Operations for Massive Heterogeneous Clusters

NASA Technical Reports Server (NTRS)

Humphrey, John; Spagnoli, Kyle

2013-01-01

High-performance computing (HPC) programming has become increasingly difficult with the advent of hybrid supercomputers consisting of multicore CPUs and accelerator boards such as the GPU. Manual tuning of software to achieve high performance on this type of machine has been performed by programmers. This is needlessly difficult and prone to being invalidated by new hardware, new software, or changes in the underlying code. A system was developed for task-based representation of programs, which when coupled with a scheduler and runtime system, allows for many benefits, including higher performance and utilization of computational resources, easier programming and porting, and adaptations of code during runtime. The system consists of a method of representing computer algorithms as a series of data-dependent tasks. The series forms a graph, which can be scheduled for execution on many nodes of a supercomputer efficiently by a computer algorithm. The schedule is executed by a dispatch component, which is tailored to understand all of the hardware types that may be available within the system. The scheduler is informed by a cluster mapping tool, which generates a topology of available resources and their strengths and communication costs. Software is decoupled from its hardware, which aids in porting to future architectures. A computer algorithm schedules all operations, which for systems of high complexity (i.e., most NASA codes), cannot be performed optimally by a human. The system aids in reducing repetitive code, such as communication code, and aids in the reduction of redundant code across projects. It adds new features to code automatically, such as recovering from a lost node or the ability to modify the code while running. In this project, the innovators at the time of this reporting intend to develop two distinct technologies that build upon each other and both of which serve as building blocks for more efficient HPC usage. First is the scheduling and dynamic execution framework, and the second is scalable linear algebra libraries that are built directly on the former.
Evaluation of Emerging Energy-Efficient Heterogeneous Computing Platforms for Biomolecular and Cellular Simulation Workloads

PubMed Central

Stone, John E.; Hallock, Michael J.; Phillips, James C.; Peterson, Joseph R.; Luthey-Schulten, Zaida; Schulten, Klaus

2016-01-01

Many of the continuing scientific advances achieved through computational biology are predicated on the availability of ongoing increases in computational power required for detailed simulation and analysis of cellular processes on biologically-relevant timescales. A critical challenge facing the development of future exascale supercomputer systems is the development of new computing hardware and associated scientific applications that dramatically improve upon the energy efficiency of existing solutions, while providing increased simulation, analysis, and visualization performance. Mobile computing platforms have recently become powerful enough to support interactive molecular visualization tasks that were previously only possible on laptops and workstations, creating future opportunities for their convenient use for meetings, remote collaboration, and as head mounted displays for immersive stereoscopic viewing. We describe early experiences adapting several biomolecular simulation and analysis applications for emerging heterogeneous computing platforms that combine power-efficient system-on-chip multi-core CPUs with high-performance massively parallel GPUs. We present low-cost power monitoring instrumentation that provides sufficient temporal resolution to evaluate the power consumption of individual CPU algorithms and GPU kernels. We compare the performance and energy efficiency of scientific applications running on emerging platforms with results obtained on traditional platforms, identify hardware and algorithmic performance bottlenecks that affect the usability of these platforms, and describe avenues for improving both the hardware and applications in pursuit of the needs of molecular modeling tasks on mobile devices and future exascale computers. PMID:27516922
High-performance sparse matrix-matrix products on Intel KNL and multicore architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Nagasaka, Y; Matsuoka, S; Azad, A

Sparse matrix-matrix multiplication (SpGEMM) is a computational primitive that is widely used in areas ranging from traditional numerical applications to recent big data analysis and machine learning. Although many SpGEMM algorithms have been proposed, hardware specific optimizations for multi- and many-core processors are lacking and a detailed analysis of their performance under various use cases and matrices is not available. We firstly identify and mitigate multiple bottlenecks with memory management and thread scheduling on Intel Xeon Phi (Knights Landing or KNL). Specifically targeting multi- and many-core processors, we develop a hash-table-based algorithm and optimize a heap-based shared-memory SpGEMM algorithm. Wemore » examine their performance together with other publicly available codes. Different from the literature, our evaluation also includes use cases that are representative of real graph algorithms, such as multi-source breadth-first search or triangle counting. Our hash-table and heap-based algorithms are showing significant speedups from libraries in the majority of the cases while different algorithms dominate the other scenarios with different matrix size, sparsity, compression factor and operation type. We wrap up in-depth evaluation results and make a recipe to give the best SpGEMM algorithm for target scenario. A critical finding is that hash-table-based SpGEMM gets a significant performance boost if the nonzeros are not required to be sorted within each row of the output matrix.« less
A Hybrid Task Graph Scheduler for High Performance Image Processing Workflows.

PubMed

Blattner, Timothy; Keyrouz, Walid; Bhattacharyya, Shuvra S; Halem, Milton; Brady, Mary

2017-12-01

Designing applications for scalability is key to improving their performance in hybrid and cluster computing. Scheduling code to utilize parallelism is difficult, particularly when dealing with data dependencies, memory management, data motion, and processor occupancy. The Hybrid Task Graph Scheduler (HTGS) improves programmer productivity when implementing hybrid workflows for multi-core and multi-GPU systems. The Hybrid Task Graph Scheduler (HTGS) is an abstract execution model, framework, and API that increases programmer productivity when implementing hybrid workflows for such systems. HTGS manages dependencies between tasks, represents CPU and GPU memories independently, overlaps computations with disk I/O and memory transfers, keeps multiple GPUs occupied, and uses all available compute resources. Through these abstractions, data motion and memory are explicit; this makes data locality decisions more accessible. To demonstrate the HTGS application program interface (API), we present implementations of two example algorithms: (1) a matrix multiplication that shows how easily task graphs can be used; and (2) a hybrid implementation of microscopy image stitching that reduces code size by ≈ 43% compared to a manually coded hybrid workflow implementation and showcases the minimal overhead of task graphs in HTGS. Both of the HTGS-based implementations show good performance. In image stitching the HTGS implementation achieves similar performance to the hybrid workflow implementation. Matrix multiplication with HTGS achieves 1.3× and 1.8× speedup over the multi-threaded OpenBLAS library for 16k × 16k and 32k × 32k size matrices, respectively.
Multifunctional PEG-carboxylate copolymer coated superparamagnetic iron oxide nanoparticles for biomedical application

NASA Astrophysics Data System (ADS)

Illés, Erzsébet; Szekeres, Márta; Tóth, Ildikó Y.; Szabó, Ákos; Iván, Béla; Turcu, Rodica; Vékás, Ladislau; Zupkó, István; Jaics, György; Tombácz, Etelka

2018-04-01

Biocompatible magnetite nanoparticles (MNPs) were prepared by post-coating the magnetic nanocores with a synthetic polymer designed specifically to shield the particles from non-specific interaction with cells. Poly(ethylene glycol) methyl ether methacrylate (PEGMA) macromonomers and acrylic acid (AA) small molecular monomers were chemically coupled by quasi-living atom transfer radical polymerization (ATRP) to a comb-like copolymer, P(PEGMA-co-AA) designated here as P(PEGMA-AA). The polymer contains pendant carboxylate moieties near the backbone and PEG side chains. It is able to bind spontaneously to MNPs; stabilize the particles electrostatically via the carboxylate moieties and sterically via the PEG moieties; provide high protein repellency via the structured PEG layer; and anchor bioactive proteins via peptide bond formation with the free carboxylate groups. The presence of the P(PEGMA-AA) coating was verified in XPS experiments. The electrosteric (i.e., combined electrostatic and steric) stabilization is efficient down to pH 4 (at 10 mM ionic strength). Static magnetization and AC susceptibility measurements showed that the P(PEGMA-AA)@MNPs are superparamagnetic with a saturation magnetization value of 55 emu/g and that both single core nanoparticles and multicore structures are present in the samples. The multicore components make our product well suited for magnetic hyperthermia applications (SAR values up to 17.44 W/g). In vitro biocompatibility, cell internalization, and magnetic hyperthermia studies demonstrate the excellent theranostic potential of our product.
Graphics Processing Units (GPU) and the Goddard Earth Observing System atmospheric model (GEOS-5): Implementation and Potential Applications

NASA Technical Reports Server (NTRS)

Putnam, William M.

2011-01-01

Earth system models like the Goddard Earth Observing System model (GEOS-5) have been pushing the limits of large clusters of multi-core microprocessors, producing breath-taking fidelity in resolving cloud systems at a global scale. GPU computing presents an opportunity for improving the efficiency of these leading edge models. A GPU implementation of GEOS-5 will facilitate the use of cloud-system resolving resolutions in data assimilation and weather prediction, at resolutions near 3.5 km, improving our ability to extract detailed information from high-resolution satellite observations and ultimately produce better weather and climate predictions
Some Future Software Engineering Opportunities and Challenges

NASA Astrophysics Data System (ADS)

Boehm, Barry

This paper provides an update and extension of a 2006 paper, “Some Future Trends and Implications for Systems and Software Engineering Processes,” Systems Engineering, Spring 2006. Some of its challenges and opportunities are similar, such as the need to simultaneously achieve high levels of both agility and assurance. Others have emerged as increasingly important, such as the challenges of dealing with ultralarge volumes of data, with multicore chips, and with software as a service. The paper is organized around eight relatively surprise-free trends and two “wild cards” whose trends and implications are harder to foresee. The eight surprise-free trends are:
Automatic detection and classification of obstacles with applications in autonomous mobile robots

NASA Astrophysics Data System (ADS)

Ponomaryov, Volodymyr I.; Rosas-Miranda, Dario I.

2016-04-01

Hardware implementation of an automatic detection and classification of objects that can represent an obstacle for an autonomous mobile robot using stereo vision algorithms is presented. We propose and evaluate a new method to detect and classify objects for a mobile robot in outdoor conditions. This method is divided in two parts, the first one is the object detection step based on the distance from the objects to the camera and a BLOB analysis. The second part is the classification step that is based on visuals primitives and a SVM classifier. The proposed method is performed in GPU in order to reduce the processing time values. This is performed with help of hardware based on multi-core processors and GPU platform, using a NVIDIA R GeForce R GT640 graphic card and Matlab over a PC with Windows 10.
Parallel, distributed and GPU computing technologies in single-particle electron microscopy

PubMed Central

Schmeisser, Martin; Heisen, Burkhard C.; Luettich, Mario; Busche, Boris; Hauer, Florian; Koske, Tobias; Knauber, Karl-Heinz; Stark, Holger

2009-01-01

Most known methods for the determination of the structure of macromolecular complexes are limited or at least restricted at some point by their computational demands. Recent developments in information technology such as multicore, parallel and GPU processing can be used to overcome these limitations. In particular, graphics processing units (GPUs), which were originally developed for rendering real-time effects in computer games, are now ubiquitous and provide unprecedented computational power for scientific applications. Each parallel-processing paradigm alone can improve overall performance; the increased computational performance obtained by combining all paradigms, unleashing the full power of today’s technology, makes certain applications feasible that were previously virtually impossible. In this article, state-of-the-art paradigms are introduced, the tools and infrastructure needed to apply these paradigms are presented and a state-of-the-art infrastructure and solution strategy for moving scientific applications to the next generation of computer hardware is outlined. PMID:19564686
Experiments with a Parallel Multi-Objective Evolutionary Algorithm for Scheduling

NASA Technical Reports Server (NTRS)

Brown, Matthew; Johnston, Mark D.

2013-01-01

Evolutionary multi-objective algorithms have great potential for scheduling in those situations where tradeoffs among competing objectives represent a key requirement. One challenge, however, is runtime performance, as a consequence of evolving not just a single schedule, but an entire population, while attempting to sample the Pareto frontier as accurately and uniformly as possible. The growing availability of multi-core processors in end user workstations, and even laptops, has raised the question of the extent to which such hardware can be used to speed up evolutionary algorithms. In this paper we report on early experiments in parallelizing a Generalized Differential Evolution (GDE) algorithm for scheduling long-range activities on NASA's Deep Space Network. Initial results show that significant speedups can be achieved, but that performance does not necessarily improve as more cores are utilized. We describe our preliminary results and some initial suggestions from parallelizing the GDE algorithm. Directions for future work are outlined.
Vectorization, threading, and cache-blocking considerations for hydrocodes on emerging architectures

DOE PAGES

Fung, J.; Aulwes, R. T.; Bement, M. T.; ...

2015-07-14

This work reports on considerations for improving computational performance in preparation for current and expected changes to computer architecture. The algorithms studied will include increasingly complex prototypes for radiation hydrodynamics codes, such as gradient routines and diffusion matrix assembly (e.g., in [1-6]). The meshes considered for the algorithms are structured or unstructured meshes. The considerations applied for performance improvements are meant to be general in terms of architecture (not specifically graphical processing unit (GPUs) or multi-core machines, for example) and include techniques for vectorization, threading, tiling, and cache blocking. Out of a survey of optimization techniques on applications such asmore » diffusion and hydrodynamics, we make general recommendations with a view toward making these techniques conceptually accessible to the applications code developer. Published 2015. This article is a U.S. Government work and is in the public domain in the USA.« less
Parallel, distributed and GPU computing technologies in single-particle electron microscopy.

PubMed

Schmeisser, Martin; Heisen, Burkhard C; Luettich, Mario; Busche, Boris; Hauer, Florian; Koske, Tobias; Knauber, Karl-Heinz; Stark, Holger

2009-07-01

Most known methods for the determination of the structure of macromolecular complexes are limited or at least restricted at some point by their computational demands. Recent developments in information technology such as multicore, parallel and GPU processing can be used to overcome these limitations. In particular, graphics processing units (GPUs), which were originally developed for rendering real-time effects in computer games, are now ubiquitous and provide unprecedented computational power for scientific applications. Each parallel-processing paradigm alone can improve overall performance; the increased computational performance obtained by combining all paradigms, unleashing the full power of today's technology, makes certain applications feasible that were previously virtually impossible. In this article, state-of-the-art paradigms are introduced, the tools and infrastructure needed to apply these paradigms are presented and a state-of-the-art infrastructure and solution strategy for moving scientific applications to the next generation of computer hardware is outlined.
2nd Generation QUATARA Flight Computer Project

NASA Technical Reports Server (NTRS)

Falker, Jay; Keys, Andrew; Fraticelli, Jose Molina; Capo-Iugo, Pedro; Peeples, Steven

2015-01-01

Single core flight computer boards have been designed, developed, and tested (DD&T) to be flown in small satellites for the last few years. In this project, a prototype flight computer will be designed as a distributed multi-core system containing four microprocessors running code in parallel. This flight computer will be capable of performing multiple computationally intensive tasks such as processing digital and/or analog data, controlling actuator systems, managing cameras, operating robotic manipulators and transmitting/receiving from/to a ground station. In addition, this flight computer will be designed to be fault tolerant by creating both a robust physical hardware connection and by using a software voting scheme to determine the processor's performance. This voting scheme will leverage on the work done for the Space Launch System (SLS) flight software. The prototype flight computer will be constructed with Commercial Off-The-Shelf (COTS) components which are estimated to survive for two years in a low-Earth orbit.
Towards Batched Linear Solvers on Accelerated Hardware Platforms

DOE Office of Scientific and Technical Information (OSTI.GOV)

Haidar, Azzam; Dong, Tingzing Tim; Tomov, Stanimire

2015-01-01

As hardware evolves, an increasingly effective approach to develop energy efficient, high-performance solvers, is to design them to work on many small and independent problems. Indeed, many applications already need this functionality, especially for GPUs, which are known to be currently about four to five times more energy efficient than multicore CPUs for every floating-point operation. In this paper, we describe the development of the main one-sided factorizations: LU, QR, and Cholesky; that are needed for a set of small dense matrices to work in parallel. We refer to such algorithms as batched factorizations. Our approach is based on representingmore » the algorithms as a sequence of batched BLAS routines for GPU-contained execution. Note that this is similar in functionality to the LAPACK and the hybrid MAGMA algorithms for large-matrix factorizations. But it is different from a straightforward approach, whereby each of GPU's symmetric multiprocessors factorizes a single problem at a time. We illustrate how our performance analysis together with the profiling and tracing tools guided the development of batched factorizations to achieve up to 2-fold speedup and 3-fold better energy efficiency compared to our highly optimized batched CPU implementations based on the MKL library on a two-sockets, Intel Sandy Bridge server. Compared to a batched LU factorization featured in the NVIDIA's CUBLAS library for GPUs, we achieves up to 2.5-fold speedup on the K40 GPU.« less
Multicore Education through Simulation

ERIC Educational Resources Information Center

Ozturk, O.

2011-01-01

A project-oriented course for advanced undergraduate and graduate students is described for simulating multiple processor cores. Simics, a free simulator for academia, was utilized to enable students to explore computer architecture, operating systems, and hardware/software cosimulation. Motivation for including this course in the curriculum is…
Enhancing Image Processing Performance for PCID in a Heterogeneous Network of Multi-code Processors

NASA Astrophysics Data System (ADS)

Linderman, R.; Spetka, S.; Fitzgerald, D.; Emeny, S.

The Physically-Constrained Iterative Deconvolution (PCID) image deblurring code is being ported to heterogeneous networks of multi-core systems, including Intel Xeons and IBM Cell Broadband Engines. This paper reports results from experiments using the JAWS supercomputer at MHPCC (60 TFLOPS of dual-dual Xeon nodes linked with Infiniband) and the Cell Cluster at AFRL in Rome, NY. The Cell Cluster has 52 TFLOPS of Playstation 3 (PS3) nodes with IBM Cell Broadband Engine multi-cores and 15 dual-quad Xeon head nodes. The interconnect fabric includes Infiniband, 10 Gigabit Ethernet and 1 Gigabit Ethernet to each of the 336 PS3s. The results compare approaches to parallelizing FFT executions across the Xeons and the Cell's Synergistic Processing Elements (SPEs) for frame-level image processing. The experiments included Intel's Performance Primitives and Math Kernel Library, FFTW3.2, and Carnegie Mellon's SPIRAL. Optimization of FFTs in the PCID code led to a decrease in relative processing time for FFTs. Profiling PCID version 6.2, about one year ago, showed the 13 functions that accounted for the highest percentage of processing were all FFT processing functions. They accounted for over 88% of processing time in one run on Xeons. FFT optimizations led to improvement in the current PCID version 8.0. A recent profile showed that only two of the 19 functions with the highest processing time were FFT processing functions. Timing measurements showed that FFT processing for PCID version 8.0 has been reduced to less than 19% of overall processing time. We are working toward a goal of scaling to 200-400 cores per job (1-2 imagery frames/core). Running a pair of cores on each set of frames reduces latency by implementing parallel FFT processing. Our current results show scaling well out to 100 pairs of cores. These results support the next higher level of parallelism in PCID, where groups of several hundred frames each producing one resolved image are sent to cliques of several hundred cores in a round robin fashion. Current efforts toward further performance enhancement for PCID are shifting toward using the Playstations in conjunction with the Xeons to take advantage of outstanding price/performance as well as the Flops/Watt cost advantage. We are fine-tuning the PCID parallization strategy to balance processing over Xeons and Cell BEs to find an optimal partitioning of PCID over the heterogeneous processors. A high performance information management system that exploits native Infiniband multicast is used to improve latency among the head nodes. Using a publication/subscription oriented information management system to implement a unified communications platform makes runs on large HPCs with thousands of intercommunicating cores more flexible and more fault tolerant. It features a loose couplingof publishers to subscribers through intervening brokers. We are also working on enhancing performance for both Xeons and Cell BEs, buy moving selected operations to single precision. Techniques for adapting the code to single precision and performance results are reported.

GPU accelerated dynamic functional connectivity analysis for functional MRI data.

PubMed

Akgün, Devrim; Sakoğlu, Ünal; Esquivel, Johnny; Adinoff, Bryon; Mete, Mutlu

2015-07-01

Recent advances in multi-core processors and graphics card based computational technologies have paved the way for an improved and dynamic utilization of parallel computing techniques. Numerous applications have been implemented for the acceleration of computationally-intensive problems in various computational science fields including bioinformatics, in which big data problems are prevalent. In neuroimaging, dynamic functional connectivity (DFC) analysis is a computationally demanding method used to investigate dynamic functional interactions among different brain regions or networks identified with functional magnetic resonance imaging (fMRI) data. In this study, we implemented and analyzed a parallel DFC algorithm based on thread-based and block-based approaches. The thread-based approach was designed to parallelize DFC computations and was implemented in both Open Multi-Processing (OpenMP) and Compute Unified Device Architecture (CUDA) programming platforms. Another approach developed in this study to better utilize CUDA architecture is the block-based approach, where parallelization involves smaller parts of fMRI time-courses obtained by sliding-windows. Experimental results showed that the proposed parallel design solutions enabled by the GPUs significantly reduce the computation time for DFC analysis. Multicore implementation using OpenMP on 8-core processor provides up to 7.7× speed-up. GPU implementation using CUDA yielded substantial accelerations ranging from 18.5× to 157× speed-up once thread-based and block-based approaches were combined in the analysis. Proposed parallel programming solutions showed that multi-core processor and CUDA-supported GPU implementations accelerated the DFC analyses significantly. Developed algorithms make the DFC analyses more practical for multi-subject studies with more dynamic analyses. Copyright © 2015 Elsevier Ltd. All rights reserved.
Effect of luting agents on the tensile bond strength of glass fiber posts: An in vitro study.

PubMed

Aleisa, Khalil; Al-Dwairi, Ziad N; Alghabban, Rawda; Goodacre, Charles J

2013-09-01

Fiber posts can fail because of loss of retention; and it is unknown which luting agent provides the highest bond strength. The purpose of this study was to investigate the tensile bond strength of glass fiber posts luted to premolar teeth with 6 resin composite luting agents. Ninety-six single-rooted extracted human mandibular premolars were sectioned 2 mm coronal to the most incisal point of the cementoenamel junction. Root canals were instrumented and obturated with laterally condensed gutta percha and root canal sealer (AH26). Gutta percha was removed from the canals to a depth of 8 mm and diameter post spaces with a 1.5 mm were prepared. The specimens were divided into the following 6 groups according to the luting agent used (n=16): Group V, Variolink II; Group A, RelyX ARC; Group N, Multilink N; Group U, RelyX Unicem; Group P, ParaCore; Group F, MultiCore Flow. Each specimen was secured in a universal testing machine and a separating load was applied at a rate of 0.5 mm/min. The forces required to dislodge the posts were recorded. A 1-way analysis of variance (ANOVA) was applied to the mean retentive strengths of various cement materials (α=.05). Significant differences were recorded among the 6 cement types (P<.001). Three materials provided statistically equivalent mean bond strengths (RelyX Unicem, Paracore, and MultiCore Flow) that were significantly greater than for the other 3 materials. Fiber posts luted with RelyX Unicem, Paracore, and MultiCore Flow demonstrated significantly higher bond strengths. Copyright © 2013 The Editorial Council of the Journal of Prosthetic Dentistry. Published by Mosby, Inc. All rights reserved.
CloVR: a virtual machine for automated and portable sequence analysis from the desktop using cloud computing.

PubMed

Angiuoli, Samuel V; Matalka, Malcolm; Gussman, Aaron; Galens, Kevin; Vangala, Mahesh; Riley, David R; Arze, Cesar; White, James R; White, Owen; Fricke, W Florian

2011-08-30

Next-generation sequencing technologies have decentralized sequence acquisition, increasing the demand for new bioinformatics tools that are easy to use, portable across multiple platforms, and scalable for high-throughput applications. Cloud computing platforms provide on-demand access to computing infrastructure over the Internet and can be used in combination with custom built virtual machines to distribute pre-packaged with pre-configured software. We describe the Cloud Virtual Resource, CloVR, a new desktop application for push-button automated sequence analysis that can utilize cloud computing resources. CloVR is implemented as a single portable virtual machine (VM) that provides several automated analysis pipelines for microbial genomics, including 16S, whole genome and metagenome sequence analysis. The CloVR VM runs on a personal computer, utilizes local computer resources and requires minimal installation, addressing key challenges in deploying bioinformatics workflows. In addition CloVR supports use of remote cloud computing resources to improve performance for large-scale sequence processing. In a case study, we demonstrate the use of CloVR to automatically process next-generation sequencing data on multiple cloud computing platforms. The CloVR VM and associated architecture lowers the barrier of entry for utilizing complex analysis protocols on both local single- and multi-core computers and cloud systems for high throughput data processing.
A FPGA-Based, Granularity-Variable Neuromorphic Processor and Its Application in a MIMO Real-Time Control System.

PubMed

Zhang, Zhen; Ma, Cheng; Zhu, Rong

2017-08-23

Artificial Neural Networks (ANNs), including Deep Neural Networks (DNNs), have become the state-of-the-art methods in machine learning and achieved amazing success in speech recognition, visual object recognition, and many other domains. There are several hardware platforms for developing accelerated implementation of ANN models. Since Field Programmable Gate Array (FPGA) architectures are flexible and can provide high performance per watt of power consumption, they have drawn a number of applications from scientists. In this paper, we propose a FPGA-based, granularity-variable neuromorphic processor (FBGVNP). The traits of FBGVNP can be summarized as granularity variability, scalability, integrated computing, and addressing ability: first, the number of neurons is variable rather than constant in one core; second, the multi-core network scale can be extended in various forms; third, the neuron addressing and computing processes are executed simultaneously. These make the processor more flexible and better suited for different applications. Moreover, a neural network-based controller is mapped to FBGVNP and applied in a multi-input, multi-output, (MIMO) real-time, temperature-sensing and control system. Experiments validate the effectiveness of the neuromorphic processor. The FBGVNP provides a new scheme for building ANNs, which is flexible, highly energy-efficient, and can be applied in many areas.
A FPGA-Based, Granularity-Variable Neuromorphic Processor and Its Application in a MIMO Real-Time Control System

PubMed Central

Zhang, Zhen; Zhu, Rong

2017-01-01

Artificial Neural Networks (ANNs), including Deep Neural Networks (DNNs), have become the state-of-the-art methods in machine learning and achieved amazing success in speech recognition, visual object recognition, and many other domains. There are several hardware platforms for developing accelerated implementation of ANN models. Since Field Programmable Gate Array (FPGA) architectures are flexible and can provide high performance per watt of power consumption, they have drawn a number of applications from scientists. In this paper, we propose a FPGA-based, granularity-variable neuromorphic processor (FBGVNP). The traits of FBGVNP can be summarized as granularity variability, scalability, integrated computing, and addressing ability: first, the number of neurons is variable rather than constant in one core; second, the multi-core network scale can be extended in various forms; third, the neuron addressing and computing processes are executed simultaneously. These make the processor more flexible and better suited for different applications. Moreover, a neural network-based controller is mapped to FBGVNP and applied in a multi-input, multi-output, (MIMO) real-time, temperature-sensing and control system. Experiments validate the effectiveness of the neuromorphic processor. The FBGVNP provides a new scheme for building ANNs, which is flexible, highly energy-efficient, and can be applied in many areas. PMID:28832522
Encapsulating micro-nano Si/SiO x into conjugated nitrogen-doped carbon as binder-free monolithic anodes for advanced lithium ion batteries

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wang, Jing; Zhou, Meijuan; Tan, Guoqiang

2015-01-01

Silicon monoxide, a promising silicon-based anode candidate for lithium-ion batteries, has recently attracted much attention for its high theoretical capacity, good cycle stability, low cost, and environmental benignity. Currently, the most critical challenge is to improve its low initial coulombic efficiency and significant volume changes during the charge–discharge processes. Herein, we report a binder-free monolithic electrode structure based on directly encapsulating micro-nano Si/SiOx particles into conjugated nitrogen-doped carbon frameworks to form monolithic, multi-core, cross-linking composite matrices. We utilize micro-nano Si/SiOx reduced by high-energy ball-milling SiO as active materials, and conjugated nitrogen-doped carbon formed by the pyrolysis of polyacrylonitrile both asmore » binders and conductive agents. Owing to the high electrochemical activity of Si/SiOx and the good mechanical resiliency of conjugated nitrogen-doped carbon backbones, this specific composite structure enhances the utilization efficiency of SiO and accommodates its large volume expansion, as well as its good ionic and electronic conductivity. The annealed Si/SiOx/polyacrylonitrile composite electrode exhibits excellent electrochemical properties, including a high initial reversible capacity (2734 mA h g-1 with 75% coulombic efficiency), stable cycle performance (988 mA h g-1 after 100 cycles), and good rate capability (800 mA h g-1 at 1 A g-1 rate). Because the composite is naturally abundant and shows such excellent electrochemical performance, it is a promising anode candidate material for lithium-ion batteries. The binder-free monolithic architectural design also provides an effective way to prepare other monolithic electrode materials for advanced lithium-ion batteries.« less
Integrated cladding-pumped multicore few-mode erbium-doped fibre amplifier for space-division-multiplexed communications

NASA Astrophysics Data System (ADS)

Chen, H.; Jin, C.; Huang, B.; Fontaine, N. K.; Ryf, R.; Shang, K.; Grégoire, N.; Morency, S.; Essiambre, R.-J.; Li, G.; Messaddeq, Y.; Larochelle, S.

2016-08-01

Space-division multiplexing (SDM), whereby multiple spatial channels in multimode and multicore optical fibres are used to increase the total transmission capacity per fibre, is being investigated to avert a data capacity crunch and reduce the cost per transmitted bit. With the number of channels employed in SDM transmission experiments continuing to rise, there is a requirement for integrated SDM components that are scalable. Here, we demonstrate a cladding-pumped SDM erbium-doped fibre amplifier (EDFA) that consists of six uncoupled multimode erbium-doped cores. Each core supports three spatial modes, which enables the EDFA to amplify a total of 18 spatial channels (six cores × three modes) simultaneously with a single pump diode and a complexity similar to a single-mode EDFA. The amplifier delivers >20 dBm total output power per core and <7 dB noise figure over the C-band. This cladding-pumped EDFA enables combined space-division and wavelength-division multiplexed transmission over multiple multimode fibre spans.
IGA-ADS: Isogeometric analysis FEM using ADS solver

NASA Astrophysics Data System (ADS)

Łoś, Marcin M.; Woźniak, Maciej; Paszyński, Maciej; Lenharth, Andrew; Hassaan, Muhamm Amber; Pingali, Keshav

2017-08-01

In this paper we present a fast explicit solver for solution of non-stationary problems using L2 projections with isogeometric finite element method. The solver has been implemented within GALOIS framework. It enables parallel multi-core simulations of different time-dependent problems, in 1D, 2D, or 3D. We have prepared the solver framework in a way that enables direct implementation of the selected PDE and corresponding boundary conditions. In this paper we describe the installation, implementation of exemplary three PDEs, and execution of the simulations on multi-core Linux cluster nodes. We consider three case studies, including heat transfer, linear elasticity, as well as non-linear flow in heterogeneous media. The presented package generates output suitable for interfacing with Gnuplot and ParaView visualization software. The exemplary simulations show near perfect scalability on Gilbert shared-memory node with four Intel® Xeon® CPU E7-4860 processors, each possessing 10 physical cores (for a total of 40 cores).
Shape Sensing Using a Multi-Core Optical Fiber Having an Arbitrary Initial Shape in the Presence of Extrinsic Forces

NASA Technical Reports Server (NTRS)

Rogge, Matthew D. (Inventor); Moore, Jason P. (Inventor)

2014-01-01

Shape of a multi-core optical fiber is determined by positioning the fiber in an arbitrary initial shape and measuring strain over the fiber's length using strain sensors. A three-coordinate p-vector is defined for each core as a function of the distance of the corresponding cores from a center point of the fiber and a bending angle of the cores. The method includes calculating, via a controller, an applied strain value of the fiber using the p-vector and the measured strain for each core, and calculating strain due to bending as a function of the measured and the applied strain values. Additionally, an apparent local curvature vector is defined for each core as a function of the calculated strain due to bending. Curvature and bend direction are calculated using the apparent local curvature vector, and fiber shape is determined via the controller using the calculated curvature and bend direction.
Fault Mitigation Schemes for Future Spaceflight Multicore Processors

NASA Technical Reports Server (NTRS)

Alexander, James W.; Clement, Bradley J.; Gostelow, Kim P.; Lai, John Y.

2012-01-01

Future planetary exploration missions demand significant advances in on-board computing capabilities over current avionics architectures based on a single-core processing element. The state-of-the-art multi-core processor provides much promise in meeting such challenges while introducing new fault tolerance problems when applied to space missions. Software-based schemes are being presented in this paper that can achieve system-level fault mitigation beyond that provided by radiation-hard-by-design (RHBD). For mission and time critical applications such as the Terrain Relative Navigation (TRN) for planetary or small body navigation, and landing, a range of fault tolerance methods can be adapted by the application. The software methods being investigated include Error Correction Code (ECC) for data packet routing between cores, virtual network routing, Triple Modular Redundancy (TMR), and Algorithm-Based Fault Tolerance (ABFT). A robust fault tolerance framework that provides fail-operational behavior under hard real-time constraints and graceful degradation will be demonstrated using TRN executing on a commercial Tilera(R) processor with simulated fault injections.
Can magneto-plasmonic nanohybrids efficiently combine photothermia with magnetic hyperthermia?

NASA Astrophysics Data System (ADS)

Espinosa, Ana; Bugnet, Mathieu; Radtke, Guillaume; Neveu, Sophie; Botton, Gianluigi A.; Wilhelm, Claire; Abou-Hassan, Ali

2015-11-01

Multifunctional hybrid-design nanomaterials appear to be a promising route to meet the current therapeutics needs required for efficient cancer treatment. Herein, two efficient heat nano-generators were combined into a multifunctional single nanohybrid (a multi-core iron oxide nanoparticle optimized for magnetic hyperthermia, and a gold branched shell with tunable plasmonic properties in the NIR region, for photothermal therapy) which impressively enhanced heat generation, in suspension or in vivo in tumours, opening up exciting new therapeutic perspectives.Multifunctional hybrid-design nanomaterials appear to be a promising route to meet the current therapeutics needs required for efficient cancer treatment. Herein, two efficient heat nano-generators were combined into a multifunctional single nanohybrid (a multi-core iron oxide nanoparticle optimized for magnetic hyperthermia, and a gold branched shell with tunable plasmonic properties in the NIR region, for photothermal therapy) which impressively enhanced heat generation, in suspension or in vivo in tumours, opening up exciting new therapeutic perspectives. Electronic supplementary information (ESI) available. See DOI: 10.1039/c5nr06168g
Accelerating finite-rate chemical kinetics with coprocessors: Comparing vectorization methods on GPUs, MICs, and CPUs

NASA Astrophysics Data System (ADS)

Stone, Christopher P.; Alferman, Andrew T.; Niemeyer, Kyle E.

2018-05-01

Accurate and efficient methods for solving stiff ordinary differential equations (ODEs) are a critical component of turbulent combustion simulations with finite-rate chemistry. The ODEs governing the chemical kinetics at each mesh point are decoupled by operator-splitting allowing each to be solved concurrently. An efficient ODE solver must then take into account the available thread and instruction-level parallelism of the underlying hardware, especially on many-core coprocessors, as well as the numerical efficiency. A stiff Rosenbrock and a nonstiff Runge-Kutta ODE solver are both implemented using the single instruction, multiple thread (SIMT) and single instruction, multiple data (SIMD) paradigms within OpenCL. Both methods solve multiple ODEs concurrently within the same instruction stream. The performance of these parallel implementations was measured on three chemical kinetic models of increasing size across several multicore and many-core platforms. Two separate benchmarks were conducted to clearly determine any performance advantage offered by either method. The first benchmark measured the run-time of evaluating the right-hand-side source terms in parallel and the second benchmark integrated a series of constant-pressure, homogeneous reactors using the Rosenbrock and Runge-Kutta solvers. The right-hand-side evaluations with SIMD parallelism on the host multicore Xeon CPU and many-core Xeon Phi co-processor performed approximately three times faster than the baseline multithreaded C++ code. The SIMT parallel model on the host and Phi was 13%-35% slower than the baseline while the SIMT model on the NVIDIA Kepler GPU provided approximately the same performance as the SIMD model on the Phi. The runtimes for both ODE solvers decreased significantly with the SIMD implementations on the host CPU (2.5-2.7 ×) and Xeon Phi coprocessor (4.7-4.9 ×) compared to the baseline parallel code. The SIMT implementations on the GPU ran 1.5-1.6 times faster than the baseline multithreaded CPU code; however, this was significantly slower than the SIMD versions on the host CPU or the Xeon Phi. The performance difference between the three platforms was attributed to thread divergence caused by the adaptive step-sizes within the ODE integrators. Analysis showed that the wider vector width of the GPU incurs a higher level of divergence than the narrower Sandy Bridge or Xeon Phi. The significant performance improvement provided by the SIMD parallel strategy motivates further research into more ODE solver methods that are both SIMD-friendly and computationally efficient.
Roofline model toolkit: A practical tool for architectural and program analysis

DOE Office of Scientific and Technical Information (OSTI.GOV)

Lo, Yu Jung; Williams, Samuel; Van Straalen, Brian

We present preliminary results of the Roofline Toolkit for multicore, many core, and accelerated architectures. This paper focuses on the processor architecture characterization engine, a collection of portable instrumented micro benchmarks implemented with Message Passing Interface (MPI), and OpenMP used to express thread-level parallelism. These benchmarks are specialized to quantify the behavior of different architectural features. Compared to previous work on performance characterization, these microbenchmarks focus on capturing the performance of each level of the memory hierarchy, along with thread-level parallelism, instruction-level parallelism and explicit SIMD parallelism, measured in the context of the compilers and run-time environments. We also measuremore » sustained PCIe throughput with four GPU memory managed mechanisms. By combining results from the architecture characterization with the Roofline model based solely on architectural specifications, this work offers insights for performance prediction of current and future architectures and their software systems. To that end, we instrument three applications and plot their resultant performance on the corresponding Roofline model when run on a Blue Gene/Q architecture.« less
Experimental evaluation of the impact of packet capturing tools for web services.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Choe, Yung Ryn; Mohapatra, Prasant; Chuah, Chen-Nee

Network measurement is a discipline that provides the techniques to collect data that are fundamental to many branches of computer science. While many capturing tools and comparisons have made available in the literature and elsewhere, the impact of these packet capturing tools on existing processes have not been thoroughly studied. While not a concern for collection methods in which dedicated servers are used, many usage scenarios of packet capturing now requires the packet capturing tool to run concurrently with operational processes. In this work we perform experimental evaluations of the performance impact that packet capturing process have on web-based services;more » in particular, we observe the impact on web servers. We find that packet capturing processes indeed impact the performance of web servers, but on a multi-core system the impact varies depending on whether the packet capturing and web hosting processes are co-located or not. In addition, the architecture and behavior of the web server and process scheduling is coupled with the behavior of the packet capturing process, which in turn also affect the web server's performance.« less
Hexabundles: first results

NASA Astrophysics Data System (ADS)

Bryant, Julia J.; O'Byrne, John W.; Bland-Hawthorn, Joss; Leon-Saval, Sergio G.

2010-07-01

New multi-core imaging fibre bundles - hexabundles - being developed at the University of Sydney will provide simultaneous integral field spectroscopy for hundreds of celestial sources across a wide angular field. These are a natural progression from the use of single fibres in existing galaxy surveys. Hexabundles will allow us to address fundamental questions in astronomy without the biases introduced by a fixed entrance aperture. We have begun to consider instrument concepts that exploit hundreds of hexabundles over the widest possible field of view. To this end, we have characterised the performance of a 61-core fully fused hexabundle and 5 unfused bundles with 7 cores each. All fibres in bundles have 100 micron cores. In the fused bundle, the cores are distorted from a circular shape in order to achieve a higher fill fraction. The unfused bundles have circular cores and five different cladding thicknesses which affect the fill fraction. We compare the optical performance of all 6 bundles and find that the advantage of smaller interstitial holes (higher fill fraction) is outweighed by the increase in FRD, crosstalk and the poor optical performance caused by the deformation of the fibre cores. Uniformly high throughput and low cross-talk are essential for imaging faint astronomical targets with sufficient resolution to disentangle the dynamical structure. Devices already under development will have 100-200 unfused cores, although larger formats are feasible. The light-weight packaging of hexabundles is sufficiently flexible to allow existing robotic positioners to make use of them.
Modeling Cooperative Threads to Project GPU Performance for Adaptive Parallelism

DOE Office of Scientific and Technical Information (OSTI.GOV)

Meng, Jiayuan; Uram, Thomas; Morozov, Vitali A.

Most accelerators, such as graphics processing units (GPUs) and vector processors, are particularly suitable for accelerating massively parallel workloads. On the other hand, conventional workloads are developed for multi-core parallelism, which often scale to only a few dozen OpenMP threads. When hardware threads significantly outnumber the degree of parallelism in the outer loop, programmers are challenged with efficient hardware utilization. A common solution is to further exploit the parallelism hidden deep in the code structure. Such parallelism is less structured: parallel and sequential loops may be imperfectly nested within each other, neigh boring inner loops may exhibit different concurrency patternsmore » (e.g. Reduction vs. Forall), yet have to be parallelized in the same parallel section. Many input-dependent transformations have to be explored. A programmer often employs a larger group of hardware threads to cooperatively walk through a smaller outer loop partition and adaptively exploit any encountered parallelism. This process is time-consuming and error-prone, yet the risk of gaining little or no performance remains high for such workloads. To reduce risk and guide implementation, we propose a technique to model workloads with limited parallelism that can automatically explore and evaluate transformations involving cooperative threads. Eventually, our framework projects the best achievable performance and the most promising transformations without implementing GPU code or using physical hardware. We envision our technique to be integrated into future compilers or optimization frameworks for autotuning.« less
Preconditioned characteristic boundary conditions based on artificial compressibility method for solution of incompressible flows

NASA Astrophysics Data System (ADS)

Hejranfar, Kazem; Parseh, Kaveh

2017-09-01

The preconditioned characteristic boundary conditions based on the artificial compressibility (AC) method are implemented at artificial boundaries for the solution of two- and three-dimensional incompressible viscous flows in the generalized curvilinear coordinates. The compatibility equations and the corresponding characteristic variables (or the Riemann invariants) are mathematically derived and then applied as suitable boundary conditions in a high-order accurate incompressible flow solver. The spatial discretization of the resulting system of equations is carried out by the fourth-order compact finite-difference (FD) scheme. In the preconditioning applied here, the value of AC parameter in the flow field and also at the far-field boundary is automatically calculated based on the local flow conditions to enhance the robustness and performance of the solution algorithm. The code is fully parallelized using the Concurrency Runtime standard and Parallel Patterns Library (PPL) and its performance on a multi-core CPU is analyzed. The incompressible viscous flows around a 2-D circular cylinder, a 2-D NACA0012 airfoil and also a 3-D wavy cylinder are simulated and the accuracy and performance of the preconditioned characteristic boundary conditions applied at the far-field boundaries are evaluated in comparison to the simplified boundary conditions and the non-preconditioned characteristic boundary conditions. It is indicated that the preconditioned characteristic boundary conditions considerably improve the convergence rate of the solution of incompressible flows compared to the other boundary conditions and the computational costs are significantly decreased.
FUX-Sim: Implementation of a fast universal simulation/reconstruction framework for X-ray systems.

PubMed

Abella, Monica; Serrano, Estefania; Garcia-Blas, Javier; García, Ines; de Molina, Claudia; Carretero, Jesus; Desco, Manuel

2017-01-01

The availability of digital X-ray detectors, together with advances in reconstruction algorithms, creates an opportunity for bringing 3D capabilities to conventional radiology systems. The downside is that reconstruction algorithms for non-standard acquisition protocols are generally based on iterative approaches that involve a high computational burden. The development of new flexible X-ray systems could benefit from computer simulations, which may enable performance to be checked before expensive real systems are implemented. The development of simulation/reconstruction algorithms in this context poses three main difficulties. First, the algorithms deal with large data volumes and are computationally expensive, thus leading to the need for hardware and software optimizations. Second, these optimizations are limited by the high flexibility required to explore new scanning geometries, including fully configurable positioning of source and detector elements. And third, the evolution of the various hardware setups increases the effort required for maintaining and adapting the implementations to current and future programming models. Previous works lack support for completely flexible geometries and/or compatibility with multiple programming models and platforms. In this paper, we present FUX-Sim, a novel X-ray simulation/reconstruction framework that was designed to be flexible and fast. Optimized implementation for different families of GPUs (CUDA and OpenCL) and multi-core CPUs was achieved thanks to a modularized approach based on a layered architecture and parallel implementation of the algorithms for both architectures. A detailed performance evaluation demonstrates that for different system configurations and hardware platforms, FUX-Sim maximizes performance with the CUDA programming model (5 times faster than other state-of-the-art implementations). Furthermore, the CPU and OpenCL programming models allow FUX-Sim to be executed over a wide range of hardware platforms.
Effects of rare earth doping on multi-core iron oxide nanoparticles properties

NASA Astrophysics Data System (ADS)

Petran, Anca; Radu, Teodora; Borodi, Gheorghe; Nan, Alexandrina; Suciu, Maria; Turcu, Rodica

2018-01-01

New multi-core iron oxide magnetic nanoparticles doped with rare earth metals (Gd, Eu) were obtained by a one step synthesis procedure using a solvothermal method for potential biomedical applications. The obtained clusters were characterized by X-ray diffraction (XRD), transmission electron microscopy (TEM), energy-dispersive X-ray microanalysis (EDX), X-ray photoelectron spectroscopy (XPS) and magnetization measurements. They possess high colloidal stability, a saturation magnetization of up to 52 emu/g, and nearly spherical shape. The presence of rare earth ions in the obtained samples was confirmed by EDX and XPS. XRD analysis proved the homogeneous distribution of the trivalent rare earth ions in the inverse-spinel structure of magnetite and the increase of crystal strain upon doping the samples. XPS study reveals the valence state and the cation distribution on the octahedral and tetrahedral sites of the analysed samples. The observed shift of the XPS valence band spectra maximum in the direction of higher binding energies after rare earth doping, as well as theoretical valence band calculations prove the presence of Gd and Eu ions in octahedral sites. The blood protein adsorption ability of the obtained samples surface, the most important factor of the interaction between biomaterials and body fluids, was assessed by interaction with bovine serum albumin (BSA). The rare earth doped clusters surface show higher afinity for binding BSA. In vitro cytotoxicity test results for the studied samples showed no cytotoxicity in low and medium doses, establishing a potential perspective for rare earth doped MNC to facilitate multiple therapies in a single formulation for cancer theranostics.
Experiment Pamir-3. Coplanar emission of high energy gamma-quanta at interaction of hadrons with nuclei of air atoms at energies above 10 to the 7th power GeV

NASA Technical Reports Server (NTRS)

Asatiani, T. L.; Genina, L. E.; Zatsepin, G. T.

1985-01-01

A systematic analysis of large gamma families, detected in X-ray emulsion chambers, cases of multicore halos have been observed, and among them five events in which the halo is divided into three of four separate cores with their alignment observed in the target diagram (coplanarity of axes of corresponding electron photon cascades). The halo alignment (tendency to the straight line) leads to the aximuthal asymmetry (thrust). The analysis of lateral and momentum distributions of particles in these families shows that they also have thrust that correlates with the direction of the halo core alignment.

Multithreaded Stochastic PDES for Reactions and Diffusions in Neurons.

PubMed

Lin, Zhongwei; Tropper, Carl; Mcdougal, Robert A; Patoary, Mohammand Nazrul Ishlam; Lytton, William W; Yao, Yiping; Hines, Michael L

2017-07-01

Cells exhibit stochastic behavior when the number of molecules is small. Hence a stochastic reaction-diffusion simulator capable of working at scale can provide a more accurate view of molecular dynamics within the cell. This paper describes a parallel discrete event simulator, Neuron Time Warp-Multi Thread (NTW-MT), developed for the simulation of reaction diffusion models of neurons. To the best of our knowledge, this is the first parallel discrete event simulator oriented towards stochastic simulation of chemical reactions in a neuron. The simulator was developed as part of the NEURON project. NTW-MT is optimistic and thread-based, which attempts to capitalize on multi-core architectures used in high performance machines. It makes use of a multi-level queue for the pending event set and a single roll-back message in place of individual anti-messages to disperse contention and decrease the overhead of processing rollbacks. Global Virtual Time is computed asynchronously both within and among processes to get rid of the overhead for synchronizing threads. Memory usage is managed in order to avoid locking and unlocking when allocating and de-allocating memory and to maximize cache locality. We verified our simulator on a calcium buffer model. We examined its performance on a calcium wave model, comparing it to the performance of a process based optimistic simulator and a threaded simulator which uses a single priority queue for each thread. Our multi-threaded simulator is shown to achieve superior performance to these simulators. Finally, we demonstrated the scalability of our simulator on a larger CICR model and a more detailed CICR model.
Use of general purpose graphics processing units with MODFLOW

USGS Publications Warehouse

Hughes, Joseph D.; White, Jeremy T.

2013-01-01

To evaluate the use of general-purpose graphics processing units (GPGPUs) to improve the performance of MODFLOW, an unstructured preconditioned conjugate gradient (UPCG) solver has been developed. The UPCG solver uses a compressed sparse row storage scheme and includes Jacobi, zero fill-in incomplete, and modified-incomplete lower-upper (LU) factorization, and generalized least-squares polynomial preconditioners. The UPCG solver also includes options for sequential and parallel solution on the central processing unit (CPU) using OpenMP. For simulations utilizing the GPGPU, all basic linear algebra operations are performed on the GPGPU; memory copies between the central processing unit CPU and GPCPU occur prior to the first iteration of the UPCG solver and after satisfying head and flow criteria or exceeding a maximum number of iterations. The efficiency of the UPCG solver for GPGPU and CPU solutions is benchmarked using simulations of a synthetic, heterogeneous unconfined aquifer with tens of thousands to millions of active grid cells. Testing indicates GPGPU speedups on the order of 2 to 8, relative to the standard MODFLOW preconditioned conjugate gradient (PCG) solver, can be achieved when (1) memory copies between the CPU and GPGPU are optimized, (2) the percentage of time performing memory copies between the CPU and GPGPU is small relative to the calculation time, (3) high-performance GPGPU cards are utilized, and (4) CPU-GPGPU combinations are used to execute sequential operations that are difficult to parallelize. Furthermore, UPCG solver testing indicates GPGPU speedups exceed parallel CPU speedups achieved using OpenMP on multicore CPUs for preconditioners that can be easily parallelized.
Scalable geocomputation: evolving an environmental model building platform from single-core to supercomputers

NASA Astrophysics Data System (ADS)

Schmitz, Oliver; de Jong, Kor; Karssenberg, Derek

2017-04-01

There is an increasing demand to run environmental models on a big scale: simulations over large areas at high resolution. The heterogeneity of available computing hardware such as multi-core CPUs, GPUs or supercomputer potentially provides significant computing power to fulfil this demand. However, this requires detailed knowledge of the underlying hardware, parallel algorithm design and the implementation thereof in an efficient system programming language. Domain scientists such as hydrologists or ecologists often lack this specific software engineering knowledge, their emphasis is (and should be) on exploratory building and analysis of simulation models. As a result, models constructed by domain specialists mostly do not take full advantage of the available hardware. A promising solution is to separate the model building activity from software engineering by offering domain specialists a model building framework with pre-programmed building blocks that they combine to construct a model. The model building framework, consequently, needs to have built-in capabilities to make full usage of the available hardware. Developing such a framework providing understandable code for domain scientists and being runtime efficient at the same time poses several challenges on developers of such a framework. For example, optimisations can be performed on individual operations or the whole model, or tasks need to be generated for a well-balanced execution without explicitly knowing the complexity of the domain problem provided by the modeller. Ideally, a modelling framework supports the optimal use of available hardware whichsoever combination of model building blocks scientists use. We demonstrate our ongoing work on developing parallel algorithms for spatio-temporal modelling and demonstrate 1) PCRaster, an environmental software framework (http://www.pcraster.eu) providing spatio-temporal model building blocks and 2) parallelisation of about 50 of these building blocks using the new Fern library (https://github.com/geoneric/fern/), an independent generic raster processing library. Fern is a highly generic software library and its algorithms can be configured according to the configuration of a modelling framework. With manageable programming effort (e.g. matching data types between programming and domain language) we created a binding between Fern and PCRaster. The resulting PCRaster Python multicore module can be used to execute existing PCRaster models without having to make any changes to the model code. We show initial results on synthetic and geoscientific models indicating significant runtime improvements provided by parallel local and focal operations. We further outline challenges in improving remaining algorithms such as flow operations over digital elevation maps and further potential improvements like enhancing disk I/O.
Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU Deep Collaborative Computing

PubMed Central

Zhang, Fan; Li, Guojun; Li, Wei; Hu, Wei; Hu, Yuxin

2016-01-01

With the development of synthetic aperture radar (SAR) technologies in recent years, the huge amount of remote sensing data brings challenges for real-time imaging processing. Therefore, high performance computing (HPC) methods have been presented to accelerate SAR imaging, especially the GPU based methods. In the classical GPU based imaging algorithm, GPU is employed to accelerate image processing by massive parallel computing, and CPU is only used to perform the auxiliary work such as data input/output (IO). However, the computing capability of CPU is ignored and underestimated. In this work, a new deep collaborative SAR imaging method based on multiple CPU/GPU is proposed to achieve real-time SAR imaging. Through the proposed tasks partitioning and scheduling strategy, the whole image can be generated with deep collaborative multiple CPU/GPU computing. In the part of CPU parallel imaging, the advanced vector extension (AVX) method is firstly introduced into the multi-core CPU parallel method for higher efficiency. As for the GPU parallel imaging, not only the bottlenecks of memory limitation and frequent data transferring are broken, but also kinds of optimized strategies are applied, such as streaming, parallel pipeline and so on. Experimental results demonstrate that the deep CPU/GPU collaborative imaging method enhances the efficiency of SAR imaging on single-core CPU by 270 times and realizes the real-time imaging in that the imaging rate outperforms the raw data generation rate. PMID:27070606
Computer-intensive simulation of solid-state NMR experiments using SIMPSON.

PubMed

Tošner, Zdeněk; Andersen, Rasmus; Stevensson, Baltzar; Edén, Mattias; Nielsen, Niels Chr; Vosegaard, Thomas

2014-09-01

Conducting large-scale solid-state NMR simulations requires fast computer software potentially in combination with efficient computational resources to complete within a reasonable time frame. Such simulations may involve large spin systems, multiple-parameter fitting of experimental spectra, or multiple-pulse experiment design using parameter scan, non-linear optimization, or optimal control procedures. To efficiently accommodate such simulations, we here present an improved version of the widely distributed open-source SIMPSON NMR simulation software package adapted to contemporary high performance hardware setups. The software is optimized for fast performance on standard stand-alone computers, multi-core processors, and large clusters of identical nodes. We describe the novel features for fast computation including internal matrix manipulations, propagator setups and acquisition strategies. For efficient calculation of powder averages, we implemented interpolation method of Alderman, Solum, and Grant, as well as recently introduced fast Wigner transform interpolation technique. The potential of the optimal control toolbox is greatly enhanced by higher precision gradients in combination with the efficient optimization algorithm known as limited memory Broyden-Fletcher-Goldfarb-Shanno. In addition, advanced parallelization can be used in all types of calculations, providing significant time reductions. SIMPSON is thus reflecting current knowledge in the field of numerical simulations of solid-state NMR experiments. The efficiency and novel features are demonstrated on the representative simulations. Copyright © 2014 Elsevier Inc. All rights reserved.
Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU Deep Collaborative Computing.

PubMed

Zhang, Fan; Li, Guojun; Li, Wei; Hu, Wei; Hu, Yuxin

2016-04-07

With the development of synthetic aperture radar (SAR) technologies in recent years, the huge amount of remote sensing data brings challenges for real-time imaging processing. Therefore, high performance computing (HPC) methods have been presented to accelerate SAR imaging, especially the GPU based methods. In the classical GPU based imaging algorithm, GPU is employed to accelerate image processing by massive parallel computing, and CPU is only used to perform the auxiliary work such as data input/output (IO). However, the computing capability of CPU is ignored and underestimated. In this work, a new deep collaborative SAR imaging method based on multiple CPU/GPU is proposed to achieve real-time SAR imaging. Through the proposed tasks partitioning and scheduling strategy, the whole image can be generated with deep collaborative multiple CPU/GPU computing. In the part of CPU parallel imaging, the advanced vector extension (AVX) method is firstly introduced into the multi-core CPU parallel method for higher efficiency. As for the GPU parallel imaging, not only the bottlenecks of memory limitation and frequent data transferring are broken, but also kinds of optimized strategies are applied, such as streaming, parallel pipeline and so on. Experimental results demonstrate that the deep CPU/GPU collaborative imaging method enhances the efficiency of SAR imaging on single-core CPU by 270 times and realizes the real-time imaging in that the imaging rate outperforms the raw data generation rate.
Geospace simulations on the Cell BE processor

NASA Astrophysics Data System (ADS)

Germaschewski, K.; Raeder, J.; Larson, D.

2008-12-01

OpenGGCM (Open Geospace General circulation Model) is an established numerical code that simulates the Earth's space environment. The most computing intensive part is the MHD (magnetohydrodynamics) solver that models the plasma surrounding Earth and its interaction with Earth's magnetic field and the solar wind flowing in from the sun. Like other global magnetosphere codes, OpenGGCM's realism is limited by computational constraints on grid resolution. We investigate porting of the MHD solver to the Cell BE architecture, a novel inhomogeneous multicore architecture capable of up to 230 GFlops per processor. Realizing this high performance on the Cell processor is a programming challenge, though. We implemented the MHD solver using a multi-level parallel approach: On the coarsest level, the problem is distributed to processors based upon the usual domain decomposition approach. Then, on each processor, the problem is divided into 3D columns, each of which is handled by the memory limited SPEs (synergistic processing elements) slice by slice. Finally, SIMD instructions are used to fully exploit the vector/SIMD FPUs in each SPE. Memory management needs to be handled explicitly by the code, using DMA to move data from main memory to the per-SPE local store and vice versa. We obtained excellent performance numbers, a speed-up of a factor of 25 compared to just using the main processor, while still keeping the numerical implementation details of the code maintainable.
Active Storage with Analytics Capabilities and I/O Runtime System for Petascale Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Choudhary, Alok

Computational scientists must understand results from experimental, observational and computational simulation generated data to gain insights and perform knowledge discovery. As systems approach the petascale range, problems that were unimaginable a few years ago are within reach. With the increasing volume and complexity of data produced by ultra-scale simulations and high-throughput experiments, understanding the science is largely hampered by the lack of comprehensive I/O, storage, acceleration of data manipulation, analysis, and mining tools. Scientists require techniques, tools and infrastructure to facilitate better understanding of their data, in particular the ability to effectively perform complex data analysis, statistical analysis and knowledgemore » discovery. The goal of this work is to enable more effective analysis of scientific datasets through the integration of enhancements in the I/O stack, from active storage support at the file system layer to MPI-IO and high-level I/O library layers. We propose to provide software components to accelerate data analytics, mining, I/O, and knowledge discovery for large-scale scientific applications, thereby increasing productivity of both scientists and the systems. Our approaches include 1) design the interfaces in high-level I/O libraries, such as parallel netCDF, for applications to activate data mining operations at the lower I/O layers; 2) Enhance MPI-IO runtime systems to incorporate the functionality developed as a part of the runtime system design; 3) Develop parallel data mining programs as part of runtime library for server-side file system in PVFS file system; and 4) Prototype an active storage cluster, which will utilize multicore CPUs, GPUs, and FPGAs to carry out the data mining workload.« less
Scheduling for energy and reliability management on multiprocessor real-time systems

NASA Astrophysics Data System (ADS)

Qi, Xuan

Scheduling algorithms for multiprocessor real-time systems have been studied for years with many well-recognized algorithms proposed. However, it is still an evolving research area and many problems remain open due to their intrinsic complexities. With the emergence of multicore processors, it is necessary to re-investigate the scheduling problems and design/develop efficient algorithms for better system utilization, low scheduling overhead, high energy efficiency, and better system reliability. Focusing cluster schedulings with optimal global schedulers, we study the utilization bound and scheduling overhead for a class of cluster-optimal schedulers. Then, taking energy/power consumption into consideration, we developed energy-efficient scheduling algorithms for real-time systems, especially for the proliferating embedded systems with limited energy budget. As the commonly deployed energy-saving technique (e.g. dynamic voltage frequency scaling (DVFS)) will significantly affect system reliability, we study schedulers that have intelligent mechanisms to recuperate system reliability to satisfy the quality assurance requirements. Extensive simulation is conducted to evaluate the performance of the proposed algorithms on reduction of scheduling overhead, energy saving, and reliability improvement. The simulation results show that the proposed reliability-aware power management schemes could preserve the system reliability while still achieving substantial energy saving.
MSAProbs-MPI: parallel multiple sequence aligner for distributed-memory systems.

PubMed

González-Domínguez, Jorge; Liu, Yongchao; Touriño, Juan; Schmidt, Bertil

2016-12-15

MSAProbs is a state-of-the-art protein multiple sequence alignment tool based on hidden Markov models. It can achieve high alignment accuracy at the expense of relatively long runtimes for large-scale input datasets. In this work we present MSAProbs-MPI, a distributed-memory parallel version of the multithreaded MSAProbs tool that is able to reduce runtimes by exploiting the compute capabilities of common multicore CPU clusters. Our performance evaluation on a cluster with 32 nodes (each containing two Intel Haswell processors) shows reductions in execution time of over one order of magnitude for typical input datasets. Furthermore, MSAProbs-MPI using eight nodes is faster than the GPU-accelerated QuickProbs running on a Tesla K20. Another strong point is that MSAProbs-MPI can deal with large datasets for which MSAProbs and QuickProbs might fail due to time and memory constraints, respectively. Source code in C ++ and MPI running on Linux systems as well as a reference manual are available at http://msaprobs.sourceforge.net CONTACT: jgonzalezd@udc.esSupplementary information: Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
DiSCaMB: a software library for aspherical atom model X-ray scattering factor calculations with CPUs and GPUs.

PubMed

Chodkiewicz, Michał L; Migacz, Szymon; Rudnicki, Witold; Makal, Anna; Kalinowski, Jarosław A; Moriarty, Nigel W; Grosse-Kunstleve, Ralf W; Afonine, Pavel V; Adams, Paul D; Dominiak, Paulina Maria

2018-02-01

It has been recently established that the accuracy of structural parameters from X-ray refinement of crystal structures can be improved by using a bank of aspherical pseudoatoms instead of the classical spherical model of atomic form factors. This comes, however, at the cost of increased complexity of the underlying calculations. In order to facilitate the adoption of this more advanced electron density model by the broader community of crystallographers, a new software implementation called DiSCaMB , 'densities in structural chemistry and molecular biology', has been developed. It addresses the challenge of providing for high performance on modern computing architectures. With parallelization options for both multi-core processors and graphics processing units (using CUDA), the library features calculation of X-ray scattering factors and their derivatives with respect to structural parameters, gives access to intermediate steps of the scattering factor calculations (thus allowing for experimentation with modifications of the underlying electron density model), and provides tools for basic structural crystallographic operations. Permissively (MIT) licensed, DiSCaMB is an open-source C++ library that can be embedded in both academic and commercial tools for X-ray structure refinement.
STAMPS: Software Tool for Automated MRI Post-processing on a supercomputer.

PubMed

Bigler, Don C; Aksu, Yaman; Miller, David J; Yang, Qing X

2009-08-01

This paper describes a Software Tool for Automated MRI Post-processing (STAMP) of multiple types of brain MRIs on a workstation and for parallel processing on a supercomputer (STAMPS). This software tool enables the automation of nonlinear registration for a large image set and for multiple MR image types. The tool uses standard brain MRI post-processing tools (such as SPM, FSL, and HAMMER) for multiple MR image types in a pipeline fashion. It also contains novel MRI post-processing features. The STAMP image outputs can be used to perform brain analysis using Statistical Parametric Mapping (SPM) or single-/multi-image modality brain analysis using Support Vector Machines (SVMs). Since STAMPS is PBS-based, the supercomputer may be a multi-node computer cluster or one of the latest multi-core computers.
GENESIS: a hybrid-parallel and multi-scale molecular dynamics simulator with enhanced sampling algorithms for biomolecular and cellular simulations.

PubMed

Jung, Jaewoon; Mori, Takaharu; Kobayashi, Chigusa; Matsunaga, Yasuhiro; Yoda, Takao; Feig, Michael; Sugita, Yuji

2015-07-01

GENESIS (Generalized-Ensemble Simulation System) is a new software package for molecular dynamics (MD) simulations of macromolecules. It has two MD simulators, called ATDYN and SPDYN. ATDYN is parallelized based on an atomic decomposition algorithm for the simulations of all-atom force-field models as well as coarse-grained Go-like models. SPDYN is highly parallelized based on a domain decomposition scheme, allowing large-scale MD simulations on supercomputers. Hybrid schemes combining OpenMP and MPI are used in both simulators to target modern multicore computer architectures. Key advantages of GENESIS are (1) the highly parallel performance of SPDYN for very large biological systems consisting of more than one million atoms and (2) the availability of various REMD algorithms (T-REMD, REUS, multi-dimensional REMD for both all-atom and Go-like models under the NVT, NPT, NPAT, and NPγT ensembles). The former is achieved by a combination of the midpoint cell method and the efficient three-dimensional Fast Fourier Transform algorithm, where the domain decomposition space is shared in real-space and reciprocal-space calculations. Other features in SPDYN, such as avoiding concurrent memory access, reducing communication times, and usage of parallel input/output files, also contribute to the performance. We show the REMD simulation results of a mixed (POPC/DMPC) lipid bilayer as a real application using GENESIS. GENESIS is released as free software under the GPLv2 licence and can be easily modified for the development of new algorithms and molecular models. WIREs Comput Mol Sci 2015, 5:310-323. doi: 10.1002/wcms.1220.
Manycore Performance-Portability: Kokkos Multidimensional Array Library

DOE PAGES

Edwards, H. Carter; Sunderland, Daniel; Porter, Vicki; ...

2012-01-01

Large, complex scientific and engineering application code have a significant investment in computational kernels to implement their mathematical models. Porting these computational kernels to the collection of modern manycore accelerator devices is a major challenge in that these devices have diverse programming models, application programming interfaces (APIs), and performance requirements. The Kokkos Array programming model provides library-based approach to implement computational kernels that are performance-portable to CPU-multicore and GPGPU accelerator devices. This programming model is based upon three fundamental concepts: (1) manycore compute devices each with its own memory space, (2) data parallel kernels and (3) multidimensional arrays. Kernel executionmore » performance is, especially for NVIDIA® devices, extremely dependent on data access patterns. Optimal data access pattern can be different for different manycore devices – potentially leading to different implementations of computational kernels specialized for different devices. The Kokkos Array programming model supports performance-portable kernels by (1) separating data access patterns from computational kernels through a multidimensional array API and (2) introduce device-specific data access mappings when a kernel is compiled. An implementation of Kokkos Array is available through Trilinos [Trilinos website, http://trilinos.sandia.gov/, August 2011].« less
Adaptive multi-GPU Exchange Monte Carlo for the 3D Random Field Ising Model

NASA Astrophysics Data System (ADS)

Navarro, Cristóbal A.; Huang, Wei; Deng, Youjin

2016-08-01

This work presents an adaptive multi-GPU Exchange Monte Carlo approach for the simulation of the 3D Random Field Ising Model (RFIM). The design is based on a two-level parallelization. The first level, spin-level parallelism, maps the parallel computation as optimal 3D thread-blocks that simulate blocks of spins in shared memory with minimal halo surface, assuming a constant block volume. The second level, replica-level parallelism, uses multi-GPU computation to handle the simulation of an ensemble of replicas. CUDA's concurrent kernel execution feature is used in order to fill the occupancy of each GPU with many replicas, providing a performance boost that is more notorious at the smallest values of L. In addition to the two-level parallel design, the work proposes an adaptive multi-GPU approach that dynamically builds a proper temperature set free of exchange bottlenecks. The strategy is based on mid-point insertions at the temperature gaps where the exchange rate is most compromised. The extra work generated by the insertions is balanced across the GPUs independently of where the mid-point insertions were performed. Performance results show that spin-level performance is approximately two orders of magnitude faster than a single-core CPU version and one order of magnitude faster than a parallel multi-core CPU version running on 16-cores. Multi-GPU performance is highly convenient under a weak scaling setting, reaching up to 99 % efficiency as long as the number of GPUs and L increase together. The combination of the adaptive approach with the parallel multi-GPU design has extended our possibilities of simulation to sizes of L = 32 , 64 for a workstation with two GPUs. Sizes beyond L = 64 can eventually be studied using larger multi-GPU systems.
Multi-core fiber amplifier arrays for intra-satellite links

NASA Astrophysics Data System (ADS)

Kechagias, Marios; Crabb, Jonathan; Stampoulidis, Leontios; Farzana, Jihan; Kehayas, Efstratios; Filipowicz, Marta; Napierala, Marek; Murawski, Michal; Nasilowski, Tomasz; Barbero, Juan

2017-09-01

In this paper we present erbium doped fibre (EDF) aimed at signal amplification within satellite photonic payload systems operating in C telecommunication band. In such volume-hungry applications, the use of advanced optical transmission techniques such as space division multiplexing (SDM) can be advantageous to reduce the component and cable count.
Exploration and Evaluation of Nanometer Low-power Multi-core VLSI Computer Architectures

DTIC Science & Technology

2015-03-01

ICC, the Milkway database was created using the command: milkyway –galaxy –nogui –tcl –log memory.log one.tcl As stated previously, it is...EDA tools. Typically, Synopsys® tools use Milkway databases, whereas, Cadence Design System® use Layout Exchange Format (LEF) formats. To help
Investigation of Large Scale Cortical Models on Clustered Multi-Core Processors

DTIC Science & Technology

2013-02-01

with the bias node ( gray ) denoted as ww and the weights associated with the remaining first layer nodes (black) denoted as W. In forming the overall...Implementation of RBF network on GPU Platform 3.5.1 The Cholesky decomposition algorithm We need to invert the matrix multiplication GTG to
A fast CT reconstruction scheme for a general multi-core PC.

PubMed

Zeng, Kai; Bai, Erwei; Wang, Ge

2007-01-01

Expensive computational cost is a severe limitation in CT reconstruction for clinical applications that need real-time feedback. A primary example is bolus-chasing computed tomography (CT) angiography (BCA) that we have been developing for the past several years. To accelerate the reconstruction process using the filtered backprojection (FBP) method, specialized hardware or graphics cards can be used. However, specialized hardware is expensive and not flexible. The graphics processing unit (GPU) in a current graphic card can only reconstruct images in a reduced precision and is not easy to program. In this paper, an acceleration scheme is proposed based on a multi-core PC. In the proposed scheme, several techniques are integrated, including utilization of geometric symmetry, optimization of data structures, single-instruction multiple-data (SIMD) processing, multithreaded computation, and an Intel C++ compilier. Our scheme maintains the original precision and involves no data exchange between the GPU and CPU. The merits of our scheme are demonstrated in numerical experiments against the traditional implementation. Our scheme achieves a speedup of about 40, which can be further improved by several folds using the latest quad-core processors.
Biodegradation mechanisms of iron oxide monocrystalline nanoflowers and tunable shield effect of gold coating.

PubMed

Javed, Yasir; Lartigue, Lénaic; Hugounenq, Pierre; Vuong, Quoc Lam; Gossuin, Yves; Bazzi, Rana; Wilhelm, Claire; Ricolleau, Christian; Gazeau, Florence; Alloyeau, Damien

2014-08-27

Understanding the relation between the structure and the reactivity of nanomaterials in the organism is a crucial step towards efficient and safe biomedical applications. The multi-scale approach reported here, allows following the magnetic and structural transformations of multicore maghemite nanoflowers in a medium mimicking intracellular lysosomal environment. By confronting atomic-scale and macroscopic information on the biodegradation of these complex nanostuctures, we can unravel the mechanisms involved in the critical alterations of their hyperthermic power and their Magnetic Resonance imaging T1 and T2 contrast effect. This transformation of multicore nanoparticles with outstanding magnetic properties into poorly magnetic single core clusters highlights the harmful influence of cellular medium on the therapeutic and diagnosis effectiveness of iron oxide-based nanomaterials. As biodegradation occurs through surface reactivity mechanism, we demonstrate that the inert activity of gold nanoshells can be exploited to protect iron oxide nanostructures. Such inorganic nanoshields could be a relevant strategy to modulate the degradability and ultimately the long term fate of nanomaterials in the organism. © 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

Some links on this page may take you to non-federal websites. Their policies may differ from this site.