mitigating multicore performance: Topics by Science.gov

Sample records for mitigating multicore performance

Fault Mitigation Schemes for Future Spaceflight Multicore Processors

NASA Technical Reports Server (NTRS)

Some, Rafi; Gostelow, Kim P.; Lai, John; Reder, Leonard; Alexander, James; Clement, Brad

2012-01-01

The goal of this work is to achieve fail-operational and graceful-degradation behavior in realistic flight mission scenarios, of multicore processors such as Mars Entry-Descent-Landing (EDL) and Primitive Body proximity operations.
Case for a field-programmable gate array multicore hybrid machine for an image-processing application

NASA Astrophysics Data System (ADS)

Rakvic, Ryan N.; Ives, Robert W.; Lira, Javier; Molina, Carlos

2011-01-01

General purpose computer designers have recently begun adding cores to their processors in order to increase performance. For example, Intel has adopted a homogeneous quad-core processor as a base for general purpose computing. PlayStation3 (PS3) game consoles contain a multicore heterogeneous processor known as the Cell, which is designed to perform complex image processing algorithms at a high level. Can modern image-processing algorithms utilize these additional cores? On the other hand, modern advancements in configurable hardware, most notably field-programmable gate arrays (FPGAs) have created an interesting question for general purpose computer designers. Is there a reason to combine FPGAs with multicore processors to create an FPGA multicore hybrid general purpose computer? Iris matching, a repeatedly executed portion of a modern iris-recognition algorithm, is parallelized on an Intel-based homogeneous multicore Xeon system, a heterogeneous multicore Cell system, and an FPGA multicore hybrid system. Surprisingly, the cheaper PS3 slightly outperforms the Intel-based multicore on a core-for-core basis. However, both multicore systems are beaten by the FPGA multicore hybrid system by >50%.
Multi-Core Processor Memory Contention Benchmark Analysis Case Study

NASA Technical Reports Server (NTRS)

Simon, Tyler; McGalliard, James

2009-01-01

Multi-core processors dominate current mainframe, server, and high performance computing (HPC) systems. This paper provides synthetic kernel and natural benchmark results from an HPC system at the NASA Goddard Space Flight Center that illustrate the performance impacts of multi-core (dual- and quad-core) vs. single core processor systems. Analysis of processor design, application source code, and synthetic and natural test results all indicate that multi-core processors can suffer from significant memory subsystem contention compared to similar single-core processors.
Enabling Future Robotic Missions with Multicore Processors

NASA Technical Reports Server (NTRS)

Powell, Wesley A.; Johnson, Michael A.; Wilmot, Jonathan; Some, Raphael; Gostelow, Kim P.; Reeves, Glenn; Doyle, Richard J.

2011-01-01

Recent commercial developments in multicore processors (e.g. Tilera, Clearspeed, HyperX) have provided an option for high performance embedded computing that rivals the performance attainable with FPGA-based reconfigurable computing architectures. Furthermore, these processors offer more straightforward and streamlined application development by allowing the use of conventional programming languages and software tools in lieu of hardware design languages such as VHDL and Verilog. With these advantages, multicore processors can significantly enhance the capabilities of future robotic space missions. This paper will discuss these benefits, along with onboard processing applications where multicore processing can offer advantages over existing or competing approaches. This paper will also discuss the key artchitecural features of current commercial multicore processors. In comparison to the current art, the features and advancements necessary for spaceflight multicore processors will be identified. These include power reduction, radiation hardening, inherent fault tolerance, and support for common spacecraft bus interfaces. Lastly, this paper will explore how multicore processors might evolve with advances in electronics technology and how avionics architectures might evolve once multicore processors are inserted into NASA robotic spacecraft.
Performance comparison of a fiber optic communication system based on optical OFDM and an optical OFDM-MIMO with Alamouti code by using numerical simulations

NASA Astrophysics Data System (ADS)

Serpa-Imbett, C. M.; Marín-Alfonso, J.; Gómez-Santamaría, C.; Betancur-Agudelo, L.; Amaya-Fernández, F.

2013-12-01

Space division multiplexing in multicore fibers is one of the most promise technologies in order to support transmissions of next-generation peta-to-exaflop-scale supercomputers and mega data centers, owing to advantages in terms of costs and space saving of the new optical fibers with multiple cores. Additionally, multicore fibers allow photonic signal processing in optical communication systems, taking advantage of the mode coupling phenomena. In this work, we numerically have simulated an optical MIMO-OFDM (multiple-input multiple-output orthogonal frequency division multiplexing) by using the coded Alamouti to be transmitted through a twin-core fiber with low coupling. Furthermore, an optical OFDM is transmitted through a core of a singlemode fiber, using pilot-aided channel estimation. We compare the transmission performance in the twin-core fiber and in the singlemode fiber taking into account numerical results of the bit-error rate, considering linear propagation, and Gaussian noise through an optical fiber link. We carry out an optical fiber transmission of OFDM frames using 8 PSK and 16 QAM, with bit rates values of 130 Gb/s and 170 Gb/s, respectively. We obtain a penalty around 4 dB for the 8 PSK transmissions, after 100 km of linear fiber optic propagation for both singlemode and twin core fiber. We obtain a penalty around 6 dB for the 16 QAM transmissions, with linear propagation after 100 km of optical fiber. The transmission in a two-core fiber by using Alamouti coded OFDM-MIMO exhibits a better performance, offering a good alternative in the mitigation of fiber impairments, allowing to expand Alamouti coded in multichannel systems spatially multiplexed in multicore fibers.
Fault Mitigation Schemes for Future Spaceflight Multicore Processors

NASA Technical Reports Server (NTRS)

Alexander, James W.; Clement, Bradley J.; Gostelow, Kim P.; Lai, John Y.

2012-01-01

Future planetary exploration missions demand significant advances in on-board computing capabilities over current avionics architectures based on a single-core processing element. The state-of-the-art multi-core processor provides much promise in meeting such challenges while introducing new fault tolerance problems when applied to space missions. Software-based schemes are being presented in this paper that can achieve system-level fault mitigation beyond that provided by radiation-hard-by-design (RHBD). For mission and time critical applications such as the Terrain Relative Navigation (TRN) for planetary or small body navigation, and landing, a range of fault tolerance methods can be adapted by the application. The software methods being investigated include Error Correction Code (ECC) for data packet routing between cores, virtual network routing, Triple Modular Redundancy (TMR), and Algorithm-Based Fault Tolerance (ABFT). A robust fault tolerance framework that provides fail-operational behavior under hard real-time constraints and graceful degradation will be demonstrated using TRN executing on a commercial Tilera(R) processor with simulated fault injections.
Energy-aware Thread and Data Management in Heterogeneous Multi-core, Multi-memory Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Su, Chun-Yi

By 2004, microprocessor design focused on multicore scaling—increasing the number of cores per die in each generation—as the primary strategy for improving performance. These multicore processors typically equip multiple memory subsystems to improve data throughput. In addition, these systems employ heterogeneous processors such as GPUs and heterogeneous memories like non-volatile memory to improve performance, capacity, and energy efficiency. With the increasing volume of hardware resources and system complexity caused by heterogeneity, future systems will require intelligent ways to manage hardware resources. Early research to improve performance and energy efficiency on heterogeneous, multi-core, multi-memory systems focused on tuning a single primitivemore » or at best a few primitives in the systems. The key limitation of past efforts is their lack of a holistic approach to resource management that balances the tradeoff between performance and energy consumption. In addition, the shift from simple, homogeneous systems to these heterogeneous, multicore, multi-memory systems requires in-depth understanding of efficient resource management for scalable execution, including new models that capture the interchange between performance and energy, smarter resource management strategies, and novel low-level performance/energy tuning primitives and runtime systems. Tuning an application to control available resources efficiently has become a daunting challenge; managing resources in automation is still a dark art since the tradeoffs among programming, energy, and performance remain insufficiently understood. In this dissertation, I have developed theories, models, and resource management techniques to enable energy-efficient execution of parallel applications through thread and data management in these heterogeneous multi-core, multi-memory systems. I study the effect of dynamic concurrent throttling on the performance and energy of multi-core, non-uniform memory access (NUMA) systems. I use critical path analysis to quantify memory contention in the NUMA memory system and determine thread mappings. In addition, I implement a runtime system that combines concurrent throttling and a novel thread mapping algorithm to manage thread resources and improve energy efficient execution in multi-core, NUMA systems.« less
On the Performance of an Algebraic MultigridSolver on Multicore Clusters

DOE Office of Scientific and Technical Information (OSTI.GOV)

Baker, A H; Schulz, M; Yang, U M

2010-04-29

Algebraic multigrid (AMG) solvers have proven to be extremely efficient on distributed-memory architectures. However, when executed on modern multicore cluster architectures, we face new challenges that can significantly harm AMG's performance. We discuss our experiences on such an architecture and present a set of techniques that help users to overcome the associated problems, including thread and process pinning and correct memory associations. We have implemented most of the techniques in a MultiCore SUPport library (MCSup), which helps to map OpenMP applications to multicore machines. We present results using both an MPI-only and a hybrid MPI/OpenMP model.
Multi-core processing and scheduling performance in CMS

NASA Astrophysics Data System (ADS)

Hernández, J. M.; Evans, D.; Foulkes, S.

2012-12-01

Commodity hardware is going many-core. We might soon not be able to satisfy the job memory needs per core in the current single-core processing model in High Energy Physics. In addition, an ever increasing number of independent and incoherent jobs running on the same physical hardware not sharing resources might significantly affect processing performance. It will be essential to effectively utilize the multi-core architecture. CMS has incorporated support for multi-core processing in the event processing framework and the workload management system. Multi-core processing jobs share common data in memory, such us the code libraries, detector geometry and conditions data, resulting in a much lower memory usage than standard single-core independent jobs. Exploiting this new processing model requires a new model in computing resource allocation, departing from the standard single-core allocation for a job. The experiment job management system needs to have control over a larger quantum of resource since multi-core aware jobs require the scheduling of multiples cores simultaneously. CMS is exploring the approach of using whole nodes as unit in the workload management system where all cores of a node are allocated to a multi-core job. Whole-node scheduling allows for optimization of the data/workflow management (e.g. I/O caching, local merging) but efficient utilization of all scheduled cores is challenging. Dedicated whole-node queues have been setup at all Tier-1 centers for exploring multi-core processing workflows in CMS. We present the evaluation of the performance scheduling and executing multi-core workflows in whole-node queues compared to the standard single-core processing workflows.
Using a Multicore Processor for Rover Autonomous Science

NASA Technical Reports Server (NTRS)

Bornstein, Benjamin; Estlin, Tara; Clement, Bradley; Springer, Paul

2011-01-01

Multicore processing promises to be a critical component of future spacecraft. It provides immense increases in onboard processing power and provides an environment for directly supporting fault-tolerant computing. This paper discusses using a state-of-the-art multicore processor to efficiently perform image analysis onboard a Mars rover in support of autonomous science activities.
Efficient provisioning for multi-core applications with LSF

NASA Astrophysics Data System (ADS)

Dal Pra, Stefano

2015-12-01

Tier-1 sites providing computing power for HEP experiments are usually tightly designed for high throughput performances. This is pursued by reducing the variety of supported use cases and tuning for performances those ones, the most important of which have been that of singlecore jobs. Moreover, the usual workload is saturation: each available core in the farm is in use and there are queued jobs waiting for their turn to run. Enabling multi-core jobs thus requires dedicating a number of hosts where to run, and waiting for them to free the needed number of cores. This drain-time introduces a loss of computing power driven by the number of unusable empty cores. As an increasing demand for multi-core capable resources have emerged, a Task Force have been constituted in WLCG, with the goal to define a simple and efficient multi-core resource provisioning model. This paper details the work done at the INFN Tier-1 to enable multi-core support for the LSF batch system, with the intent of reducing to the minimum the average number of unused cores. The adopted strategy has been that of dedicating to multi-core a dynamic set of nodes, whose dimension is mainly driven by the number of pending multi-core requests and fair-share priority of the submitting user. The node status transition, from single to multi core et vice versa, is driven by a finite state machine which is implemented in a custom multi-core director script, running in the cluster. After describing and motivating both the implementation and the details specific to the LSF batch system, results about performance are reported. Factors having positive and negative impact on the overall efficiency are discussed and solutions to reduce at most the negative ones are proposed.
Multi-core processing and scheduling performance in CMS

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hernandez, J. M.; Evans, D.; Foulkes, S.

2012-01-01

Commodity hardware is going many-core. We might soon not be able to satisfy the job memory needs per core in the current single-core processing model in High Energy Physics. In addition, an ever increasing number of independent and incoherent jobs running on the same physical hardware not sharing resources might significantly affect processing performance. It will be essential to effectively utilize the multi-core architecture. CMS has incorporated support for multi-core processing in the event processing framework and the workload management system. Multi-core processing jobs share common data in memory, such us the code libraries, detector geometry and conditions data, resultingmore » in a much lower memory usage than standard single-core independent jobs. Exploiting this new processing model requires a new model in computing resource allocation, departing from the standard single-core allocation for a job. The experiment job management system needs to have control over a larger quantum of resource since multi-core aware jobs require the scheduling of multiples cores simultaneously. CMS is exploring the approach of using whole nodes as unit in the workload management system where all cores of a node are allocated to a multi-core job. Whole-node scheduling allows for optimization of the data/workflow management (e.g. I/O caching, local merging) but efficient utilization of all scheduled cores is challenging. Dedicated whole-node queues have been setup at all Tier-1 centers for exploring multi-core processing workflows in CMS. We present the evaluation of the performance scheduling and executing multi-core workflows in whole-node queues compared to the standard single-core processing workflows.« less
Scheduling multicore workload on shared multipurpose clusters

NASA Astrophysics Data System (ADS)

Templon, J. A.; Acosta-Silva, C.; Flix Molina, J.; Forti, A. C.; Pérez-Calero Yzquierdo, A.; Starink, R.

2015-12-01

With the advent of workloads containing explicit requests for multiple cores in a single grid job, grid sites faced a new set of challenges in workload scheduling. The most common batch schedulers deployed at HEP computing sites do a poor job at multicore scheduling when using only the native capabilities of those schedulers. This paper describes how efficient multicore scheduling was achieved at the sites the authors represent, by implementing dynamically-sized multicore partitions via a minimalistic addition to the Torque/Maui batch system already in use at those sites. The paper further includes example results from use of the system in production, as well as measurements on the dependence of performance (especially the ramp-up in throughput for multicore jobs) on node size and job size.
Network Coding on Heterogeneous Multi-Core Processors for Wireless Sensor Networks

PubMed Central

Kim, Deokho; Park, Karam; Ro, Won W.

2011-01-01

While network coding is well known for its efficiency and usefulness in wireless sensor networks, the excessive costs associated with decoding computation and complexity still hinder its adoption into practical use. On the other hand, high-performance microprocessors with heterogeneous multi-cores would be used as processing nodes of the wireless sensor networks in the near future. To this end, this paper introduces an efficient network coding algorithm developed for the heterogenous multi-core processors. The proposed idea is fully tested on one of the currently available heterogeneous multi-core processors referred to as the Cell Broadband Engine. PMID:22164053
Evolution of CMS workload management towards multicore job support

NASA Astrophysics Data System (ADS)

Pérez-Calero Yzquierdo, A.; Hernández, J. M.; Khan, F. A.; Letts, J.; Majewski, K.; Rodrigues, A. M.; McCrea, A.; Vaandering, E.

2015-12-01

The successful exploitation of multicore processor architectures is a key element of the LHC distributed computing system in the coming era of the LHC Run 2. High-pileup complex-collision events represent a challenge for the traditional sequential programming in terms of memory and processing time budget. The CMS data production and processing framework is introducing the parallel execution of the reconstruction and simulation algorithms to overcome these limitations. CMS plans to execute multicore jobs while still supporting singlecore processing for other tasks difficult to parallelize, such as user analysis. The CMS strategy for job management thus aims at integrating single and multicore job scheduling across the Grid. This is accomplished by employing multicore pilots with internal dynamic partitioning of the allocated resources, capable of running payloads of various core counts simultaneously. An extensive test programme has been conducted to enable multicore scheduling with the various local batch systems available at CMS sites, with the focus on the Tier-0 and Tier-1s, responsible during 2015 of the prompt data reconstruction. Scale tests have been run to analyse the performance of this scheduling strategy and ensure an efficient use of the distributed resources. This paper presents the evolution of the CMS job management and resource provisioning systems in order to support this hybrid scheduling model, as well as its deployment and performance tests, which will enable CMS to transition to a multicore production model for the second LHC run.
Evolution of CMS Workload Management Towards Multicore Job Support

DOE Office of Scientific and Technical Information (OSTI.GOV)

Perez-Calero Yzquierdo, A.; Hernández, J. M.; Khan, F. A.

The successful exploitation of multicore processor architectures is a key element of the LHC distributed computing system in the coming era of the LHC Run 2. High-pileup complex-collision events represent a challenge for the traditional sequential programming in terms of memory and processing time budget. The CMS data production and processing framework is introducing the parallel execution of the reconstruction and simulation algorithms to overcome these limitations. CMS plans to execute multicore jobs while still supporting singlecore processing for other tasks difficult to parallelize, such as user analysis. The CMS strategy for job management thus aims at integrating single andmore » multicore job scheduling across the Grid. This is accomplished by employing multicore pilots with internal dynamic partitioning of the allocated resources, capable of running payloads of various core counts simultaneously. An extensive test programme has been conducted to enable multicore scheduling with the various local batch systems available at CMS sites, with the focus on the Tier-0 and Tier-1s, responsible during 2015 of the prompt data reconstruction. Scale tests have been run to analyse the performance of this scheduling strategy and ensure an efficient use of the distributed resources. This paper presents the evolution of the CMS job management and resource provisioning systems in order to support this hybrid scheduling model, as well as its deployment and performance tests, which will enable CMS to transition to a multicore production model for the second LHC run.« less
Parallelizing Compiler Framework and API for Power Reduction and Software Productivity of Real-Time Heterogeneous Multicores

NASA Astrophysics Data System (ADS)

Hayashi, Akihiro; Wada, Yasutaka; Watanabe, Takeshi; Sekiguchi, Takeshi; Mase, Masayoshi; Shirako, Jun; Kimura, Keiji; Kasahara, Hironori

Heterogeneous multicores have been attracting much attention to attain high performance keeping power consumption low in wide spread of areas. However, heterogeneous multicores force programmers very difficult programming. The long application program development period lowers product competitiveness. In order to overcome such a situation, this paper proposes a compilation framework which bridges a gap between programmers and heterogeneous multicores. In particular, this paper describes the compilation framework based on OSCAR compiler. It realizes coarse grain task parallel processing, data transfer using a DMA controller, power reduction control from user programs with DVFS and clock gating on various heterogeneous multicores from different vendors. This paper also evaluates processing performance and the power reduction by the proposed framework on a newly developed 15 core heterogeneous multicore chip named RP-X integrating 8 general purpose processor cores and 3 types of accelerator cores which was developed by Renesas Electronics, Hitachi, Tokyo Institute of Technology and Waseda University. The framework attains speedups up to 32x for an optical flow program with eight general purpose processor cores and four DRP(Dynamically Reconfigurable Processor) accelerator cores against sequential execution by a single processor core and 80% of power reduction for the real-time AAC encoding.
Optimization of sparse matrix-vector multiplication on emerging multicore platforms

DOE Office of Scientific and Technical Information (OSTI.GOV)

Williams, Samuel; Oliker, Leonid; Vuduc, Richard

2007-01-01

We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore specific optimization methodologies for important scientific computations. In this work, we examine sparse matrix-vector multiply (SpMV) - one of the most heavily used kernels in scientific computing - across a broad spectrum of multicore designs. Our experimental platform includes the homogeneous AMD dual-core and Intel quad-core designs, the heterogeneous STI Cell, as well as the first scientificmore » study of the highly multithreaded Sun Niagara2. We present several optimization strategies especially effective for the multicore environment, and demonstrate significant performance improvements compared to existing state-of-the-art serial and parallel SpMV implementations. Additionally, we present key insights into the architectural tradeoffs of leading multicore design strategies, in the context of demanding memory-bound numerical algorithms.« less
Compiler-Driven Performance Optimization and Tuning for Multicore Architectures

DTIC Science & Technology

2015-04-10

develop a powerful system for auto-tuning of library routines and compute-intensive kernels, driven by the Pluto system for multicores that we are...kernels, driven by the Pluto system for multicores that we are developing. The work here is motivated by recent advances in two major areas of...automatic C-to-CUDA code generator using a polyhedral compiler transformation framework. We have used and adapted PLUTO (our state-of-the-art tool
Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms

DOE Office of Scientific and Technical Information (OSTI.GOV)

Williams, Samuel; Oliker, Leonid; Vuduc, Richard

2008-10-16

We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore specific-optimization methodologies for important scientific computations. In this work, we examine sparse matrix-vector multiply (SpMV) - one of the most heavily used kernels in scientific computing - across a broad spectrum of multicore designs. Our experimental platform includes the homogeneous AMD quad-core, AMD dual-core, and Intel quad-core designs, the heterogeneous STI Cell, as well as one ofmore » the first scientific studies of the highly multithreaded Sun Victoria Falls (a Niagara2 SMP). We present several optimization strategies especially effective for the multicore environment, and demonstrate significant performance improvements compared to existing state-of-the-art serial and parallel SpMV implementations. Additionally, we present key insights into the architectural trade-offs of leading multicore design strategies, in the context of demanding memory-bound numerical algorithms.« less

PERI - Auto-tuning Memory Intensive Kernels for Multicore

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bailey, David H; Williams, Samuel; Datta, Kaushik

2008-06-24

We present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of search-based performance optimizations, popular in linear algebra and FFT libraries, to application-specific computational kernels. Our work applies this strategy to Sparse Matrix Vector Multiplication (SpMV), the explicit heat equation PDE on a regular grid (Stencil), and a lattice Boltzmann application (LBMHD). We explore one of the broadest sets of multicore architectures in the HPC literature, including the Intel Xeon Clovertown, AMD Opteron Barcelona, Sun Victoria Falls, and the Sony-Toshiba-IBM (STI) Cell. Rather than hand-tuning each kernel for each system, we developmore » a code generator for each kernel that allows us to identify a highly optimized version for each platform, while amortizing the human programming effort. Results show that our auto-tuned kernel applications often achieve a better than 4X improvement compared with the original code. Additionally, we analyze a Roofline performance model for each platform to reveal hardware bottlenecks and software challenges for future multicore systems and applications.« less
Real-time implementations of image segmentation algorithms on shared memory multicore architecture: a survey (Conference Presentation)

NASA Astrophysics Data System (ADS)

Akil, Mohamed

2017-05-01

The real-time processing is getting more and more important in many image processing applications. Image segmentation is one of the most fundamental tasks image analysis. As a consequence, many different approaches for image segmentation have been proposed. The watershed transform is a well-known image segmentation tool. The watershed transform is a very data intensive task. To achieve acceleration and obtain real-time processing of watershed algorithms, parallel architectures and programming models for multicore computing have been developed. This paper focuses on the survey of the approaches for parallel implementation of sequential watershed algorithms on multicore general purpose CPUs: homogeneous multicore processor with shared memory. To achieve an efficient parallel implementation, it's necessary to explore different strategies (parallelization/distribution/distributed scheduling) combined with different acceleration and optimization techniques to enhance parallelism. In this paper, we give a comparison of various parallelization of sequential watershed algorithms on shared memory multicore architecture. We analyze the performance measurements of each parallel implementation and the impact of the different sources of overhead on the performance of the parallel implementations. In this comparison study, we also discuss the advantages and disadvantages of the parallel programming models. Thus, we compare the OpenMP (an application programming interface for multi-Processing) with Ptheads (POSIX Threads) to illustrate the impact of each parallel programming model on the performance of the parallel implementations.
Results of SEI Independent Research and Development Projects

DTIC Science & Technology

2009-12-01

Achieving Predictable Performance in Multicore Embedded Real - Time Systems Dionisio de Niz, Jeffrey Hansen, Gabriel Moreno, Daniel Plakosh, Jorgen Hanson...Description Languages.‖ Fourth Congress on Embedded Real - Time Systems (ERTS), January 2008. [Hansson 2008b] J. Hansson, P. H. Feiler, & J. Morley...Predictable Performance in Multicore Embedded Real - Time Systems Dionisio de Niz, Jeffrey Hansen, Gabriel Moreno, Daniel Plakosh, Jorgen Hanson, Mark
Multicore Challenges and Benefits for High Performance Scientific Computing

DOE PAGES

Nielsen, Ida M. B.; Janssen, Curtis L.

2008-01-01

Until recently, performance gains in processors were achieved largely by improvements in clock speeds and instruction level parallelism. Thus, applications could obtain performance increases with relatively minor changes by upgrading to the latest generation of computing hardware. Currently, however, processor performance improvements are realized by using multicore technology and hardware support for multiple threads within each core, and taking full advantage of this technology to improve the performance of applications requires exposure of extreme levels of software parallelism. We will here discuss the architecture of parallel computers constructed from many multicore chips as well as techniques for managing the complexitymore » of programming such computers, including the hybrid message-passing/multi-threading programming model. We will illustrate these ideas with a hybrid distributed memory matrix multiply and a quantum chemistry algorithm for energy computation using Møller–Plesset perturbation theory.« less
Enhancing Image Processing Performance for PCID in a Heterogeneous Network of Multi-core Processors

DTIC Science & Technology

2009-09-01

TFLOPS of Playstation 3 (PS3) nodes with IBM Cell Broadband Engine multi-cores and 15 dual-quad Xeon head nodes. The interconnect fabric includes... 4 3. INFORMATION MANAGEMENT FOR PARALLELIZATION AND...STREAMING............................................................. 7 4 . RESULTS
Using Multi-Core Systems for Rover Autonomy

NASA Technical Reports Server (NTRS)

Clement, Brad; Estlin, Tara; Bornstein, Benjamin; Springer, Paul; Anderson, Robert C.

2010-01-01

Task Objectives are: (1) Develop and demonstrate key capabilities for rover long-range science operations using multi-core computing, (a) Adapt three rover technologies to execute on SOA multi-core processor (b) Illustrate performance improvements achieved (c) Demonstrate adapted capabilities with rover hardware, (2) Targeting three high-level autonomy technologies (a) Two for onboard data analysis (b) One for onboard command sequencing/planning, (3) Technologies identified as enabling for future missions, (4)Benefits will be measured along several metrics: (a) Execution time / Power requirements (b) Number of data products processed per unit time (c) Solution quality
Computational performance of a smoothed particle hydrodynamics simulation for shared-memory parallel computing

NASA Astrophysics Data System (ADS)

Nishiura, Daisuke; Furuichi, Mikito; Sakaguchi, Hide

2015-09-01

The computational performance of a smoothed particle hydrodynamics (SPH) simulation is investigated for three types of current shared-memory parallel computer devices: many integrated core (MIC) processors, graphics processing units (GPUs), and multi-core CPUs. We are especially interested in efficient shared-memory allocation methods for each chipset, because the efficient data access patterns differ between compute unified device architecture (CUDA) programming for GPUs and OpenMP programming for MIC processors and multi-core CPUs. We first introduce several parallel implementation techniques for the SPH code, and then examine these on our target computer architectures to determine the most effective algorithms for each processor unit. In addition, we evaluate the effective computing performance and power efficiency of the SPH simulation on each architecture, as these are critical metrics for overall performance in a multi-device environment. In our benchmark test, the GPU is found to produce the best arithmetic performance as a standalone device unit, and gives the most efficient power consumption. The multi-core CPU obtains the most effective computing performance. The computational speed of the MIC processor on Xeon Phi approached that of two Xeon CPUs. This indicates that using MICs is an attractive choice for existing SPH codes on multi-core CPUs parallelized by OpenMP, as it gains computational acceleration without the need for significant changes to the source code.
Modern multicore and manycore architectures: Modelling, optimisation and benchmarking a multiblock CFD code

NASA Astrophysics Data System (ADS)

Hadade, Ioan; di Mare, Luca

2016-08-01

Modern multicore and manycore processors exhibit multiple levels of parallelism through a wide range of architectural features such as SIMD for data parallel execution or threads for core parallelism. The exploitation of multi-level parallelism is therefore crucial for achieving superior performance on current and future processors. This paper presents the performance tuning of a multiblock CFD solver on Intel SandyBridge and Haswell multicore CPUs and the Intel Xeon Phi Knights Corner coprocessor. Code optimisations have been applied on two computational kernels exhibiting different computational patterns: the update of flow variables and the evaluation of the Roe numerical fluxes. We discuss at great length the code transformations required for achieving efficient SIMD computations for both kernels across the selected devices including SIMD shuffles and transpositions for flux stencil computations and global memory transformations. Core parallelism is expressed through threading based on a number of domain decomposition techniques together with optimisations pertaining to alleviating NUMA effects found in multi-socket compute nodes. Results are correlated with the Roofline performance model in order to assert their efficiency for each distinct architecture. We report significant speedups for single thread execution across both kernels: 2-5X on the multicore CPUs and 14-23X on the Xeon Phi coprocessor. Computations at full node and chip concurrency deliver a factor of three speedup on the multicore processors and up to 24X on the Xeon Phi manycore coprocessor.
Evaluation of SuperLU on multicore architectures

NASA Astrophysics Data System (ADS)

Li, X. S.

2008-07-01

The Chip Multiprocessor (CMP) will be the basic building block for computer systems ranging from laptops to supercomputers. New software developments at all levels are needed to fully utilize these systems. In this work, we evaluate performance of different high-performance sparse LU factorization and triangular solution algorithms on several representative multicore machines. We included both Pthreads and MPI implementations in this study and found that the Pthreads implementation consistently delivers good performance and that a left-looking algorithm is usually superior.
Neural simulations on multi-core architectures.

PubMed

Eichner, Hubert; Klug, Tobias; Borst, Alexander

2009-01-01

Neuroscience is witnessing increasing knowledge about the anatomy and electrophysiological properties of neurons and their connectivity, leading to an ever increasing computational complexity of neural simulations. At the same time, a rather radical change in personal computer technology emerges with the establishment of multi-cores: high-density, explicitly parallel processor architectures for both high performance as well as standard desktop computers. This work introduces strategies for the parallelization of biophysically realistic neural simulations based on the compartmental modeling technique and results of such an implementation, with a strong focus on multi-core architectures and automation, i.e. user-transparent load balancing.
Neural Simulations on Multi-Core Architectures

PubMed Central

Eichner, Hubert; Klug, Tobias; Borst, Alexander

2009-01-01

Neuroscience is witnessing increasing knowledge about the anatomy and electrophysiological properties of neurons and their connectivity, leading to an ever increasing computational complexity of neural simulations. At the same time, a rather radical change in personal computer technology emerges with the establishment of multi-cores: high-density, explicitly parallel processor architectures for both high performance as well as standard desktop computers. This work introduces strategies for the parallelization of biophysically realistic neural simulations based on the compartmental modeling technique and results of such an implementation, with a strong focus on multi-core architectures and automation, i.e. user-transparent load balancing. PMID:19636393
Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms

DOE Office of Scientific and Technical Information (OSTI.GOV)

Williams, Samuel; Carter, Jonathan; Oliker, Leonid

2008-02-01

We present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of search-based performance optimizations, popular in linear algebra and FFT libraries, to application-specific computational kernels. Our work applies this strategy to a lattice Boltzmann application (LBMHD) that historically has made poor use of scalar microprocessors due to its complex data structures and memory access patterns. We explore one of the broadest sets of multicore architectures in the HPC literature, including the Intel Clovertown, AMD Opteron X2, Sun Niagara2, STI Cell, as well as the single core Intel Itanium2. Rather than hand-tuning LBMHDmore » for each system, we develop a code generator that allows us identify a highly optimized version for each platform, while amortizing the human programming effort. Results show that our auto-tuned LBMHD application achieves up to a 14x improvement compared with the original code. Additionally, we present detailed analysis of each optimization, which reveal surprising hardware bottlenecks and software challenges for future multicore systems and applications.« less
Lattice Boltzmann simulation optimization on leading multicore platforms

DOE Office of Scientific and Technical Information (OSTI.GOV)

Williams, S.; Carter, J.; Oliker, L.

2008-01-01

We present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of searchbased performance optimizations, popular in linear algebra and FFT libraries, to application-specific computational kernels. Our work applies this strategy to a lattice Boltzmann application (LBMHD) that historically has made poor use of scalar microprocessors due to its complex data structures and memory access patterns. We explore one of the broadest sets of multicore architectures in the HPC literature, including the Intel Clovertown, AMD Opteron X2, Sun Niagara2, STI Cell, as well as the single core Intel Itanium2. Rather than hand-tuning LBMHDmore » for each system, we develop a code generator that allows us identify a highly optimized version for each platform, while amortizing the human programming effort. Results show that our autotuned LBMHD application achieves up to a 14 improvement compared with the original code. Additionally, we present detailed analysis of each optimization, which reveal surprising hardware bottlenecks and software challenges for future multicore systems and applications.« less
VLBI-resolution radio-map algorithms: Performance analysis of different levels of data-sharing on multi-socket, multi-core architectures

NASA Astrophysics Data System (ADS)

Tabik, S.; Romero, L. F.; Mimica, P.; Plata, O.; Zapata, E. L.

2012-09-01

A broad area in astronomy focuses on simulating extragalactic objects based on Very Long Baseline Interferometry (VLBI) radio-maps. Several algorithms in this scope simulate what would be the observed radio-maps if emitted from a predefined extragalactic object. This work analyzes the performance and scaling of this kind of algorithms on multi-socket, multi-core architectures. In particular, we evaluate a sharing approach, a privatizing approach and a hybrid approach on systems with complex memory hierarchy that includes shared Last Level Cache (LLC). In addition, we investigate which manual processes can be systematized and then automated in future works. The experiments show that the data-privatizing model scales efficiently on medium scale multi-socket, multi-core systems (up to 48 cores) while regardless of algorithmic and scheduling optimizations, the sharing approach is unable to reach acceptable scalability on more than one socket. However, the hybrid model with a specific level of data-sharing provides the best scalability over all used multi-socket, multi-core systems.
Interactive high-resolution isosurface ray casting on multicore processors.

PubMed

Wang, Qin; JaJa, Joseph

2008-01-01

We present a new method for the interactive rendering of isosurfaces using ray casting on multi-core processors. This method consists of a combination of an object-order traversal that coarsely identifies possible candidate 3D data blocks for each small set of contiguous pixels, and an isosurface ray casting strategy tailored for the resulting limited-size lists of candidate 3D data blocks. While static screen partitioning is widely used in the literature, our scheme performs dynamic allocation of groups of ray casting tasks to ensure almost equal loads among the different threads running on multi-cores while maintaining spatial locality. We also make careful use of memory management environment commonly present in multi-core processors. We test our system on a two-processor Clovertown platform, each consisting of a Quad-Core 1.86-GHz Intel Xeon Processor, for a number of widely different benchmarks. The detailed experimental results show that our system is efficient and scalable, and achieves high cache performance and excellent load balancing, resulting in an overall performance that is superior to any of the previous algorithms. In fact, we achieve an interactive isosurface rendering on a 1024(2) screen for all the datasets tested up to the maximum size of the main memory of our platform.
A high performance load balance strategy for real-time multicore systems.

PubMed

Cho, Keng-Mao; Tsai, Chun-Wei; Chiu, Yi-Shiuan; Yang, Chu-Sing

2014-01-01

Finding ways to distribute workloads to each processor core and efficiently reduce power consumption is of vital importance, especially for real-time systems. In this paper, a novel scheduling algorithm is proposed for real-time multicore systems to balance the computation loads and save power. The developed algorithm simultaneously considers multiple criteria, a novel factor, and task deadline, and is called power and deadline-aware multicore scheduling (PDAMS). Experiment results show that the proposed algorithm can greatly reduce energy consumption by up to 54.2% and the deadline times missed, as compared to the other scheduling algorithms outlined in this paper.
A High Performance Load Balance Strategy for Real-Time Multicore Systems

PubMed Central

Cho, Keng-Mao; Tsai, Chun-Wei; Chiu, Yi-Shiuan; Yang, Chu-Sing

2014-01-01

Finding ways to distribute workloads to each processor core and efficiently reduce power consumption is of vital importance, especially for real-time systems. In this paper, a novel scheduling algorithm is proposed for real-time multicore systems to balance the computation loads and save power. The developed algorithm simultaneously considers multiple criteria, a novel factor, and task deadline, and is called power and deadline-aware multicore scheduling (PDAMS). Experiment results show that the proposed algorithm can greatly reduce energy consumption by up to 54.2% and the deadline times missed, as compared to the other scheduling algorithms outlined in this paper. PMID:24955382
Efficiency of static core turn-off in a system-on-a-chip with variation

DOEpatents

Cher, Chen-Yong; Coteus, Paul W; Gara, Alan; Kursun, Eren; Paulsen, David P; Schuelke, Brian A; Sheets, II, John E; Tian, Shurong

2013-10-29

A processor-implemented method for improving efficiency of a static core turn-off in a multi-core processor with variation, the method comprising: conducting via a simulation a turn-off analysis of the multi-core processor at the multi-core processor's design stage, wherein the turn-off analysis of the multi-core processor at the multi-core processor's design stage includes a first output corresponding to a first multi-core processor core to turn off; conducting a turn-off analysis of the multi-core processor at the multi-core processor's testing stage, wherein the turn-off analysis of the multi-core processor at the multi-core processor's testing stage includes a second output corresponding to a second multi-core processor core to turn off; comparing the first output and the second output to determine if the first output is referring to the same core to turn off as the second output; outputting a third output corresponding to the first multi-core processor core if the first output and the second output are both referring to the same core to turn off.
Optimization of a Lattice Boltzmann Computation on State-of-the-Art Multicore Platforms

DOE Office of Scientific and Technical Information (OSTI.GOV)

Williams, Samuel; Carter, Jonathan; Oliker, Leonid

2009-04-10

We present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of search-based performance optimizations, popular in linear algebra and FFT libraries, to application-specific computational kernels. Our work applies this strategy to a lattice Boltzmann application (LBMHD) that historically has made poor use of scalar microprocessors due to its complex data structures and memory access patterns. We explore one of the broadest sets of multicore architectures in the HPC literature, including the Intel Xeon E5345 (Clovertown), AMD Opteron 2214 (Santa Rosa), AMD Opteron 2356 (Barcelona), Sun T5140 T2+ (Victoria Falls), as well asmore » a QS20 IBM Cell Blade. Rather than hand-tuning LBMHD for each system, we develop a code generator that allows us to identify a highly optimized version for each platform, while amortizing the human programming effort. Results show that our auto-tuned LBMHD application achieves up to a 15x improvement compared with the original code at a given concurrency. Additionally, we present detailed analysis of each optimization, which reveal surprising hardware bottlenecks and software challenges for future multicore systems and applications.« less
Performance implications from sizing a VM on multi-core systems: A Data analytic application s view

DOE Office of Scientific and Technical Information (OSTI.GOV)

Lim, Seung-Hwan; Horey, James L; Begoli, Edmon

In this paper, we present a quantitative performance analysis of data analytics applications running on multi-core virtual machines. Such environments form the core of cloud computing. In addition, data analytics applications, such as Cassandra and Hadoop, are becoming increasingly popular on cloud computing platforms. This convergence necessitates a better understanding of the performance and cost implications of such hybrid systems. For example, the very rst step in hosting applications in virtualized environments, requires the user to con gure the number of virtual processors and the size of memory. To understand performance implications of this step, we benchmarked three Yahoo Cloudmore » Serving Benchmark (YCSB) workloads in a virtualized multi-core environment. Our measurements indicate that the performance of Cassandra for YCSB workloads does not heavily depend on the processing capacity of a system, while the size of the data set is critical to performance relative to allocated memory. We also identi ed a strong relationship between the running time of workloads and various hardware events (last level cache loads, misses, and CPU migrations). From this analysis, we provide several suggestions to improve the performance of data analytics applications running on cloud computing environments.« less

CMS Readiness for Multi-Core Workload Scheduling

DOE Office of Scientific and Technical Information (OSTI.GOV)

Perez-Calero Yzquierdo, A.; Balcas, J.; Hernandez, J.

In the present run of the LHC, CMS data reconstruction and simulation algorithms benefit greatly from being executed as multiple threads running on several processor cores. The complexity of the Run 2 events requires parallelization of the code to reduce the memory-per- core footprint constraining serial execution programs, thus optimizing the exploitation of present multi-core processor architectures. The allocation of computing resources for multi-core tasks, however, becomes a complex problem in itself. The CMS workload submission infrastructure employs multi-slot partitionable pilots, built on HTCondor and GlideinWMS native features, to enable scheduling of single and multi-core jobs simultaneously. This provides amore » solution for the scheduling problem in a uniform way across grid sites running a diversity of gateways to compute resources and batch system technologies. This paper presents this strategy and the tools on which it has been implemented. The experience of managing multi-core resources at the Tier-0 and Tier-1 sites during 2015, along with the deployment phase to Tier-2 sites during early 2016 is reported. The process of performance monitoring and optimization to achieve efficient and flexible use of the resources is also described.« less
CMS readiness for multi-core workload scheduling

NASA Astrophysics Data System (ADS)

Perez-Calero Yzquierdo, A.; Balcas, J.; Hernandez, J.; Aftab Khan, F.; Letts, J.; Mason, D.; Verguilov, V.

2017-10-01

In the present run of the LHC, CMS data reconstruction and simulation algorithms benefit greatly from being executed as multiple threads running on several processor cores. The complexity of the Run 2 events requires parallelization of the code to reduce the memory-per- core footprint constraining serial execution programs, thus optimizing the exploitation of present multi-core processor architectures. The allocation of computing resources for multi-core tasks, however, becomes a complex problem in itself. The CMS workload submission infrastructure employs multi-slot partitionable pilots, built on HTCondor and GlideinWMS native features, to enable scheduling of single and multi-core jobs simultaneously. This provides a solution for the scheduling problem in a uniform way across grid sites running a diversity of gateways to compute resources and batch system technologies. This paper presents this strategy and the tools on which it has been implemented. The experience of managing multi-core resources at the Tier-0 and Tier-1 sites during 2015, along with the deployment phase to Tier-2 sites during early 2016 is reported. The process of performance monitoring and optimization to achieve efficient and flexible use of the resources is also described.
OSCAR API for Real-Time Low-Power Multicores and Its Performance on Multicores and SMP Servers

NASA Astrophysics Data System (ADS)

Kimura, Keiji; Mase, Masayoshi; Mikami, Hiroki; Miyamoto, Takamichi; Shirako, Jun; Kasahara, Hironori

OSCAR (Optimally Scheduled Advanced Multiprocessor) API has been designed for real-time embedded low-power multicores to generate parallel programs for various multicores from different vendors by using the OSCAR parallelizing compiler. The OSCAR API has been developed by Waseda University in collaboration with Fujitsu Laboratory, Hitachi, NEC, Panasonic, Renesas Technology, and Toshiba in an METI/NEDO project entitled "Multicore Technology for Realtime Consumer Electronics." By using the OSCAR API as an interface between the OSCAR compiler and backend compilers, the OSCAR compiler enables hierarchical multigrain parallel processing with memory optimization under capacity restriction for cache memory, local memory, distributed shared memory, and on-chip/off-chip shared memory; data transfer using a DMA controller; and power reduction control using DVFS (Dynamic Voltage and Frequency Scaling), clock gating, and power gating for various embedded multicores. In addition, a parallelized program automatically generated by the OSCAR compiler with OSCAR API can be compiled by the ordinary OpenMP compilers since the OSCAR API is designed on a subset of the OpenMP. This paper describes the OSCAR API and its compatibility with the OSCAR compiler by showing code examples. Performance evaluations of the OSCAR compiler and the OSCAR API are carried out using an IBM Power5+ workstation, an IBM Power6 high-end SMP server, and a newly developed consumer electronics multicore chip RP2 by Renesas, Hitachi and Waseda. From the results of scalability evaluation, it is found that on an average, the OSCAR compiler with the OSCAR API can exploit 5.8 times speedup over the sequential execution on the Power5+ workstation with eight cores and 2.9 times speedup on RP2 with four cores, respectively. In addition, the OSCAR compiler can accelerate an IBM XL Fortran compiler up to 3.3 times on the Power6 SMP server. Due to low-power optimization on RP2, the OSCAR compiler with the OSCAR API achieves a maximum power reduction of 84% in the real-time execution mode.
Electronic Structure Calculations and Adaptation Scheme in Multi-core Computing Environments

DOE Office of Scientific and Technical Information (OSTI.GOV)

Seshagiri, Lakshminarasimhan; Sosonkina, Masha; Zhang, Zhao

2009-05-20

Multi-core processing environments have become the norm in the generic computing environment and are being considered for adding an extra dimension to the execution of any application. The T2 Niagara processor is a very unique environment where it consists of eight cores having a capability of running eight threads simultaneously in each of the cores. Applications like General Atomic and Molecular Electronic Structure (GAMESS), used for ab-initio molecular quantum chemistry calculations, can be good indicators of the performance of such machines and would be a guideline for both hardware designers and application programmers. In this paper we try to benchmarkmore » the GAMESS performance on a T2 Niagara processor for a couple of molecules. We also show the suitability of using a middleware based adaptation algorithm on GAMESS on such a multi-core environment.« less
Exploiting multicore compute resources in the CMS experiment

NASA Astrophysics Data System (ADS)

Ramírez, J. E.; Pérez-Calero Yzquierdo, A.; Hernández, J. M.; CMS Collaboration

2016-10-01

CMS has developed a strategy to efficiently exploit the multicore architecture of the compute resources accessible to the experiment. A coherent use of the multiple cores available in a compute node yields substantial gains in terms of resource utilization. The implemented approach makes use of the multithreading support of the event processing framework and the multicore scheduling capabilities of the resource provisioning system. Multicore slots are acquired and provisioned by means of multicore pilot agents which internally schedule and execute single and multicore payloads. Multicore scheduling and multithreaded processing are currently used in production for online event selection and prompt data reconstruction. More workflows are being adapted to run in multicore mode. This paper presents a review of the experience gained in the deployment and operation of the multicore scheduling and processing system, the current status and future plans.
Multicore fiber beamforming network for broadband satellite communications

NASA Astrophysics Data System (ADS)

Zainullin, Airat; Vidal, Borja; Macho, Andres; Llorente, Roberto

2017-02-01

Multi-core fiber (MCF) has been one of the main innovations in fiber optics in the last decade. Reported work on MCF has been focused on increasing the transmission capacity of optical communication links by exploiting space-division multiplexing. Additionally, MCF presents a strong potential in optical beamforming networks. The use of MCF can increase the compactness of the broadband antenna array controller. This is of utmost importance in platforms where size and weight are critical parameters such as communications satellites and airplanes. Here, an optical beamforming architecture that exploits the space-division capacity of MCF to implement compact optical beamforming networks is proposed, being a new application field for MCF. The experimental demonstration of this system using a 4-core MCF that controls a four-element antenna array is reported. An analysis of the impact of MCF on the performance of antenna arrays is presented. The analysis indicates that the main limitation comes from the relatively high insertion loss in the MCF fan-in and fan-out devices, which leads to angle dependent losses which can be mitigated by using fixed optical attenuators or a photonic lantern to reduce MCF insertion loss. The crosstalk requirements are also experimentally evaluated for the proposed MCF-based architecture. The potential signal impairment in the beamforming network is analytically evaluated, being of special importance when MCF with a large number of cores is considered. Finally, the optimization of the proposed MCF-based beamforming network is addressed targeting the scalability to large arrays.
Reconfigurable SDM Switching Using Novel Silicon Photonic Integrated Circuit.

PubMed

Ding, Yunhong; Kamchevska, Valerija; Dalgaard, Kjeld; Ye, Feihong; Asif, Rameez; Gross, Simon; Withford, Michael J; Galili, Michael; Morioka, Toshio; Oxenløwe, Leif Katsuo

2016-12-21

Space division multiplexing using multicore fibers is becoming a more and more promising technology. In space-division multiplexing fiber network, the reconfigurable switch is one of the most critical components in network nodes. In this paper we for the first time demonstrate reconfigurable space-division multiplexing switching using silicon photonic integrated circuit, which is fabricated on a novel silicon-on-insulator platform with buried Al mirror. The silicon photonic integrated circuit is composed of a 7 × 7 switch and low loss grating coupler array based multicore fiber couplers. Thanks to the Al mirror, grating couplers with ultra-low coupling loss with optical multicore fibers is achieved. The lowest total insertion loss of the silicon integrated circuit is as low as 4.5 dB, with low crosstalk lower than -30 dB. Excellent performances in terms of low insertion loss and low crosstalk are obtained for the whole C-band. 1 Tb/s/core transmission over a 2-km 7-core fiber and space-division multiplexing switching is demonstrated successfully. Bit error rate performance below 10 -9 is obtained for all spatial channels with low power penalty. The proposed design can be easily upgraded to reconfigurable optical add/drop multiplexer capable of switching several multicore fibers.
Reconfigurable SDM Switching Using Novel Silicon Photonic Integrated Circuit

NASA Astrophysics Data System (ADS)

Ding, Yunhong; Kamchevska, Valerija; Dalgaard, Kjeld; Ye, Feihong; Asif, Rameez; Gross, Simon; Withford, Michael J.; Galili, Michael; Morioka, Toshio; Oxenløwe, Leif Katsuo

2016-12-01

Space division multiplexing using multicore fibers is becoming a more and more promising technology. In space-division multiplexing fiber network, the reconfigurable switch is one of the most critical components in network nodes. In this paper we for the first time demonstrate reconfigurable space-division multiplexing switching using silicon photonic integrated circuit, which is fabricated on a novel silicon-on-insulator platform with buried Al mirror. The silicon photonic integrated circuit is composed of a 7 × 7 switch and low loss grating coupler array based multicore fiber couplers. Thanks to the Al mirror, grating couplers with ultra-low coupling loss with optical multicore fibers is achieved. The lowest total insertion loss of the silicon integrated circuit is as low as 4.5 dB, with low crosstalk lower than -30 dB. Excellent performances in terms of low insertion loss and low crosstalk are obtained for the whole C-band. 1 Tb/s/core transmission over a 2-km 7-core fiber and space-division multiplexing switching is demonstrated successfully. Bit error rate performance below 10-9 is obtained for all spatial channels with low power penalty. The proposed design can be easily upgraded to reconfigurable optical add/drop multiplexer capable of switching several multicore fibers.
Progress Towards a Rad-Hydro Code for Modern Computing Architectures LA-UR-10-02825

NASA Astrophysics Data System (ADS)

Wohlbier, J. G.; Lowrie, R. B.; Bergen, B.; Calef, M.

2010-11-01

We are entering an era of high performance computing where data movement is the overwhelming bottleneck to scalable performance, as opposed to the speed of floating-point operations per processor. All multi-core hardware paradigms, whether heterogeneous or homogeneous, be it the Cell processor, GPGPU, or multi-core x86, share this common trait. In multi-physics applications such as inertial confinement fusion or astrophysics, one may be solving multi-material hydrodynamics with tabular equation of state data lookups, radiation transport, nuclear reactions, and charged particle transport in a single time cycle. The algorithms are intensely data dependent, e.g., EOS, opacity, nuclear data, and multi-core hardware memory restrictions are forcing code developers to rethink code and algorithm design. For the past two years LANL has been funding a small effort referred to as Multi-Physics on Multi-Core to explore ideas for code design as pertaining to inertial confinement fusion and astrophysics applications. The near term goals of this project are to have a multi-material radiation hydrodynamics capability, with tabular equation of state lookups, on cartesian and curvilinear block structured meshes. In the longer term we plan to add fully implicit multi-group radiation diffusion and material heat conduction, and block structured AMR. We will report on our progress to date.
Reconfigurable SDM Switching Using Novel Silicon Photonic Integrated Circuit

PubMed Central

Ding, Yunhong; Kamchevska, Valerija; Dalgaard, Kjeld; Ye, Feihong; Asif, Rameez; Gross, Simon; Withford, Michael J.; Galili, Michael; Morioka, Toshio; Oxenløwe, Leif Katsuo

2016-01-01

Space division multiplexing using multicore fibers is becoming a more and more promising technology. In space-division multiplexing fiber network, the reconfigurable switch is one of the most critical components in network nodes. In this paper we for the first time demonstrate reconfigurable space-division multiplexing switching using silicon photonic integrated circuit, which is fabricated on a novel silicon-on-insulator platform with buried Al mirror. The silicon photonic integrated circuit is composed of a 7 × 7 switch and low loss grating coupler array based multicore fiber couplers. Thanks to the Al mirror, grating couplers with ultra-low coupling loss with optical multicore fibers is achieved. The lowest total insertion loss of the silicon integrated circuit is as low as 4.5 dB, with low crosstalk lower than −30 dB. Excellent performances in terms of low insertion loss and low crosstalk are obtained for the whole C-band. 1 Tb/s/core transmission over a 2-km 7-core fiber and space-division multiplexing switching is demonstrated successfully. Bit error rate performance below 10−9 is obtained for all spatial channels with low power penalty. The proposed design can be easily upgraded to reconfigurable optical add/drop multiplexer capable of switching several multicore fibers. PMID:28000735
CQPSO scheduling algorithm for heterogeneous multi-core DAG task model

NASA Astrophysics Data System (ADS)

Zhai, Wenzheng; Hu, Yue-Li; Ran, Feng

2017-07-01

Efficient task scheduling is critical to achieve high performance in a heterogeneous multi-core computing environment. The paper focuses on the heterogeneous multi-core directed acyclic graph (DAG) task model and proposes a novel task scheduling method based on an improved chaotic quantum-behaved particle swarm optimization (CQPSO) algorithm. A task priority scheduling list was built. A processor with minimum cumulative earliest finish time (EFT) was acted as the object of the first task assignment. The task precedence relationships were satisfied and the total execution time of all tasks was minimized. The experimental results show that the proposed algorithm has the advantage of optimization abilities, simple and feasible, fast convergence, and can be applied to the task scheduling optimization for other heterogeneous and distributed environment.
Coding for parallel execution of hardware-in-the-loop millimeter-wave scene generation models on multicore SIMD processor architectures

NASA Astrophysics Data System (ADS)

Olson, Richard F.

2013-05-01

Rendering of point scatterer based radar scenes for millimeter wave (mmW) seeker tests in real-time hardware-in-the-loop (HWIL) scene generation requires efficient algorithms and vector-friendly computer architectures for complex signal synthesis. New processor technology from Intel implements an extended 256-bit vector SIMD instruction set (AVX, AVX2) in a multi-core CPU design providing peak execution rates of hundreds of GigaFLOPS (GFLOPS) on one chip. Real world mmW scene generation code can approach peak SIMD execution rates only after careful algorithm and source code design. An effective software design will maintain high computing intensity emphasizing register-to-register SIMD arithmetic operations over data movement between CPU caches or off-chip memories. Engineers at the U.S. Army Aviation and Missile Research, Development and Engineering Center (AMRDEC) applied two basic parallel coding methods to assess new 256-bit SIMD multi-core architectures for mmW scene generation in HWIL. These include use of POSIX threads built on vector library functions and more portable, highlevel parallel code based on compiler technology (e.g. OpenMP pragmas and SIMD autovectorization). Since CPU technology is rapidly advancing toward high processor core counts and TeraFLOPS peak SIMD execution rates, it is imperative that coding methods be identified which produce efficient and maintainable parallel code. This paper describes the algorithms used in point scatterer target model rendering, the parallelization of those algorithms, and the execution performance achieved on an AVX multi-core machine using the two basic parallel coding methods. The paper concludes with estimates for scale-up performance on upcoming multi-core technology.
[Series: Medical Applications of the PHITS Code (2): Acceleration by Parallel Computing].

PubMed

Furuta, Takuya; Sato, Tatsuhiko

2015-01-01

Time-consuming Monte Carlo dose calculation becomes feasible owing to the development of computer technology. However, the recent development is due to emergence of the multi-core high performance computers. Therefore, parallel computing becomes a key to achieve good performance of software programs. A Monte Carlo simulation code PHITS contains two parallel computing functions, the distributed-memory parallelization using protocols of message passing interface (MPI) and the shared-memory parallelization using open multi-processing (OpenMP) directives. Users can choose the two functions according to their needs. This paper gives the explanation of the two functions with their advantages and disadvantages. Some test applications are also provided to show their performance using a typical multi-core high performance workstation.
Performance evaluation of throughput computing workloads using multi-core processors and graphics processors

NASA Astrophysics Data System (ADS)

Dave, Gaurav P.; Sureshkumar, N.; Blessy Trencia Lincy, S. S.

2017-11-01

Current trend in processor manufacturing focuses on multi-core architectures rather than increasing the clock speed for performance improvement. Graphic processors have become as commodity hardware for providing fast co-processing in computer systems. Developments in IoT, social networking web applications, big data created huge demand for data processing activities and such kind of throughput intensive applications inherently contains data level parallelism which is more suited for SIMD architecture based GPU. This paper reviews the architectural aspects of multi/many core processors and graphics processors. Different case studies are taken to compare performance of throughput computing applications using shared memory programming in OpenMP and CUDA API based programming.
Parameters that affect parallel processing for computational electromagnetic simulation codes on high performance computing clusters

NASA Astrophysics Data System (ADS)

Moon, Hongsik

What is the impact of multicore and associated advanced technologies on computational software for science? Most researchers and students have multicore laptops or desktops for their research and they need computing power to run computational software packages. Computing power was initially derived from Central Processing Unit (CPU) clock speed. That changed when increases in clock speed became constrained by power requirements. Chip manufacturers turned to multicore CPU architectures and associated technological advancements to create the CPUs for the future. Most software applications benefited by the increased computing power the same way that increases in clock speed helped applications run faster. However, for Computational ElectroMagnetics (CEM) software developers, this change was not an obvious benefit - it appeared to be a detriment. Developers were challenged to find a way to correctly utilize the advancements in hardware so that their codes could benefit. The solution was parallelization and this dissertation details the investigation to address these challenges. Prior to multicore CPUs, advanced computer technologies were compared with the performance using benchmark software and the metric was FLoting-point Operations Per Seconds (FLOPS) which indicates system performance for scientific applications that make heavy use of floating-point calculations. Is FLOPS an effective metric for parallelized CEM simulation tools on new multicore system? Parallel CEM software needs to be benchmarked not only by FLOPS but also by the performance of other parameters related to type and utilization of the hardware, such as CPU, Random Access Memory (RAM), hard disk, network, etc. The codes need to be optimized for more than just FLOPs and new parameters must be included in benchmarking. In this dissertation, the parallel CEM software named High Order Basis Based Integral Equation Solver (HOBBIES) is introduced. This code was developed to address the needs of the changing computer hardware platforms in order to provide fast, accurate and efficient solutions to large, complex electromagnetic problems. The research in this dissertation proves that the performance of parallel code is intimately related to the configuration of the computer hardware and can be maximized for different hardware platforms. To benchmark and optimize the performance of parallel CEM software, a variety of large, complex projects are created and executed on a variety of computer platforms. The computer platforms used in this research are detailed in this dissertation. The projects run as benchmarks are also described in detail and results are presented. The parameters that affect parallel CEM software on High Performance Computing Clusters (HPCC) are investigated. This research demonstrates methods to maximize the performance of parallel CEM software code.
Aho-Corasick String Matching on Shared and Distributed Memory Parallel Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tumeo, Antonino; Villa, Oreste; Chavarría-Miranda, Daniel

String matching is at the core of many critical applications, including network intrusion detection systems, search engines, virus scanners, spam filters, DNA and protein sequencing, and data mining. For all of these applications string matching requires a combination of (sometimes all) the following characteristics: high and/or predictable performance, support for large data sets and flexibility of integration and customization. Many software based implementations targeting conventional cache-based microprocessors fail to achieve high and predictable performance requirements, while Field-Programmable Gate Array (FPGA) implementations and dedicated hardware solutions fail to support large data sets (dictionary sizes) and are difficult to integrate and customize.more » The advent of multicore, multithreaded, and GPU-based systems is opening the possibility for software based solutions to reach very high performance at a sustained rate. This paper compares several software-based implementations of the Aho-Corasick string searching algorithm for high performance systems. We discuss the implementation of the algorithm on several types of shared-memory high-performance architectures (Niagara 2, large x86 SMPs and Cray XMT), distributed memory with homogeneous processing elements (InfiniBand cluster of x86 multicores) and heterogeneous processing elements (InfiniBand cluster of x86 multicores with NVIDIA Tesla C10 GPUs). We describe in detail how each solution achieves the objectives of supporting large dictionaries, sustaining high performance, and enabling customization and flexibility using various data sets.« less
A Bandwidth-Optimized Multi-Core Architecture for Irregular Applications

DOE Office of Scientific and Technical Information (OSTI.GOV)

Secchi, Simone; Tumeo, Antonino; Villa, Oreste

This paper presents an architecture template for next-generation high performance computing systems specifically targeted to irregular applications. We start our work by considering that future generation interconnection and memory bandwidth full-system numbers are expected to grow by a factor of 10. In order to keep up with such a communication capacity, while still resorting to fine-grained multithreading as the main way to tolerate unpredictable memory access latencies of irregular applications, we show how overall performance scaling can benefit from the multi-core paradigm. At the same time, we also show how such an architecture template must be coupled with specific techniquesmore » in order to optimize bandwidth utilization and achieve the maximum scalability. We propose a technique based on memory references aggregation, together with the related hardware implementation, as one of such optimization techniques. We explore the proposed architecture template by focusing on the Cray XMT architecture and, using a dedicated simulation infrastructure, validate the performance of our template with two typical irregular applications. Our experimental results prove the benefits provided by both the multi-core approach and the bandwidth optimization reference aggregation technique.« less
The design of multi-core DSP parallel model based on message passing and multi-level pipeline

NASA Astrophysics Data System (ADS)

Niu, Jingyu; Hu, Jian; He, Wenjing; Meng, Fanrong; Li, Chuanrong

2017-10-01

Currently, the design of embedded signal processing system is often based on a specific application, but this idea is not conducive to the rapid development of signal processing technology. In this paper, a parallel processing model architecture based on multi-core DSP platform is designed, and it is mainly suitable for the complex algorithms which are composed of different modules. This model combines the ideas of multi-level pipeline parallelism and message passing, and summarizes the advantages of the mainstream model of multi-core DSP (the Master-Slave model and the Data Flow model), so that it has better performance. This paper uses three-dimensional image generation algorithm to validate the efficiency of the proposed model by comparing with the effectiveness of the Master-Slave and the Data Flow model.
Performance of an MPI-only semiconductor device simulator on a quad socket/quad core InfiniBand platform.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shadid, John Nicolas; Lin, Paul Tinphone

2009-01-01

This preliminary study considers the scaling and performance of a finite element (FE) semiconductor device simulator on a capacity cluster with 272 compute nodes based on a homogeneous multicore node architecture utilizing 16 cores. The inter-node communication backbone for this Tri-Lab Linux Capacity Cluster (TLCC) machine is comprised of an InfiniBand interconnect. The nonuniform memory access (NUMA) nodes consist of 2.2 GHz quad socket/quad core AMD Opteron processors. The performance results for this study are obtained with a FE semiconductor device simulation code (Charon) that is based on a fully-coupled Newton-Krylov solver with domain decomposition and multilevel preconditioners. Scaling andmore » multicore performance results are presented for large-scale problems of 100+ million unknowns on up to 4096 cores. A parallel scaling comparison is also presented with the Cray XT3/4 Red Storm capability platform. The results indicate that an MPI-only programming model for utilizing the multicore nodes is reasonably efficient on all 16 cores per compute node. However, the results also indicated that the multilevel preconditioner, which is critical for large-scale capability type simulations, scales better on the Red Storm machine than the TLCC machine.« less
An Energy-Aware Runtime Management of Multi-Core Sensory Swarms.

PubMed

Kim, Sungchan; Yang, Hoeseok

2017-08-24

In sensory swarms, minimizing energy consumption under performance constraint is one of the key objectives. One possible approach to this problem is to monitor application workload that is subject to change at runtime, and to adjust system configuration adaptively to satisfy the performance goal. As today's sensory swarms are usually implemented using multi-core processors with adjustable clock frequency, we propose to monitor the CPU workload periodically and adjust the task-to-core allocation or clock frequency in an energy-efficient way in response to the workload variations. In doing so, we present an online heuristic that determines the most energy-efficient adjustment that satisfies the performance requirement. The proposed method is based on a simple yet effective energy model that is built upon performance prediction using IPC (instructions per cycle) measured online and power equation derived empirically. The use of IPC accounts for memory intensities of a given workload, enabling the accurate prediction of execution time. Hence, the model allows us to rapidly and accurately estimate the effect of the two control knobs, clock frequency adjustment and core allocation. The experiments show that the proposed technique delivers considerable energy saving of up to 45%compared to the state-of-the-art multi-core energy management technique.

An Energy-Aware Runtime Management of Multi-Core Sensory Swarms

PubMed Central

Kim, Sungchan

2017-01-01

In sensory swarms, minimizing energy consumption under performance constraint is one of the key objectives. One possible approach to this problem is to monitor application workload that is subject to change at runtime, and to adjust system configuration adaptively to satisfy the performance goal. As today’s sensory swarms are usually implemented using multi-core processors with adjustable clock frequency, we propose to monitor the CPU workload periodically and adjust the task-to-core allocation or clock frequency in an energy-efficient way in response to the workload variations. In doing so, we present an online heuristic that determines the most energy-efficient adjustment that satisfies the performance requirement. The proposed method is based on a simple yet effective energy model that is built upon performance prediction using IPC (instructions per cycle) measured online and power equation derived empirically. The use of IPC accounts for memory intensities of a given workload, enabling the accurate prediction of execution time. Hence, the model allows us to rapidly and accurately estimate the effect of the two control knobs, clock frequency adjustment and core allocation. The experiments show that the proposed technique delivers considerable energy saving of up to 45%compared to the state-of-the-art multi-core energy management technique. PMID:28837094
Toward performance portability of the Albany finite element analysis code using the Kokkos library

DOE Office of Scientific and Technical Information (OSTI.GOV)

Demeshko, Irina; Watkins, Jerry; Tezaur, Irina K.

Performance portability on heterogeneous high-performance computing (HPC) systems is a major challenge faced today by code developers: parallel code needs to be executed correctly as well as with high performance on machines with different architectures, operating systems, and software libraries. The finite element method (FEM) is a popular and flexible method for discretizing partial differential equations arising in a wide variety of scientific, engineering, and industrial applications that require HPC. This paper presents some preliminary results pertaining to our development of a performance portable implementation of the FEM-based Albany code. Performance portability is achieved using the Kokkos library. We presentmore » performance results for the Aeras global atmosphere dynamical core module in Albany. Finally, numerical experiments show that our single code implementation gives reasonable performance across three multicore/many-core architectures: NVIDIA General Processing Units (GPU’s), Intel Xeon Phis, and multicore CPUs.« less
Toward performance portability of the Albany finite element analysis code using the Kokkos library

DOE PAGES

Demeshko, Irina; Watkins, Jerry; Tezaur, Irina K.; ...

2018-02-05

Performance portability on heterogeneous high-performance computing (HPC) systems is a major challenge faced today by code developers: parallel code needs to be executed correctly as well as with high performance on machines with different architectures, operating systems, and software libraries. The finite element method (FEM) is a popular and flexible method for discretizing partial differential equations arising in a wide variety of scientific, engineering, and industrial applications that require HPC. This paper presents some preliminary results pertaining to our development of a performance portable implementation of the FEM-based Albany code. Performance portability is achieved using the Kokkos library. We presentmore » performance results for the Aeras global atmosphere dynamical core module in Albany. Finally, numerical experiments show that our single code implementation gives reasonable performance across three multicore/many-core architectures: NVIDIA General Processing Units (GPU’s), Intel Xeon Phis, and multicore CPUs.« less
Second generation OH suppression filters using multicore fibers

NASA Astrophysics Data System (ADS)

Haynes, R.; Birks, T. A.; Bland-Hawthorn, J.; Cruz, J. L.; Diez, A.; Ellis, S. C.; Haynes, D.; Krämer, R. G.; Mangan, B. J.; Min, S.; Murphy, D. F.; Nolte, S.; Olaya, J. C.; Thomas, J. U.; Trinh, C. Q.; Tünnermann, A.; Voigtländer, Christian

2012-09-01

Ground based near-infrared observations have long been plagued by poor sensitivity when compared to visible observations as a result of the bright narrow line emission from atmospheric OH molecules. The GNOSIS instrument recently commissioned at the Australian Astronomical Observatory uses Photonic Lanterns in combination with individually printed single mode fibre Bragg gratings to filter out the brightest OH-emission lines between 1.47 and 1.70μm. GNOSIS, reported in a separate paper in this conference, demonstrates excellent OH-suppression, providing very “clean” filtering of the lines. It represents a major step forward in the goal to improve the sensitivity of ground based near-infrared observation to that possible at visible wavelengths, however, the filter units are relatively bulky and costly to produce. The 2nd generation fibre OH-Suppression filters based on multicore fibres are currently under development. The development aims to produce high quality, cost effective, compact and robust OH-Suppression units in a single optical fibre with numerous isolated single mode cores that replicate the function and performance of the current generation of “conventional” photonic lantern based devices. In this paper we present the early results from the multicore fibre development and multicore fibre Bragg grating imprinting process.
All-fiber intensity bend sensor based on photonic crystal fiber with asymmetric air-hole structure

NASA Astrophysics Data System (ADS)

Budnicki, Dawid; Szostkiewicz, Lukasz; Szymanski, Michal O.; Ostrowski, Lukasz; Holdynski, Zbigniew; Lipinski, Stanislaw; Murawski, Michal; Wojcik, Grzegorz; Makara, Mariusz; Poturaj, Krzysztof; Mergo, Pawel; Napierala, Marek; Nasilowski, Tomasz

2017-10-01

Monitoring the geometry of an moving element is a crucial task for example in robotics. The robots equipped with fiber bend sensor integrated in their arms can be a promising solution for medicine, physiotherapy and also for application in computer games. We report an all-fiber intensity bend sensor, which is based on microstructured multicore optical fiber. It allows to perform a measurement of the bending radius as well as the bending orientation. The reported solution has a special airhole structure which makes the sensor only bend-sensitive. Our solution is an intensity based sensor, which measures power transmitted along the fiber, influenced by bend. The sensor is based on a multicore fiber with the special air-hole structure that allows detection of bending orientation in range of 360°. Each core in the multicore fiber is sensitive to bend in specified direction. The principle behind sensor operation is to differentiate the confinement loss of fundamental mode propagating in each core. Thanks to received power differences one can distinguish not only bend direction but also its amplitude. Multicore fiber is designed to utilize most common light sources that operate at 1.55 μm thus ensuring high stability of operation. The sensitivity of the proposed solution is equal 29,4 dB/cm and the accuracy of bend direction for the fiber end point is up to 5 degrees for 15 cm fiber length. Such sensitivity allows to perform end point detection with millimeter precision.
Multicore Considerations for Legacy Flight Software Migration

NASA Technical Reports Server (NTRS)

Vines, Kenneth; Day, Len

2013-01-01

In this paper we will discuss potential benefits and pitfalls when considering a migration from an existing single core code base to a multicore processor implementation. The results of this study present options that should be considered before migrating fault managers, device handlers and tasks with time-constrained requirements to a multicore flight software environment. Possible future multicore test bed demonstrations are also discussed.
Data Parallel Bin-Based Indexing for Answering Queries on Multi-Core Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gosink, Luke; Wu, Kesheng; Bethel, E. Wes

2009-06-02

The multi-core trend in CPUs and general purpose graphics processing units (GPUs) offers new opportunities for the database community. The increase of cores at exponential rates is likely to affect virtually every server and client in the coming decade, and presents database management systems with a huge, compelling disruption that will radically change how processing is done. This paper presents a new parallel indexing data structure for answering queries that takes full advantage of the increasing thread-level parallelism emerging in multi-core architectures. In our approach, our Data Parallel Bin-based Index Strategy (DP-BIS) first bins the base data, and then partitionsmore » and stores the values in each bin as a separate, bin-based data cluster. In answering a query, the procedures for examining the bin numbers and the bin-based data clusters offer the maximum possible level of concurrency; each record is evaluated by a single thread and all threads are processed simultaneously in parallel. We implement and demonstrate the effectiveness of DP-BIS on two multi-core architectures: a multi-core CPU and a GPU. The concurrency afforded by DP-BIS allows us to fully utilize the thread-level parallelism provided by each architecture--for example, our GPU-based DP-BIS implementation simultaneously evaluates over 12,000 records with an equivalent number of concurrently executing threads. In comparing DP-BIS's performance across these architectures, we show that the GPU-based DP-BIS implementation requires significantly less computation time to answer a query than the CPU-based implementation. We also demonstrate in our analysis that DP-BIS provides better overall performance than the commonly utilized CPU and GPU-based projection index. Finally, due to data encoding, we show that DP-BIS accesses significantly smaller amounts of data than index strategies that operate solely on a column's base data; this smaller data footprint is critical for parallel processors that possess limited memory resources (e.g., GPUs).« less
3D Kirchhoff depth migration algorithm: A new scalable approach for parallelization on multicore CPU based cluster

NASA Astrophysics Data System (ADS)

Rastogi, Richa; Londhe, Ashutosh; Srivastava, Abhishek; Sirasala, Kirannmayi M.; Khonde, Kiran

2017-03-01

In this article, a new scalable 3D Kirchhoff depth migration algorithm is presented on state of the art multicore CPU based cluster. Parallelization of 3D Kirchhoff depth migration is challenging due to its high demand of compute time, memory, storage and I/O along with the need of their effective management. The most resource intensive modules of the algorithm are traveltime calculations and migration summation which exhibit an inherent trade off between compute time and other resources. The parallelization strategy of the algorithm largely depends on the storage of calculated traveltimes and its feeding mechanism to the migration process. The presented work is an extension of our previous work, wherein a 3D Kirchhoff depth migration application for multicore CPU based parallel system had been developed. Recently, we have worked on improving parallel performance of this application by re-designing the parallelization approach. The new algorithm is capable to efficiently migrate both prestack and poststack 3D data. It exhibits flexibility for migrating large number of traces within the available node memory and with minimal requirement of storage, I/O and inter-node communication. The resultant application is tested using 3D Overthrust data on PARAM Yuva II, which is a Xeon E5-2670 based multicore CPU cluster with 16 cores/node and 64 GB shared memory. Parallel performance of the algorithm is studied using different numerical experiments and the scalability results show striking improvement over its previous version. An impressive 49.05X speedup with 76.64% efficiency is achieved for 3D prestack data and 32.00X speedup with 50.00% efficiency for 3D poststack data, using 64 nodes. The results also demonstrate the effectiveness and robustness of the improved algorithm with high scalability and efficiency on a multicore CPU cluster.
Parallel Discrete Molecular Dynamics Simulation With Speculation and In-Order Commitment*†

PubMed Central

Khan, Md. Ashfaquzzaman; Herbordt, Martin C.

2011-01-01

Discrete molecular dynamics simulation (DMD) uses simplified and discretized models enabling simulations to advance by event rather than by timestep. DMD is an instance of discrete event simulation and so is difficult to scale: even in this multi-core era, all reported DMD codes are serial. In this paper we discuss the inherent difficulties of scaling DMD and present our method of parallelizing DMD through event-based decomposition. Our method is microarchitecture inspired: speculative processing of events exposes parallelism, while in-order commitment ensures correctness. We analyze the potential of this parallelization method for shared-memory multiprocessors. Achieving scalability required extensive experimentation with scheduling and synchronization methods to mitigate serialization. The speed-up achieved for a variety of system sizes and complexities is nearly 6× on an 8-core and over 9× on a 12-core processor. We present and verify analytical models that account for the achieved performance as a function of available concurrency and architectural limitations. PMID:21822327
Parallel Discrete Molecular Dynamics Simulation With Speculation and In-Order Commitment.

PubMed

Khan, Md Ashfaquzzaman; Herbordt, Martin C

2011-07-20

Discrete molecular dynamics simulation (DMD) uses simplified and discretized models enabling simulations to advance by event rather than by timestep. DMD is an instance of discrete event simulation and so is difficult to scale: even in this multi-core era, all reported DMD codes are serial. In this paper we discuss the inherent difficulties of scaling DMD and present our method of parallelizing DMD through event-based decomposition. Our method is microarchitecture inspired: speculative processing of events exposes parallelism, while in-order commitment ensures correctness. We analyze the potential of this parallelization method for shared-memory multiprocessors. Achieving scalability required extensive experimentation with scheduling and synchronization methods to mitigate serialization. The speed-up achieved for a variety of system sizes and complexities is nearly 6× on an 8-core and over 9× on a 12-core processor. We present and verify analytical models that account for the achieved performance as a function of available concurrency and architectural limitations.
Application Performance Analysis and Efficient Execution on Systems with multi-core CPUs, GPUs and MICs: A Case Study with Microscopy Image Analysis

PubMed Central

Teodoro, George; Kurc, Tahsin; Andrade, Guilherme; Kong, Jun; Ferreira, Renato; Saltz, Joel

2015-01-01

We carry out a comparative performance study of multi-core CPUs, GPUs and Intel Xeon Phi (Many Integrated Core-MIC) with a microscopy image analysis application. We experimentally evaluate the performance of computing devices on core operations of the application. We correlate the observed performance with the characteristics of computing devices and data access patterns, computation complexities, and parallelization forms of the operations. The results show a significant variability in the performance of operations with respect to the device used. The performances of operations with regular data access are comparable or sometimes better on a MIC than that on a GPU. GPUs are more efficient than MICs for operations that access data irregularly, because of the lower bandwidth of the MIC for random data accesses. We propose new performance-aware scheduling strategies that consider variabilities in operation speedups. Our scheduling strategies significantly improve application performance compared to classic strategies in hybrid configurations. PMID:28239253
Exact diagonalization of quantum lattice models on coprocessors

NASA Astrophysics Data System (ADS)

Siro, T.; Harju, A.

2016-10-01

We implement the Lanczos algorithm on an Intel Xeon Phi coprocessor and compare its performance to a multi-core Intel Xeon CPU and an NVIDIA graphics processor. The Xeon and the Xeon Phi are parallelized with OpenMP and the graphics processor is programmed with CUDA. The performance is evaluated by measuring the execution time of a single step in the Lanczos algorithm. We study two quantum lattice models with different particle numbers, and conclude that for small systems, the multi-core CPU is the fastest platform, while for large systems, the graphics processor is the clear winner, reaching speedups of up to 7.6 compared to the CPU. The Xeon Phi outperforms the CPU with sufficiently large particle number, reaching a speedup of 2.5.
Cross talk analysis in multicore optical fibers by supermode theory.

PubMed

Szostkiewicz, Lukasz; Napierala, Marek; Ziolowicz, Anna; Pytel, Anna; Tenderenda, Tadeusz; Nasilowski, Tomasz

2016-08-15

We discuss the theoretical aspects of core-to-core power transfer in multicore fibers relying on supermode theory. Based on a dual core fiber model, we investigate the consequences of this approach, such as the influence of initial excitation conditions on cross talk. Supermode interpretation of power coupling proves to be intuitive and thus may lead to new concepts of multicore fiber-based devices. As a conclusion, we propose a definition of a uniform cross talk parameter that describes multicore fiber design.
Multicore-based 3D-DWT video encoder

NASA Astrophysics Data System (ADS)

Galiano, Vicente; López-Granado, Otoniel; Malumbres, Manuel P.; Migallón, Hector

2013-12-01

Three-dimensional wavelet transform (3D-DWT) encoders are good candidates for applications like professional video editing, video surveillance, multi-spectral satellite imaging, etc. where a frame must be reconstructed as quickly as possible. In this paper, we present a new 3D-DWT video encoder based on a fast run-length coding engine. Furthermore, we present several multicore optimizations to speed-up the 3D-DWT computation. An exhaustive evaluation of the proposed encoder (3D-GOP-RL) has been performed, and we have compared the evaluation results with other video encoders in terms of rate/distortion (R/D), coding/decoding delay, and memory consumption. Results show that the proposed encoder obtains good R/D results for high-resolution video sequences with nearly in-place computation using only the memory needed to store a group of pictures. After applying the multicore optimization strategies over the 3D DWT, the proposed encoder is able to compress a full high-definition video sequence in real-time.
Concurrent and Accurate Short Read Mapping on Multicore Processors.

PubMed

Martínez, Héctor; Tárraga, Joaquín; Medina, Ignacio; Barrachina, Sergio; Castillo, Maribel; Dopazo, Joaquín; Quintana-Ortí, Enrique S

2015-01-01

We introduce a parallel aligner with a work-flow organization for fast and accurate mapping of RNA sequences on servers equipped with multicore processors. Our software, HPG Aligner SA (HPG Aligner SA is an open-source application. The software is available at http://www.opencb.org, exploits a suffix array to rapidly map a large fraction of the RNA fragments (reads), as well as leverages the accuracy of the Smith-Waterman algorithm to deal with conflictive reads. The aligner is enhanced with a careful strategy to detect splice junctions based on an adaptive division of RNA reads into small segments (or seeds), which are then mapped onto a number of candidate alignment locations, providing crucial information for the successful alignment of the complete reads. The experimental results on a platform with Intel multicore technology report the parallel performance of HPG Aligner SA, on RNA reads of 100-400 nucleotides, which excels in execution time/sensitivity to state-of-the-art aligners such as TopHat 2+Bowtie 2, MapSplice, and STAR.
Energy Efficient Image/Video Data Transmission on Commercial Multi-Core Processors

PubMed Central

Lee, Sungju; Kim, Heegon; Chung, Yongwha; Park, Daihee

2012-01-01

In transmitting image/video data over Video Sensor Networks (VSNs), energy consumption must be minimized while maintaining high image/video quality. Although image/video compression is well known for its efficiency and usefulness in VSNs, the excessive costs associated with encoding computation and complexity still hinder its adoption for practical use. However, it is anticipated that high-performance handheld multi-core devices will be used as VSN processing nodes in the near future. In this paper, we propose a way to improve the energy efficiency of image and video compression with multi-core processors while maintaining the image/video quality. We improve the compression efficiency at the algorithmic level or derive the optimal parameters for the combination of a machine and compression based on the tradeoff between the energy consumption and the image/video quality. Based on experimental results, we confirm that the proposed approach can improve the energy efficiency of the straightforward approach by a factor of 2∼5 without compromising image/video quality. PMID:23202181
Performance evaluation of canny edge detection on a tiled multicore architecture

NASA Astrophysics Data System (ADS)

Brethorst, Andrew Z.; Desai, Nehal; Enright, Douglas P.; Scrofano, Ronald

2011-01-01

In the last few years, a variety of multicore architectures have been used to parallelize image processing applications. In this paper, we focus on assessing the parallel speed-ups of different Canny edge detection parallelization strategies on the Tile64, a tiled multicore architecture developed by the Tilera Corporation. Included in these strategies are different ways Canny edge detection can be parallelized, as well as differences in data management. The two parallelization strategies examined were loop-level parallelism and domain decomposition. Loop-level parallelism is achieved through the use of OpenMP,1 and it is capable of parallelization across the range of values over which a loop iterates. Domain decomposition is the process of breaking down an image into subimages, where each subimage is processed independently, in parallel. The results of the two strategies show that for the same number of threads, programmer implemented, domain decomposition exhibits higher speed-ups than the compiler managed, loop-level parallelism implemented with OpenMP.
Performance evaluation of GPU parallelization, space-time adaptive algorithms, and their combination for simulating cardiac electrophysiology.

PubMed

Sachetto Oliveira, Rafael; Martins Rocha, Bernardo; Burgarelli, Denise; Meira, Wagner; Constantinides, Christakis; Weber Dos Santos, Rodrigo

2018-02-01

The use of computer models as a tool for the study and understanding of the complex phenomena of cardiac electrophysiology has attained increased importance nowadays. At the same time, the increased complexity of the biophysical processes translates into complex computational and mathematical models. To speed up cardiac simulations and to allow more precise and realistic uses, 2 different techniques have been traditionally exploited: parallel computing and sophisticated numerical methods. In this work, we combine a modern parallel computing technique based on multicore and graphics processing units (GPUs) and a sophisticated numerical method based on a new space-time adaptive algorithm. We evaluate each technique alone and in different combinations: multicore and GPU, multicore and GPU and space adaptivity, multicore and GPU and space adaptivity and time adaptivity. All the techniques and combinations were evaluated under different scenarios: 3D simulations on slabs, 3D simulations on a ventricular mouse mesh, ie, complex geometry, sinus-rhythm, and arrhythmic conditions. Our results suggest that multicore and GPU accelerate the simulations by an approximate factor of 33×, whereas the speedups attained by the space-time adaptive algorithms were approximately 48. Nevertheless, by combining all the techniques, we obtained speedups that ranged between 165 and 498. The tested methods were able to reduce the execution time of a simulation by more than 498× for a complex cellular model in a slab geometry and by 165× in a realistic heart geometry simulating spiral waves. The proposed methods will allow faster and more realistic simulations in a feasible time with no significant loss of accuracy. Copyright © 2017 John Wiley & Sons, Ltd.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Madduri, Kamesh; Im, Eun-Jin; Ibrahim, Khaled Z.

The next decade of high-performance computing (HPC) systems will see a rapid evolution and divergence of multi- and manycore architectures as power and cooling constraints limit increases in microprocessor clock speeds. Understanding efficient optimization methodologies on diverse multicore designs in the context of demanding numerical methods is one of the greatest challenges faced today by the HPC community. In this paper, we examine the efficient multicore optimization of GTC, a petascale gyrokinetic toroidal fusion code for studying plasma microturbulence in tokamak devices. For GTC’s key computational components (charge deposition and particle push), we explore efficient parallelization strategies across a broadmore » range of emerging multicore designs, including the recently-released Intel Nehalem-EX, the AMD Opteron Istanbul, and the highly multithreaded Sun UltraSparc T2+. We also present the first study on tuning gyrokinetic particle-in-cell (PIC) algorithms for graphics processors, using the NVIDIA C2050 (Fermi). Our work discusses several novel optimization approaches for gyrokinetic PIC, including mixed-precision computation, particle binning and decomposition strategies, grid replication, SIMDized atomic floating-point operations, and effective GPU texture memory utilization. Overall, we achieve significant performance improvements of 1.3–4.7× on these complex PIC kernels, despite the inherent challenges of data dependency and locality. Finally, our work also points to several architectural and programming features that could significantly enhance PIC performance and productivity on next-generation architectures.« less
MIC-SVM: Designing A Highly Efficient Support Vector Machine For Advanced Modern Multi-Core and Many-Core Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

You, Yang; Song, Shuaiwen; Fu, Haohuan

2014-08-16

Support Vector Machine (SVM) has been widely used in data-mining and Big Data applications as modern commercial databases start to attach an increasing importance to the analytic capabilities. In recent years, SVM was adapted to the field of High Performance Computing for power/performance prediction, auto-tuning, and runtime scheduling. However, even at the risk of losing prediction accuracy due to insufficient runtime information, researchers can only afford to apply offline model training to avoid significant runtime training overhead. To address the challenges above, we designed and implemented MICSVM, a highly efficient parallel SVM for x86 based multi-core and many core architectures,more » such as the Intel Ivy Bridge CPUs and Intel Xeon Phi coprocessor (MIC).« less

Computational multicore on two-layer 1D shallow water equations for erodible dambreak

NASA Astrophysics Data System (ADS)

Simanjuntak, C. A.; Bagustara, B. A. R. H.; Gunawan, P. H.

2018-03-01

The simulation of erodible dambreak using two-layer shallow water equations and SCHR scheme are elaborated in this paper. The results show that the two-layer SWE model in a good agreement with the data experiment which is performed by Louvain-la-Neuve Université Catholique de Louvain. Moreover, the parallel algorithm with multicore architecture are given in the results. The results show that Computer I with processor Intel(R) Core(TM) i5-2500 CPU Quad-Core has the best performance to accelerate the computational time. Moreover, Computer III with processor AMD A6-5200 APU Quad-Core is observed has higher speedup and efficiency. The speedup and efficiency of Computer III with number of grids 3200 are 3.716050530 times and 92.9% respectively.
Optimizing the Performance of Reactive Molecular Dynamics Simulations for Multi-core Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Aktulga, Hasan Metin; Coffman, Paul; Shan, Tzu-Ray

2015-12-01

Hybrid parallelism allows high performance computing applications to better leverage the increasing on-node parallelism of modern supercomputers. In this paper, we present a hybrid parallel implementation of the widely used LAMMPS/ReaxC package, where the construction of bonded and nonbonded lists and evaluation of complex ReaxFF interactions are implemented efficiently using OpenMP parallelism. Additionally, the performance of the QEq charge equilibration scheme is examined and a dual-solver is implemented. We present the performance of the resulting ReaxC-OMP package on a state-of-the-art multi-core architecture Mira, an IBM BlueGene/Q supercomputer. For system sizes ranging from 32 thousand to 16.6 million particles, speedups inmore » the range of 1.5-4.5x are observed using the new ReaxC-OMP software. Sustained performance improvements have been observed for up to 262,144 cores (1,048,576 processes) of Mira with a weak scaling efficiency of 91.5% in larger simulations containing 16.6 million particles.« less
Parallelization of interpolation, solar radiation and water flow simulation modules in GRASS GIS using OpenMP

NASA Astrophysics Data System (ADS)

Hofierka, Jaroslav; Lacko, Michal; Zubal, Stanislav

2017-10-01

In this paper, we describe the parallelization of three complex and computationally intensive modules of GRASS GIS using the OpenMP application programming interface for multi-core computers. These include the v.surf.rst module for spatial interpolation, the r.sun module for solar radiation modeling and the r.sim.water module for water flow simulation. We briefly describe the functionality of the modules and parallelization approaches used in the modules. Our approach includes the analysis of the module's functionality, identification of source code segments suitable for parallelization and proper application of OpenMP parallelization code to create efficient threads processing the subtasks. We document the efficiency of the solutions using the airborne laser scanning data representing land surface in the test area and derived high-resolution digital terrain model grids. We discuss the performance speed-up and parallelization efficiency depending on the number of processor threads. The study showed a substantial increase in computation speeds on a standard multi-core computer while maintaining the accuracy of results in comparison to the output from original modules. The presented parallelization approach showed the simplicity and efficiency of the parallelization of open-source GRASS GIS modules using OpenMP, leading to an increased performance of this geospatial software on standard multi-core computers.
Fault-Tolerant Software-Defined Radio on Manycore

NASA Technical Reports Server (NTRS)

Ricketts, Scott

2015-01-01

Software-defined radio (SDR) platforms generally rely on field-programmable gate arrays (FPGAs) and digital signal processors (DSPs), but such architectures require significant software development. In addition, application demands for radiation mitigation and fault tolerance exacerbate programming challenges. MaXentric Technologies, LLC, has developed a manycore-based SDR technology that provides 100 times the throughput of conventional radiationhardened general purpose processors. Manycore systems (30-100 cores and beyond) have the potential to provide high processing performance at error rates that are equivalent to current space-deployed uniprocessor systems. MaXentric's innovation is a highly flexible radio, providing over-the-air reconfiguration; adaptability; and uninterrupted, real-time, multimode operation. The technology is also compliant with NASA's Space Telecommunications Radio System (STRS) architecture. In addition to its many uses within NASA communications, the SDR can also serve as a highly programmable research-stage prototyping device for new waveforms and other communications technologies. It can also support noncommunication codes on its multicore processor, collocated with the communications workload-reducing the size, weight, and power of the overall system by aggregating processing jobs to a single board computer.
Evaluating Multi-core Architectures through Accelerating the Three-Dimensional Lax–Wendroff Correction

DOE Office of Scientific and Technical Information (OSTI.GOV)

You, Yang; Fu, Haohuan; Song, Shuaiwen

2014-07-18

Wave propagation forward modeling is a widely used computational method in oil and gas exploration. The iterative stencil loops in such problems have broad applications in scientific computing. However, executing such loops can be highly time time-consuming, which greatly limits application’s performance and power efficiency. In this paper, we accelerate the forward modeling technique on the latest multi-core and many-core architectures such as Intel Sandy Bridge CPUs, NVIDIA Fermi C2070 GPU, NVIDIA Kepler K20x GPU, and the Intel Xeon Phi Co-processor. For the GPU platforms, we propose two parallel strategies to explore the performance optimization opportunities for our stencil kernels.more » For Sandy Bridge CPUs and MIC, we also employ various optimization techniques in order to achieve the best.« less
Multi-cored vortices support function of slotted wing tips of birds in gliding and flapping flight

PubMed Central

2017-01-01

Slotted wing tips of birds are commonly considered an adaptation to improve soaring performance, despite their presence in species that neither soar nor glide. We used particle image velocimetry to measure the airflow around the slotted wing tip of a jackdaw (Corvus monedula) as well as in its wake during unrestrained flight in a wind tunnel. The separated primary feathers produce individual wakes, confirming a multi-slotted function, in both gliding and flapping flight. The resulting multi-cored wingtip vortex represents a spreading of vorticity, which has previously been suggested as indicative of increased aerodynamic efficiency. Considering benefits of the slotted wing tips that are specific to flapping flight combined with the wide phylogenetic occurrence of this configuration, we propose the hypothesis that slotted wings evolved initially to improve performance in powered flight. PMID:28539482
Gyrokinetic particle-in-cell optimization on emerging multi- and manycore platforms

DOE PAGES

Madduri, Kamesh; Im, Eun-Jin; Ibrahim, Khaled Z.; ...

2011-03-02

The next decade of high-performance computing (HPC) systems will see a rapid evolution and divergence of multi- and manycore architectures as power and cooling constraints limit increases in microprocessor clock speeds. Understanding efficient optimization methodologies on diverse multicore designs in the context of demanding numerical methods is one of the greatest challenges faced today by the HPC community. In this paper, we examine the efficient multicore optimization of GTC, a petascale gyrokinetic toroidal fusion code for studying plasma microturbulence in tokamak devices. For GTC’s key computational components (charge deposition and particle push), we explore efficient parallelization strategies across a broadmore » range of emerging multicore designs, including the recently-released Intel Nehalem-EX, the AMD Opteron Istanbul, and the highly multithreaded Sun UltraSparc T2+. We also present the first study on tuning gyrokinetic particle-in-cell (PIC) algorithms for graphics processors, using the NVIDIA C2050 (Fermi). Our work discusses several novel optimization approaches for gyrokinetic PIC, including mixed-precision computation, particle binning and decomposition strategies, grid replication, SIMDized atomic floating-point operations, and effective GPU texture memory utilization. Overall, we achieve significant performance improvements of 1.3–4.7× on these complex PIC kernels, despite the inherent challenges of data dependency and locality. Finally, our work also points to several architectural and programming features that could significantly enhance PIC performance and productivity on next-generation architectures.« less
A Parallel Point Matching Algorithm for Landmark Based Image Registration Using Multicore Platform

PubMed Central

Yang, Lin; Gong, Leiguang; Zhang, Hong; Nosher, John L.; Foran, David J.

2013-01-01

Point matching is crucial for many computer vision applications. Establishing the correspondence between a large number of data points is a computationally intensive process. Some point matching related applications, such as medical image registration, require real time or near real time performance if applied to critical clinical applications like image assisted surgery. In this paper, we report a new multicore platform based parallel algorithm for fast point matching in the context of landmark based medical image registration. We introduced a non-regular data partition algorithm which utilizes the K-means clustering algorithm to group the landmarks based on the number of available processing cores, which optimize the memory usage and data transfer. We have tested our method using the IBM Cell Broadband Engine (Cell/B.E.) platform. The results demonstrated a significant speed up over its sequential implementation. The proposed data partition and parallelization algorithm, though tested only on one multicore platform, is generic by its design. Therefore the parallel algorithm can be extended to other computing platforms, as well as other point matching related applications. PMID:24308014
Influence of fibre design and curvature on crosstalk in multi-core fibre

NASA Astrophysics Data System (ADS)

Egorova, O. N.; Astapovich, M. S.; Melnikov, L. A.; Salganskii, M. Yu; Mishkin, V. P.; Nishchev, K. N.; Semjonov, S. L.; Dianov, E. M.

2016-03-01

We have studied the influence of cross-sectional structure and bends on optical cross-talk in a multicore fibre. A reduced refractive index layer produced between the cores of such fibre with a small centre-to-centre spacing between neighbouring cores (27 μm) reduces optical cross-talk by 20 dB. The cross-talk level achieved, 30 dB per kilometre of the length of the multicore fibre, is acceptable for a number of applications where relatively small lengths of fibre are needed. Moreover, a significant decrease in optical cross-talk has been ensured by reducing the winding diameter of multicore fibres with identical cores.
A highly efficient multi-core algorithm for clustering extremely large datasets

PubMed Central

2010-01-01

Background In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer. Results We introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization. Conclusions Most desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer. PMID:20370922
Interfacial redox reaction-directed synthesis of silver@cerium oxide core-shell nanocomposites as catalysts for rechargeable lithium-air batteries

NASA Astrophysics Data System (ADS)

Liu, Ying; Wang, Man; Cao, Lu-Jie; Yang, Ming-Yang; Ho-Sum Cheng, Samson; Cao, Chen-Wei; Leung, Kwan-Lan; Chung, Chi-Yuen; Lu, Zhou-Guang

2015-07-01

A facile oxidation-reduction reaction method has been implemented to prepare pomegranate-like Ag@CeO2 multicore-shell structured nanocomposites. Under Ar atmosphere, redox reaction automatically occurs between AgNO3 and Ce(NO3)3 in an alkaline solution, where Ag+ is reduced to Ag nanopartilces and Ce3+ is simultaneously oxidized to form CeO2, followed by the self-assembly to form the pomegranate-like multicore-shell structured Ag@CeO2 nanocomposites driven by thermodynamic equilibrium. No other organic amines or surfactants are utilized in the whole reaction system and only NaOH instead of organic reducing agent is used to prevent the introduction of a secondary reducing byproduct. The as-obtained pomegranate-like Ag@CeO2 multicore-shell structured nanocomposites have been characterized as electro-catalysts for the air cathode of lithium-air batteries operated in a simulated air environment. Superior electrochemical performance with high discharge capacity of 3415 mAh g-1 at 100 mA g-1, stable cycling and small charge/discharge polarization voltage is achieved, which is much better than that of the CeO2 or simple mixture of CeO2 and Ag. The enhanced properties can be primarily attributed to the synergy effect between the Ag core and the CeO2 shell resulting from the unique pomegranate-like multicore-shell nanostructures possessing plenty of active sites to promote the facile formation and decomposition of Li2O2.
Application of Advanced Multi-Core Processor Technologies to Oceanographic Research

DTIC Science & Technology

2013-09-30

STM32 NXP LPC series No Proprietary Microchip PIC32/DSPIC No > 500 mW; < 5 W ARM Cortex TI OMAP TI Sitara Broadcom BCM2835 Varies FPGA...1 DISTRIBUTION STATEMENT A. Approved for public release; distribution is unlimited. Application of Advanced Multi-Core Processor Technologies...state-of-the-art information processing architectures. OBJECTIVES Next-generation processor architectures (multi-core, multi-threaded) hold the
Improving the performance of heterogeneous multi-core processors by modifying the cache coherence protocol

NASA Astrophysics Data System (ADS)

Fang, Juan; Hao, Xiaoting; Fan, Qingwen; Chang, Zeqing; Song, Shuying

2017-05-01

In the Heterogeneous multi-core architecture, CPU and GPU processor are integrated on the same chip, which poses a new challenge to the last-level cache management. In this architecture, the CPU application and the GPU application execute concurrently, accessing the last-level cache. CPU and GPU have different memory access characteristics, so that they have differences in the sensitivity of last-level cache (LLC) capacity. For many CPU applications, a reduced share of the LLC could lead to significant performance degradation. On the contrary, GPU applications can tolerate increase in memory access latency when there is sufficient thread-level parallelism. Taking into account the GPU program memory latency tolerance characteristics, this paper presents a method that let GPU applications can access to memory directly, leaving lots of LLC space for CPU applications, in improving the performance of CPU applications and does not affect the performance of GPU applications. When the CPU application is cache sensitive, and the GPU application is insensitive to the cache, the overall performance of the system is improved significantly.
Influence of fibre design and curvature on crosstalk in multi-core fibre

DOE Office of Scientific and Technical Information (OSTI.GOV)

Egorova, O N; Astapovich, M S; Semjonov, S L

2016-03-31

We have studied the influence of cross-sectional structure and bends on optical cross-talk in a multicore fibre. A reduced refractive index layer produced between the cores of such fibre with a small centre-to-centre spacing between neighbouring cores (27 μm) reduces optical cross-talk by 20 dB. The cross-talk level achieved, 30 dB per kilometre of the length of the multicore fibre, is acceptable for a number of applications where relatively small lengths of fibre are needed. Moreover, a significant decrease in optical cross-talk has been ensured by reducing the winding diameter of multicore fibres with identical cores. (fiber optics)
Connectivity: Performance Portable Algorithms for graph connectivity v. 0.1

DOE Office of Scientific and Technical Information (OSTI.GOV)

Slota, George; Rajamanickam, Sivasankaran; Madduri, Kamesh

Graphs occur in several places in real world from road networks, social networks and scientific simulations. Connectivity is a graph analysis software to graph connectivity in modern architectures like multicore CPUs, Xeon Phi and GPUs.
Multi-cored vortices support function of slotted wing tips of birds in gliding and flapping flight.

PubMed

KleinHeerenbrink, Marco; Johansson, L Christoffer; Hedenström, Anders

2017-05-01

Slotted wing tips of birds are commonly considered an adaptation to improve soaring performance, despite their presence in species that neither soar nor glide. We used particle image velocimetry to measure the airflow around the slotted wing tip of a jackdaw ( Corvus monedula ) as well as in its wake during unrestrained flight in a wind tunnel. The separated primary feathers produce individual wakes, confirming a multi-slotted function, in both gliding and flapping flight. The resulting multi-cored wingtip vortex represents a spreading of vorticity, which has previously been suggested as indicative of increased aerodynamic efficiency. Considering benefits of the slotted wing tips that are specific to flapping flight combined with the wide phylogenetic occurrence of this configuration, we propose the hypothesis that slotted wings evolved initially to improve performance in powered flight. © 2017 The Author(s).
The parallel algorithm for the 2D discrete wavelet transform

NASA Astrophysics Data System (ADS)

Barina, David; Najman, Pavel; Kleparnik, Petr; Kula, Michal; Zemcik, Pavel

2018-04-01

The discrete wavelet transform can be found at the heart of many image-processing algorithms. Until now, the transform on general-purpose processors (CPUs) was mostly computed using a separable lifting scheme. As the lifting scheme consists of a small number of operations, it is preferred for processing using single-core CPUs. However, considering a parallel processing using multi-core processors, this scheme is inappropriate due to a large number of steps. On such architectures, the number of steps corresponds to the number of points that represent the exchange of data. Consequently, these points often form a performance bottleneck. Our approach appropriately rearranges calculations inside the transform, and thereby reduces the number of steps. In other words, we propose a new scheme that is friendly to parallel environments. When evaluating on multi-core CPUs, we consistently overcome the original lifting scheme. The evaluation was performed on 61-core Intel Xeon Phi and 8-core Intel Xeon processors.
A Programming Model Performance Study Using the NAS Parallel Benchmarks

DOE PAGES

Shan, Hongzhang; Blagojević, Filip; Min, Seung-Jai; ...

2010-01-01

Harnessing the power of multicore platforms is challenging due to the additional levels of parallelism present. In this paper we use the NAS Parallel Benchmarks to study three programming models, MPI, OpenMP and PGAS to understand their performance and memory usage characteristics on current multicore architectures. To understand these characteristics we use the Integrated Performance Monitoring tool and other ways to measure communication versus computation time, as well as the fraction of the run time spent in OpenMP. The benchmarks are run on two different Cray XT5 systems and an Infiniband cluster. Our results show that in general the threemore » programming models exhibit very similar performance characteristics. In a few cases, OpenMP is significantly faster because it explicitly avoids communication. For these particular cases, we were able to re-write the UPC versions and achieve equal performance to OpenMP. Using OpenMP was also the most advantageous in terms of memory usage. Also we compare performance differences between the two Cray systems, which have quad-core and hex-core processors. We show that at scale the performance is almost always slower on the hex-core system because of increased contention for network resources.« less
Design Tools for Accelerating Development and Usage of Multi-Core Computing Platforms

DTIC Science & Technology

2014-04-01

Government formulated or supplied the drawings, specifications, or other data does not license the holder or any other person or corporation ; or convey...multicore PDSP platforms. The GPU- based capabilities of TDIF are currently oriented towards NVIDIA GPUs, based on the Compute Unified Device Architecture...CUDA) programming language [ NVIDIA 2007], which can be viewed as an extension of C. The multicore PDSP capabilities currently in TDIF are oriented
Reducing Response Time Bounds for DAG-Based Task Systems on Heterogeneous Multicore Platforms

DTIC Science & Technology

2016-01-01

synchronous parallel tasks on multicore platforms. In 25th ECRTS, 2013. [10] U. Devi. Soft Real - Time Scheduling on Multiprocessors. PhD thesis...report, Washington University in St Louis, 2014. [18] C. Liu and J. Anderson. Supporting soft real - time DAG-based sys- tems on multiprocessors with...analysis for DAG-based real - time task systems im- plemented on heterogeneous multicore platforms. The spe- cific analysis problem that is considered was

Fast multi-core based multimodal registration of 2D cross-sections and 3D datasets.

PubMed

Scharfe, Michael; Pielot, Rainer; Schreiber, Falk

2010-01-11

Solving bioinformatics tasks often requires extensive computational power. Recent trends in processor architecture combine multiple cores into a single chip to improve overall performance. The Cell Broadband Engine (CBE), a heterogeneous multi-core processor, provides power-efficient and cost-effective high-performance computing. One application area is image analysis and visualisation, in particular registration of 2D cross-sections into 3D image datasets. Such techniques can be used to put different image modalities into spatial correspondence, for example, 2D images of histological cuts into morphological 3D frameworks. We evaluate the CBE-driven PlayStation 3 as a high performance, cost-effective computing platform by adapting a multimodal alignment procedure to several characteristic hardware properties. The optimisations are based on partitioning, vectorisation, branch reducing and loop unrolling techniques with special attention to 32-bit multiplies and limited local storage on the computing units. We show how a typical image analysis and visualisation problem, the multimodal registration of 2D cross-sections and 3D datasets, benefits from the multi-core based implementation of the alignment algorithm. We discuss several CBE-based optimisation methods and compare our results to standard solutions. More information and the source code are available from http://cbe.ipk-gatersleben.de. The results demonstrate that the CBE processor in a PlayStation 3 accelerates computational intensive multimodal registration, which is of great importance in biological/medical image processing. The PlayStation 3 as a low cost CBE-based platform offers an efficient option to conventional hardware to solve computational problems in image processing and bioinformatics.
Thread mapping using system-level model for shared memory multicores

NASA Astrophysics Data System (ADS)

Mitra, Reshmi

Exploring thread-to-core mapping options for a parallel application on a multicore architecture is computationally very expensive. For the same algorithm, the mapping strategy (MS) with the best response time may change with data size and thread counts. The primary challenge is to design a fast, accurate and automatic framework for exploring these MSs for large data-intensive applications. This is to ensure that the users can explore the design space within reasonable machine hours, without thorough understanding on how the code interacts with the platform. Response time is related to the cycles per instructions retired (CPI), taking into account both active and sleep states of the pipeline. This work establishes a hybrid approach, based on Markov Chain Model (MCM) and Model Tree (MT) for system-level steady state CPI prediction. It is designed for shared memory multicore processors with coarse-grained multithreading. The thread status is represented by the MCM states. The program characteristics are modeled as the transition probabilities, representing the system moving between active and suspended thread states. The MT model extrapolates these probabilities for the actual application size (AS) from the smaller AS performance. This aspect of the framework, along with, the use of mathematical expressions for the actual AS performance information, results in a tremendous reduction in the CPI prediction time. The framework is validated using an electromagnetics application. The average performance prediction error for steady state CPI results with 12 different MSs is less than 1%. The total run time of model is of the order of minutes, whereas the actual application execution time is in terms of days.
Implementing an Affordable High-Performance Computing for Teaching-Oriented Computer Science Curriculum

ERIC Educational Resources Information Center

Abuzaghleh, Omar; Goldschmidt, Kathleen; Elleithy, Yasser; Lee, Jeongkyu

2013-01-01

With the advances in computing power, high-performance computing (HPC) platforms have had an impact on not only scientific research in advanced organizations but also computer science curriculum in the educational community. For example, multicore programming and parallel systems are highly desired courses in the computer science major. However,…
Nonlinear combining and compression in multicore fibers

DOE PAGES

Chekhovskoy, I. S.; Rubenchik, A. M.; Shtyrina, O. V.; ...

2016-10-25

In this paper, we demonstrate numerically light-pulse combining and pulse compression using wave-collapse (self-focusing) energy-localization dynamics in a continuous-discrete nonlinear system, as implemented in a multicore fiber (MCF) using one-dimensional (1D) and 2D core distribution designs. Large-scale numerical simulations were performed to determine the conditions of the most efficient coherent combining and compression of pulses injected into the considered MCFs. We demonstrate the possibility of combining in a single core 90% of the total energy of pulses initially injected into all cores of a 7-core MCF with a hexagonal lattice. Finally, a pulse compression factor of about 720 can bemore » obtained with a 19-core ring MCF.« less
SiN-assisted polarization-insensitive multicore fiber to silicon photonics interface

NASA Astrophysics Data System (ADS)

Poulopoulos, Giannis N.; Kalavrouziotis, Dimitrios; Mitchell, Paul; Macdonald, John R.; Bakopoulos, Paraskevas; Avramopoulos, Hercules

2015-06-01

We demonstrate a polarization-insensitive coupler interfacing multicore-fiber (MCF) to silicon waveguides. It comprises a 3D glass fanout transforming the circular MCF core-arrangement to linear and performing initial tapering, followed by a Spot-Size-Converter on the silicon chip. Glass waveguides are formed of multiple overlapped modification elements and appropriate offsetting thereof yields tapers with symmetric cross-section. The Spot-Size-Converter is an inverselytapered silicon waveguide with a tapered polymer overcladding where light is initially coupled, whereas phase-matching gradually shifts it towards the silicon core. Co-design of the glass fanout and Spot-Size-Converter obtains theoretical loss below 1dB for the overall Si-to-MCF transition in both polarizations.
Efficient computation of the phylogenetic likelihood function on multi-gene alignments and multi-core architectures.

PubMed

Stamatakis, Alexandros; Ott, Michael

2008-12-27

The continuous accumulation of sequence data, for example, due to novel wet-laboratory techniques such as pyrosequencing, coupled with the increasing popularity of multi-gene phylogenies and emerging multi-core processor architectures that face problems of cache congestion, poses new challenges with respect to the efficient computation of the phylogenetic maximum-likelihood (ML) function. Here, we propose two approaches that can significantly speed up likelihood computations that typically represent over 95 per cent of the computational effort conducted by current ML or Bayesian inference programs. Initially, we present a method and an appropriate data structure to efficiently compute the likelihood score on 'gappy' multi-gene alignments. By 'gappy' we denote sampling-induced gaps owing to missing sequences in individual genes (partitions), i.e. not real alignment gaps. A first proof-of-concept implementation in RAXML indicates that this approach can accelerate inferences on large and gappy alignments by approximately one order of magnitude. Moreover, we present insights and initial performance results on multi-core architectures obtained during the transition from an OpenMP-based to a Pthreads-based fine-grained parallelization of the ML function.
Exploiting Vector and Multicore Parallelsim for Recursive, Data- and Task-Parallel Programs

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ren, Bin; Krishnamoorthy, Sriram; Agrawal, Kunal

Modern hardware contains parallel execution resources that are well-suited for data-parallelism-vector units-and task parallelism-multicores. However, most work on parallel scheduling focuses on one type of hardware or the other. In this work, we present a scheduling framework that allows for a unified treatment of task- and data-parallelism. Our key insight is an abstraction, task blocks, that uniformly handles data-parallel iterations and task-parallel tasks, allowing them to be scheduled on vector units or executed independently as multicores. Our framework allows us to define schedulers that can dynamically select between executing task- blocks on vector units or multicores. We show that thesemore » schedulers are asymptotically optimal, and deliver the maximum amount of parallelism available in computation trees. To evaluate our schedulers, we develop program transformations that can convert mixed data- and task-parallel pro- grams into task block-based programs. Using a prototype instantiation of our scheduling framework, we show that, on an 8-core system, we can simultaneously exploit vector and multicore parallelism to achieve 14×-108× speedup over sequential baselines.« less
Options for Parallelizing a Planning and Scheduling Algorithm

NASA Technical Reports Server (NTRS)

Clement, Bradley J.; Estlin, Tara A.; Bornstein, Benjamin D.

2011-01-01

Space missions have a growing interest in putting multi-core processors onboard spacecraft. For many missions processing power significantly slows operations. We investigate how continual planning and scheduling algorithms can exploit multi-core processing and outline different potential design decisions for a parallelized planning architecture. This organization of choices and challenges helps us with an initial design for parallelizing the CASPER planning system for a mesh multi-core processor. This work extends that presented at another workshop with some preliminary results.
High-performance sparse matrix-matrix products on Intel KNL and multicore architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Nagasaka, Y; Matsuoka, S; Azad, A

Sparse matrix-matrix multiplication (SpGEMM) is a computational primitive that is widely used in areas ranging from traditional numerical applications to recent big data analysis and machine learning. Although many SpGEMM algorithms have been proposed, hardware specific optimizations for multi- and many-core processors are lacking and a detailed analysis of their performance under various use cases and matrices is not available. We firstly identify and mitigate multiple bottlenecks with memory management and thread scheduling on Intel Xeon Phi (Knights Landing or KNL). Specifically targeting multi- and many-core processors, we develop a hash-table-based algorithm and optimize a heap-based shared-memory SpGEMM algorithm. Wemore » examine their performance together with other publicly available codes. Different from the literature, our evaluation also includes use cases that are representative of real graph algorithms, such as multi-source breadth-first search or triangle counting. Our hash-table and heap-based algorithms are showing significant speedups from libraries in the majority of the cases while different algorithms dominate the other scenarios with different matrix size, sparsity, compression factor and operation type. We wrap up in-depth evaluation results and make a recipe to give the best SpGEMM algorithm for target scenario. A critical finding is that hash-table-based SpGEMM gets a significant performance boost if the nonzeros are not required to be sorted within each row of the output matrix.« less
Efficiently Scheduling Multi-core Guest Virtual Machines on Multi-core Hosts in Network Simulation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Yoginath, Srikanth B; Perumalla, Kalyan S

2011-01-01

Virtual machine (VM)-based simulation is a method used by network simulators to incorporate realistic application behaviors by executing actual VMs as high-fidelity surrogates for simulated end-hosts. A critical requirement in such a method is the simulation time-ordered scheduling and execution of the VMs. Prior approaches such as time dilation are less efficient due to the high degree of multiplexing possible when multiple multi-core VMs are simulated on multi-core host systems. We present a new simulation time-ordered scheduler to efficiently schedule multi-core VMs on multi-core real hosts, with a virtual clock realized on each virtual core. The distinguishing features of ourmore » approach are: (1) customizable granularity of the VM scheduling time unit on the simulation time axis, (2) ability to take arbitrary leaps in virtual time by VMs to maximize the utilization of host (real) cores when guest virtual cores idle, and (3) empirically determinable optimality in the tradeoff between total execution (real) time and time-ordering accuracy levels. Experiments show that it is possible to get nearly perfect time-ordered execution, with a slight cost in total run time, relative to optimized non-simulation VM schedulers. Interestingly, with our time-ordered scheduler, it is also possible to reduce the time-ordering error from over 50% of non-simulation scheduler to less than 1% realized by our scheduler, with almost the same run time efficiency as that of the highly efficient non-simulation VM schedulers.« less
Fast multi-core based multimodal registration of 2D cross-sections and 3D datasets

PubMed Central

2010-01-01

Background Solving bioinformatics tasks often requires extensive computational power. Recent trends in processor architecture combine multiple cores into a single chip to improve overall performance. The Cell Broadband Engine (CBE), a heterogeneous multi-core processor, provides power-efficient and cost-effective high-performance computing. One application area is image analysis and visualisation, in particular registration of 2D cross-sections into 3D image datasets. Such techniques can be used to put different image modalities into spatial correspondence, for example, 2D images of histological cuts into morphological 3D frameworks. Results We evaluate the CBE-driven PlayStation 3 as a high performance, cost-effective computing platform by adapting a multimodal alignment procedure to several characteristic hardware properties. The optimisations are based on partitioning, vectorisation, branch reducing and loop unrolling techniques with special attention to 32-bit multiplies and limited local storage on the computing units. We show how a typical image analysis and visualisation problem, the multimodal registration of 2D cross-sections and 3D datasets, benefits from the multi-core based implementation of the alignment algorithm. We discuss several CBE-based optimisation methods and compare our results to standard solutions. More information and the source code are available from http://cbe.ipk-gatersleben.de. Conclusions The results demonstrate that the CBE processor in a PlayStation 3 accelerates computational intensive multimodal registration, which is of great importance in biological/medical image processing. The PlayStation 3 as a low cost CBE-based platform offers an efficient option to conventional hardware to solve computational problems in image processing and bioinformatics. PMID:20064262
Multi-core and GPU accelerated simulation of a radial star target imaged with equivalent t-number circular and Gaussian pupils

NASA Astrophysics Data System (ADS)

Greynolds, Alan W.

2013-09-01

Results from the GelOE optical engineering software are presented for the through-focus, monochromatic coherent and polychromatic incoherent imaging of a radial "star" target for equivalent t-number circular and Gaussian pupils. The FFT-based simulations are carried out using OpenMP threading on a multi-core desktop computer, with and without the aid of a many-core NVIDIA GPU accessing its cuFFT library. It is found that a custom FFT optimized for the 12-core host has similar performance to a simply implemented 256-core GPU FFT. A more sophisticated version of the latter but tuned to reduce overhead on a 448-core GPU is 20 to 28 times faster than a basic FFT implementation running on one CPU core.
Printed freeform lens arrays on multi-core fibers for highly efficient coupling in astrophotonic systems.

PubMed

Dietrich, Philipp-Immanuel; Harris, Robert J; Blaicher, Matthias; Corrigan, Mark K; Morris, Tim M; Freude, Wolfgang; Quirrenbach, Andreas; Koos, Christian

2017-07-24

Coupling of light into multi-core fibers (MCF) for spatially resolved spectroscopy is of great importance to astronomical instrumentation. To achieve high coupling efficiencies along with fill-fractions close to unity, micro-optical elements are required to concentrate the incoming light to the individual cores of the MCF. In this paper we demonstrate facet-attached lens arrays (LA) fabricated by two-photon polymerization. The LA provide close to 100% fill-fraction along with efficiencies of up to 73% (down to 1.4 dB loss) for coupling of light from free space into an MCF core. We show the viability of the concept for astrophotonic applications by integrating an MCF-LA assembly in an adaptive-optics test bed and by assessing its performance as a tip/tilt sensor.
Using OpenMP vs. Threading Building Blocks for Medical Imaging on Multi-cores

NASA Astrophysics Data System (ADS)

Kegel, Philipp; Schellmann, Maraike; Gorlatch, Sergei

We compare two parallel programming approaches for multi-core systems: the well-known OpenMP and the recently introduced Threading Building Blocks (TBB) library by Intel®. The comparison is made using the parallelization of a real-world numerical algorithm for medical imaging. We develop several parallel implementations, and compare them w.r.t. programming effort, programming style and abstraction, and runtime performance. We show that TBB requires a considerable program re-design, whereas with OpenMP simple compiler directives are sufficient. While TBB appears to be less appropriate for parallelizing existing implementations, it fosters a good programming style and higher abstraction level for newly developed parallel programs. Our experimental measurements on a dual quad-core system demonstrate that OpenMP slightly outperforms TBB in our implementation.
Energy Efficient Real-Time Scheduling Using DPM on Mobile Sensors with a Uniform Multi-Cores

PubMed Central

Kim, Youngmin; Lee, Chan-Gun

2017-01-01

In wireless sensor networks (WSNs), sensor nodes are deployed for collecting and analyzing data. These nodes use limited energy batteries for easy deployment and low cost. The use of limited energy batteries is closely related to the lifetime of the sensor nodes when using wireless sensor networks. Efficient-energy management is important to extending the lifetime of the sensor nodes. Most effort for improving power efficiency in tiny sensor nodes has focused mainly on reducing the power consumed during data transmission. However, recent emergence of sensor nodes equipped with multi-cores strongly requires attention to be given to the problem of reducing power consumption in multi-cores. In this paper, we propose an energy efficient scheduling method for sensor nodes supporting a uniform multi-cores. We extend the proposed T-Ler plane based scheduling for global optimal scheduling of a uniform multi-cores and multi-processors to enable power management using dynamic power management. In the proposed approach, processor selection for a scheduling and mapping method between the tasks and processors is proposed to efficiently utilize dynamic power management. Experiments show the effectiveness of the proposed approach compared to other existing methods. PMID:29240695
Identifying, Quantifying, Extracting and Enhancing Implicit Parallelism

ERIC Educational Resources Information Center

Agarwal, Mayank

2009-01-01

The shift of the microprocessor industry towards multicore architectures has placed a huge burden on the programmers by requiring explicit parallelization for performance. Implicit Parallelization is an alternative that could ease the burden on programmers by parallelizing applications "under the covers" while maintaining sequential semantics…
Resource-Efficient, Hierarchical Auto-Tuning of a Hybrid Lattice Boltzmann Computation on the Cray XT4

DOE Office of Scientific and Technical Information (OSTI.GOV)

Computational Research Division, Lawrence Berkeley National Laboratory; NERSC, Lawrence Berkeley National Laboratory; Computer Science Department, University of California, Berkeley

2009-05-04

We apply auto-tuning to a hybrid MPI-pthreads lattice Boltzmann computation running on the Cray XT4 at National Energy Research Scientific Computing Center (NERSC). Previous work showed that multicore-specific auto-tuning can improve the performance of lattice Boltzmann magnetohydrodynamics (LBMHD) by a factor of 4x when running on dual- and quad-core Opteron dual-socket SMPs. We extend these studies to the distributed memory arena via a hybrid MPI/pthreads implementation. In addition to conventional auto-tuning at the local SMP node, we tune at the message-passing level to determine the optimal aspect ratio as well as the correct balance between MPI tasks and threads permore » MPI task. Our study presents a detailed performance analysis when moving along an isocurve of constant hardware usage: fixed total memory, total cores, and total nodes. Overall, our work points to approaches for improving intra- and inter-node efficiency on large-scale multicore systems for demanding scientific applications.« less
Behavior-aware cache hierarchy optimization for low-power multi-core embedded systems

NASA Astrophysics Data System (ADS)

Zhao, Huatao; Luo, Xiao; Zhu, Chen; Watanabe, Takahiro; Zhu, Tianbo

2017-07-01

In modern embedded systems, the increasing number of cores requires efficient cache hierarchies to ensure data throughput, but such cache hierarchies are restricted by their tumid size and interference accesses which leads to both performance degradation and wasted energy. In this paper, we firstly propose a behavior-aware cache hierarchy (BACH) which can optimally allocate the multi-level cache resources to many cores and highly improved the efficiency of cache hierarchy, resulting in low energy consumption. The BACH takes full advantage of the explored application behaviors and runtime cache resource demands as the cache allocation bases, so that we can optimally configure the cache hierarchy to meet the runtime demand. The BACH was implemented on the GEM5 simulator. The experimental results show that energy consumption of a three-level cache hierarchy can be saved from 5.29% up to 27.94% compared with other key approaches while the performance of the multi-core system even has a slight improvement counting in hardware overhead.
T-L Plane Abstraction-Based Energy-Efficient Real-Time Scheduling for Multi-Core Wireless Sensors.

PubMed

Kim, Youngmin; Lee, Ki-Seong; Pham, Ngoc-Son; Lee, Sun-Ro; Lee, Chan-Gun

2016-07-08

Energy efficiency is considered as a critical requirement for wireless sensor networks. As more wireless sensor nodes are equipped with multi-cores, there are emerging needs for energy-efficient real-time scheduling algorithms. The T-L plane-based scheme is known to be an optimal global scheduling technique for periodic real-time tasks on multi-cores. Unfortunately, there has been a scarcity of studies on extending T-L plane-based scheduling algorithms to exploit energy-saving techniques. In this paper, we propose a new T-L plane-based algorithm enabling energy-efficient real-time scheduling on multi-core sensor nodes with dynamic power management (DPM). Our approach addresses the overhead of processor mode transitions and reduces fragmentations of the idle time, which are inherent in T-L plane-based algorithms. Our experimental results show the effectiveness of the proposed algorithm compared to other energy-aware scheduling methods on T-L plane abstraction.
Photonic-Networks-on-Chip for High Performance Radiation Survivable Multi-Core Processor Systems

DTIC Science & Technology

2013-12-01

Loss Spectra” Proceedings of SPIE 8255, (2012) and in a journal publication: M. T. Crowley, D. Murrell, N. Patel, M. Breivik , C.-Y. Lin, Y. Li, B.-O...Crowley, D. Murrell, N. Patel, M. Breivik , C.-Y. Lin, Y. Li, B.-O. Fimland and L. F. Lester, "Analytical Modeling of the Temperature Performance of

Parallel Agent-Based Simulations on Clusters of GPUs and Multi-Core Processors

DOE Office of Scientific and Technical Information (OSTI.GOV)

Aaby, Brandon G; Perumalla, Kalyan S; Seal, Sudip K

2010-01-01

An effective latency-hiding mechanism is presented in the parallelization of agent-based model simulations (ABMS) with millions of agents. The mechanism is designed to accommodate the hierarchical organization as well as heterogeneity of current state-of-the-art parallel computing platforms. We use it to explore the computation vs. communication trade-off continuum available with the deep computational and memory hierarchies of extant platforms and present a novel analytical model of the tradeoff. We describe our implementation and report preliminary performance results on two distinct parallel platforms suitable for ABMS: CUDA threads on multiple, networked graphical processing units (GPUs), and pthreads on multi-core processors. Messagemore » Passing Interface (MPI) is used for inter-GPU as well as inter-socket communication on a cluster of multiple GPUs and multi-core processors. Results indicate the benefits of our latency-hiding scheme, delivering as much as over 100-fold improvement in runtime for certain benchmark ABMS application scenarios with several million agents. This speed improvement is obtained on our system that is already two to three orders of magnitude faster on one GPU than an equivalent CPU-based execution in a popular simulator in Java. Thus, the overall execution of our current work is over four orders of magnitude faster when executed on multiple GPUs.« less
Electrosprayed Multi-Core Alginate Microcapsules as Novel Self-Healing Containers

NASA Astrophysics Data System (ADS)

Hia, Iee Lee; Pasbakhsh, Pooria; Chan, Eng-Seng; Chai, Siang-Piao

2016-10-01

Alginate microcapsules containing epoxy resin were developed through electrospraying method and embedded into epoxy matrix to produce a capsule-based self-healing composite system. These formaldehyde free alginate/epoxy microcapsules were characterized via light microscope, field emission scanning electron microscope, fourier transform infrared spectroscopy and thermogravimetric analysis. Results showed that epoxy resin was successfully encapsulated within alginate matrix to form porous (multi-core) microcapsules with pore size ranged from 5-100 μm. The microcapsules had an average size of 320 ± 20 μm with decomposition temperature at 220 °C. The loading capacity of these capsules was estimated to be 79%. Under in situ healing test, impact specimens showed healing efficiency as high as 86% and the ability to heal up to 3 times due to the multi-core capsule structure and the high impact energy test that triggered the released of epoxy especially in the second and third healings. TDCB specimens showed one-time healing only with the highest healing efficiency of 76%. The single healing event was attributed by the constant crack propagation rate of TDCB fracture test. For the first time, a cost effective, environmentally benign and sustainable capsule-based self-healing system with multiple healing capabilities and high healing performance was developed.
Electrosprayed Multi-Core Alginate Microcapsules as Novel Self-Healing Containers.

PubMed

Hia, Iee Lee; Pasbakhsh, Pooria; Chan, Eng-Seng; Chai, Siang-Piao

2016-10-03

Alginate microcapsules containing epoxy resin were developed through electrospraying method and embedded into epoxy matrix to produce a capsule-based self-healing composite system. These formaldehyde free alginate/epoxy microcapsules were characterized via light microscope, field emission scanning electron microscope, fourier transform infrared spectroscopy and thermogravimetric analysis. Results showed that epoxy resin was successfully encapsulated within alginate matrix to form porous (multi-core) microcapsules with pore size ranged from 5-100 μm. The microcapsules had an average size of 320 ± 20 μm with decomposition temperature at 220 °C. The loading capacity of these capsules was estimated to be 79%. Under in situ healing test, impact specimens showed healing efficiency as high as 86% and the ability to heal up to 3 times due to the multi-core capsule structure and the high impact energy test that triggered the released of epoxy especially in the second and third healings. TDCB specimens showed one-time healing only with the highest healing efficiency of 76%. The single healing event was attributed by the constant crack propagation rate of TDCB fracture test. For the first time, a cost effective, environmentally benign and sustainable capsule-based self-healing system with multiple healing capabilities and high healing performance was developed.
Electrosprayed Multi-Core Alginate Microcapsules as Novel Self-Healing Containers

PubMed Central

Hia, Iee Lee; Pasbakhsh, Pooria; Chan, Eng-Seng; Chai, Siang-Piao

2016-01-01

Alginate microcapsules containing epoxy resin were developed through electrospraying method and embedded into epoxy matrix to produce a capsule-based self-healing composite system. These formaldehyde free alginate/epoxy microcapsules were characterized via light microscope, field emission scanning electron microscope, fourier transform infrared spectroscopy and thermogravimetric analysis. Results showed that epoxy resin was successfully encapsulated within alginate matrix to form porous (multi-core) microcapsules with pore size ranged from 5–100 μm. The microcapsules had an average size of 320 ± 20 μm with decomposition temperature at 220 °C. The loading capacity of these capsules was estimated to be 79%. Under in situ healing test, impact specimens showed healing efficiency as high as 86% and the ability to heal up to 3 times due to the multi-core capsule structure and the high impact energy test that triggered the released of epoxy especially in the second and third healings. TDCB specimens showed one-time healing only with the highest healing efficiency of 76%. The single healing event was attributed by the constant crack propagation rate of TDCB fracture test. For the first time, a cost effective, environmentally benign and sustainable capsule-based self-healing system with multiple healing capabilities and high healing performance was developed. PMID:27694922
Parallelized computation for computer simulation of electrocardiograms using personal computers with multi-core CPU and general-purpose GPU.

PubMed

Shen, Wenfeng; Wei, Daming; Xu, Weimin; Zhu, Xin; Yuan, Shizhong

2010-10-01

Biological computations like electrocardiological modelling and simulation usually require high-performance computing environments. This paper introduces an implementation of parallel computation for computer simulation of electrocardiograms (ECGs) in a personal computer environment with an Intel CPU of Core (TM) 2 Quad Q6600 and a GPU of Geforce 8800GT, with software support by OpenMP and CUDA. It was tested in three parallelization device setups: (a) a four-core CPU without a general-purpose GPU, (b) a general-purpose GPU plus 1 core of CPU, and (c) a four-core CPU plus a general-purpose GPU. To effectively take advantage of a multi-core CPU and a general-purpose GPU, an algorithm based on load-prediction dynamic scheduling was developed and applied to setting (c). In the simulation with 1600 time steps, the speedup of the parallel computation as compared to the serial computation was 3.9 in setting (a), 16.8 in setting (b), and 20.0 in setting (c). This study demonstrates that a current PC with a multi-core CPU and a general-purpose GPU provides a good environment for parallel computations in biological modelling and simulation studies. Copyright 2010 Elsevier Ireland Ltd. All rights reserved.
Multiplexed fibre optic sensing in the distal lung (Conference Presentation)

NASA Astrophysics Data System (ADS)

Choudhary, Tushar R.; Tanner, Michael G.; Megia-Fernandez, Alicia; Harrington, Kerrianne; Wood, Harry A.; Chankeshwara, Sunay; Zhu, Patricia; Choudhury, Debaditya; Yu, Fei; Thomson, Robert R.; Duncan, Rory R.; Dhaliwal, Kevin; Bradley, Mark

2017-02-01

We present a toolkit for a multiplexed pH and oxygen sensing probe in the distal lung using multicore fibres. Measuring physiological relevant parameters like pH and oxygen is of significant importance in understanding changes associated with disease pathology. We present here, a single multicore fibre based pH and oxygen sensing probe which can be used with a standard bronchoscope to perform in vivo measurements in the distal lung. The multiplexed probe consists of fluorescent pH sensors (fluorescein based) and oxygen sensors (Palladium porphyrin complex based) covalently bonded to silica microspheres (10 µm) loaded on the distal facet of a 19 core (10 µm core diameter) multicore fibre (total diameter of 150 µm excluding coating). Pits are formed by selectively etching the cores using hydrofluoric acid, multiplexing is achieved through the self-location of individual probes on differing cores. This architecture can be expanded to include probes for further parameters. Robust measurements are demonstrated of self-referencing fluorophores, not limited by photobleaching, with short (100ms) measurement times at low ( 10µW) illumination powers. We have performed on bench calibration and tests of in vitro tissue models and in an ovine whole lung model to validate our sensors. The pH sensor is demonstrated in the physiologically relevant range of pH 5 to pH 8.5 and with an accuracy of ± 0.05 pH units. The oxygen sensor is demonstrated in gas mixtures downwards from 20% oxygen and in liquid saturated with 20% oxygen mixtures ( 8mg/L) down to full depletion (0mg/L) with 0.5mg/L accuracy.
Multi-Threaded DNA Tag/Anti-Tag Library Generator for Multi-Core Platforms

DTIC Science & Technology

2009-05-01

base pair) Watson ‐ Crick strand pairs that bind perfectly within pairs, but poorly across pairs. A variety of DNA strand hybridization metrics...AFRL-RI-RS-TR-2009-131 Final Technical Report May 2009 MULTI-THREADED DNA TAG/ANTI-TAG LIBRARY GENERATOR FOR MULTI-CORE PLATFORMS...TYPE Final 3. DATES COVERED (From - To) Jun 08 – Feb 09 4. TITLE AND SUBTITLE MULTI-THREADED DNA TAG/ANTI-TAG LIBRARY GENERATOR FOR MULTI-CORE
Coupled-mode propagation in multicore fibers characterized by optical low-coherence reflectometry.

PubMed

Salathé, R P; Gilgen, H; Bodmer, G

1996-07-01

A fiber-optical low-coherence ref lectometer has been used to probe a multicore fiber locally at a wavelength of 1.3 microm. This technique allows one to determine the group index of refraction of the modes in the multicore fiber with high accuracy. Light propagation that is due to noncoherent coupling of energy from one fiber core to adjacent cores through cladding modes can be distinguished quantitatively from light propagating in coherently coupled modes. Intercore coupling constants in the range of 0.6-2 mm(-1) have been evaluated for the coupled modes.
Core-to-core uniformity improvement in multi-core fiber Bragg gratings

NASA Astrophysics Data System (ADS)

Lindley, Emma; Min, Seong-Sik; Leon-Saval, Sergio; Cvetojevic, Nick; Jovanovic, Nemanja; Bland-Hawthorn, Joss; Lawrence, Jon; Gris-Sanchez, Itandehui; Birks, Tim; Haynes, Roger; Haynes, Dionne

2014-07-01

Multi-core fiber Bragg gratings (MCFBGs) will be a valuable tool not only in communications but also various astronomical, sensing and industry applications. In this paper we address some of the technical challenges of fabricating effective multi-core gratings by simulating improvements to the writing method. These methods allow a system designed for inscribing single-core fibers to cope with MCFBG fabrication with only minor, passive changes to the writing process. Using a capillary tube that was polished on one side, the field entering the fiber was flattened which improved the coverage and uniformity of all cores.
A Parallel Framework with Block Matrices of a Discrete Fourier Transform for Vector-Valued Discrete-Time Signals.

PubMed

Soto-Quiros, Pablo

2015-01-01

This paper presents a parallel implementation of a kind of discrete Fourier transform (DFT): the vector-valued DFT. The vector-valued DFT is a novel tool to analyze the spectra of vector-valued discrete-time signals. This parallel implementation is developed in terms of a mathematical framework with a set of block matrix operations. These block matrix operations contribute to analysis, design, and implementation of parallel algorithms in multicore processors. In this work, an implementation and experimental investigation of the mathematical framework are performed using MATLAB with the Parallel Computing Toolbox. We found that there is advantage to use multicore processors and a parallel computing environment to minimize the high execution time. Additionally, speedup increases when the number of logical processors and length of the signal increase.
MDTM: Optimizing Data Transfer using Multicore-Aware I/O Scheduling

DOE Office of Scientific and Technical Information (OSTI.GOV)

Zhang, Liang; Demar, Phil; Wu, Wenji

2017-05-09

Bulk data transfer is facing significant challenges in the coming era of big data. There are multiple performance bottlenecks along the end-to-end path from the source to destination storage system. The limitations of current generation data transfer tools themselves can have a significant impact on end-to-end data transfer rates. In this paper, we identify the issues that lead to underperformance of these tools, and present a new data transfer tool with an innovative I/O scheduler called MDTM. The MDTM scheduler exploits underlying multicore layouts to optimize throughput by reducing delay and contention for I/O reading and writing operations. With ourmore » evaluations, we show how MDTM successfully avoids NUMA-based congestion and significantly improves end-to-end data transfer rates across high-speed wide area networks.« less
MDTM: Optimizing Data Transfer using Multicore-Aware I/O Scheduling

DOE Office of Scientific and Technical Information (OSTI.GOV)

Zhang, Liang; Demar, Phil; Wu, Wenji

2017-01-01

Bulk data transfer is facing significant challenges in the coming era of big data. There are multiple performance bottlenecks along the end-to-end path from the source to destination storage system. The limitations of current generation data transfer tools themselves can have a significant impact on end-to-end data transfer rates. In this paper, we identify the issues that lead to underperformance of these tools, and present a new data transfer tool with an innovative I/O scheduler called MDTM. The MDTM scheduler exploits underlying multicore layouts to optimize throughput by reducing delay and contention for I/O reading and writing operations. With ourmore » evaluations, we show how MDTM successfully avoids NUMA-based congestion and significantly improves end-to-end data transfer rates across high-speed wide area networks.« less
An MPI-1 Compliant Thread-Based Implementation

NASA Astrophysics Data System (ADS)

Díaz Martín, J. C.; Rico Gallego, J. A.; Álvarez Llorente, J. M.; Perogil Duque, J. F.

This work presents AzequiaMPI, the first full compliant implementation of the MPI-1 standard where the MPI node is a thread. Performance comparisons with MPICH2-Nemesis show that thread-based implementations exploit adequately the multicore architectures under oversubscription, what could make MPI competitive with OpenMP-like solutions.
Scalable Algorithms for Parallel Discrete Event Simulation Systems in Multicore Environments

DTIC Science & Technology

2013-05-01

consolidated at the sender side. At the receiver side, the messages are deconsolidated and delivered to the appropriate thread. This approach bears some...Jiang, S. Kini, W. Yu, D. Buntinas, P. Wyckoff, and D. Panda . Performance comparison of mpi implementations over infiniband, myrinet and quadrics
RTEMS SMP and MTAPI for Efficient Multi-Core Space Applications on LEON3/LEON4 Processors

NASA Astrophysics Data System (ADS)

Cederman, Daniel; Hellstrom, Daniel; Sherrill, Joel; Bloom, Gedare; Patte, Mathieu; Zulianello, Marco

2015-09-01

This paper presents the final result of an European Space Agency (ESA) activity aimed at improving the software support for LEON processors used in SMP configurations. One of the benefits of using a multicore system in a SMP configuration is that in many instances it is possible to better utilize the available processing resources by load balancing between cores. This however comes with the cost of having to synchronize operations between cores, leading to increased complexity. While in an AMP system one can use multiple instances of operating systems that are only uni-processor capable, a SMP system requires the operating system to be written to support multicore systems. In this activity we have improved and extended the SMP support of the RTEMS real-time operating system and ensured that it fully supports the multicore capable LEON processors. The targeted hardware in the activity has been the GR712RC, a dual-core core LEON3FT processor, and the functional prototype of ESA's Next Generation Multiprocessor (NGMP), a quad core LEON4 processor. The final version of the NGMP is now available as a product under the name GR740. An implementation of the Multicore Task Management API (MTAPI) has been developed as part of this activity to aid in the parallelization of applications for RTEMS SMP. It allows for simplified development of parallel applications using the task-based programming model. An existing space application, the Gaia Video Processing Unit, has been ported to RTEMS SMP using the MTAPI implementation to demonstrate the feasibility and usefulness of multicore processors for space payload software. The activity is funded by ESA under contract 4000108560/13/NL/JK. Gedare Bloom is supported in part by NSF CNS-0934725.
Node Resource Manager: A Distributed Computing Software Framework Used for Solving Geophysical Problems

NASA Astrophysics Data System (ADS)

Lawry, B. J.; Encarnacao, A.; Hipp, J. R.; Chang, M.; Young, C. J.

2011-12-01

With the rapid growth of multi-core computing hardware, it is now possible for scientific researchers to run complex, computationally intensive software on affordable, in-house commodity hardware. Multi-core CPUs (Central Processing Unit) and GPUs (Graphics Processing Unit) are now commonplace in desktops and servers. Developers today have access to extremely powerful hardware that enables the execution of software that could previously only be run on expensive, massively-parallel systems. It is no longer cost-prohibitive for an institution to build a parallel computing cluster consisting of commodity multi-core servers. In recent years, our research team has developed a distributed, multi-core computing system and used it to construct global 3D earth models using seismic tomography. Traditionally, computational limitations forced certain assumptions and shortcuts in the calculation of tomographic models; however, with the recent rapid growth in computational hardware including faster CPU's, increased RAM, and the development of multi-core computers, we are now able to perform seismic tomography, 3D ray tracing and seismic event location using distributed parallel algorithms running on commodity hardware, thereby eliminating the need for many of these shortcuts. We describe Node Resource Manager (NRM), a system we developed that leverages the capabilities of a parallel computing cluster. NRM is a software-based parallel computing management framework that works in tandem with the Java Parallel Processing Framework (JPPF, http://www.jppf.org/), a third party library that provides a flexible and innovative way to take advantage of modern multi-core hardware. NRM enables multiple applications to use and share a common set of networked computers, regardless of their hardware platform or operating system. Using NRM, algorithms can be parallelized to run on multiple processing cores of a distributed computing cluster of servers and desktops, which results in a dramatic speedup in execution time. NRM is sufficiently generic to support applications in any domain, as long as the application is parallelizable (i.e., can be subdivided into multiple individual processing tasks). At present, NRM has been effective in decreasing the overall runtime of several algorithms: 1) the generation of a global 3D model of the compressional velocity distribution in the Earth using tomographic inversion, 2) the calculation of the model resolution matrix, model covariance matrix, and travel time uncertainty for the aforementioned velocity model, and 3) the correlation of waveforms with archival data on a massive scale for seismic event detection. Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000.
T-L Plane Abstraction-Based Energy-Efficient Real-Time Scheduling for Multi-Core Wireless Sensors

PubMed Central

Kim, Youngmin; Lee, Ki-Seong; Pham, Ngoc-Son; Lee, Sun-Ro; Lee, Chan-Gun

2016-01-01

Energy efficiency is considered as a critical requirement for wireless sensor networks. As more wireless sensor nodes are equipped with multi-cores, there are emerging needs for energy-efficient real-time scheduling algorithms. The T-L plane-based scheme is known to be an optimal global scheduling technique for periodic real-time tasks on multi-cores. Unfortunately, there has been a scarcity of studies on extending T-L plane-based scheduling algorithms to exploit energy-saving techniques. In this paper, we propose a new T-L plane-based algorithm enabling energy-efficient real-time scheduling on multi-core sensor nodes with dynamic power management (DPM). Our approach addresses the overhead of processor mode transitions and reduces fragmentations of the idle time, which are inherent in T-L plane-based algorithms. Our experimental results show the effectiveness of the proposed algorithm compared to other energy-aware scheduling methods on T-L plane abstraction. PMID:27399722
Writing Bragg Gratings in Multicore Fibers.

PubMed

Lindley, Emma Y; Min, Seong-Sik; Leon-Saval, Sergio G; Cvetojevic, Nick; Lawrence, Jon; Ellis, Simon C; Bland-Hawthorn, Joss

2016-04-20

Fiber Bragg gratings in multicore fibers can be used as compact and robust filters in astronomical and other research and commercial applications. Strong suppression at a single wavelength requires that all cores have matching transmission profiles. These gratings cannot be inscribed using the same method as for single-core fibers because the curved surface of the cladding acts as a lens, focusing the incoming UV laser beam and causing variations in exposure between cores. Therefore we use an additional optical element to ensure that the beam shape does not change while passing through the cross-section of the multicore fiber. This consists of a glass capillary tube which has been polished flat on one side, which is then placed over the section of the fiber to be inscribed. The laser beam enters the fiber through the flat surface of the capillary tube and hence maintains its original dimensions. This paper demonstrates the improvements in core-to-core uniformity for a 7-core fiber using this method. The technique can be generalized to larger multicore fibers.
Writing Bragg Gratings in Multicore Fibers

PubMed Central

Lindley, Emma Y.; Min, Seong-sik; Leon-Saval, Sergio G.; Cvetojevic, Nick; Lawrence, Jon; Ellis, Simon C.; Bland-Hawthorn, Joss

2016-01-01

Fiber Bragg gratings in multicore fibers can be used as compact and robust filters in astronomical and other research and commercial applications. Strong suppression at a single wavelength requires that all cores have matching transmission profiles. These gratings cannot be inscribed using the same method as for single-core fibers because the curved surface of the cladding acts as a lens, focusing the incoming UV laser beam and causing variations in exposure between cores. Therefore we use an additional optical element to ensure that the beam shape does not change while passing through the cross-section of the multicore fiber. This consists of a glass capillary tube which has been polished flat on one side, which is then placed over the section of the fiber to be inscribed. The laser beam enters the fiber through the flat surface of the capillary tube and hence maintains its original dimensions. This paper demonstrates the improvements in core-to-core uniformity for a 7-core fiber using this method. The technique can be generalized to larger multicore fibers. PMID:27167576
Single-step generation of metal-plasma polymer multicore@shell nanoparticles from the gas phase.

PubMed

Solař, Pavel; Polonskyi, Oleksandr; Olbricht, Ansgar; Hinz, Alexander; Shelemin, Artem; Kylián, Ondřej; Choukourov, Andrei; Faupel, Franz; Biederman, Hynek

2017-08-17

Nanoparticles composed of multiple silver cores and a plasma polymer shell (multicore@shell) were prepared in a single step with a gas aggregation cluster source operating with Ar/hexamethyldisiloxane mixtures and optionally oxygen. The size distribution of the metal inclusions as well as the chemical composition and the thickness of the shells were found to be controlled by the composition of the working gas mixture. Shell matrices ranging from organosilicon plasma polymer to nearly stoichiometric SiO 2 were obtained. The method allows facile fabrication of multicore@shell nanoparticles with tailored functional properties, as demonstrated here with the optical response.

Shared Memory Parallelism for 3D Cartesian Discrete Ordinates Solver

NASA Astrophysics Data System (ADS)

Moustafa, Salli; Dutka-Malen, Ivan; Plagne, Laurent; Ponçot, Angélique; Ramet, Pierre

2014-06-01

This paper describes the design and the performance of DOMINO, a 3D Cartesian SN solver that implements two nested levels of parallelism (multicore+SIMD) on shared memory computation nodes. DOMINO is written in C++, a multi-paradigm programming language that enables the use of powerful and generic parallel programming tools such as Intel TBB and Eigen. These two libraries allow us to combine multi-thread parallelism with vector operations in an efficient and yet portable way. As a result, DOMINO can exploit the full power of modern multi-core processors and is able to tackle very large simulations, that usually require large HPC clusters, using a single computing node. For example, DOMINO solves a 3D full core PWR eigenvalue problem involving 26 energy groups, 288 angular directions (S16), 46 × 106 spatial cells and 1 × 1012 DoFs within 11 hours on a single 32-core SMP node. This represents a sustained performance of 235 GFlops and 40:74% of the SMP node peak performance for the DOMINO sweep implementation. The very high Flops/Watt ratio of DOMINO makes it a very interesting building block for a future many-nodes nuclear simulation tool.
On Designing Multicore-Aware Simulators for Systems Biology Endowed with OnLine Statistics

PubMed Central

Calcagno, Cristina; Coppo, Mario

2014-01-01

The paper arguments are on enabling methodologies for the design of a fully parallel, online, interactive tool aiming to support the bioinformatics scientists .In particular, the features of these methodologies, supported by the FastFlow parallel programming framework, are shown on a simulation tool to perform the modeling, the tuning, and the sensitivity analysis of stochastic biological models. A stochastic simulation needs thousands of independent simulation trajectories turning into big data that should be analysed by statistic and data mining tools. In the considered approach the two stages are pipelined in such a way that the simulation stage streams out the partial results of all simulation trajectories to the analysis stage that immediately produces a partial result. The simulation-analysis workflow is validated for performance and effectiveness of the online analysis in capturing biological systems behavior on a multicore platform and representative proof-of-concept biological systems. The exploited methodologies include pattern-based parallel programming and data streaming that provide key features to the software designers such as performance portability and efficient in-memory (big) data management and movement. Two paradigmatic classes of biological systems exhibiting multistable and oscillatory behavior are used as a testbed. PMID:25050327
On designing multicore-aware simulators for systems biology endowed with OnLine statistics.

PubMed

Aldinucci, Marco; Calcagno, Cristina; Coppo, Mario; Damiani, Ferruccio; Drocco, Maurizio; Sciacca, Eva; Spinella, Salvatore; Torquati, Massimo; Troina, Angelo

2014-01-01

The paper arguments are on enabling methodologies for the design of a fully parallel, online, interactive tool aiming to support the bioinformatics scientists .In particular, the features of these methodologies, supported by the FastFlow parallel programming framework, are shown on a simulation tool to perform the modeling, the tuning, and the sensitivity analysis of stochastic biological models. A stochastic simulation needs thousands of independent simulation trajectories turning into big data that should be analysed by statistic and data mining tools. In the considered approach the two stages are pipelined in such a way that the simulation stage streams out the partial results of all simulation trajectories to the analysis stage that immediately produces a partial result. The simulation-analysis workflow is validated for performance and effectiveness of the online analysis in capturing biological systems behavior on a multicore platform and representative proof-of-concept biological systems. The exploited methodologies include pattern-based parallel programming and data streaming that provide key features to the software designers such as performance portability and efficient in-memory (big) data management and movement. Two paradigmatic classes of biological systems exhibiting multistable and oscillatory behavior are used as a testbed.
Tiled architecture of a CNN-mostly IP system

NASA Astrophysics Data System (ADS)

Spaanenburg, Lambert; Malki, Suleyman

2009-05-01

Multi-core architectures have been popularized with the advent of the IBM CELL. On a finer grain the problems in scheduling multi-cores have already existed in the tiled architectures, such as the EPIC and Da Vinci. It is not easy to evaluate the performance of a schedule on such architecture as historical data are not available. One solution is to compile algorithms for which an optimal schedule is known by analysis. A typical example is an algorithm that is already defined in terms of many collaborating simple nodes, such as a Cellular Neural Network (CNN). A simple node with a local register stack together with a 'rotating wheel' internal communication mechanism has been proposed. Though the basic CNN allows for a tiled implementation of a tiled algorithm on a tiled structure, a practical CNN system will have to disturb this regularity by the additional need for arithmetical and logical operations. Arithmetic operations are needed for instance to accommodate for low-level image processing, while logical operations are needed to fork and merge different data streams without use of the external memory. It is found that the 'rotating wheel' internal communication mechanism still handles such mechanisms without the need for global control. Overall the CNN system provides for a practical network size as implemented on a FPGA, can be easily used as embedded IP and provides a clear benchmark for a multi-core compiler.
pyPaSWAS: Python-based multi-core CPU and GPU sequence alignment.

PubMed

Warris, Sven; Timal, N Roshan N; Kempenaar, Marcel; Poortinga, Arne M; van de Geest, Henri; Varbanescu, Ana L; Nap, Jan-Peter

2018-01-01

Our previously published CUDA-only application PaSWAS for Smith-Waterman (SW) sequence alignment of any type of sequence on NVIDIA-based GPUs is platform-specific and therefore adopted less than could be. The OpenCL language is supported more widely and allows use on a variety of hardware platforms. Moreover, there is a need to promote the adoption of parallel computing in bioinformatics by making its use and extension more simple through more and better application of high-level languages commonly used in bioinformatics, such as Python. The novel application pyPaSWAS presents the parallel SW sequence alignment code fully packed in Python. It is a generic SW implementation running on several hardware platforms with multi-core systems and/or GPUs that provides accurate sequence alignments that also can be inspected for alignment details. Additionally, pyPaSWAS support the affine gap penalty. Python libraries are used for automated system configuration, I/O and logging. This way, the Python environment will stimulate further extension and use of pyPaSWAS. pyPaSWAS presents an easy Python-based environment for accurate and retrievable parallel SW sequence alignments on GPUs and multi-core systems. The strategy of integrating Python with high-performance parallel compute languages to create a developer- and user-friendly environment should be considered for other computationally intensive bioinformatics algorithms.
Scaling GDL for Multi-cores to Process Planck HFI Beams Monte Carlo on HPC

NASA Astrophysics Data System (ADS)

Coulais, A.; Schellens, M.; Duvert, G.; Park, J.; Arabas, S.; Erard, S.; Roudier, G.; Hivon, E.; Mottet, S.; Laurent, B.; Pinter, M.; Kasradze, N.; Ayad, M.

2014-05-01

After reviewing the majors progress done in GDL -now in 0.9.4- on performance and plotting capabilities since ADASS XXI paper (Coulais et al. 2012), we detail how a large code for Planck HFI beams Monte Carlo was successfully transposed from IDL to GDL on HPC.
An embedded multi-core parallel model for real-time stereo imaging

NASA Astrophysics Data System (ADS)

He, Wenjing; Hu, Jian; Niu, Jingyu; Li, Chuanrong; Liu, Guangyu

2018-04-01

The real-time processing based on embedded system will enhance the application capability of stereo imaging for LiDAR and hyperspectral sensor. The task partitioning and scheduling strategies for embedded multiprocessor system starts relatively late, compared with that for PC computer. In this paper, aimed at embedded multi-core processing platform, a parallel model for stereo imaging is studied and verified. After analyzing the computing amount, throughout capacity and buffering requirements, a two-stage pipeline parallel model based on message transmission is established. This model can be applied to fast stereo imaging for airborne sensors with various characteristics. To demonstrate the feasibility and effectiveness of the parallel model, a parallel software was designed using test flight data, based on the 8-core DSP processor TMS320C6678. The results indicate that the design performed well in workload distribution and had a speed-up ratio up to 6.4.
Simulating electron wave dynamics in graphene superlattices exploiting parallel processing advantages

NASA Astrophysics Data System (ADS)

Rodrigues, Manuel J.; Fernandes, David E.; Silveirinha, Mário G.; Falcão, Gabriel

2018-01-01

This work introduces a parallel computing framework to characterize the propagation of electron waves in graphene-based nanostructures. The electron wave dynamics is modeled using both "microscopic" and effective medium formalisms and the numerical solution of the two-dimensional massless Dirac equation is determined using a Finite-Difference Time-Domain scheme. The propagation of electron waves in graphene superlattices with localized scattering centers is studied, and the role of the symmetry of the microscopic potential in the electron velocity is discussed. The computational methodologies target the parallel capabilities of heterogeneous multi-core CPU and multi-GPU environments and are built with the OpenCL parallel programming framework which provides a portable, vendor agnostic and high throughput-performance solution. The proposed heterogeneous multi-GPU implementation achieves speedup ratios up to 75x when compared to multi-thread and multi-core CPU execution, reducing simulation times from several hours to a couple of minutes.
Parallel transformation of K-SVD solar image denoising algorithm

NASA Astrophysics Data System (ADS)

Liang, Youwen; Tian, Yu; Li, Mei

2017-02-01

The images obtained by observing the sun through a large telescope always suffered with noise due to the low SNR. K-SVD denoising algorithm can effectively remove Gauss white noise. Training dictionaries for sparse representations is a time consuming task, due to the large size of the data involved and to the complexity of the training algorithms. In this paper, an OpenMP parallel programming language is proposed to transform the serial algorithm to the parallel version. Data parallelism model is used to transform the algorithm. Not one atom but multiple atoms updated simultaneously is the biggest change. The denoising effect and acceleration performance are tested after completion of the parallel algorithm. Speedup of the program is 13.563 in condition of using 16 cores. This parallel version can fully utilize the multi-core CPU hardware resources, greatly reduce running time and easily to transplant in multi-core platform.
Parallelization of the preconditioned IDR solver for modern multicore computer systems

NASA Astrophysics Data System (ADS)

Bessonov, O. A.; Fedoseyev, A. I.

2012-10-01

This paper present the analysis, parallelization and optimization approach for the large sparse matrix solver CNSPACK for modern multicore microprocessors. CNSPACK is an advanced solver successfully used for coupled solution of stiff problems arising in multiphysics applications such as CFD, semiconductor transport, kinetic and quantum problems. It employs iterative IDR algorithm with ILU preconditioning (user chosen ILU preconditioning order). CNSPACK has been successfully used during last decade for solving problems in several application areas, including fluid dynamics and semiconductor device simulation. However, there was a dramatic change in processor architectures and computer system organization in recent years. Due to this, performance criteria and methods have been revisited, together with involving the parallelization of the solver and preconditioner using Open MP environment. Results of the successful implementation for efficient parallelization are presented for the most advances computer system (Intel Core i7-9xx or two-processor Xeon 55xx/56xx).
HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi

DOE PAGES

Dongarra, Jack; Gates, Mark; Haidar, Azzam; ...

2015-01-01

This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms for multicore with Intel Xeon Phi coprocessors. In particular, we consider algorithms for solving linear systems. Further, we give an overview of the MAGMA MIC library, an open source, high performance library, that incorporates the developments presented here and, more broadly, provides the DLA functionality equivalent to that of the popular LAPACK library while targeting heterogeneous architectures that feature a mix of multicore CPUs and coprocessors. The LAPACK-compliance simplifies the use of the MAGMA MIC library in applications, while providing them with portably performant DLA.more » High performance is obtained through the use of the high-performance BLAS, hardware-specific tuning, and a hybridization methodology whereby we split the algorithm into computational tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components. Our methodology and programming techniques are incorporated into the MAGMA MIC API, which abstracts the application developer from the specifics of the Xeon Phi architecture and is therefore applicable to algorithms beyond the scope of DLA.« less
Power splitting of 1 × 16 in multicore photonic crystal fibers

NASA Astrophysics Data System (ADS)

Malka, Dror; Peled, Aaron

2017-09-01

A novel concept of 1 × 16 power splitter based on a variable multicore photonic crystal fiber (PCF) structure is described. Numerical simulations showed how the optical signal can be split in a PCF structure having dimensions of 60 μm × 60 μm × 3.582 mm. The coupled mode analysis and beam propagation method (BPM) was used for analyzing the multicore PCF based 1 × 16 splitter. The input optical signal at a wavelength of 1.55 μm inserted into the central core was divided into sixteen output cores, each with a 6.25% of the total power. The full width half maximum (FWHM) bandwidth found for each core was 100 nm.
ASC-ATDM Performance Portability Requirements for 2015-2019

DOE Office of Scientific and Technical Information (OSTI.GOV)

Edwards, Harold C.; Trott, Christian Robert

This report outlines the research, development, and support requirements for the Advanced Simulation and Computing (ASC ) Advanced Technology, Development, and Mitigation (ATDM) Performance Portability (a.k.a., Kokkos) project for 2015 - 2019 . The research and development (R&D) goal for Kokkos (v2) has been to create and demonstrate a thread - parallel programming model a nd standard C++ library - based implementation that enables performance portability across diverse manycore architectures such as multicore CPU, Intel Xeon Phi, and NVIDIA Kepler GPU. This R&D goal has been achieved for algorithms that use data parallel pat terns including parallel - for, parallelmore » - reduce, and parallel - scan. Current R&D is focusing on hierarchical parallel patterns such as a directed acyclic graph (DAG) of asynchronous tasks where each task contain s nested data parallel algorithms. This five y ear plan includes R&D required to f ully and performance portably exploit thread parallelism across current and anticipated next generation platforms (NGP). The Kokkos library is being evaluated by many projects exploring algorithm s and code design for NGP. Some production libraries and applications such as Trilinos and LAMMPS have already committed to Kokkos as their foundation for manycore parallelism an d performance portability. These five year requirements includes support required for current and antic ipated ASC projects to be effective and productive in their use of Kokkos on NGP. The greatest risk to the success of Kokkos and ASC projects relying upon Kokkos is a lack of staffing resources to support Kokkos to the degree needed by these ASC projects. This support includes up - to - date tutorials, documentation, multi - platform (hardware and software stack) testing, minor feature enhancements, thread - scalable algorithm consulting, and managing collaborative R&D.« less
Novel Designs and Coupling Schemes for Affordable High Energy Laser Modules

DTIC Science & Technology

2007-09-28

possibility of single polarization operation of phase- locked multicore fiber lasers and amplifiers. 5.5. UV...transverse direction (propagation and polarization vectors shown as solid arrows and dashed lines, respectively) having a dipole-like wave front from an...31 5.4. Phase Locking in Monolithic Multicore Fiber Laser..................................................... 38 5.5. UV
Implications of Multi-Core Architectures on the Development of Multiple Independent Levels of Security (MILS) Compliant Systems

DTIC Science & Technology

2012-10-01

REPORT 3. DATES COVERED (From - To) MAR 2010 – APR 2012 4 . TITLE AND SUBTITLE IMPLICATIONS OF MULT-CORE ARCHITECTURES ON THE DEVELOPMENT OF...Framework for Multicore Information Flow Analysis ...................................... 23 4 4.1 A Hypothetical Reference Architecture... 4 Figure 2: Pentium II Block Diagram
"Photonic lantern" spectral filters in multi-core Fiber.

PubMed

Birks, T A; Mangan, B J; Díez, A; Cruz, J L; Murphy, D F

2012-06-18

Fiber Bragg gratings are written across all 120 single-mode cores of a multi-core optical Fiber. The Fiber is interfaced to multimode ports by tapering it within a depressed-index glass jacket. The result is a compact multimode "photonic lantern" filter with astrophotonic applications. The tapered structure is also an effective mode scrambler.
Secure and Resilient Functional Modeling for Navy Cyber-Physical Systems

DTIC Science & Technology

2017-05-24

Functional Modeling Compiler (SCCT) FM Compiler and Key Performance Indicators (KPI) May 2018 Pending. Model Management Backbone (SCCT) MMB Demonstration...implement the agent- based distributed runtime. - KPIs for single/multicore controllers and temporal/spatial domains. - Integration of the model management ...Distributed Runtime (UCI) Not started. Model Management Backbone (SCCT) Not started. Siemens Corporation Corporate Technology Unrestricted
Analysis of scalability of high-performance 3D image processing platform for virtual colonoscopy

NASA Astrophysics Data System (ADS)

Yoshida, Hiroyuki; Wu, Yin; Cai, Wenli

2014-03-01

One of the key challenges in three-dimensional (3D) medical imaging is to enable the fast turn-around time, which is often required for interactive or real-time response. This inevitably requires not only high computational power but also high memory bandwidth due to the massive amount of data that need to be processed. For this purpose, we previously developed a software platform for high-performance 3D medical image processing, called HPC 3D-MIP platform, which employs increasingly available and affordable commodity computing systems such as the multicore, cluster, and cloud computing systems. To achieve scalable high-performance computing, the platform employed size-adaptive, distributable block volumes as a core data structure for efficient parallelization of a wide range of 3D-MIP algorithms, supported task scheduling for efficient load distribution and balancing, and consisted of a layered parallel software libraries that allow image processing applications to share the common functionalities. We evaluated the performance of the HPC 3D-MIP platform by applying it to computationally intensive processes in virtual colonoscopy. Experimental results showed a 12-fold performance improvement on a workstation with 12-core CPUs over the original sequential implementation of the processes, indicating the efficiency of the platform. Analysis of performance scalability based on the Amdahl's law for symmetric multicore chips showed the potential of a high performance scalability of the HPC 3DMIP platform when a larger number of cores is available.
Ultrathin endoscopes based on multicore fibers and adaptive optics: a status review and perspectives.

PubMed

Andresen, Esben Ravn; Sivankutty, Siddharth; Tsvirkun, Viktor; Bouwmans, Géraud; Rigneault, Hervé

2016-12-01

We take stock of the progress that has been made into developing ultrathin endoscopes assisted by wave front shaping. We focus our review on multicore fiber-based lensless endoscopes intended for multiphoton imaging applications. We put the work into perspective by comparing with alternative approaches and by outlining the challenges that lie ahead.
Amplification and noise properties of an erbium-doped multicore fiber amplifier.

PubMed

Abedin, K S; Taunay, T F; Fishteyn, M; Yan, M F; Zhu, B; Fini, J M; Monberg, E M; Dimarcello, F V; Wisk, P W

2011-08-15

A multicore erbium-doped fiber (MC-EDF) amplifier for simultaneous amplification in the 7-cores has been developed, and the gain and noise properties of individual cores have been studied. The pump and signal radiation were coupled to individual cores of MC-EDF using two tapered fiber bundled (TFB) couplers with low insertion loss. For a pump power of 146 mW, the average gain achieved in the MC-EDF fiber was 30 dB, and noise figure was less than 4 dB. The net useful gain from the multicore-amplifier, after taking into consideration of all the passive losses, was about 23-27 dB. Pump induced ASE noise transfer between the neighboring channel was negligible. © 2011 Optical Society of America

802.11ac WLAN MIMO radio-over-fiber distributed antenna system for in-building networks based on multicore fiber

NASA Astrophysics Data System (ADS)

Morant, Maria; Llorente, Roberto

2017-01-01

In this work we propose and evaluate experimentally the performance of IEEE 802.11ac WLAN standard signals in radio-over-fiber (RoF) distributed-antenna systems based on multicore fiber (MCF) for in-building WLAN connectivity. The RoF performance of WLAN signals with different bandwidth is investigated considering up to IEEE 802.11ac maximum of 160 MHz per user. We evaluate experimentally the performance of WLAN signals employing different modulation and coding schemes achieving bitrates from 78 Mbps to 1404 Mbps per user in distances up to 300 m in a 4-core MCF. The performance of the wireless standard multiple-input multiple-output (MIMO) processing algorithms included in WLAN signals applied to the RoF transmission in MCF optical systems is also evaluated. The impact on the quality of the signal from one of the cores in the MIMO processing is investigated and compared with the results achieved with single-input single-output (SISO) transmission in each core. We measured the error vector magnitude (EVM) and the OFDM data burst information of the received WLAN signals after RoF transmission for different distributed-antenna systems with uni- and bi-directional MCF communication. Finally, we compare the received EVM of a single-antenna system (SISO arrangement) with WLAN systems using two antennas (2×2 MIMO) and four antennas (4×4 MIMO).
A Real-Time Linux for Multicore Platforms

DTIC Science & Technology

2013-12-20

under ARO support) to obtain a fully-functional OS for supporting real-time workloads on multicore platforms. This system, called LITMUS -RT...to be specified as plugin components. LITMUS -RT is open-source software (available at The views, opinions and/or findings contained in this report... LITMUS -RT (LInux Testbed for MUltiprocessor Scheduling in Real-Time systems), allows different multiprocessor real-time scheduling and
Nonlinear Light Dynamics in Multi-Core Structures

DTIC Science & Technology

2017-02-27

be generated in continuous- discrete optical media such as multi-core optical fiber or waveguide arrays; localisation dynamics in a continuous... discrete nonlinear system. Detailed theoretical analysis is presented of the existence and stability of the discrete -continuous light bullets using a very...and pulse compression using wave collapse (self-focusing) energy localisation dynamics in a continuous- discrete nonlinear system, as implemented in a
Myalgia as the revealing symptom of multicore disease and fibre type disproportion myopathy

PubMed Central

Sobreira*, C; Marques, W; Barreira, A

2003-01-01

Objective: To report the occurrence of myalgia as the revealing symptom of multicore disease and fibre type disproportion myopathy. Methods: The clinical cases of three patients with fibre type disproportion myopathy and one with multicore disease are described. Skeletal muscle biopsies were processed for routine histological and histochemical studies. Results: The clinical picture was unusual in that the symptoms were of late onset and the predominant complaint was muscle pain exacerbated by exercise. Muscle weakness was found in only a single patient, the mother of a patient with fibre type disproportion myopathy. Physical examination was unremarkable in the other patients. Muscle biopsies from patients 1 and 2 contained type I fibres that were considerably smaller than the type II fibres, supporting the diagnosis of fibre type disproportion myopathy. Skeletal muscle of patient 4 showed multiple areas, predominantly but not exclusively in the type I fibres, from which oxidative enzyme activities were absent, as seen in multicore disease. Conclusions: Muscle pain was the main clinical manifestation in our patients. Recognition of the broader clinical expression of these myopathies is important for prognostic reasons and for genetic counselling of the family members. PMID:12933945
A Review of High-Performance Computational Strategies for Modeling and Imaging of Electromagnetic Induction Data

NASA Astrophysics Data System (ADS)

Newman, Gregory A.

2014-01-01

Many geoscientific applications exploit electrostatic and electromagnetic fields to interrogate and map subsurface electrical resistivity—an important geophysical attribute for characterizing mineral, energy, and water resources. In complex three-dimensional geologies, where many of these resources remain to be found, resistivity mapping requires large-scale modeling and imaging capabilities, as well as the ability to treat significant data volumes, which can easily overwhelm single-core and modest multicore computing hardware. To treat such problems requires large-scale parallel computational resources, necessary for reducing the time to solution to a time frame acceptable to the exploration process. The recognition that significant parallel computing processes must be brought to bear on these problems gives rise to choices that must be made in parallel computing hardware and software. In this review, some of these choices are presented, along with the resulting trade-offs. We also discuss future trends in high-performance computing and the anticipated impact on electromagnetic (EM) geophysics. Topics discussed in this review article include a survey of parallel computing platforms, graphics processing units to multicore CPUs with a fast interconnect, along with effective parallel solvers and associated solver libraries effective for inductive EM modeling and imaging.
Efficient parallel linear scaling construction of the density matrix for Born-Oppenheimer molecular dynamics.

PubMed

Mniszewski, S M; Cawkwell, M J; Wall, M E; Mohd-Yusof, J; Bock, N; Germann, T C; Niklasson, A M N

2015-10-13

We present an algorithm for the calculation of the density matrix that for insulators scales linearly with system size and parallelizes efficiently on multicore, shared memory platforms with small and controllable numerical errors. The algorithm is based on an implementation of the second-order spectral projection (SP2) algorithm [ Niklasson, A. M. N. Phys. Rev. B 2002 , 66 , 155115 ] in sparse matrix algebra with the ELLPACK-R data format. We illustrate the performance of the algorithm within self-consistent tight binding theory by total energy calculations of gas phase poly(ethylene) molecules and periodic liquid water systems containing up to 15,000 atoms on up to 16 CPU cores. We consider algorithm-specific performance aspects, such as local vs nonlocal memory access and the degree of matrix sparsity. Comparisons to sparse matrix algebra implementations using off-the-shelf libraries on multicore CPUs, graphics processing units (GPUs), and the Intel many integrated core (MIC) architecture are also presented. The accuracy and stability of the algorithm are illustrated with long duration Born-Oppenheimer molecular dynamics simulations of 1000 water molecules and a 303 atom Trp cage protein solvated by 2682 water molecules.
A novel heterogeneous algorithm to simulate multiphase flow in porous media on multicore CPU-GPU systems

NASA Astrophysics Data System (ADS)

McClure, J. E.; Prins, J. F.; Miller, C. T.

2014-07-01

Multiphase flow implementations of the lattice Boltzmann method (LBM) are widely applied to the study of porous medium systems. In this work, we construct a new variant of the popular "color" LBM for two-phase flow in which a three-dimensional, 19-velocity (D3Q19) lattice is used to compute the momentum transport solution while a three-dimensional, seven velocity (D3Q7) lattice is used to compute the mass transport solution. Based on this formulation, we implement a novel heterogeneous GPU-accelerated algorithm in which the mass transport solution is computed by multiple shared memory CPU cores programmed using OpenMP while a concurrent solution of the momentum transport is performed using a GPU. The heterogeneous solution is demonstrated to provide speedup of 2.6 × as compared to multi-core CPU solution and 1.8 × compared to GPU solution due to concurrent utilization of both CPU and GPU bandwidths. Furthermore, we verify that the proposed formulation provides an accurate physical representation of multiphase flow processes and demonstrate that the approach can be applied to perform heterogeneous simulations of two-phase flow in porous media using a typical GPU-accelerated workstation.
Novel magnetic multicore nanoparticles designed for MPI and other biomedical applications: From synthesis to first in vivo studies

PubMed Central

Taupitz, Matthias; Ariza de Schellenberger, Angela; Kosch, Olaf; Eberbeck, Dietmar; Wagner, Susanne; Trahms, Lutz; Hamm, Bernd; Schnorr, Jörg

2018-01-01

Synthesis of novel magnetic multicore particles (MCP) in the nano range, involves alkaline precipitation of iron(II) chloride in the presence of atmospheric oxygen. This step yields green rust, which is oxidized to obtain magnetic nanoparticles, which probably consist of a magnetite/maghemite mixed-phase. Final growth and annealing at 90°C in the presence of a large excess of carboxymethyl dextran gives MCP very promising magnetic properties for magnetic particle imaging (MPI), an emerging medical imaging modality, and magnetic resonance imaging (MRI). The magnetic nanoparticles are biocompatible and thus potential candidates for future biomedical applications such as cardiovascular imaging, sentinel lymph node mapping in cancer patients, and stem cell tracking. The new MCP that we introduce here have three times higher magnetic particle spectroscopy performance at lower and middle harmonics and five times higher MPS signal strength at higher harmonics compared with Resovist®. In addition, the new MCP have also an improved in vivo MPI performance compared to Resovist®, and we here report the first in vivo MPI investigation of this new generation of magnetic nanoparticles. PMID:29300729
Recent progress in InP/polymer-based devices for telecom and data center applications

NASA Astrophysics Data System (ADS)

Kleinert, Moritz; Zhang, Ziyang; de Felipe, David; Zawadzki, Crispin; Maese Novo, Alejandro; Brinker, Walter; Möhrle, Martin; Keil, Norbert

2015-02-01

Recent progress on polymer-based photonic devices and hybrid photonic integration technology using InP-based active components is presented. High performance thermo-optic components, including compact polymer variable optical attenuators and switches are powerful tools to regulate and control the light flow in the optical backbone. Polymer arrayed waveguide gratings integrated with InP laser and detector arrays function as low-cost optical line terminals (OLTs) in the WDM-PON network. External cavity tunable lasers combined with C/L band thinfilm filter, on-chip U-groove and 45° mirrors construct a compact, bi-directional and color-less optical network unit (ONU). A tunable laser integrated with VOAs, TFEs and two 90° hybrids builds the optical front-end of a colorless, dual-polarization coherent receiver. Multicore polymer waveguides and multi-step 45°mirrors are demonstrated as bridging devices between the spatialdivision- multiplexing transmission technology using multi-core fibers and the conventional PLCbased photonic platforms, appealing to the fast development of dense 3D photonic integration.
Multiplexed single-mode wavelength-to-time mapping of multimode light

PubMed Central

Chandrasekharan, Harikumar K; Izdebski, Frauke; Gris-Sánchez, Itandehui; Krstajić, Nikola; Walker, Richard; Bridle, Helen L.; Dalgarno, Paul A.; MacPherson, William N.; Henderson, Robert K.; Birks, Tim A.; Thomson, Robert R.

2017-01-01

When an optical pulse propagates along an optical fibre, different wavelengths travel at different group velocities. As a result, wavelength information is converted into arrival-time information, a process known as wavelength-to-time mapping. This phenomenon is most cleanly observed using a single-mode fibre transmission line, where spatial mode dispersion is not present, but the use of such fibres restricts possible applications. Here we demonstrate that photonic lanterns based on tapered single-mode multicore fibres provide an efficient way to couple multimode light to an array of single-photon avalanche detectors, each of which has its own time-to-digital converter for time-correlated single-photon counting. Exploiting this capability, we demonstrate the multiplexed single-mode wavelength-to-time mapping of multimode light using a multicore fibre photonic lantern with 121 single-mode cores, coupled to 121 detectors on a 32 × 32 detector array. This work paves the way to efficient multimode wavelength-to-time mapping systems with the spectral performance of single-mode systems. PMID:28120822
MPIGeneNet: Parallel Calculation of Gene Co-Expression Networks on Multicore Clusters.

PubMed

Gonzalez-Dominguez, Jorge; Martin, Maria J

2017-10-10

In this work we present MPIGeneNet, a parallel tool that applies Pearson's correlation and Random Matrix Theory to construct gene co-expression networks. It is based on the state-of-the-art sequential tool RMTGeneNet, which provides networks with high robustness and sensitivity at the expenses of relatively long runtimes for large scale input datasets. MPIGeneNet returns the same results as RMTGeneNet but improves the memory management, reduces the I/O cost, and accelerates the two most computationally demanding steps of co-expression network construction by exploiting the compute capabilities of common multicore CPU clusters. Our performance evaluation on two different systems using three typical input datasets shows that MPIGeneNet is significantly faster than RMTGeneNet. As an example, our tool is up to 175.41 times faster on a cluster with eight nodes, each one containing two 12-core Intel Haswell processors. Source code of MPIGeneNet, as well as a reference manual, are available at https://sourceforge.net/projects/mpigenenet/.
ParallelStructure: A R Package to Distribute Parallel Runs of the Population Genetics Program STRUCTURE on Multi-Core Computers

PubMed Central

Besnier, Francois; Glover, Kevin A.

2013-01-01

This software package provides an R-based framework to make use of multi-core computers when running analyses in the population genetics program STRUCTURE. It is especially addressed to those users of STRUCTURE dealing with numerous and repeated data analyses, and who could take advantage of an efficient script to automatically distribute STRUCTURE jobs among multiple processors. It also consists of additional functions to divide analyses among combinations of populations within a single data set without the need to manually produce multiple projects, as it is currently the case in STRUCTURE. The package consists of two main functions: MPI_structure() and parallel_structure() as well as an example data file. We compared the performance in computing time for this example data on two computer architectures and showed that the use of the present functions can result in several-fold improvements in terms of computation time. ParallelStructure is freely available at https://r-forge.r-project.org/projects/parallstructure/. PMID:23923012
Parallel k-means++

DOE Office of Scientific and Technical Information (OSTI.GOV)

A parallelization of the k-means++ seed selection algorithm on three distinct hardware platforms: GPU, multicore CPU, and multithreaded architecture. K-means++ was developed by David Arthur and Sergei Vassilvitskii in 2007 as an extension of the k-means data clustering technique. These algorithms allow people to cluster multidimensional data, by attempting to minimize the mean distance of data points within a cluster. K-means++ improved upon traditional k-means by using a more intelligent approach to selecting the initial seeds for the clustering process. While k-means++ has become a popular alternative to traditional k-means clustering, little work has been done to parallelize this technique.more » We have developed original C++ code for parallelizing the algorithm on three unique hardware architectures: GPU using NVidia's CUDA/Thrust framework, multicore CPU using OpenMP, and the Cray XMT multithreaded architecture. By parallelizing the process for these platforms, we are able to perform k-means++ clustering much more quickly than it could be done before.« less
Hardware design and implementation of fast DOA estimation method based on multicore DSP

NASA Astrophysics Data System (ADS)

Guo, Rui; Zhao, Yingxiao; Zhang, Yue; Lin, Qianqiang; Chen, Zengping

2016-10-01

In this paper, we present a high-speed real-time signal processing hardware platform based on multicore digital signal processor (DSP). The real-time signal processing platform shows several excellent characteristics including high performance computing, low power consumption, large-capacity data storage and high speed data transmission, which make it able to meet the constraint of real-time direction of arrival (DOA) estimation. To reduce the high computational complexity of DOA estimation algorithm, a novel real-valued MUSIC estimator is used. The algorithm is decomposed into several independent steps and the time consumption of each step is counted. Based on the statistics of the time consumption, we present a new parallel processing strategy to distribute the task of DOA estimation to different cores of the real-time signal processing hardware platform. Experimental results demonstrate that the high processing capability of the signal processing platform meets the constraint of real-time direction of arrival (DOA) estimation.
Real-Time Spatio-Temporal Twice Whitening for MIMO Energy Detector

DOE Office of Scientific and Technical Information (OSTI.GOV)

Humble, Travis S; Mitra, Pramita; Barhen, Jacob

2010-01-01

While many techniques exist for local spectrum sensing of a primary user, each represents a computationally demanding task to secondary user receivers. In software-defined radio, computational complexity lengthens the time for a cognitive radio to recognize changes in the transmission environment. This complexity is even more significant for spatially multiplexed receivers, e.g., in SIMO and MIMO, where the spatio-temporal data sets grow in size with the number of antennae. Limits on power and space for the processor hardware further constrain SDR performance. In this report, we discuss improvements in spatio-temporal twice whitening (STTW) for real-time local spectrum sensing by demonstratingmore » a form of STTW well suited for MIMO environments. We implement STTW on the Coherent Logix hx3100 processor, a multicore processor intended for low-power, high-throughput software-defined signal processing. These results demonstrate how coupling the novel capabilities of emerging multicore processors with algorithmic advances can enable real-time, software-defined processing of large spatio-temporal data sets.« less
Document Image Parsing and Understanding using Neuromorphic Architecture

DTIC Science & Technology

2015-03-01

processing speed at different layers. In the pattern matching layer, the computing power of multicore processors is explored to reduce the processing...developed to reduce the processing speed at different layers. In the pattern matching layer, the computing power of multicore processors is explored... cortex where the complex data is reduced to abstract representations. The abstract representation is compared to stored patterns in massively parallel
Photonic Crystal Fibers

DTIC Science & Technology

2005-12-01

passive and active versions of each fiber designed under this task. Crystal Fibre shall provide characteristics of the fiber fabricated to include core...passive version of multicore fiber iteration 2. 15. SUBJECT TERMS EOARD, Laser physics, Fibre Lasers, Photonic Crystal, Multicore, Fiber Laser 16...9 00* 0 " CRYSTAL FIBRE INT ODUCTION This report describes the photonic crystal fibers developed under agreement No FA8655-o5-a- 3046. All
Multi-Core Processors: An Enabling Technology for Embedded Distributed Model-Based Control (Postprint)

DTIC Science & Technology

2008-07-01

generation of process partitioning, a thread pipelining becomes possible. In this paper we briefly summarize the requirements and trends for FADEC based... FADEC environment, presenting a hypothetical realization of an example application. Finally we discuss the application of Time-Triggered...based control applications of the future. 15. SUBJECT TERMS Gas turbine, FADEC , Multi-core processing technology, disturbed based control
GoCxx: a tool to easily leverage C++ legacy code for multicore-friendly Go libraries and frameworks

NASA Astrophysics Data System (ADS)

Binet, Sébastien

2012-12-01

Current HENP libraries and frameworks were written before multicore systems became widely deployed and used. From this environment, a ‘single-thread’ processing model naturally emerged but the implicit assumptions it encouraged are greatly impairing our abilities to scale in a multicore/manycore world. Writing scalable code in C++ for multicore architectures, while doable, is no panacea. Sure, C++11 will improve on the current situation (by standardizing on std::thread, introducing lambda functions and defining a memory model) but it will do so at the price of complicating further an already quite sophisticated language. This level of sophistication has probably already strongly motivated analysis groups to migrate to CPython, hoping for its current limitations with respect to multicore scalability to be either lifted (Grand Interpreter Lock removal) or for the advent of a new Python VM better tailored for this kind of environment (PyPy, Jython, …) Could HENP migrate to a language with none of the deficiencies of C++ (build time, deployment, low level tools for concurrency) and with the fast turn-around time, simplicity and ease of coding of Python? This paper will try to make the case for Go - a young open source language with built-in facilities to easily express and expose concurrency - being such a language. We introduce GoCxx, a tool leveraging gcc-xml's output to automatize the tedious work of creating Go wrappers for foreign languages, a critical task for any language wishing to leverage legacy and field-tested code. We will conclude with the first results of applying GoCxx to real C++ code.
Dynamic Voltage-Frequency and Workload Joint Scaling Power Management for Energy Harvesting Multi-Core WSN Node SoC

PubMed Central

Li, Xiangyu; Xie, Nijie; Tian, Xinyue

2017-01-01

This paper proposes a scheduling and power management solution for energy harvesting heterogeneous multi-core WSN node SoC such that the system continues to operate perennially and uses the harvested energy efficiently. The solution consists of a heterogeneous multi-core system oriented task scheduling algorithm and a low-complexity dynamic workload scaling and configuration optimization algorithm suitable for light-weight platforms. Moreover, considering the power consumption of most WSN applications have the characteristic of data dependent behavior, we introduce branches handling mechanism into the solution as well. The experimental result shows that the proposed algorithm can operate in real-time on a lightweight embedded processor (MSP430), and that it can make a system do more valuable works and make more than 99.9% use of the power budget. PMID:28208730

Experimental investigation of inter-core crosstalk tolerance of MIMO-OFDM/OQAM radio over multicore fiber system.

PubMed

He, Jiale; Li, Borui; Deng, Lei; Tang, Ming; Gan, Lin; Fu, Songnian; Shum, Perry Ping; Liu, Deming

2016-06-13

In this paper, the feasibility of space division multiplexing for optical wireless fronthaul systems is experimentally demonstrated by implementing high speed MIMO-OFDM/OQAM radio signals over 20km 7-core fiber and 0.4m wireless link. Moreover, the impact of optical inter-core crosstalk in multicore fibers on the proposed MIMO-OFDM/OQAM radio over fiber system is experimentally evaluated in both SISO and MIMO configurations for comparison. The experimental results show that the inter-core crosstalk tolerance of the proposed radio over fiber system can be relaxed to -10 dB by using the proposed MIMO-OFDM/OQAM processing. These results could guide high density multicore fiber design to support a large number of antenna modules and a higher density of radio-access points for potential applications in 5G cellular system.
Dynamic Voltage-Frequency and Workload Joint Scaling Power Management for Energy Harvesting Multi-Core WSN Node SoC.

PubMed

Li, Xiangyu; Xie, Nijie; Tian, Xinyue

2017-02-08

This paper proposes a scheduling and power management solution for energy harvesting heterogeneous multi-core WSN node SoC such that the system continues to operate perennially and uses the harvested energy efficiently. The solution consists of a heterogeneous multi-core system oriented task scheduling algorithm and a low-complexity dynamic workload scaling and configuration optimization algorithm suitable for light-weight platforms. Moreover, considering the power consumption of most WSN applications have the characteristic of data dependent behavior, we introduce branches handling mechanism into the solution as well. The experimental result shows that the proposed algorithm can operate in real-time on a lightweight embedded processor (MSP430), and that it can make a system do more valuable works and make more than 99.9% use of the power budget.
A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dong, Tingzing Tim; Tomov, Stanimire Z; Luszczek, Piotr R

As modern hardware keeps evolving, an increasingly effective approach to developing energy efficient and high-performance solvers is to design them to work on many small size and independent problems. Many applications already need this functionality, especially for GPUs, which are currently known to be about four to five times more energy efficient than multicore CPUs. We describe the development of one-sided factorizations that work for a set of small dense matrices in parallel, and we illustrate our techniques on the QR factorization based on Householder transformations. We refer to this mode of operation as a batched factorization. Our approach ismore » based on representing the algorithms as a sequence of batched BLAS routines for GPU-only execution. This is in contrast to the hybrid CPU-GPU algorithms that rely heavily on using the multicore CPU for specific parts of the workload. But for a system to benefit fully from the GPU's significantly higher energy efficiency, avoiding the use of the multicore CPU must be a primary design goal, so the system can rely more heavily on the more efficient GPU. Additionally, this will result in the removal of the costly CPU-to-GPU communication. Furthermore, we do not use a single symmetric multiprocessor(on the GPU) to factorize a single problem at a time. We illustrate how our performance analysis, and the use of profiling and tracing tools, guided the development and optimization of our batched factorization to achieve up to a 2-fold speedup and a 3-fold energy efficiency improvement compared to our highly optimized batched CPU implementations based on the MKL library(when using two sockets of Intel Sandy Bridge CPUs). Compared to a batched QR factorization featured in the CUBLAS library for GPUs, we achieved up to 5x speedup on the K40 GPU.« less
Multicore Architectures for Multiple Independent Levels of Security Applications

DTIC Science & Technology

2012-09-01

to bolster the MILS effort. However, current MILS operating systems are not designed for multi-core platforms. They do not have the hardware support...current MILS operating systems are not designed for multi‐core platforms. They do not have the hardware support to ensure that the separation...the availability of information at different security classification levels while increasing the overall security of the computing system . Due to the
Polytopol computing for multi-core and distributed systems

NASA Astrophysics Data System (ADS)

Spaanenburg, Henk; Spaanenburg, Lambert; Ranefors, Johan

2009-05-01

Multi-core computing provides new challenges to software engineering. The paper addresses such issues in the general setting of polytopol computing, that takes multi-core problems in such widely differing areas as ambient intelligence sensor networks and cloud computing into account. It argues that the essence lies in a suitable allocation of free moving tasks. Where hardware is ubiquitous and pervasive, the network is virtualized into a connection of software snippets judiciously injected to such hardware that a system function looks as one again. The concept of polytopol computing provides a further formalization in terms of the partitioning of labor between collector and sensor nodes. Collectors provide functions such as a knowledge integrator, awareness collector, situation displayer/reporter, communicator of clues and an inquiry-interface provider. Sensors provide functions such as anomaly detection (only communicating singularities, not continuous observation), they are generally powered or self-powered, amorphous (not on a grid) with generation-and-attrition, field re-programmable, and sensor plug-and-play-able. Together the collector and the sensor are part of the skeleton injector mechanism, added to every node, and give the network the ability to organize itself into some of many topologies. Finally we will discuss a number of applications and indicate how a multi-core architecture supports the security aspects of the skeleton injector.
Performance analysis of distributed symmetric sparse matrix vector multiplication algorithm for multi-core architectures

DOE PAGES

Oryspayev, Dossay; Aktulga, Hasan Metin; Sosonkina, Masha; ...

2015-07-14

In this article, sparse matrix vector multiply (SpMVM) is an important kernel that frequently arises in high performance computing applications. Due to its low arithmetic intensity, several approaches have been proposed in literature to improve its scalability and efficiency in large scale computations. In this paper, our target systems are high end multi-core architectures and we use messaging passing interface + open multiprocessing hybrid programming model for parallelism. We analyze the performance of recently proposed implementation of the distributed symmetric SpMVM, originally developed for large sparse symmetric matrices arising in ab initio nuclear structure calculations. We also study important featuresmore » of this implementation and compare with previously reported implementations that do not exploit underlying symmetry. Our SpMVM implementations leverage the hybrid paradigm to efficiently overlap expensive communications with computations. Our main comparison criterion is the "CPU core hours" metric, which is the main measure of resource usage on supercomputers. We analyze the effects of topology-aware mapping heuristic using simplified network load model. Furthermore, we have tested the different SpMVM implementations on two large clusters with 3D Torus and Dragonfly topology. Our results show that the distributed SpMVM implementation that exploits matrix symmetry and hides communication yields the best value for the "CPU core hours" metric and significantly reduces data movement overheads.« less
Micromagnetics on high-performance workstation and mobile computational platforms

NASA Astrophysics Data System (ADS)

Fu, S.; Chang, R.; Couture, S.; Menarini, M.; Escobar, M. A.; Kuteifan, M.; Lubarda, M.; Gabay, D.; Lomakin, V.

2015-05-01

The feasibility of using high-performance desktop and embedded mobile computational platforms is presented, including multi-core Intel central processing unit, Nvidia desktop graphics processing units, and Nvidia Jetson TK1 Platform. FastMag finite element method-based micromagnetic simulator is used as a testbed, showing high efficiency on all the platforms. Optimization aspects of improving the performance of the mobile systems are discussed. The high performance, low cost, low power consumption, and rapid performance increase of the embedded mobile systems make them a promising candidate for micromagnetic simulations. Such architectures can be used as standalone systems or can be built as low-power computing clusters.
Fast data reconstructed method of Fourier transform imaging spectrometer based on multi-core CPU

NASA Astrophysics Data System (ADS)

Yu, Chunchao; Du, Debiao; Xia, Zongze; Song, Li; Zheng, Weijian; Yan, Min; Lei, Zhenggang

2017-10-01

Imaging spectrometer can gain two-dimensional space image and one-dimensional spectrum at the same time, which shows high utility in color and spectral measurements, the true color image synthesis, military reconnaissance and so on. In order to realize the fast reconstructed processing of the Fourier transform imaging spectrometer data, the paper designed the optimization reconstructed algorithm with OpenMP parallel calculating technology, which was further used for the optimization process for the HyperSpectral Imager of `HJ-1' Chinese satellite. The results show that the method based on multi-core parallel computing technology can control the multi-core CPU hardware resources competently and significantly enhance the calculation of the spectrum reconstruction processing efficiency. If the technology is applied to more cores workstation in parallel computing, it will be possible to complete Fourier transform imaging spectrometer real-time data processing with a single computer.
Discrimination of Temperature and Strain in Brillouin Optical Time Domain Analysis Using a Multicore Optical Fiber

PubMed Central

Zaghloul, Mohamed A. S.; Wang, Mohan; Milione, Giovanni; Li, Ming-Jun; Li, Shenping; Huang, Yue-Kai; Wang, Ting; Chen, Kevin P.

2018-01-01

Brillouin optical time domain analysis is the sensing of temperature and strain changes along an optical fiber by measuring the frequency shift changes of Brillouin backscattering. Because frequency shift changes are a linear combination of temperature and strain changes, their discrimination is a challenge. Here, a multicore optical fiber that has two cores is fabricated. The differences between the cores’ temperature and strain coefficients are such that temperature (strain) changes can be discriminated with error amplification factors of 4.57 °C/MHz (69.11 μϵ/MHz), which is 2.63 (3.67) times lower than previously demonstrated. As proof of principle, using the multicore optical fiber and a commercial Brillouin optical time domain analyzer, the temperature (strain) changes of a thermally expanding metal cylinder are discriminated with an error of 0.24% (3.7%). PMID:29649148
Optimization of multicore-shell Fe3O4-SiO2 magnetic nanocomposites synthesis and retention in cellulose pulp

NASA Astrophysics Data System (ADS)

Buteica, Dan; Borbath, Istvan; Nicolae, Ionel Valentin; Turcu, Rodica; Marinica, Oana; Socoliuc, Vlad

2017-12-01

The use of magnetite nanoparticles to produce magnetic paper has a severe effect on the color of the paper, which is worth searching means to alleviate. Multicore-shell Fe3O4-SiO2 magnetic nanocomposites were synthesized. The nanocomposite powder was dispersed in cellulose pulp and paper was produced by dehydration on a Rapid Kothen machine. The nanocomposite retention efficiency was investigated in correlation with nanocomposite shell thickness, the resinous vs. deciduous fiber content of the cellulose pulp, the long and short fibers' grinding degree, the cationic starch and polymeric retention agent content of the pulp. The whiteness and magnetization was measured for all paper samples. It was proved that the use of multi-core shell magnetic nanocomposites leads to weaker paper coloring. This effect is enhanced by increasing the polymeric retention agent content of the pulp, in spite of higher composite content.
Discrimination of Temperature and Strain in Brillouin Optical Time Domain Analysis Using a Multicore Optical Fiber.

PubMed

Zaghloul, Mohamed A S; Wang, Mohan; Milione, Giovanni; Li, Ming-Jun; Li, Shenping; Huang, Yue-Kai; Wang, Ting; Chen, Kevin P

2018-04-12

Brillouin optical time domain analysis is the sensing of temperature and strain changes along an optical fiber by measuring the frequency shift changes of Brillouin backscattering. Because frequency shift changes are a linear combination of temperature and strain changes, their discrimination is a challenge. Here, a multicore optical fiber that has two cores is fabricated. The differences between the cores' temperature and strain coefficients are such that temperature (strain) changes can be discriminated with error amplification factors of 4.57 °C/MHz (69.11 μ ϵ /MHz), which is 2.63 (3.67) times lower than previously demonstrated. As proof of principle, using the multicore optical fiber and a commercial Brillouin optical time domain analyzer, the temperature (strain) changes of a thermally expanding metal cylinder are discriminated with an error of 0.24% (3.7%).
BilKristal 2.0: A tool for pattern information extraction from crystal structures

NASA Astrophysics Data System (ADS)

Okuyan, Erhan; Güdükbay, Uğur

2014-01-01

We present a revised version of the BilKristal tool of Okuyan et al. (2007). We converted the development environment into Microsoft Visual Studio 2005 in order to resolve compatibility issues. We added multi-core CPU support and improvements are made to graphics functions in order to improve performance. Discovered bugs are fixed and exporting functionality to a material visualization tool is added.
Mobile Thread Task Manager

NASA Technical Reports Server (NTRS)

Clement, Bradley J.; Estlin, Tara A.; Bornstein, Benjamin J.

2013-01-01

The Mobile Thread Task Manager (MTTM) is being applied to parallelizing existing flight software to understand the benefits and to develop new techniques and architectural concepts for adapting software to multicore architectures. It allocates and load-balances tasks for a group of threads that migrate across processors to improve cache performance. In order to balance-load across threads, the MTTM augments a basic map-reduce strategy to draw jobs from a global queue. In a multicore processor, memory may be "homed" to the cache of a specific processor and must be accessed from that processor. The MTTB architecture wraps access to data with thread management to move threads to the home processor for that data so that the computation follows the data in an attempt to avoid L2 cache misses. Cache homing is also handled by a memory manager that translates identifiers to processor IDs where the data will be homed (according to rules defined by the user). The user can also specify the number of threads and processors separately, which is important for tuning performance for different patterns of computation and memory access. MTTM efficiently processes tasks in parallel on a multiprocessor computer. It also provides an interface to make it easier to adapt existing software to a multiprocessor environment.
LIBS data analysis using a predictor-corrector based digital signal processor algorithm

NASA Astrophysics Data System (ADS)

Sanders, Alex; Griffin, Steven T.; Robinson, Aaron

2012-06-01

There are many accepted sensor technologies for generating spectra for material classification. Once the spectra are generated, communication bandwidth limitations favor local material classification with its attendant reduction in data transfer rates and power consumption. Transferring sensor technologies such as Cavity Ring-Down Spectroscopy (CRDS) and Laser Induced Breakdown Spectroscopy (LIBS) require effective material classifiers. A result of recent efforts has been emphasis on Partial Least Squares - Discriminant Analysis (PLS-DA) and Principle Component Analysis (PCA). Implementation of these via general purpose computers is difficult in small portable sensor configurations. This paper addresses the creation of a low mass, low power, robust hardware spectra classifier for a limited set of predetermined materials in an atmospheric matrix. Crucial to this is the incorporation of PCA or PLS-DA classifiers into a predictor-corrector style implementation. The system configuration guarantees rapid convergence. Software running on multi-core Digital Signal Processor (DSPs) simulates a stream-lined plasma physics model estimator, reducing Analog-to-Digital (ADC) power requirements. This paper presents the results of a predictorcorrector model implemented on a low power multi-core DSP to perform substance classification. This configuration emphasizes the hardware system and software design via a predictor corrector model that simultaneously decreases the sample rate while performing the classification.
QR-decomposition based SENSE reconstruction using parallel architecture.

PubMed

Ullah, Irfan; Nisar, Habab; Raza, Haseeb; Qasim, Malik; Inam, Omair; Omer, Hammad

2018-04-01

Magnetic Resonance Imaging (MRI) is a powerful medical imaging technique that provides essential clinical information about the human body. One major limitation of MRI is its long scan time. Implementation of advance MRI algorithms on a parallel architecture (to exploit inherent parallelism) has a great potential to reduce the scan time. Sensitivity Encoding (SENSE) is a Parallel Magnetic Resonance Imaging (pMRI) algorithm that utilizes receiver coil sensitivities to reconstruct MR images from the acquired under-sampled k-space data. At the heart of SENSE lies inversion of a rectangular encoding matrix. This work presents a novel implementation of GPU based SENSE algorithm, which employs QR decomposition for the inversion of the rectangular encoding matrix. For a fair comparison, the performance of the proposed GPU based SENSE reconstruction is evaluated against single and multicore CPU using openMP. Several experiments against various acceleration factors (AFs) are performed using multichannel (8, 12 and 30) phantom and in-vivo human head and cardiac datasets. Experimental results show that GPU significantly reduces the computation time of SENSE reconstruction as compared to multi-core CPU (approximately 12x speedup) and single-core CPU (approximately 53x speedup) without any degradation in the quality of the reconstructed images. Copyright © 2018 Elsevier Ltd. All rights reserved.
Scalable and Power Efficient Data Analytics for Hybrid Exascale Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Choudhary, Alok; Samatova, Nagiza; Wu, Kesheng

This project developed a generic and optimized set of core data analytics functions. These functions organically consolidate a broad constellation of high performance analytical pipelines. As the architectures of emerging HPC systems become inherently heterogeneous, there is a need to design algorithms for data analysis kernels accelerated on hybrid multi-node, multi-core HPC architectures comprised of a mix of CPUs, GPUs, and SSDs. Furthermore, the power-aware trend drives the advances in our performance-energy tradeoff analysis framework which enables our data analysis kernels algorithms and software to be parameterized so that users can choose the right power-performance optimizations.
Spectral efficiency in crosstalk-impaired multi-core fiber links

NASA Astrophysics Data System (ADS)

Luís, Ruben S.; Puttnam, Benjamin J.; Rademacher, Georg; Klaus, Werner; Agrell, Erik; Awaji, Yoshinari; Wada, Naoya

2018-02-01

We review the latest advances on ultra-high throughput transmission using crosstalk-limited single-mode multicore fibers and compare these with the theoretical spectral efficiency of such systems. We relate the crosstalkimposed spectral efficiency limits with fiber parameters, such as core diameter, core pitch, and trench design. Furthermore, we investigate the potential of techniques such as direction interleaving and high-order MIMO to improve the throughput or reach of these systems when using various modulation formats.
Optimization of Selected Remote Sensing Algorithms for Embedded NVIDIA Kepler GPU Architecture

NASA Technical Reports Server (NTRS)

Riha, Lubomir; Le Moigne, Jacqueline; El-Ghazawi, Tarek

2015-01-01

This paper evaluates the potential of embedded Graphic Processing Units in the Nvidias Tegra K1 for onboard processing. The performance is compared to a general purpose multi-core CPU and full fledge GPU accelerator. This study uses two algorithms: Wavelet Spectral Dimension Reduction of Hyperspectral Imagery and Automated Cloud-Cover Assessment (ACCA) Algorithm. Tegra K1 achieved 51 for ACCA algorithm and 20 for the dimension reduction algorithm, as compared to the performance of the high-end 8-core server Intel Xeon CPU with 13.5 times higher power consumption.
Design and Development of a Run-Time Monitor for Multi-Core Architectures in Cloud Computing

PubMed Central

Kang, Mikyung; Kang, Dong-In; Crago, Stephen P.; Park, Gyung-Leen; Lee, Junghoon

2011-01-01

Cloud computing is a new information technology trend that moves computing and data away from desktops and portable PCs into large data centers. The basic principle of cloud computing is to deliver applications as services over the Internet as well as infrastructure. A cloud is a type of parallel and distributed system consisting of a collection of inter-connected and virtualized computers that are dynamically provisioned and presented as one or more unified computing resources. The large-scale distributed applications on a cloud require adaptive service-based software, which has the capability of monitoring system status changes, analyzing the monitored information, and adapting its service configuration while considering tradeoffs among multiple QoS features simultaneously. In this paper, we design and develop a Run-Time Monitor (RTM) which is a system software to monitor the application behavior at run-time, analyze the collected information, and optimize cloud computing resources for multi-core architectures. RTM monitors application software through library instrumentation as well as underlying hardware through a performance counter optimizing its computing configuration based on the analyzed data. PMID:22163811
Real-Time Agent-Based Modeling Simulation with in-situ Visualization of Complex Biological Systems: A Case Study on Vocal Fold Inflammation and Healing.

PubMed

Seekhao, Nuttiiya; Shung, Caroline; JaJa, Joseph; Mongeau, Luc; Li-Jessen, Nicole Y K

2016-05-01

We present an efficient and scalable scheme for implementing agent-based modeling (ABM) simulation with In Situ visualization of large complex systems on heterogeneous computing platforms. The scheme is designed to make optimal use of the resources available on a heterogeneous platform consisting of a multicore CPU and a GPU, resulting in minimal to no resource idle time. Furthermore, the scheme was implemented under a client-server paradigm that enables remote users to visualize and analyze simulation data as it is being generated at each time step of the model. Performance of a simulation case study of vocal fold inflammation and wound healing with 3.8 million agents shows 35× and 7× speedup in execution time over single-core and multi-core CPU respectively. Each iteration of the model took less than 200 ms to simulate, visualize and send the results to the client. This enables users to monitor the simulation in real-time and modify its course as needed.

Design and development of a run-time monitor for multi-core architectures in cloud computing.

PubMed

Kang, Mikyung; Kang, Dong-In; Crago, Stephen P; Park, Gyung-Leen; Lee, Junghoon

2011-01-01

Cloud computing is a new information technology trend that moves computing and data away from desktops and portable PCs into large data centers. The basic principle of cloud computing is to deliver applications as services over the Internet as well as infrastructure. A cloud is a type of parallel and distributed system consisting of a collection of inter-connected and virtualized computers that are dynamically provisioned and presented as one or more unified computing resources. The large-scale distributed applications on a cloud require adaptive service-based software, which has the capability of monitoring system status changes, analyzing the monitored information, and adapting its service configuration while considering tradeoffs among multiple QoS features simultaneously. In this paper, we design and develop a Run-Time Monitor (RTM) which is a system software to monitor the application behavior at run-time, analyze the collected information, and optimize cloud computing resources for multi-core architectures. RTM monitors application software through library instrumentation as well as underlying hardware through a performance counter optimizing its computing configuration based on the analyzed data.
Evaluation of a Multicore-Optimized Implementation for Tomographic Reconstruction

PubMed Central

Agulleiro, Jose-Ignacio; Fernández, José Jesús

2012-01-01

Tomography allows elucidation of the three-dimensional structure of an object from a set of projection images. In life sciences, electron microscope tomography is providing invaluable information about the cell structure at a resolution of a few nanometres. Here, large images are required to combine wide fields of view with high resolution requirements. The computational complexity of the algorithms along with the large image size then turns tomographic reconstruction into a computationally demanding problem. Traditionally, high-performance computing techniques have been applied to cope with such demands on supercomputers, distributed systems and computer clusters. In the last few years, the trend has turned towards graphics processing units (GPUs). Here we present a detailed description and a thorough evaluation of an alternative approach that relies on exploitation of the power available in modern multicore computers. The combination of single-core code optimization, vector processing, multithreading and efficient disk I/O operations succeeds in providing fast tomographic reconstructions on standard computers. The approach turns out to be competitive with the fastest GPU-based solutions thus far. PMID:23139768
Self-powered information measuring wireless networks using the distribution of tasks within multicore processors

NASA Astrophysics Data System (ADS)

Zhuravska, Iryna M.; Koretska, Oleksandra O.; Musiyenko, Maksym P.; Surtel, Wojciech; Assembay, Azat; Kovalev, Vladimir; Tleshova, Akmaral

2017-08-01

The article contains basic approaches to develop the self-powered information measuring wireless networks (SPIM-WN) using the distribution of tasks within multicore processors critical applying based on the interaction of movable components - as in the direction of data transmission as wireless transfer of energy coming from polymetric sensors. Base mathematic model of scheduling tasks within multiprocessor systems was modernized to schedule and allocate tasks between cores of one-crystal computer (SoC) to increase energy efficiency SPIM-WN objects.
A multicore compound glass optical fiber for neutron imaging

NASA Astrophysics Data System (ADS)

Moore, Michael; Zhang, Xiaodong; Feng, Xian; Brambilla, Gilberto; Hayward, Jason

2017-04-01

Optical fibers have been successfully utilized for point sensors targeting physical quantities (stress, strain, rotation, acceleration), chemical compounds (humidity, oil, nitrates, alcohols, DNA) or radiation fields (X-rays, β particles, γ-rays). Similarly, bundles of fibers have been extremely successful in imaging visible wavelengths for medical endoscopy and industrial boroscopy. This work presents the progress in the fabrication and experimental evaluation of multicore fiber as neutron scattering instrumentation designed to detect and image neutrons with micron level spatial resolution.
Fully-elastic multi-granular network with space/frequency/time switching using multi-core fibres and programmable optical nodes.

PubMed

Amaya, N; Irfan, M; Zervas, G; Nejabati, R; Simeonidou, D; Sakaguchi, J; Klaus, W; Puttnam, B J; Miyazawa, T; Awaji, Y; Wada, N; Henning, I

2013-04-08

We present the first elastic, space division multiplexing, and multi-granular network based on two 7-core MCF links and four programmable optical nodes able to switch traffic utilising the space, frequency and time dimensions with over 6000-fold bandwidth granularity. Results show good end-to-end performance on all channels with power penalties between 0.75 dB and 3.7 dB.
Optimal Configuration and Deployment of Software on Multi-Core Processing Architectures

DTIC Science & Technology

2008-07-01

between the event generating threads and the collector thread is implemented through semaphores . The Perseus data logger is designed to minimize the...performance counters (through the PAPI API) and opens up access to the shared memory logger through a semaphore and Remote Procedure Call (RPC) buffer... synchronization events. Using this rich data, the TMAM is able to output all of the information necessary to identify precisely which pairs of thread
First experience with particle-in-cell plasma physics code on ARM-based HPC systems

NASA Astrophysics Data System (ADS)

Sáez, Xavier; Soba, Alejandro; Sánchez, Edilberto; Mantsinen, Mervi; Mateo, Sergi; Cela, José M.; Castejón, Francisco

2015-09-01

In this work, we will explore the feasibility of porting a Particle-in-cell code (EUTERPE) to an ARM multi-core platform from the Mont-Blanc project. The used prototype is based on a system-on-chip Samsung Exynos 5 with an integrated GPU. It is the first prototype that could be used for High-Performance Computing (HPC), since it supports double precision and parallel programming languages.
Cyber-Physical Multi-Core Optimization for Resource and Cache Effects (C2ORES)

DTIC Science & Technology

2014-03-01

DoD-sponsored ATAACK mobile cloud testbed funded through the DURIP program, which is deployed at Virginia Tech and Vanderbilt University to conduct...0.9.2. Jug was configured to use a filesystem (network file system (nfs)) backend for locking and task synchronization. 4.1.7.2 Experiment 1...and performance-aware virtual machine placement technique that is realized as cloud infrastructure middleware. The key contributions of iPlace include
Extreme-scale Algorithms and Solver Resilience

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dongarra, Jack

A widening gap exists between the peak performance of high-performance computers and the performance achieved by complex applications running on these platforms. Over the next decade, extreme-scale systems will present major new challenges to algorithm development that could amplify this mismatch in such a way that it prevents the productive use of future DOE Leadership computers due to the following; Extreme levels of parallelism due to multicore processors; An increase in system fault rates requiring algorithms to be resilient beyond just checkpoint/restart; Complex memory hierarchies and costly data movement in both energy and performance; Heterogeneous system architectures (mixing CPUs, GPUs,more » etc.); and Conflicting goals of performance, resilience, and power requirements.« less
A New Generation of Real-Time Systems in the JET Tokamak

NASA Astrophysics Data System (ADS)

Alves, Diogo; Neto, Andre C.; Valcarcel, Daniel F.; Felton, Robert; Lopez, Juan M.; Barbalace, Antonio; Boncagni, Luca; Card, Peter; De Tommasi, Gianmaria; Goodyear, Alex; Jachmich, Stefan; Lomas, Peter J.; Maviglia, Francesco; McCullen, Paul; Murari, Andrea; Rainford, Mark; Reux, Cedric; Rimini, Fernanda; Sartori, Filippo; Stephen, Adam V.; Vega, Jesus; Vitelli, Riccardo; Zabeo, Luca; Zastrow, Klaus-Dieter

2014-04-01

Recently, a new recipe for developing and deploying real-time systems has become increasingly adopted in the JET tokamak. Powered by the advent of x86 multi-core technology and the reliability of JET's well established Real-Time Data Network (RTDN) to handle all real-time I/O, an official Linux vanilla kernel has been demonstrated to be able to provide real-time performance to user-space applications that are required to meet stringent timing constraints. In particular, a careful rearrangement of the Interrupt ReQuests' (IRQs) affinities together with the kernel's CPU isolation mechanism allows one to obtain either soft or hard real-time behavior depending on the synchronization mechanism adopted. Finally, the Multithreaded Application Real-Time executor (MARTe) framework is used for building applications particularly optimised for exploring multi-core architectures. In the past year, four new systems based on this philosophy have been installed and are now part of JET's routine operation. The focus of the present work is on the configuration aspects that enable these new systems' real-time capability. Details are given about the common real-time configuration of these systems, followed by a brief description of each system together with results regarding their real-time performance. A cycle time jitter analysis of a user-space MARTe based application synchronizing over a network is also presented. The goal is to compare its deterministic performance while running on a vanilla and on a Messaging Real time Grid (MRG) Linux kernel.
A multi-core fiber based interferometer for high temperature sensing

NASA Astrophysics Data System (ADS)

Zhou, Song; Huang, Bo; Shu, Xuewen

2017-04-01

In this paper, we have verified and implemented a Mach-Zehnder interferometer based on seven-core fiber for high temperature sensing application. This proposed structure is based on a multi-mode-multi-core-multi-mode fiber structure sandwiched by a single mode fiber. Between the single-mode and multi-core fiber, a 3 mm long multi-mode fiber is formed for lead-in and lead-out light. The basic operation principle of this device is the use of multi-core modes, single-mode and multi-mode interference coupling is also utilized. Experimental results indicate that this interferometer sensor is capable of accurate measurements of temperatures up to 800 °C, and the temperature sensitivity of the proposed sensor is as high as 170.2 pm/°C, which is much higher than the current existing MZI based temperature sensors (109 pm/°C). This type of sensor is promising for practical high temperature applications due to its advantages including high sensitivity, simple fabrication process, low cost and compactness.
Group delay spread analysis of coupled-multicore fibers: A comparison between weak and tight bending conditions

NASA Astrophysics Data System (ADS)

Fujisawa, Takeshi; Saitoh, Kunimasa

2017-06-01

Group delay spread of coupled three-core fiber is investigated based on coupled-wave theory. The differences between supermode and discrete core mode models are thoroughly investigated to reveal applicability of both models for specific fiber bending condition. A macrobending with random twisting is taken into account for random modal mixing in the fiber. It is found that for weakly bent condition, both supermode and discrete core mode models are applicable. On the other hand, for strongly bent condition, the discrete core mode model should be used to account for increased differential modal group delay for the fiber without twisting and short correlation length, which were experimentally observed recently. Results presented in this paper indicate the discrete core mode model is superior to the supermode model for the analysis of coupled-multicore fibers for various bent condition. Also, for estimating GDS of coupled-multicore fiber, it is critically important to take into account the fiber bending condition.
Neural networks within multi-core optic fibers

PubMed Central

Cohen, Eyal; Malka, Dror; Shemer, Amir; Shahmoon, Asaf; Zalevsky, Zeev; London, Michael

2016-01-01

Hardware implementation of artificial neural networks facilitates real-time parallel processing of massive data sets. Optical neural networks offer low-volume 3D connectivity together with large bandwidth and minimal heat production in contrast to electronic implementation. Here, we present a conceptual design for in-fiber optical neural networks. Neurons and synapses are realized as individual silica cores in a multi-core fiber. Optical signals are transferred transversely between cores by means of optical coupling. Pump driven amplification in erbium-doped cores mimics synaptic interactions. We simulated three-layered feed-forward neural networks and explored their capabilities. Simulations suggest that networks can differentiate between given inputs depending on specific configurations of amplification; this implies classification and learning capabilities. Finally, we tested experimentally our basic neuronal elements using fibers, couplers, and amplifiers, and demonstrated that this configuration implements a neuron-like function. Therefore, devices similar to our proposed multi-core fiber could potentially serve as building blocks for future large-scale small-volume optical artificial neural networks. PMID:27383911
Neural networks within multi-core optic fibers.

PubMed

Cohen, Eyal; Malka, Dror; Shemer, Amir; Shahmoon, Asaf; Zalevsky, Zeev; London, Michael

2016-07-07

Hardware implementation of artificial neural networks facilitates real-time parallel processing of massive data sets. Optical neural networks offer low-volume 3D connectivity together with large bandwidth and minimal heat production in contrast to electronic implementation. Here, we present a conceptual design for in-fiber optical neural networks. Neurons and synapses are realized as individual silica cores in a multi-core fiber. Optical signals are transferred transversely between cores by means of optical coupling. Pump driven amplification in erbium-doped cores mimics synaptic interactions. We simulated three-layered feed-forward neural networks and explored their capabilities. Simulations suggest that networks can differentiate between given inputs depending on specific configurations of amplification; this implies classification and learning capabilities. Finally, we tested experimentally our basic neuronal elements using fibers, couplers, and amplifiers, and demonstrated that this configuration implements a neuron-like function. Therefore, devices similar to our proposed multi-core fiber could potentially serve as building blocks for future large-scale small-volume optical artificial neural networks.
Multicore and GPU algorithms for Nussinov RNA folding

PubMed Central

2014-01-01

Background One segment of a RNA sequence might be paired with another segment of the same RNA sequence due to the force of hydrogen bonds. This two-dimensional structure is called the RNA sequence's secondary structure. Several algorithms have been proposed to predict an RNA sequence's secondary structure. These algorithms are referred to as RNA folding algorithms. Results We develop cache efficient, multicore, and GPU algorithms for RNA folding using Nussinov's algorithm. Conclusions Our cache efficient algorithm provides a speedup between 1.6 and 3.0 relative to a naive straightforward single core code. The multicore version of the cache efficient single core algorithm provides a speedup, relative to the naive single core algorithm, between 7.5 and 14.0 on a 6 core hyperthreaded CPU. Our GPU algorithm for the NVIDIA C2050 is up to 1582 times as fast as the naive single core algorithm and between 5.1 and 11.2 times as fast as the fastest previously known GPU algorithm for Nussinov RNA folding. PMID:25082539
High performance ultrasonic field simulation on complex geometries

NASA Astrophysics Data System (ADS)

Chouh, H.; Rougeron, G.; Chatillon, S.; Iehl, J. C.; Farrugia, J. P.; Ostromoukhov, V.

2016-02-01

Ultrasonic field simulation is a key ingredient for the design of new testing methods as well as a crucial step for NDT inspection simulation. As presented in a previous paper [1], CEA-LIST has worked on the acceleration of these simulations focusing on simple geometries (planar interfaces, isotropic materials). In this context, significant accelerations were achieved on multicore processors and GPUs (Graphics Processing Units), bringing the execution time of realistic computations in the 0.1 s range. In this paper, we present recent works that aim at similar performances on a wider range of configurations. We adapted the physical model used by the CIVA platform to design and implement a new algorithm providing a fast ultrasonic field simulation that yields nearly interactive results for complex cases. The improvements over the CIVA pencil-tracing method include adaptive strategies for pencil subdivisions to achieve a good refinement of the sensor geometry while keeping a reasonable number of ray-tracing operations. Also, interpolation of the times of flight was used to avoid time consuming computations in the impulse response reconstruction stage. To achieve the best performance, our algorithm runs on multi-core superscalar CPUs and uses high performance specialized libraries such as Intel Embree for ray-tracing, Intel MKL for signal processing and Intel TBB for parallelization. We validated the simulation results by comparing them to the ones produced by CIVA on identical test configurations including mono-element and multiple-element transducers, homogeneous, meshed 3D CAD specimens, isotropic and anisotropic materials and wave paths that can involve several interactions with interfaces. We show performance results on complete simulations that achieve computation times in the 1s range.
MetAlign 3.0: performance enhancement by efficient use of advances in computer hardware.

PubMed

Lommen, Arjen; Kools, Harrie J

2012-08-01

A new, multi-threaded version of the GC-MS and LC-MS data processing software, metAlign, has been developed which is able to utilize multiple cores on one PC. This new version was tested using three different multi-core PCs with different operating systems. The performance of noise reduction, baseline correction and peak-picking was 8-19 fold faster compared to the previous version on a single core machine from 2008. The alignment was 5-10 fold faster. Factors influencing the performance enhancement are discussed. Our observations show that performance scales with the increase in processor core numbers we currently see in consumer PC hardware development.
Few Mode Multicore Photonic Lantern Multiplexer

DTIC Science & Technology

2016-01-01

2015, Valencia (2015). [6] S. G. Leon-Saval, T. A. Birks, J. Bland- Hawthorn , and M. Englund, “Multimode fiber devices with single-mode performance...Opt. Lett. 30, 2545–2547 (2005). [7] D. Noordegraaf, P. M. W. Skovgaard, M. D. Nielsen, and J. Bland- Hawthorn , “Efficient multi-mode to single mode...coupling in a photonic lantern,” Opt. Express 17, 1988–1994 (2009). [8] S. G. Leon-Saval, A. Argyros, and J. Bland- Hawthorn , “Photonic lanterns: a
Scalable Triadic Analysis of Large-Scale Graphs: Multi-Core vs. Multi-Processor vs. Multi-Threaded Shared Memory Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chin, George; Marquez, Andres; Choudhury, Sutanay

2012-09-01

Triadic analysis encompasses a useful set of graph mining methods that is centered on the concept of a triad, which is a subgraph of three nodes and the configuration of directed edges across the nodes. Such methods are often applied in the social sciences as well as many other diverse fields. Triadic methods commonly operate on a triad census that counts the number of triads of every possible edge configuration in a graph. Like other graph algorithms, triadic census algorithms do not scale well when graphs reach tens of millions to billions of nodes. To enable the triadic analysis ofmore » large-scale graphs, we developed and optimized a triad census algorithm to efficiently execute on shared memory architectures. We will retrace the development and evolution of a parallel triad census algorithm. Over the course of several versions, we continually adapted the code’s data structures and program logic to expose more opportunities to exploit parallelism on shared memory that would translate into improved computational performance. We will recall the critical steps and modifications that occurred during code development and optimization. Furthermore, we will compare the performances of triad census algorithm versions on three specific systems: Cray XMT, HP Superdome, and AMD multi-core NUMA machine. These three systems have shared memory architectures but with markedly different hardware capabilities to manage parallelism.« less
Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures

PubMed Central

Manolakos, Elias S.

2015-01-01

Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise protein comparison algorithms, and the trend in the domain towards using multiple criteria for protein structures comparison (MCPSC) and combining results. We have developed a software framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of the two parallel MCPSC implementations using Intel's experimental many-core Single-Chip Cloud Computer (SCC) as well as Intel's Core i7 multicore processor. We show that the 48-core SCC is more efficient than the latest generation Core i7, achieving a speedup factor of 42 (efficiency of 0.9), making many-core processors an exciting emerging technology for large-scale structural proteomics. We compare and contrast the performance of the two processors on several datasets and also show that MCPSC outperforms its component methods in grouping related domains, achieving a high F-measure of 0.91 on the benchmark CK34 dataset. The software implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub. PMID:26605332

Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures.

PubMed

Sharma, Anuj; Manolakos, Elias S

2015-01-01

Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise protein comparison algorithms, and the trend in the domain towards using multiple criteria for protein structures comparison (MCPSC) and combining results. We have developed a software framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of the two parallel MCPSC implementations using Intel's experimental many-core Single-Chip Cloud Computer (SCC) as well as Intel's Core i7 multicore processor. We show that the 48-core SCC is more efficient than the latest generation Core i7, achieving a speedup factor of 42 (efficiency of 0.9), making many-core processors an exciting emerging technology for large-scale structural proteomics. We compare and contrast the performance of the two processors on several datasets and also show that MCPSC outperforms its component methods in grouping related domains, achieving a high F-measure of 0.91 on the benchmark CK34 dataset. The software implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub.
Comparison of multihardware parallel implementations for a phase unwrapping algorithm

NASA Astrophysics Data System (ADS)

Hernandez-Lopez, Francisco Javier; Rivera, Mariano; Salazar-Garibay, Adan; Legarda-Sáenz, Ricardo

2018-04-01

Phase unwrapping is an important problem in the areas of optical metrology, synthetic aperture radar (SAR) image analysis, and magnetic resonance imaging (MRI) analysis. These images are becoming larger in size and, particularly, the availability and need for processing of SAR and MRI data have increased significantly with the acquisition of remote sensing data and the popularization of magnetic resonators in clinical diagnosis. Therefore, it is important to develop faster and accurate phase unwrapping algorithms. We propose a parallel multigrid algorithm of a phase unwrapping method named accumulation of residual maps, which builds on a serial algorithm that consists of the minimization of a cost function; minimization achieved by means of a serial Gauss-Seidel kind algorithm. Our algorithm also optimizes the original cost function, but unlike the original work, our algorithm is a parallel Jacobi class with alternated minimizations. This strategy is known as the chessboard type, where red pixels can be updated in parallel at same iteration since they are independent. Similarly, black pixels can be updated in parallel in an alternating iteration. We present parallel implementations of our algorithm for different parallel multicore architecture such as CPU-multicore, Xeon Phi coprocessor, and Nvidia graphics processing unit. In all the cases, we obtain a superior performance of our parallel algorithm when compared with the original serial version. In addition, we present a detailed comparative performance of the developed parallel versions.
Research of real-time video processing system based on 6678 multi-core DSP

NASA Astrophysics Data System (ADS)

Li, Xiangzhen; Xie, Xiaodan; Yin, Xiaoqiang

2017-10-01

In the information age, the rapid development in the direction of intelligent video processing, complex algorithm proposed the powerful challenge on the performance of the processor. In this article, through the FPGA + TMS320C6678 frame structure, the image to fog, merge into an organic whole, to stabilize the image enhancement, its good real-time, superior performance, break through the traditional function of video processing system is simple, the product defects such as single, solved the video application in security monitoring, video, etc. Can give full play to the video monitoring effectiveness, improve enterprise economic benefits.
Improvement of Speckle Contrast Image Processing by an Efficient Algorithm.

PubMed

Steimers, A; Farnung, W; Kohl-Bareis, M

2016-01-01

We demonstrate an efficient algorithm for the temporal and spatial based calculation of speckle contrast for the imaging of blood flow by laser speckle contrast analysis (LASCA). It reduces the numerical complexity of necessary calculations, facilitates a multi-core and many-core implementation of the speckle analysis and enables an independence of temporal or spatial resolution and SNR. The new algorithm was evaluated for both spatial and temporal based analysis of speckle patterns with different image sizes and amounts of recruited pixels as sequential, multi-core and many-core code.
Fiber Bragg grating inscription in optical multicore fibers

NASA Astrophysics Data System (ADS)

Becker, Martin; Elsmann, Tino; Lorenz, Adrian; Spittel, Ron; Kobelke, Jens; Schuster, Kay; Rothhardt, Manfred; Latka, Ines; Dochow, Sebastian; Bartelt, Hartmut

2015-09-01

Fiber Bragg gratings as key components in telecommunication, fiber lasers, and sensing systems usually rely on the Bragg condition for single mode fibers. In special applications, such as in biophotonics and astrophysics, high light coupling efficiency is of great importance and therefore, multimode fibers are often preferred. The wavelength filtering effect of Bragg gratings in multimode fibers, however is spectrally blurred over a wide modal spectrum of the fiber. With a well-designed all solid multicore microstructured fiber a good light guiding efficiency in combination with narrow spectral filtering effect by Bragg gratings becomes possible.
WOMBAT: A Scalable and High-performance Astrophysical Magnetohydrodynamics Code

NASA Astrophysics Data System (ADS)

Mendygral, P. J.; Radcliffe, N.; Kandalla, K.; Porter, D.; O'Neill, B. J.; Nolting, C.; Edmon, P.; Donnert, J. M. F.; Jones, T. W.

2017-02-01

We present a new code for astrophysical magnetohydrodynamics specifically designed and optimized for high performance and scaling on modern and future supercomputers. We describe a novel hybrid OpenMP/MPI programming model that emerged from a collaboration between Cray, Inc. and the University of Minnesota. This design utilizes MPI-RMA optimized for thread scaling, which allows the code to run extremely efficiently at very high thread counts ideal for the latest generation of multi-core and many-core architectures. Such performance characteristics are needed in the era of “exascale” computing. We describe and demonstrate our high-performance design in detail with the intent that it may be used as a model for other, future astrophysical codes intended for applications demanding exceptional performance.
Accelerated event-by-event Monte Carlo microdosimetric calculations of electrons and protons tracks on a multi-core CPU and a CUDA-enabled GPU.

PubMed

Kalantzis, Georgios; Tachibana, Hidenobu

2014-01-01

For microdosimetric calculations event-by-event Monte Carlo (MC) methods are considered the most accurate. The main shortcoming of those methods is the extensive requirement for computational time. In this work we present an event-by-event MC code of low projectile energy electron and proton tracks for accelerated microdosimetric MC simulations on a graphic processing unit (GPU). Additionally, a hybrid implementation scheme was realized by employing OpenMP and CUDA in such a way that both GPU and multi-core CPU were utilized simultaneously. The two implementation schemes have been tested and compared with the sequential single threaded MC code on the CPU. Performance comparison was established on the speed-up for a set of benchmarking cases of electron and proton tracks. A maximum speedup of 67.2 was achieved for the GPU-based MC code, while a further improvement of the speedup up to 20% was achieved for the hybrid approach. The results indicate the capability of our CPU-GPU implementation for accelerated MC microdosimetric calculations of both electron and proton tracks without loss of accuracy. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.
A History-based Estimation for LHCb job requirements

NASA Astrophysics Data System (ADS)

Rauschmayr, Nathalie

2015-12-01

The main goal of a Workload Management System (WMS) is to find and allocate resources for the given tasks. The more and better job information the WMS receives, the easier will be to accomplish its task, which directly translates into higher utilization of resources. Traditionally, the information associated with each job, like expected runtime, is defined beforehand by the Production Manager in best case and fixed arbitrary values by default. In the case of LHCb's Workload Management System no mechanisms are provided which automate the estimation of job requirements. As a result, much more CPU time is normally requested than actually needed. Particularly, in the context of multicore jobs this presents a major problem, since single- and multicore jobs shall share the same resources. Consequently, grid sites need to rely on estimations given by the VOs in order to not decrease the utilization of their worker nodes when making multicore job slots available. The main reason for going to multicore jobs is the reduction of the overall memory footprint. Therefore, it also needs to be studied how memory consumption of jobs can be estimated. A detailed workload analysis of past LHCb jobs is presented. It includes a study of job features and their correlation with runtime and memory consumption. Following the features, a supervised learning algorithm is developed based on a history based prediction. The aim is to learn over time how jobs’ runtime and memory evolve influenced due to changes in experiment conditions and software versions. It will be shown that estimation can be notably improved if experiment conditions are taken into account.
Re-Form: FPGA-Powered True Codesign Flow for High-Performance Computing In The Post-Moore Era

DOE Office of Scientific and Technical Information (OSTI.GOV)

Cappello, Franck; Yoshii, Kazutomo; Finkel, Hal

Multicore scaling will end soon because of practical power limits. Dark silicon is becoming a major issue even more than the end of Moore’s law. In the post-Moore era, the energy efficiency of computing will be a major concern. FPGAs could be a key to maximizing the energy efficiency. In this paper we address severe challenges in the adoption of FPGA in HPC and describe “Re-form,” an FPGA-powered codesign flow.
Multi-Core Programming Design Patterns: Stream Processing Algorithms for Dynamic Scene Perceptions

DTIC Science & Technology

2014-05-01

processor developed by IBM and other companies , incorpo- rates the verb—POWER5— processor as the Power Processor Element (PPE), one of the early general...deliver an power efficient single-precision peak performance of more than 256 GFlops. Substantially more raw power became available later, when nVIDIA ...algorithms, including IBM’s Cell/B.E., GPUs from NVidia and AMD and many-core CPUs from Intel.27 The vast growth of digital video content has been a
GeauxDock: Accelerating Structure-Based Virtual Screening with Heterogeneous Computing

PubMed Central

Fang, Ye; Ding, Yun; Feinstein, Wei P.; Koppelman, David M.; Moreno, Juana; Jarrell, Mark; Ramanujam, J.; Brylinski, Michal

2016-01-01

Computational modeling of drug binding to proteins is an integral component of direct drug design. Particularly, structure-based virtual screening is often used to perform large-scale modeling of putative associations between small organic molecules and their pharmacologically relevant protein targets. Because of a large number of drug candidates to be evaluated, an accurate and fast docking engine is a critical element of virtual screening. Consequently, highly optimized docking codes are of paramount importance for the effectiveness of virtual screening methods. In this communication, we describe the implementation, tuning and performance characteristics of GeauxDock, a recently developed molecular docking program. GeauxDock is built upon the Monte Carlo algorithm and features a novel scoring function combining physics-based energy terms with statistical and knowledge-based potentials. Developed specifically for heterogeneous computing platforms, the current version of GeauxDock can be deployed on modern, multi-core Central Processing Units (CPUs) as well as massively parallel accelerators, Intel Xeon Phi and NVIDIA Graphics Processing Unit (GPU). First, we carried out a thorough performance tuning of the high-level framework and the docking kernel to produce a fast serial code, which was then ported to shared-memory multi-core CPUs yielding a near-ideal scaling. Further, using Xeon Phi gives 1.9× performance improvement over a dual 10-core Xeon CPU, whereas the best GPU accelerator, GeForce GTX 980, achieves a speedup as high as 3.5×. On that account, GeauxDock can take advantage of modern heterogeneous architectures to considerably accelerate structure-based virtual screening applications. GeauxDock is open-sourced and publicly available at www.brylinski.org/geauxdock and https://figshare.com/articles/geauxdock_tar_gz/3205249. PMID:27420300
GeauxDock: Accelerating Structure-Based Virtual Screening with Heterogeneous Computing.

PubMed

Fang, Ye; Ding, Yun; Feinstein, Wei P; Koppelman, David M; Moreno, Juana; Jarrell, Mark; Ramanujam, J; Brylinski, Michal

2016-01-01

Computational modeling of drug binding to proteins is an integral component of direct drug design. Particularly, structure-based virtual screening is often used to perform large-scale modeling of putative associations between small organic molecules and their pharmacologically relevant protein targets. Because of a large number of drug candidates to be evaluated, an accurate and fast docking engine is a critical element of virtual screening. Consequently, highly optimized docking codes are of paramount importance for the effectiveness of virtual screening methods. In this communication, we describe the implementation, tuning and performance characteristics of GeauxDock, a recently developed molecular docking program. GeauxDock is built upon the Monte Carlo algorithm and features a novel scoring function combining physics-based energy terms with statistical and knowledge-based potentials. Developed specifically for heterogeneous computing platforms, the current version of GeauxDock can be deployed on modern, multi-core Central Processing Units (CPUs) as well as massively parallel accelerators, Intel Xeon Phi and NVIDIA Graphics Processing Unit (GPU). First, we carried out a thorough performance tuning of the high-level framework and the docking kernel to produce a fast serial code, which was then ported to shared-memory multi-core CPUs yielding a near-ideal scaling. Further, using Xeon Phi gives 1.9× performance improvement over a dual 10-core Xeon CPU, whereas the best GPU accelerator, GeForce GTX 980, achieves a speedup as high as 3.5×. On that account, GeauxDock can take advantage of modern heterogeneous architectures to considerably accelerate structure-based virtual screening applications. GeauxDock is open-sourced and publicly available at www.brylinski.org/geauxdock and https://figshare.com/articles/geauxdock_tar_gz/3205249.
MIMO signal progressing with RLSCMA algorithm for multi-mode multi-core optical transmission system

NASA Astrophysics Data System (ADS)

Bi, Yuan; Liu, Bo; Zhang, Li-jia; Xin, Xiang-jun; Zhang, Qi; Wang, Yong-jun; Tian, Qing-hua; Tian, Feng; Mao, Ya-ya

2018-01-01

In the process of transmitting signals of multi-mode multi-core fiber, there will be mode coupling between modes. The mode dispersion will also occur because each mode has different transmission speed in the link. Mode coupling and mode dispersion will cause damage to the useful signal in the transmission link, so the receiver needs to deal received signal with digital signal processing, and compensate the damage in the link. We first analyzes the influence of mode coupling and mode dispersion in the process of transmitting signals of multi-mode multi-core fiber, then presents the relationship between the coupling coefficient and dispersion coefficient. Then we carry out adaptive signal processing with MIMO equalizers based on recursive least squares constant modulus algorithm (RLSCMA). The MIMO equalization algorithm offers adaptive equalization taps according to the degree of crosstalk in cores or modes, which eliminates the interference among different modes and cores in space division multiplexing(SDM) transmission system. The simulation results show that the distorted signals are restored efficiently with fast convergence speed.
Parallel k-means++ for Multiple Shared-Memory Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mackey, Patrick S.; Lewis, Robert R.

2016-09-22

In recent years k-means++ has become a popular initialization technique for improved k-means clustering. To date, most of the work done to improve its performance has involved parallelizing algorithms that are only approximations of k-means++. In this paper we present a parallelization of the exact k-means++ algorithm, with a proof of its correctness. We develop implementations for three distinct shared-memory architectures: multicore CPU, high performance GPU, and the massively multithreaded Cray XMT platform. We demonstrate the scalability of the algorithm on each platform. In addition we present a visual approach for showing which platform performed k-means++ the fastest for varyingmore » data sizes.« less
MILC Code Performance on High End CPU and GPU Supercomputer Clusters

NASA Astrophysics Data System (ADS)

DeTar, Carleton; Gottlieb, Steven; Li, Ruizi; Toussaint, Doug

2018-03-01

With recent developments in parallel supercomputing architecture, many core, multi-core, and GPU processors are now commonplace, resulting in more levels of parallelism, memory hierarchy, and programming complexity. It has been necessary to adapt the MILC code to these new processors starting with NVIDIA GPUs, and more recently, the Intel Xeon Phi processors. We report on our efforts to port and optimize our code for the Intel Knights Landing architecture. We consider performance of the MILC code with MPI and OpenMP, and optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on the staggered conjugate gradient and gauge force. We also consider performance on recent NVIDIA GPUs using the QUDA library.
Multicore Hardware Experiments in Software Producibility

DTIC Science & Technology

2009-06-01

processors. 15. SUBJECT TERMS Multi-core, Real - time Systems , Testing, Software Modernization 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF... real ‐ time systems . The inputs to the dgclocalnav component are the path plan (received from highlevelplanner, discussed next), the drivable grid... time systems , robotics, and software. As frequently observed in cyber‐physical systems, the system designers may need experience in multiple
MC3: Multi-core Markov-chain Monte Carlo code

NASA Astrophysics Data System (ADS)

Cubillos, Patricio; Harrington, Joseph; Lust, Nate; Foster, AJ; Stemm, Madison; Loredo, Tom; Stevenson, Kevin; Campo, Chris; Hardin, Matt; Hardy, Ryan

2016-10-01

MC3 (Multi-core Markov-chain Monte Carlo) is a Bayesian statistics tool that can be executed from the shell prompt or interactively through the Python interpreter with single- or multiple-CPU parallel computing. It offers Markov-chain Monte Carlo (MCMC) posterior-distribution sampling for several algorithms, Levenberg-Marquardt least-squares optimization, and uniform non-informative, Jeffreys non-informative, or Gaussian-informative priors. MC3 can share the same value among multiple parameters and fix the value of parameters to constant values, and offers Gelman-Rubin convergence testing and correlated-noise estimation with time-averaging or wavelet-based likelihood estimation methods.
Experimental demonstration of large capacity WSDM optical access network with multicore fibers and advanced modulation formats.

PubMed

Li, Borui; Feng, Zhenhua; Tang, Ming; Xu, Zhilin; Fu, Songnian; Wu, Qiong; Deng, Lei; Tong, Weijun; Liu, Shuang; Shum, Perry Ping

2015-05-04

Towards the next generation optical access network supporting large capacity data transmission to enormous number of users covering a wider area, we proposed a hybrid wavelength-space division multiplexing (WSDM) optical access network architecture utilizing multicore fibers with advanced modulation formats. As a proof of concept, we experimentally demonstrated a WSDM optical access network with duplex transmission using our developed and fabricated multicore (7-core) fibers with 58.7km distance. As a cost-effective modulation scheme for access network, the optical OFDM-QPSK signal has been intensity modulated on the downstream transmission in the optical line terminal (OLT) and it was directly detected in the optical network unit (ONU) after MCF transmission. 10 wavelengths with 25GHz channel spacing from an optical comb generator are employed and each wavelength is loaded with 5Gb/s OFDM-QPSK signal. After amplification, power splitting, and fan-in multiplexer, 10-wavelength downstream signal was injected into six outer layer cores simultaneously and the aggregation downstream capacity reaches 300 Gb/s. -16 dBm sensitivity has been achieved for 3.8 × 10^-3 bit error ratio (BER) with 7% Forward Error Correction (FEC) limit for all wavelengths in every core. Upstream signal from ONU side has also been generated and the bidirectional transmission in the same core causes negligible performance degradation to the downstream signal. As a universal platform for wired/wireless data access, our proposed architecture provides additional dimension for high speed mobile signal transmission and we hence demonstrated an upstream delivery of 20Gb/s per wavelength with QPSK modulation formats using the inner core of MCF emulating a mobile backhaul service. The IQ modulated data was coherently detected in the OLT side. -19 dBm sensitivity has been achieved under the FEC limit and more than 18 dB power budget is guaranteed.
Software Graphics Processing Unit (sGPU) for Deep Space Applications

NASA Technical Reports Server (NTRS)

McCabe, Mary; Salazar, George; Steele, Glen

2015-01-01

A graphics processing capability will be required for deep space missions and must include a range of applications, from safety-critical vehicle health status to telemedicine for crew health. However, preliminary radiation testing of commercial graphics processing cards suggest they cannot operate in the deep space radiation environment. Investigation into an Software Graphics Processing Unit (sGPU)comprised of commercial-equivalent radiation hardened/tolerant single board computers, field programmable gate arrays, and safety-critical display software shows promising results. Preliminary performance of approximately 30 frames per second (FPS) has been achieved. Use of multi-core processors may provide a significant increase in performance.
WOMBAT: A Scalable and High-performance Astrophysical Magnetohydrodynamics Code

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mendygral, P. J.; Radcliffe, N.; Kandalla, K.

2017-02-01

We present a new code for astrophysical magnetohydrodynamics specifically designed and optimized for high performance and scaling on modern and future supercomputers. We describe a novel hybrid OpenMP/MPI programming model that emerged from a collaboration between Cray, Inc. and the University of Minnesota. This design utilizes MPI-RMA optimized for thread scaling, which allows the code to run extremely efficiently at very high thread counts ideal for the latest generation of multi-core and many-core architectures. Such performance characteristics are needed in the era of “exascale” computing. We describe and demonstrate our high-performance design in detail with the intent that it maymore » be used as a model for other, future astrophysical codes intended for applications demanding exceptional performance.« less

Towards large dynamic range and ultrahigh measurement resolution in distributed fiber sensing based on multicore fiber.

PubMed

Dang, Yunli; Zhao, Zhiyong; Tang, Ming; Zhao, Can; Gan, Lin; Fu, Songnian; Liu, Tongqing; Tong, Weijun; Shum, Perry Ping; Liu, Deming

2017-08-21

Featuring a dependence of Brillouin frequency shift (BFS) on temperature and strain changes over a wide range, Brillouin distributed optical fiber sensors are however essentially subjected to the relatively poor temperature/strain measurement resolution. On the other hand, phase-sensitive optical time-domain reflectometry (Φ-OTDR) offers ultrahigh temperature/strain measurement resolution, but the available frequency scanning range is normally narrow thereby severely restricts its measurement dynamic range. In order to achieve large dynamic range and high measurement resolution simultaneously, we propose to employ both the Brillouin optical time domain analysis (BOTDA) and Φ-OTDR through space-division multiplexed (SDM) configuration based on the multicore fiber (MCF), in which the two sensors are spatially separately implemented in the central core and a side core, respectively. As a proof of concept, the temperature sensing has been performed for validation with 2.5 m spatial resolution over 1.565 km MCF. Large temperature range (10 °C) has been measured by BOTDA and the 0.1 °C small temperature variation is successfully identified by Φ-OTDR with ~0.001 °C resolution. Moreover, the temperature changing process has been recorded by continuously performing the measurement of Φ-OTDR with 80 s frequency scanning period, showing about 0.02 °C temperature spacing at the monitored profile. The proposed system enables the capability to see finer and/or farther upon requirement in distributed optical fiber sensing.
Rapid Onboard Data Product Generation with Multicore Processors and FPGA

NASA Astrophysics Data System (ADS)

Mandl, D.; Sohlberg, R. A.; Cappelaere, P. G.; Frye, S. W.; Ly, V.; Handy, M.; Ambrosia, V. G.; Sullivan, D. V.; Bland, G.; Pastor, E.; Crago, S.; Flatley, C.; Shah, N.; Bronston, J.; Creech, T.

2012-12-01

The Intelligent Payload Module (IPM) is an experimental testbed with multicore processors and Field Programmable Gate Array (FPGA). This effort is being funded by the NASA Earth Science Technology Office as part of an Advanced Information Systems Technology (AIST) 2011 research grant to investigate the use of high performance onboard processing to create an onboard data processing pipeline that can rapidly process a subset of onboard imaging spectrometer data (1) through radiance to reflectance conversion (2) atmospheric correction (3) geolocation and co-registration and (4) level 2 data product generation. The requirements are driven by the mission concept for the HyspIRI NASA Decadal mission, although other NASA Decadal missions could use the same concept. The system is being set up to make use of the same ground and flight software being used by other satellites at NASA/GSFC. Furthermore, a Web Coverage Processing Service (WCPS) is installed as part of the flight software which enables a user on the ground to specify the desired algorithm to run onboard against the data in realtime. Benchmark demonstrations are being run and will be run through the three year effort on various platforms including a helicopter and various airplane platforms with various instruments to demonstrate various configurations that would be compatible with the HyspIRI mission and other similar missions. This presentation will lay out the demonstrations conducted to date along with any benchmark performance metrics and future demonstration efforts and objectives.Initial IPM Test Box
Center for Technology for Advanced Scientific Componet Software (TASCS)

DOE Office of Scientific and Technical Information (OSTI.GOV)

Govindaraju, Madhusudhan

Advanced Scientific Computing Research Computer Science FY 2010Report Center for Technology for Advanced Scientific Component Software: Distributed CCA State University of New York, Binghamton, NY, 13902 Summary The overall objective of Binghamton's involvement is to work on enhancements of the CCA environment, motivated by the applications and research initiatives discussed in the proposal. This year we are working on re-focusing our design and development efforts to develop proof-of-concept implementations that have the potential to significantly impact scientific components. We worked on developing parallel implementations for non-hydrostatic code and worked on a model coupling interface for biogeochemical computations coded in MATLAB.more » We also worked on the design and implementation modules that will be required for the emerging MapReduce model to be effective for scientific applications. Finally, we focused on optimizing the processing of scientific datasets on multi-core processors. Research Details We worked on the following research projects that we are working on applying to CCA-based scientific applications. 1. Non-Hydrostatic Hydrodynamics: Non-static hydrodynamics are significantly more accurate at modeling internal waves that may be important in lake ecosystems. Non-hydrostatic codes, however, are significantly more computationally expensive, often prohibitively so. We have worked with Chin Wu at the University of Wisconsin to parallelize non-hydrostatic code. We have obtained a speed up of about 26 times maximum. Although this is significant progress, we hope to improve the performance further, such that it becomes a practical alternative to hydrostatic codes. 2. Model-coupling for water-based ecosystems: To answer pressing questions about water resources requires that physical models (hydrodynamics) be coupled with biological and chemical models. Most hydrodynamics codes are written in Fortran, however, while most ecologists work in MATLAB. This disconnect creates a great barrier. To address this, we are working on a model coupling interface that will allow biogeochemical computations written in MATLAB to couple with Fortran codes. This will greatly improve the productivity of ecosystem scientists. 2. Low overhead and Elastic MapReduce Implementation Optimized for Memory and CPU-Intensive Applications: Since its inception, MapReduce has frequently been associated with Hadoop and large-scale datasets. Its deployment at Amazon in the cloud, and its applications at Yahoo! for large-scale distributed document indexing and database building, among other tasks, have thrust MapReduce to the forefront of the data processing application domain. The applicability of the paradigm however extends far beyond its use with data intensive applications and diskbased systems, and can also be brought to bear in processing small but CPU intensive distributed applications. MapReduce however carries its own burdens. Through experiments using Hadoop in the context of diverse applications, we uncovered latencies and delay conditions potentially inhibiting the expected performance of a parallel execution in CPU-intensive applications. Furthermore, as it currently stands, MapReduce is favored for data-centric applications, and as such tends to be solely applied to disk-based applications. The paradigm, falls short in bringing its novelty to diskless systems dedicated to in-memory applications, and compute intensive programs processing much smaller data, but requiring intensive computations. In this project, we focused both on the performance of processing large-scale hierarchical data in distributed scientific applications, as well as the processing of smaller but demanding input sizes primarily used in diskless, and memory resident I/O systems. We designed LEMO-MR [1], a Low overhead, elastic, configurable for in- memory applications, and on-demand fault tolerance, an optimized implementation of MapReduce, for both on disk and in memory applications. We conducted experiments to identify not only the necessary components of this model, but also trade offs and factors to be considered. We have initial results to show the efficacy of our implementation in terms of potential speedup that can be achieved for representative data sets used by cloud applications. We have quantified the performance gains exhibited by our MapReduce implementation over Apache Hadoop in a compute intensive environment. 3. Cache Performance Optimization for Processing XML and HDF-based Application Data on Multi-core Processors: It is important to design and develop scientific middleware libraries to harness the opportunities presented by emerging multi-core processors. Implementations of scientific middleware and applications that do not adapt to the programming paradigm when executing on emerging processors can severely impact the overall performance. In this project, we focused on the utilization of the L2 cache, which is a critical shared resource on chip multiprocessors (CMP). The access pattern of the shared L2 cache, which is dependent on how the application schedules and assigns processing work to each thread, can either enhance or hurt the ability to hide memory latency on a multi-core processor. Therefore, while processing scientific datasets such as HDF5, it is essential to conduct fine-grained analysis of cache utilization, to inform scheduling decisions in multi-threaded programming. In this project, using the TAU toolkit for performance feedback from dual- and quad-core machines, we conducted performance analysis and recommendations on how processing threads can be scheduled on multi-core nodes to enhance the performance of a class of scientific applications that requires processing of HDF5 data. In particular, we quantified the gains associated with the use of the adaptations we have made to the Cache-Affinity and Balanced-Set scheduling algorithms to improve L2 cache performance, and hence the overall application execution time [2]. References: 1. Zacharia Fadika, Madhusudhan Govindaraju, ``MapReduce Implementation for Memory-Based and Processing Intensive Applications'', accepted in 2nd IEEE International Conference on Cloud Computing Technology and Science, Indianapolis, USA, Nov 30 - Dec 3, 2010. 2. Rajdeep Bhowmik, Madhusudhan Govindaraju, ``Cache Performance Optimization for Processing XML-based Application Data on Multi-core Processors'', in proceedings of The 10th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 17-20, 2010, Melbourne, Victoria, Australia. Contact Information: Madhusudhan Govindaraju Binghamton University State University of New York (SUNY) mgovinda@cs.binghamton.edu Phone: 607-777-4904« less
A multicore optical fiber for distributed sensing

NASA Astrophysics Data System (ADS)

Sun, Xiaoguang; Li, Jie; Burgess, David T.; Hines, Mike; Zhu, Beyuan

2014-06-01

With advancements in optical fiber technology, the incorporation of multiple sensing functionalities within a single fiber structure opens the possibility to deploy dielectric, fully distributed, long-length optical sensors in an extremely small cross section. To illustrate the concept, we designed and manufactured a multicore optical fiber with three graded-index (GI) multimode (MM) cores and one single mode (SM) core. The fiber was coated with both a silicone primary layer and an ETFE buffer for high temperature applications. The fiber properties such as geometry, crosstalk and attenuation are described. A method for coupling the signal from the individual cores into separate optical fibers is also presented.
Stack-and-Draw Manufacture Process of a Seven-Core Optical Fiber for Fluorescence Measurements

NASA Astrophysics Data System (ADS)

Samir, Ahmed; Batagelj, Bostjan

2018-01-01

Multi-core, optical-fiber technology is expected to be used in telecommunications and sensory systems in a relatively short amount of time. However, a successful transition from research laboratories to industry applications will only be possible with an optimized design and manufacturing process. The fabrication process is an important aspect in designing and developing new multi-applicable, multi-core fibers, where the best candidate is a seven-core fiber. Here, the basics for designing and manufacturing a single-mode, seven-core fiber using the stack-and-draw process is described for the example of a fluorescence sensory system.
Design of Multi-core Fiber Patch Panel for Space Division Multiplexing Implementations

NASA Astrophysics Data System (ADS)

González, Luz E.; Morales, Alvaro; Rommel, Simon; Jørgensen, Bo F.; Porras-Montenegro, N.; Tafur Monroy, Idelfonso

2018-03-01

A multi-core fiber (MCF) patch panel was designed, allowing easy coupling of individual signals to and from a 7-core MCF. The device was characterized, measuring insertion loss and cross talk, finding highest insertion loss and lowest crosstalk at 1300 nm with values of 9.7 dB and -36.5 dB respectively, while at 1600 nm insertion loss drops to 4.8 dB and crosstalk increases to -24.1 dB. Two MCF splices between the fan-in module, the MCF, and the fan-out module are included in the characterization, and splicing parameters are discussed.
New multicore low mode noise scrambling fiber for applications in high-resolution spectroscopy

NASA Astrophysics Data System (ADS)

Haynes, Dionne M.; Gris-Sanchez, Itandehui; Ehrlich, Katjana; Birks, Tim A.; Giannone, Domenico; Haynes, Roger

2014-07-01

We present a new type of multicore fiber (MCF) and photonic lantern that consists of 511 individual cores designed to operate over a broadband visible wavelength range (380-860nm). It combines the coupling efficiency of a multimode fiber with modal stability intrinsic to a single mode fibre. It is designed to provide phase and amplitude scrambling to achieve a stable near field and far field illumination pattern during input coupling variations; it also has low modal noise for increased photometric stability. Preliminary results are presented for the new MCF as well as current state of the art octagonal fiber for comparison.
Extending Automatic Parallelization to Optimize High-Level Abstractions for Multicore

DOE Office of Scientific and Technical Information (OSTI.GOV)

Liao, C; Quinlan, D J; Willcock, J J

2008-12-12

Automatic introduction of OpenMP for sequential applications has attracted significant attention recently because of the proliferation of multicore processors and the simplicity of using OpenMP to express parallelism for shared-memory systems. However, most previous research has only focused on C and Fortran applications operating on primitive data types. C++ applications using high-level abstractions, such as STL containers and complex user-defined types, are largely ignored due to the lack of research compilers that are readily able to recognize high-level object-oriented abstractions and leverage their associated semantics. In this paper, we automatically parallelize C++ applications using ROSE, a multiple-language source-to-source compiler infrastructuremore » which preserves the high-level abstractions and gives us access to their semantics. Several representative parallelization candidate kernels are used to explore semantic-aware parallelization strategies for high-level abstractions, combined with extended compiler analyses. Those kernels include an array-base computation loop, a loop with task-level parallelism, and a domain-specific tree traversal. Our work extends the applicability of automatic parallelization to modern applications using high-level abstractions and exposes more opportunities to take advantage of multicore processors.« less
Optimization of the coherence function estimation for multi-core central processing unit

NASA Astrophysics Data System (ADS)

Cheremnov, A. G.; Faerman, V. A.; Avramchuk, V. S.

2017-02-01

The paper considers use of parallel processing on multi-core central processing unit for optimization of the coherence function evaluation arising in digital signal processing. Coherence function along with other methods of spectral analysis is commonly used for vibration diagnosis of rotating machinery and its particular nodes. An algorithm is given for the function evaluation for signals represented with digital samples. The algorithm is analyzed for its software implementation and computational problems. Optimization measures are described, including algorithmic, architecture and compiler optimization, their results are assessed for multi-core processors from different manufacturers. Thus, speeding-up of the parallel execution with respect to sequential execution was studied and results are presented for Intel Core i7-4720HQ и AMD FX-9590 processors. The results show comparatively high efficiency of the optimization measures taken. In particular, acceleration indicators and average CPU utilization have been significantly improved, showing high degree of parallelism of the constructed calculating functions. The developed software underwent state registration and will be used as a part of a software and hardware solution for rotating machinery fault diagnosis and pipeline leak location with acoustic correlation method.
24 CFR 203.605 - Loss mitigation performance.

Code of Federal Regulations, 2012 CFR

2012-04-01

... 24 Housing and Urban Development 2 2012-04-01 2012-04-01 false Loss mitigation performance. 203....605 Loss mitigation performance. (a) Duty to mitigate. Before four full monthly installments due on... mitigation techniques provided at § 203.501 to determine which is appropriate. Based upon such evaluations...
24 CFR 203.605 - Loss mitigation performance.

Code of Federal Regulations, 2014 CFR

2014-04-01

... 24 Housing and Urban Development 2 2014-04-01 2014-04-01 false Loss mitigation performance. 203....605 Loss mitigation performance. (a) Duty to mitigate. Before four full monthly installments due on... mitigation techniques provided at § 203.501 to determine which is appropriate. Based upon such evaluations...
24 CFR 203.605 - Loss mitigation performance.

Code of Federal Regulations, 2011 CFR

2011-04-01

... 24 Housing and Urban Development 2 2011-04-01 2011-04-01 false Loss mitigation performance. 203....605 Loss mitigation performance. (a) Duty to mitigate. Before four full monthly installments due on... mitigation techniques provided at § 203.501 to determine which is appropriate. Based upon such evaluations...
24 CFR 203.605 - Loss mitigation performance.

Code of Federal Regulations, 2013 CFR

2013-04-01

... 24 Housing and Urban Development 2 2013-04-01 2013-04-01 false Loss mitigation performance. 203....605 Loss mitigation performance. (a) Duty to mitigate. Before four full monthly installments due on... mitigation techniques provided at § 203.501 to determine which is appropriate. Based upon such evaluations...
Maximal clique enumeration with data-parallel primitives

DOE Office of Scientific and Technical Information (OSTI.GOV)

Lessley, Brenton; Perciano, Talita; Mathai, Manish

The enumeration of all maximal cliques in an undirected graph is a fundamental problem arising in several research areas. We consider maximal clique enumeration on shared-memory, multi-core architectures and introduce an approach consisting entirely of data-parallel operations, in an effort to achieve efficient and portable performance across different architectures. We study the performance of the algorithm via experiments varying over benchmark graphs and architectures. Overall, we observe that our algorithm achieves up to a 33-time speedup and 9-time speedup over state-of-the-art distributed and serial algorithms, respectively, for graphs with higher ratios of maximal cliques to total cliques. Further, we attainmore » additional speedups on a GPU architecture, demonstrating the portable performance of our data-parallel design.« less
3D environment modeling and location tracking using off-the-shelf components

NASA Astrophysics Data System (ADS)

Luke, Robert H.

2016-05-01

The remarkable popularity of smartphones over the past decade has led to a technological race for dominance in market share. This has resulted in a flood of new processors and sensors that are inexpensive, low power and high performance. These sensors include accelerometers, gyroscope, barometers and most importantly cameras. This sensor suite, coupled with multicore processors, allows a new community of researchers to build small, high performance platforms for low cost. This paper describes a system using off-the-shelf components to perform position tracking as well as environment modeling. The system relies on tracking using stereo vision and inertial navigation to determine movement of the system as well as create a model of the environment sensed by the system.
Geospatial Applications on Different Parallel and Distributed Systems in enviroGRIDS Project

NASA Astrophysics Data System (ADS)

Rodila, D.; Bacu, V.; Gorgan, D.

2012-04-01

The execution of Earth Science applications and services on parallel and distributed systems has become a necessity especially due to the large amounts of Geospatial data these applications require and the large geographical areas they cover. The parallelization of these applications comes to solve important performance issues and can spread from task parallelism to data parallelism as well. Parallel and distributed architectures such as Grid, Cloud, Multicore, etc. seem to offer the necessary functionalities to solve important problems in the Earth Science domain: storing, distribution, management, processing and security of Geospatial data, execution of complex processing through task and data parallelism, etc. A main goal of the FP7-funded project enviroGRIDS (Black Sea Catchment Observation and Assessment System supporting Sustainable Development) [1] is the development of a Spatial Data Infrastructure targeting this catchment region but also the development of standardized and specialized tools for storing, analyzing, processing and visualizing the Geospatial data concerning this area. For achieving these objectives, the enviroGRIDS deals with the execution of different Earth Science applications, such as hydrological models, Geospatial Web services standardized by the Open Geospatial Consortium (OGC) and others, on parallel and distributed architecture to maximize the obtained performance. This presentation analysis the integration and execution of Geospatial applications on different parallel and distributed architectures and the possibility of choosing among these architectures based on application characteristics and user requirements through a specialized component. Versions of the proposed platform have been used in enviroGRIDS project on different use cases such as: the execution of Geospatial Web services both on Web and Grid infrastructures [2] and the execution of SWAT hydrological models both on Grid and Multicore architectures [3]. The current focus is to integrate in the proposed platform the Cloud infrastructure, which is still a paradigm with critical problems to be solved despite the great efforts and investments. Cloud computing comes as a new way of delivering resources while using a large set of old as well as new technologies and tools for providing the necessary functionalities. The main challenges in the Cloud computing, most of them identified also in the Open Cloud Manifesto 2009, address resource management and monitoring, data and application interoperability and portability, security, scalability, software licensing, etc. We propose a platform able to execute different Geospatial applications on different parallel and distributed architectures such as Grid, Cloud, Multicore, etc. with the possibility of choosing among these architectures based on application characteristics and complexity, user requirements, necessary performances, cost support, etc. The execution redirection on a selected architecture is realized through a specialized component and has the purpose of offering a flexible way in achieving the best performances considering the existing restrictions.
Using the cloud to speed-up calibration of watershed-scale hydrologic models (Invited)

NASA Astrophysics Data System (ADS)

Goodall, J. L.; Ercan, M. B.; Castronova, A. M.; Humphrey, M.; Beekwilder, N.; Steele, J.; Kim, I.

2013-12-01

This research focuses on using the cloud to address computational challenges associated with hydrologic modeling. One example is calibration of a watershed-scale hydrologic model, which can take days of execution time on typical computers. While parallel algorithms for model calibration exist and some researchers have used multi-core computers or clusters to run these algorithms, these solutions do not fully address the challenge because (i) calibration can still be too time consuming even on multicore personal computers and (ii) few in the community have the time and expertise needed to manage a compute cluster. Given this, another option for addressing this challenge that we are exploring through this work is the use of the cloud for speeding-up calibration of watershed-scale hydrologic models. The cloud used in this capacity provides a means for renting a specific number and type of machines for only the time needed to perform a calibration model run. The cloud allows one to precisely balance the duration of the calibration with the financial costs so that, if the budget allows, the calibration can be performed more quickly by renting more machines. Focusing specifically on the SWAT hydrologic model and a parallel version of the DDS calibration algorithm, we show significant speed-up time across a range of watershed sizes using up to 256 cores to perform a model calibration. The tool provides a simple web-based user interface and the ability to monitor the calibration job submission process during the calibration process. Finally this talk concludes with initial work to leverage the cloud for other tasks associated with hydrologic modeling including tasks related to preparing inputs for constructing place-based hydrologic models.
Benchmarking GPU and CPU codes for Heisenberg spin glass over-relaxation

NASA Astrophysics Data System (ADS)

Bernaschi, M.; Parisi, G.; Parisi, L.

2011-06-01

We present a set of possible implementations for Graphics Processing Units (GPU) of the Over-relaxation technique applied to the 3D Heisenberg spin glass model. The results show that a carefully tuned code can achieve more than 100 GFlops/s of sustained performance and update a single spin in about 0.6 nanoseconds. A multi-hit technique that exploits the GPU shared memory further reduces this time. Such results are compared with those obtained by means of a highly-tuned vector-parallel code on latest generation multi-core CPUs.
1001 Ways to run AutoDock Vina for virtual screening

NASA Astrophysics Data System (ADS)

Jaghoori, Mohammad Mahdi; Bleijlevens, Boris; Olabarriaga, Silvia D.

2016-03-01

Large-scale computing technologies have enabled high-throughput virtual screening involving thousands to millions of drug candidates. It is not trivial, however, for biochemical scientists to evaluate the technical alternatives and their implications for running such large experiments. Besides experience with the molecular docking tool itself, the scientist needs to learn how to run it on high-performance computing (HPC) infrastructures, and understand the impact of the choices made. Here, we review such considerations for a specific tool, AutoDock Vina, and use experimental data to illustrate the following points: (1) an additional level of parallelization increases virtual screening throughput on a multi-core machine; (2) capturing of the random seed is not enough (though necessary) for reproducibility on heterogeneous distributed computing systems; (3) the overall time spent on the screening of a ligand library can be improved by analysis of factors affecting execution time per ligand, including number of active torsions, heavy atoms and exhaustiveness. We also illustrate differences among four common HPC infrastructures: grid, Hadoop, small cluster and multi-core (virtual machine on the cloud). Our analysis shows that these platforms are suitable for screening experiments of different sizes. These considerations can guide scientists when choosing the best computing platform and set-up for their future large virtual screening experiments.
1001 Ways to run AutoDock Vina for virtual screening.

PubMed

Jaghoori, Mohammad Mahdi; Bleijlevens, Boris; Olabarriaga, Silvia D

2016-03-01

Large-scale computing technologies have enabled high-throughput virtual screening involving thousands to millions of drug candidates. It is not trivial, however, for biochemical scientists to evaluate the technical alternatives and their implications for running such large experiments. Besides experience with the molecular docking tool itself, the scientist needs to learn how to run it on high-performance computing (HPC) infrastructures, and understand the impact of the choices made. Here, we review such considerations for a specific tool, AutoDock Vina, and use experimental data to illustrate the following points: (1) an additional level of parallelization increases virtual screening throughput on a multi-core machine; (2) capturing of the random seed is not enough (though necessary) for reproducibility on heterogeneous distributed computing systems; (3) the overall time spent on the screening of a ligand library can be improved by analysis of factors affecting execution time per ligand, including number of active torsions, heavy atoms and exhaustiveness. We also illustrate differences among four common HPC infrastructures: grid, Hadoop, small cluster and multi-core (virtual machine on the cloud). Our analysis shows that these platforms are suitable for screening experiments of different sizes. These considerations can guide scientists when choosing the best computing platform and set-up for their future large virtual screening experiments.

Genetic mapping of 15 human X chromosomal forensic short tandem repeat (STR) loci by means of multi-core parallelization.

PubMed

Diegoli, Toni Marie; Rohde, Heinrich; Borowski, Stefan; Krawczak, Michael; Coble, Michael D; Nothnagel, Michael

2016-11-01

Typing of X chromosomal short tandem repeat (X STR) markers has become a standard element of human forensic genetic analysis. Joint consideration of many X STR markers at a time increases their discriminatory power but, owing to physical linkage, requires inter-marker recombination rates to be accurately known. We estimated the recombination rates between 15 well established X STR markers using genotype data from 158 families (1041 individuals) and following a previously proposed likelihood-based approach that allows for single-step mutations. To meet the computational requirements of this family-based type of analysis, we modified a previous implementation so as to allow multi-core parallelization on a high-performance computing system. While we obtained recombination rate estimates larger than zero for all but one pair of adjacent markers within the four previously proposed linkage groups, none of the three X STR pairs defining the junctions of these groups yielded a recombination rate estimate of 0.50. Corroborating previous studies, our results therefore argue against a simple model of independent X chromosomal linkage groups. Moreover, the refined recombination fraction estimates obtained in our study will facilitate the appropriate joint consideration of all 15 investigated markers in forensic analysis. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
We introduce an algorithm for the simultaneous reconstruction of faults and slip fields. We prove that the minimum of a related regularized functional converges to the unique solution of the fault inverse problem. We consider a Bayesian approach. We use a parallel multi-core platform and we discuss techniques to save on computational time.

NASA Astrophysics Data System (ADS)

Volkov, D.

2017-12-01

We introduce an algorithm for the simultaneous reconstruction of faults and slip fields on those faults. We define a regularized functional to be minimized for the reconstruction. We prove that the minimum of that functional converges to the unique solution of the related fault inverse problem. Due to inherent uncertainties in measurements, rather than seeking a deterministic solution to the fault inverse problem, we consider a Bayesian approach. The advantage of such an approach is that we obtain a way of quantifying uncertainties as part of our final answer. On the downside, this Bayesian approach leads to a very large computation. To contend with the size of this computation we developed an algorithm for the numerical solution to the stochastic minimization problem which can be easily implemented on a parallel multi-core platform and we discuss techniques to save on computational time. After showing how this algorithm performs on simulated data and assessing the effect of noise, we apply it to measured data. The data was recorded during a slow slip event in Guerrero, Mexico.
Bristol Ridge: A 28-nm $$\\times$$ 86 Performance-Enhanced Microprocessor Through System Power Management

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sundaram, Sriram; Grenat, Aaron; Naffziger, Samuel

Power management techniques can be effective at extracting more performance and energy efficiency out of mature systems on chip (SoCs). For instance, the peak performance of microprocessors is often limited by worst case technology (Vmax), infrastructure (thermal/electrical), and microprocessor usage assumptions. Performance/watt of microprocessors also typically suffers from guard bands associated with the test and binning processes as well as worst case aging/lifetime degradation. Similarly, on multicore processors, shared voltage rails tend to limit the peak performance achievable in low thread count workloads. In this paper, we describe five power management techniques that maximize the per-part performance under the before-mentionedmore » constraints. Using these techniques, we demonstrate a net performance increase of up to 15% depending on the application and TDP of the SoC, implemented on 'Bristol Ridge,' a 28-nm CMOS, dual-core x 86 accelerated processing unit.« less
Spiking neural networks on high performance computer clusters

NASA Astrophysics Data System (ADS)

Chen, Chong; Taha, Tarek M.

2011-09-01

In this paper we examine the acceleration of two spiking neural network models on three clusters of multicore processors representing three categories of processors: x86, STI Cell, and NVIDIA GPGPUs. The x86 cluster utilized consists of 352 dualcore AMD Opterons, the Cell cluster consists of 320 Sony Playstation 3s, while the GPGPU cluster contains 32 NVIDIA Tesla S1070 systems. The results indicate that the GPGPU platform can dominate in performance compared to the Cell and x86 platforms examined. From a cost perspective, the GPGPU is more expensive in terms of neuron/s throughput. If the cost of GPGPUs go down in the future, this platform will become very cost effective for these models.
An accuracy aware low power wireless EEG unit with information content based adaptive data compression.

PubMed

Tolbert, Jeremy R; Kabali, Pratik; Brar, Simeranjit; Mukhopadhyay, Saibal

2009-01-01

We present a digital system for adaptive data compression for low power wireless transmission of Electroencephalography (EEG) data. The proposed system acts as a base-band processor between the EEG analog-to-digital front-end and RF transceiver. It performs a real-time accuracy energy trade-off for multi-channel EEG signal transmission by controlling the volume of transmitted data. We propose a multi-core digital signal processor for on-chip processing of EEG signals, to detect signal information of each channel and perform real-time adaptive compression. Our analysis shows that the proposed approach can provide significant savings in transmitter power with minimal impact on the overall signal accuracy.
Vascular system modeling in parallel environment - distributed and shared memory approaches

PubMed Central

Jurczuk, Krzysztof; Kretowski, Marek; Bezy-Wendling, Johanne

2011-01-01

The paper presents two approaches in parallel modeling of vascular system development in internal organs. In the first approach, new parts of tissue are distributed among processors and each processor is responsible for perfusing its assigned parts of tissue to all vascular trees. Communication between processors is accomplished by passing messages and therefore this algorithm is perfectly suited for distributed memory architectures. The second approach is designed for shared memory machines. It parallelizes the perfusion process during which individual processing units perform calculations concerning different vascular trees. The experimental results, performed on a computing cluster and multi-core machines, show that both algorithms provide a significant speedup. PMID:21550891
Parallel processing implementation for the coupled transport of photons and electrons using OpenMP

NASA Astrophysics Data System (ADS)

Doerner, Edgardo

2016-05-01

In this work the use of OpenMP to implement the parallel processing of the Monte Carlo (MC) simulation of the coupled transport for photons and electrons is presented. This implementation was carried out using a modified EGSnrc platform which enables the use of the Microsoft Visual Studio 2013 (VS2013) environment, together with the developing tools available in the Intel Parallel Studio XE 2015 (XE2015). The performance study of this new implementation was carried out in a desktop PC with a multi-core CPU, taking as a reference the performance of the original platform. The results were satisfactory, both in terms of scalability as parallelization efficiency.
Boosting the FM-Index on the GPU: Effective Techniques to Mitigate Random Memory Access.

PubMed

Chacón, Alejandro; Marco-Sola, Santiago; Espinosa, Antonio; Ribeca, Paolo; Moure, Juan Carlos

2015-01-01

The recent advent of high-throughput sequencing machines producing big amounts of short reads has boosted the interest in efficient string searching techniques. As of today, many mainstream sequence alignment software tools rely on a special data structure, called the FM-index, which allows for fast exact searches in large genomic references. However, such searches translate into a pseudo-random memory access pattern, thus making memory access the limiting factor of all computation-efficient implementations, both on CPUs and GPUs. Here, we show that several strategies can be put in place to remove the memory bottleneck on the GPU: more compact indexes can be implemented by having more threads work cooperatively on larger memory blocks, and a k-step FM-index can be used to further reduce the number of memory accesses. The combination of those and other optimisations yields an implementation that is able to process about two Gbases of queries per second on our test platform, being about 8 × faster than a comparable multi-core CPU version, and about 3 × to 5 × faster than the FM-index implementation on the GPU provided by the recently announced Nvidia NVBIO bioinformatics library.
Experimental observation of spontaneous depolarized guided acoustic-wave Brillouin scattering in side cores of a multicore fiber

NASA Astrophysics Data System (ADS)

Hayashi, Neisei; Mizuno, Yosuke; Nakamura, Kentaro; Set, Sze Yun; Yamashita, Shinji

2018-06-01

Spontaneous depolarized guided acoustic-wave Brillouin scattering (GAWBS) was experimentally observed in one of the side cores of an uncoated multicore fiber (MCF). The frequency bandwidth in the side core was up to ∼400 MHz, which is 0.5 times that in the central core. The GAWBS spectrum of the side core of the MCF included intrinsic peaks, which had different acoustic resonance frequencies from those of the central core. In addition, the spontaneous depolarized GAWBS in the central/side core was unaffected by that in the other core. These results will lead to the development of polarization/phase modulators using an MCF.
Single-shot polarimetry imaging of multicore fiber.

PubMed

Sivankutty, Siddharth; Andresen, Esben Ravn; Bouwmans, Géraud; Brown, Thomas G; Alonso, Miguel A; Rigneault, Hervé

2016-05-01

We report an experimental test of single-shot polarimetry applied to the problem of real-time monitoring of the output polarization states in each core within a multicore fiber bundle. The technique uses a stress-engineered optical element, together with an analyzer, and provides a point spread function whose shape unambiguously reveals the polarization state of a point source. We implement this technique to monitor, simultaneously and in real time, the output polarization states of up to 180 single-mode fiber cores in both conventional and polarization-maintaining fiber bundles. We demonstrate also that the technique can be used to fully characterize the polarization properties of each individual fiber core, including eigen-polarization states, phase delay, and diattenuation.
Shape sensing using multi-core fiber optic cable and parametric curve solutions.

PubMed

Moore, Jason P; Rogge, Matthew D

2012-01-30

The shape of a multi-core optical fiber is calculated by numerically solving a set of Frenet-Serret equations describing the path of the fiber in three dimensions. Included in the Frenet-Serret equations are curvature and bending direction functions derived from distributed fiber Bragg grating strain measurements in each core. The method offers advantages over prior art in that it determines complex three-dimensional fiber shape as a continuous parametric solution rather than an integrated series of discrete planar bends. Results and error analysis of the method using a tri-core optical fiber is presented. Maximum error expressed as a percentage of fiber length was found to be 7.2%.
Tunable arbitrary unitary transformer based on multiple sections of multicore fibers with phase control.

PubMed

Zhou, Junhe; Wu, Jianjie; Hu, Qinsong

2018-02-05

In this paper, we propose a novel tunable unitary transformer, which can achieve arbitrary discrete unitary transforms. The unitary transformer is composed of multiple sections of multi-core fibers with closely aligned coupled cores. Phase shifters are inserted before and after the sections to control the phases of the waves in the cores. A simple algorithm is proposed to find the optimal phase setup for the phase shifters to realize the desired unitary transforms. The proposed device is fiber based and is particularly suitable for the mode division multiplexing systems. A tunable mode MUX/DEMUX for a three-mode fiber is designed based on the proposed structure.
Classification of Magnetic Nanoparticle Systems—Synthesis, Standardization and Analysis Methods in the NanoMag Project

PubMed Central

Bogren, Sara; Fornara, Andrea; Ludwig, Frank; del Puerto Morales, Maria; Steinhoff, Uwe; Fougt Hansen, Mikkel; Kazakova, Olga; Johansson, Christer

2015-01-01

This study presents classification of different magnetic single- and multi-core particle systems using their measured dynamic magnetic properties together with their nanocrystal and particle sizes. The dynamic magnetic properties are measured with AC (dynamical) susceptometry and magnetorelaxometry and the size parameters are determined from electron microscopy and dynamic light scattering. Using these methods, we also show that the nanocrystal size and particle morphology determines the dynamic magnetic properties for both single- and multi-core particles. The presented results are obtained from the four year EU NMP FP7 project, NanoMag, which is focused on standardization of analysis methods for magnetic nanoparticles. PMID:26343639
Fluorosilicate and fluorophosphate superfluorescent multicore optical fibers co-doped with Nd3+/Yb3+

NASA Astrophysics Data System (ADS)

Kochanowicz, M.; Zmojda, J.; Dorosz, D.

2014-06-01

In the paper spectroscopic properties of two fluorosilicate and fluorophosphate glass systems co-doped with Nd3+/Yb3+ ions are investigated. As a result of optical excitation at the wavelength of 808 nm strong and wide emission in the 1 μm region corresponding to the superposition of optical transitions 4F3/2 → 4I11/2 (Nd3+) and 2F5/2 → 2F7/2 (Yb3+) can be observed. The optimization of Nd3+ → Yb3+ energy transfer in both glasses allows to manufacture multicore optical fibers with narrowing and red-shifting of amplified spontaneous emission (ASE) at 1.1 μm.
Seven-core multicore fiber transmissions for passive optical network.

PubMed

Zhu, B; Taunay, T F; Yan, M F; Fini, J M; Fishteyn, M; Monberg, E M; Dimarcello, F V

2010-05-24

We design and fabricate a novel multicore fiber (MCF), with seven cores arranged in a hexagonal array. The fiber properties of MCF including low crosstalk, attenuation and splice loss are described. A new tapered MCF connector (TMC), showing ultra-low crosstalk and losses, is also designed and fabricated for coupling the individual signals in-and-out of the MCF. We further propose a novel network configuration using parallel transmissions with the MCF and TMC for passive optical network (PON). To the best of our knowledge, we demonstrate the first bi-directional parallel transmissions of 1310 nm and 1490 nm signals over 11.3-km of seven-core MCF with 64-way splitter for PON.
Fast parallel algorithm for slicing STL based on pipeline

NASA Astrophysics Data System (ADS)

Ma, Xulong; Lin, Feng; Yao, Bo

2016-05-01

In Additive Manufacturing field, the current researches of data processing mainly focus on a slicing process of large STL files or complicated CAD models. To improve the efficiency and reduce the slicing time, a parallel algorithm has great advantages. However, traditional algorithms can't make full use of multi-core CPU hardware resources. In the paper, a fast parallel algorithm is presented to speed up data processing. A pipeline mode is adopted to design the parallel algorithm. And the complexity of the pipeline algorithm is analyzed theoretically. To evaluate the performance of the new algorithm, effects of threads number and layers number are investigated by a serial of experiments. The experimental results show that the threads number and layers number are two remarkable factors to the speedup ratio. The tendency of speedup versus threads number reveals a positive relationship which greatly agrees with the Amdahl's law, and the tendency of speedup versus layers number also keeps a positive relationship agreeing with Gustafson's law. The new algorithm uses topological information to compute contours with a parallel method of speedup. Another parallel algorithm based on data parallel is used in experiments to show that pipeline parallel mode is more efficient. A case study at last shows a suspending performance of the new parallel algorithm. Compared with the serial slicing algorithm, the new pipeline parallel algorithm can make full use of the multi-core CPU hardware, accelerate the slicing process, and compared with the data parallel slicing algorithm, the new slicing algorithm in this paper adopts a pipeline parallel model, and a much higher speedup ratio and efficiency is achieved.
SAXS analysis of single- and multi-core iron oxide magnetic nanoparticles

PubMed Central

Szczerba, Wojciech; Costo, Rocio; Morales, Maria del Puerto; Thünemann, Andreas F.

2017-01-01

This article reports on the characterization of four superparamagnetic iron oxide nanoparticles stabilized with dimercaptosuccinic acid, which are suitable candidates for reference materials for magnetic properties. Particles p1 and p2 are single-core particles, while p3 and p4 are multi-core particles. Small-angle X-ray scattering analysis reveals a lognormal type of size distribution for the iron oxide cores of the particles. Their mean radii are 6.9 nm (p1), 10.6 nm (p2), 5.5 nm (p3) and 4.1 nm (p4), with narrow relative distribution widths of 0.08, 0.13, 0.08 and 0.12. The cores are arranged as a clustered network in the form of dense mass fractals with a fractal dimension of 2.9 in the multi-core particles p3 and p4, but the cores are well separated from each other by a protecting organic shell. The radii of gyration of the mass fractals are 48 and 44 nm, and each network contains 117 and 186 primary particles, respectively. The radius distributions of the primary particle were confirmed with transmission electron microscopy. All particles contain purely maghemite, as shown by X-ray absorption fine structure spectroscopy. PMID:28381973
Structural and magnetic properties of multi-core nanoparticles analysed using a generalised numerical inversion method

PubMed Central

Bender, P.; Bogart, L. K.; Posth, O.; Szczerba, W.; Rogers, S. E.; Castro, A.; Nilsson, L.; Zeng, L. J.; Sugunan, A.; Sommertune, J.; Fornara, A.; González-Alonso, D.; Barquín, L. Fernández; Johansson, C.

2017-01-01

The structural and magnetic properties of magnetic multi-core particles were determined by numerical inversion of small angle scattering and isothermal magnetisation data. The investigated particles consist of iron oxide nanoparticle cores (9 nm) embedded in poly(styrene) spheres (160 nm). A thorough physical characterisation of the particles included transmission electron microscopy, X-ray diffraction and asymmetrical flow field-flow fractionation. Their structure was ultimately disclosed by an indirect Fourier transform of static light scattering, small angle X-ray scattering and small angle neutron scattering data of the colloidal dispersion. The extracted pair distance distribution functions clearly indicated that the cores were mostly accumulated in the outer surface layers of the poly(styrene) spheres. To investigate the magnetic properties, the isothermal magnetisation curves of the multi-core particles (immobilised and dispersed in water) were analysed. The study stands out by applying the same numerical approach to extract the apparent moment distributions of the particles as for the indirect Fourier transform. It could be shown that the main peak of the apparent moment distributions correlated to the expected intrinsic moment distribution of the cores. Additional peaks were observed which signaled deviations of the isothermal magnetisation behavior from the non-interacting case, indicating weak dipolar interactions. PMID:28397851
Geocomputation over Hybrid Computer Architecture and Systems: Prior Works and On-going Initiatives at UARK

NASA Astrophysics Data System (ADS)

Shi, X.

2015-12-01

As NSF indicated - "Theory and experimentation have for centuries been regarded as two fundamental pillars of science. It is now widely recognized that computational and data-enabled science forms a critical third pillar." Geocomputation is the third pillar of GIScience and geosciences. With the exponential growth of geodata, the challenge of scalable and high performance computing for big data analytics become urgent because many research activities are constrained by the inability of software or tool that even could not complete the computation process. Heterogeneous geodata integration and analytics obviously magnify the complexity and operational time frame. Many large-scale geospatial problems may be not processable at all if the computer system does not have sufficient memory or computational power. Emerging computer architectures, such as Intel's Many Integrated Core (MIC) Architecture and Graphics Processing Unit (GPU), and advanced computing technologies provide promising solutions to employ massive parallelism and hardware resources to achieve scalability and high performance for data intensive computing over large spatiotemporal and social media data. Exploring novel algorithms and deploying the solutions in massively parallel computing environment to achieve the capability for scalable data processing and analytics over large-scale, complex, and heterogeneous geodata with consistent quality and high-performance has been the central theme of our research team in the Department of Geosciences at the University of Arkansas (UARK). New multi-core architectures combined with application accelerators hold the promise to achieve scalability and high performance by exploiting task and data levels of parallelism that are not supported by the conventional computing systems. Such a parallel or distributed computing environment is particularly suitable for large-scale geocomputation over big data as proved by our prior works, while the potential of such advanced infrastructure remains unexplored in this domain. Within this presentation, our prior and on-going initiatives will be summarized to exemplify how we exploit multicore CPUs, GPUs, and MICs, and clusters of CPUs, GPUs and MICs, to accelerate geocomputation in different applications.
Accelerating Large Scale Image Analyses on Parallel, CPU-GPU Equipped Systems

PubMed Central

Teodoro, George; Kurc, Tahsin M.; Pan, Tony; Cooper, Lee A.D.; Kong, Jun; Widener, Patrick; Saltz, Joel H.

2014-01-01

The past decade has witnessed a major paradigm shift in high performance computing with the introduction of accelerators as general purpose processors. These computing devices make available very high parallel computing power at low cost and power consumption, transforming current high performance platforms into heterogeneous CPU-GPU equipped systems. Although the theoretical performance achieved by these hybrid systems is impressive, taking practical advantage of this computing power remains a very challenging problem. Most applications are still deployed to either GPU or CPU, leaving the other resource under- or un-utilized. In this paper, we propose, implement, and evaluate a performance aware scheduling technique along with optimizations to make efficient collaborative use of CPUs and GPUs on a parallel system. In the context of feature computations in large scale image analysis applications, our evaluations show that intelligently co-scheduling CPUs and GPUs can significantly improve performance over GPU-only or multi-core CPU-only approaches. PMID:25419545

Efficient Geometric Sound Propagation Using Visibility Culling

NASA Astrophysics Data System (ADS)

Chandak, Anish

2011-07-01

Simulating propagation of sound can improve the sense of realism in interactive applications such as video games and can lead to better designs in engineering applications such as architectural acoustics. In this thesis, we present geometric sound propagation techniques which are faster than prior methods and map well to upcoming parallel multi-core CPUs. We model specular reflections by using the image-source method and model finite-edge diffraction by using the well-known Biot-Tolstoy-Medwin (BTM) model. We accelerate the computation of specular reflections by applying novel visibility algorithms, FastV and AD-Frustum, which compute visibility from a point. We accelerate finite-edge diffraction modeling by applying a novel visibility algorithm which computes visibility from a region. Our visibility algorithms are based on frustum tracing and exploit recent advances in fast ray-hierarchy intersections, data-parallel computations, and scalable, multi-core algorithms. The AD-Frustum algorithm adapts its computation to the scene complexity and allows small errors in computing specular reflection paths for higher computational efficiency. FastV and our visibility algorithm from a region are general, object-space, conservative visibility algorithms that together significantly reduce the number of image sources compared to other techniques while preserving the same accuracy. Our geometric propagation algorithms are an order of magnitude faster than prior approaches for modeling specular reflections and two to ten times faster for modeling finite-edge diffraction. Our algorithms are interactive, scale almost linearly on multi-core CPUs, and can handle large, complex, and dynamic scenes. We also compare the accuracy of our sound propagation algorithms with other methods. Once sound propagation is performed, it is desirable to listen to the propagated sound in interactive and engineering applications. We can generate smooth, artifact-free output audio signals by applying efficient audio-processing algorithms. We also present the first efficient audio-processing algorithm for scenarios with simultaneously moving source and moving receiver (MS-MR) which incurs less than 25% overhead compared to static source and moving receiver (SS-MR) or moving source and static receiver (MS-SR) scenario.
High-accuracy fiber-optic shape sensing

NASA Astrophysics Data System (ADS)

Duncan, Roger G.; Froggatt, Mark E.; Kreger, Stephen T.; Seeley, Ryan J.; Gifford, Dawn K.; Sang, Alexander K.; Wolfe, Matthew S.

2007-04-01

We describe the results of a study of the performance characteristics of a monolithic fiber-optic shape sensor array. Distributed strain measurements in a multi-core optical fiber interrogated with the optical frequency domain reflectometry technique are used to deduce the shape of the optical fiber; referencing to a coordinate system yields position information. Two sensing techniques are discussed herein: the first employing fiber Bragg gratings and the second employing the intrinsic Rayleigh backscatter of the optical fiber. We have measured shape and position under a variety of circumstances and report the accuracy and precision of these measurements. A discussion of error sources is included.
Multicore fibre photonic lanterns for precision radial velocity Science

NASA Astrophysics Data System (ADS)

Gris-Sánchez, Itandehui; Haynes, Dionne M.; Ehrlich, Katjana; Haynes, Roger; Birks, Tim A.

2018-04-01

Incomplete fibre scrambling and fibre modal noise can degrade high-precision spectroscopic applications (typically high spectral resolution and high signal to noise). For example, it can be the dominating error source for exoplanet finding spectrographs, limiting the maximum measurement precision possible with such facilities. This limitation is exacerbated in the next generation of infra-red based systems, as the number of modes supported by the fibre scales inversely with the wavelength squared and more modes typically equates to better scrambling. Substantial effort has been made by major research groups in this area to improve the fibre link performance by employing non-circular fibres, double scramblers, fibre shakers, and fibre stretchers. We present an original design of a multicore fibre (MCF) terminated with multimode photonic lantern ports. It is designed to act as a relay fibre with the coupling efficiency of a multimode fibre (MMF), modal stability similar to a single-mode fibre and low loss in a wide range of wavelengths (380 nm to 860 nm). It provides phase and amplitude scrambling to achieve a stable near field and far-field output illumination pattern despite input coupling variations, and low modal noise for increased stability for high signal-to-noise applications such as precision radial velocity (PRV) science. Preliminary results are presented for a 511-core MCF and compared with current state of the art octagonal fibre.
Static and Dynamic Frequency Scaling on Multicore CPUs

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bao, Wenlei; Hong, Changwan; Chunduri, Sudheer

2016-12-28

Dynamic voltage and frequency scaling (DVFS) adapts CPU power consumption by modifying a processor’s operating frequency (and the associated voltage). Typical approaches employing DVFS involve default strategies such as running at the lowest or the highest frequency, or observing the CPU’s runtime behavior and dynamically adapting the voltage/frequency configuration based on CPU usage. In this paper, we argue that many previous approaches suffer from inherent limitations, such as not account- ing for processor-specific impact of frequency changes on energy for different workload types. We first propose a lightweight runtime-based approach to automatically adapt the frequency based on the CPU workload,more » that is agnostic of the processor characteristics. We then show that further improvements can be achieved for affine kernels in the application, using a compile-time characterization instead of run-time monitoring to select the frequency and number of CPU cores to use. Our framework relies on a one-time energy characterization of CPU-specific DVFS profiles followed by a compile-time categorization of loop-based code segments in the application. These are combined to determine a priori of the frequency and the number of cores to use to execute the application so as to optimize energy or energy-delay product, outperforming runtime approach. Extensive evaluation on 60 benchmarks and five multi-core CPUs show that our approach systematically outperforms the powersave Linux governor, while improving overall performance.« less
Multicore Programming Challenges

NASA Astrophysics Data System (ADS)

Perrone, Michael

The computer industry is facing fundamental challenges that are driving a major change in the design of computer processors. Due to restrictions imposed by quantum physics, one historical path to higher computer processor performance - by increased clock frequency - has come to an end. Increasing clock frequency now leads to power consumption costs that are too high to justify. As a result, we have seen in recent years that the processor frequencies have peaked and are receding from their high point. At the same time, competitive market conditions are giving business advantage to those companies that can field new streaming applications, handle larger data sets, and update their models to market conditions faster. The desire for newer, faster and larger is driving continued demand for higher computer performance.
Accelerating 3D Elastic Wave Equations on Knights Landing based Intel Xeon Phi processors

NASA Astrophysics Data System (ADS)

Sourouri, Mohammed; Birger Raknes, Espen

2017-04-01

In advanced imaging methods like reverse-time migration (RTM) and full waveform inversion (FWI) the elastic wave equation (EWE) is numerically solved many times to create the seismic image or the elastic parameter model update. Thus, it is essential to optimize the solution time for solving the EWE as this will have a major impact on the total computational cost in running RTM or FWI. From a computational point of view applications implementing EWEs are associated with two major challenges. The first challenge is the amount of memory-bound computations involved, while the second challenge is the execution of such computations over very large datasets. So far, multi-core processors have not been able to tackle these two challenges, which eventually led to the adoption of accelerators such as Graphics Processing Units (GPUs). Compared to conventional CPUs, GPUs are densely populated with many floating-point units and fast memory, a type of architecture that has proven to map well to many scientific computations. Despite its architectural advantages, full-scale adoption of accelerators has yet to materialize. First, accelerators require a significant programming effort imposed by programming models such as CUDA or OpenCL. Second, accelerators come with a limited amount of memory, which also require explicit data transfers between the CPU and the accelerator over the slow PCI bus. The second generation of the Xeon Phi processor based on the Knights Landing (KNL) architecture, promises the computational capabilities of an accelerator but require the same programming effort as traditional multi-core processors. The high computational performance is realized through many integrated cores (number of cores and tiles and memory varies with the model) organized in tiles that are connected via a 2D mesh based interconnect. In contrary to accelerators, KNL is a self-hosted system, meaning explicit data transfers over the PCI bus are no longer required. However, like most accelerators, KNL sports a memory subsystem consisting of low-level caches and 16GB of high-bandwidth MCDRAM memory. For capacity computing, up to 400GB of conventional DDR4 memory is provided. Such a strict hierarchical memory layout means that data locality is imperative if the true potential of this product is to be harnessed. In this work, we study a series of optimizations specifically targeting KNL for our EWE based application to reduce the time-to-solution time for the following 3D model sizes in grid points: 1283, 2563 and 5123. We compare the results with an optimized version for multi-core CPUs running on a dual-socket Xeon E5 2680v3 system using OpenMP. Our initial naive implementation on the KNL is roughly 20% faster than the multi-core version, but by using only one thread per core and careful memory placement using the memkind library, we could achieve higher speedups. Additionally, by using the MCDRAM as cache for problem sizes that are smaller than 16 GB further performance improvements were unlocked. Depending on the problem size, our overall results indicate that the KNL based system is approximately 2.2x faster than the 24-core Xeon E5 2680v3 system, with only modest changes to the code.
Parallel Evolutionary Optimization for Neuromorphic Network Training

DOE Office of Scientific and Technical Information (OSTI.GOV)

Schuman, Catherine D; Disney, Adam; Singh, Susheela

One of the key impediments to the success of current neuromorphic computing architectures is the issue of how best to program them. Evolutionary optimization (EO) is one promising programming technique; in particular, its wide applicability makes it especially attractive for neuromorphic architectures, which can have many different characteristics. In this paper, we explore different facets of EO on a spiking neuromorphic computing model called DANNA. We focus on the performance of EO in the design of our DANNA simulator, and on how to structure EO on both multicore and massively parallel computing systems. We evaluate how our parallel methods impactmore » the performance of EO on Titan, the U.S.'s largest open science supercomputer, and BOB, a Beowulf-style cluster of Raspberry Pi's. We also focus on how to improve the EO by evaluating commonality in higher performing neural networks, and present the result of a study that evaluates the EO performed by Titan.« less
mdtmFTP and its evaluation on ESNET SDN testbed

DOE PAGES

Zhang, Liang; Wu, Wenji; DeMar, Phil; ...

2017-04-21

In this paper, to address the high-performance challenges of data transfer in the big data era, we are developing and implementing mdtmFTP: a high-performance data transfer tool for big data. mdtmFTP has four salient features. First, it adopts an I/O centric architecture to execute data transfer tasks. Second, it more efficiently utilizes the underlying multicore platform through optimized thread scheduling. Third, it implements a large virtual file mechanism to address the lots-of-small-files (LOSF) problem. In conclusion, mdtmFTP integrates multiple optimization mechanisms, including–zero copy, asynchronous I/O, pipelining, batch processing, and pre-allocated buffer pools–to enhance performance. mdtmFTP has been extensively tested andmore » evaluated within the ESNET 100G testbed. Evaluations show that mdtmFTP can achieve higher performance than existing data transfer tools, such as GridFTP, FDT, and BBCP.« less
Multicore job scheduling in the Worldwide LHC Computing Grid

NASA Astrophysics Data System (ADS)

Forti, A.; Pérez-Calero Yzquierdo, A.; Hartmann, T.; Alef, M.; Lahiff, A.; Templon, J.; Dal Pra, S.; Gila, M.; Skipsey, S.; Acosta-Silva, C.; Filipcic, A.; Walker, R.; Walker, C. J.; Traynor, D.; Gadrat, S.

2015-12-01

After the successful first run of the LHC, data taking is scheduled to restart in Summer 2015 with experimental conditions leading to increased data volumes and event complexity. In order to process the data generated in such scenario and exploit the multicore architectures of current CPUs, the LHC experiments have developed parallelized software for data reconstruction and simulation. However, a good fraction of their computing effort is still expected to be executed as single-core tasks. Therefore, jobs with diverse resources requirements will be distributed across the Worldwide LHC Computing Grid (WLCG), making workload scheduling a complex problem in itself. In response to this challenge, the WLCG Multicore Deployment Task Force has been created in order to coordinate the joint effort from experiments and WLCG sites. The main objective is to ensure the convergence of approaches from the different LHC Virtual Organizations (VOs) to make the best use of the shared resources in order to satisfy their new computing needs, minimizing any inefficiency originated from the scheduling mechanisms, and without imposing unnecessary complexities in the way sites manage their resources. This paper describes the activities and progress of the Task Force related to the aforementioned topics, including experiences from key sites on how to best use different batch system technologies, the evolution of workload submission tools by the experiments and the knowledge gained from scale tests of the different proposed job submission strategies.
Performance of adaptive DD-OFDM multicore fiber links and its relation with intercore crosstalk.

PubMed

Alves, Tiago M F; Luís, Ruben S; Puttnam, Benjamin J; Cartaxo, Adolfo V T; Awaji, Yoshinari; Wada, Naoya

2017-07-10

Adaptive direct-detection (DD) orthogonal frequency-division multiplexing (OFDM) is proposed to guarantee signal quality over time in weakly-coupled homogenous multicore fiber (MCFs) links impaired by stochastic intercore crosstalk (ICXT). For the first time, the received electrical power of the ICXT and the performance of the adaptive DD-OFDM MCF link are experimentally monitored quasi-simultaneously over a 210 hour period. Experimental results show that the time evolution of the error vector magnitude due to the ICXT can be suitably estimated from the normalized power of the detected crosstalk. The detected crosstalk results from the beating between the carrier in the test core and ICXT originating from the carrier and modulated signal from interfering core. The results show that the operation of DD-OFDM systems employing fixed modulation can be severely impaired by the presence of ICXT that may unpredictable vary in both power and frequency. The system may suffer from deleterious impact of moderate ICXT levels over a time duration of several hours or from peak ICXT levels occurring over a number of minutes. Such power fluctuations can lead to large variations in bit error ratio (BER) for static modulation schemes. Here, we show that BER fluctuations may be minimized by the use of adaptive modulation techniques and that in particular, the adaptive OFDM is a viable solution to guarantee link quality in MCF-based systems. An experimental model of an adaptive DD-OFDM MCF link shows an average throughput of 12 Gb/s that represents a reduction of only 9% compared to the maximum throughput measured without ICXT and an improvement of 23% relative to throughput obtained with static modulation.
Performance analysis of the FDTD method applied to holographic volume gratings: Multi-core CPU versus GPU computing

NASA Astrophysics Data System (ADS)

Francés, J.; Bleda, S.; Neipp, C.; Márquez, A.; Pascual, I.; Beléndez, A.

2013-03-01

The finite-difference time-domain method (FDTD) allows electromagnetic field distribution analysis as a function of time and space. The method is applied to analyze holographic volume gratings (HVGs) for the near-field distribution at optical wavelengths. Usually, this application requires the simulation of wide areas, which implies more memory and time processing. In this work, we propose a specific implementation of the FDTD method including several add-ons for a precise simulation of optical diffractive elements. Values in the near-field region are computed considering the illumination of the grating by means of a plane wave for different angles of incidence and including absorbing boundaries as well. We compare the results obtained by FDTD with those obtained using a matrix method (MM) applied to diffraction gratings. In addition, we have developed two optimized versions of the algorithm, for both CPU and GPU, in order to analyze the improvement of using the new NVIDIA Fermi GPU architecture versus highly tuned multi-core CPU as a function of the size simulation. In particular, the optimized CPU implementation takes advantage of the arithmetic and data transfer streaming SIMD (single instruction multiple data) extensions (SSE) included explicitly in the code and also of multi-threading by means of OpenMP directives. A good agreement between the results obtained using both FDTD and MM methods is obtained, thus validating our methodology. Moreover, the performance of the GPU is compared to the SSE+OpenMP CPU implementation, and it is quantitatively determined that a highly optimized CPU program can be competitive for a wider range of simulation sizes, whereas GPU computing becomes more powerful for large-scale simulations.
The new landscape of parallel computer architecture

NASA Astrophysics Data System (ADS)

Shalf, John

2007-07-01

The past few years has seen a sea change in computer architecture that will impact every facet of our society as every electronic device from cell phone to supercomputer will need to confront parallelism of unprecedented scale. Whereas the conventional multicore approach (2, 4, and even 8 cores) adopted by the computing industry will eventually hit a performance plateau, the highest performance per watt and per chip area is achieved using manycore technology (hundreds or even thousands of cores). However, fully unleashing the potential of the manycore approach to ensure future advances in sustained computational performance will require fundamental advances in computer architecture and programming models that are nothing short of reinventing computing. In this paper we examine the reasons behind the movement to exponentially increasing parallelism, and its ramifications for system design, applications and programming models.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Aliaga, José I., E-mail: aliaga@uji.es; Alonso, Pedro; Badía, José M.

We introduce a new iterative Krylov subspace-based eigensolver for the simulation of macromolecular motions on desktop multithreaded platforms equipped with multicore processors and, possibly, a graphics accelerator (GPU). The method consists of two stages, with the original problem first reduced into a simpler band-structured form by means of a high-performance compute-intensive procedure. This is followed by a memory-intensive but low-cost Krylov iteration, which is off-loaded to be computed on the GPU by means of an efficient data-parallel kernel. The experimental results reveal the performance of the new eigensolver. Concretely, when applied to the simulation of macromolecules with a few thousandsmore » degrees of freedom and the number of eigenpairs to be computed is small to moderate, the new solver outperforms other methods implemented as part of high-performance numerical linear algebra packages for multithreaded architectures.« less
Variation tolerant SoC design

NASA Astrophysics Data System (ADS)

Kozhikkottu, Vivek J.

The scaling of integrated circuits into the nanometer regime has led to variations emerging as a primary concern for designers of integrated circuits. Variations are an inevitable consequence of the semiconductor manufacturing process, and also arise due to the side-effects of operation of integrated circuits (voltage, temperature, and aging). Conventional design approaches, which are based on design corners or worst-case scenarios, leave designers with an undesirable choice between the considerable overheads associated with over-design and significantly reduced manufacturing yield. Techniques for variation-tolerant design at the logic, circuit and layout levels of the design process have been developed and are in commercial use. However, with the incessant increase in variations due to technology scaling and design trends such as near-threshold computing, these techniques are no longer sufficient to contain the effects of variations, and there is a need to address variations at all stages of design. This thesis addresses the problem of variation-tolerant design at the earliest stages of the design process, where the system-level design decisions that are made can have a very significant impact. There are two key aspects to making system-level design variation-aware. First, analysis techniques must be developed to project the impact of variations on system-level metrics such as application performance and energy. Second, variation-tolerant design techniques need to be developed to absorb the residual impact of variations (that cannot be contained through lower-level techniques). In this thesis, we address both these facets by developing robust and scalable variation-aware analysis and variation mitigation techniques at the system level. The first contribution of this thesis is a variation-aware system-level performance analysis framework. We address the key challenge of translating the per-component clock frequency distributions into a system-level application performance distribution. This task is particularly complex and challenging due to the inter-dependencies between components' execution, indirect effects of shared resources, and interactions between multiple system-level "execution paths". We argue that accurate variation-aware performance analysis requires Monte-Carlo based repeated system execution. Our proposed analysis framework leverages emulation to significantly speedup performance analysis without sacrificing the generality and accuracy achieved by Monte-Carlo based simulations. Our experiments show performance improvements of around 60x compared to state-of-the-art hardware-software co-simulation tools and also underscore the framework's potential to enable variation-aware design and exploration at the system level. Our second contribution addresses the problem of designing variation-tolerant SoCs using recovery based design, a popular circuit design paradigm that addresses variations by eliminating guard-bands and operating circuits at close to "zero margins" while detecting and recovering from timing errors. While previous efforts have demonstrated the potential benefits of recovery based design, we identify several challenges that need to be addressed in order to apply this technique to SoCs. We present a systematic design framework to apply recovery based design at the system level. We propose to partition SoCs into "recovery islands", wherein each recovery island consists of one or more SoC components that can recover independent of the rest of the SoC. We present a variation-aware design methodology that partitions a given SoC into recovery islands and computes the optimal operating points for each island, taking into account the various trade-offs involved. Our experiments demonstrate that the proposed design framework achieves an average of 32% energy savings over conventional worst-case designs, with negligible losses in performance. The third contribution of this thesis introduces disproportionate allocation of shared system resources as a means to combat the adverse impact of within-die variations on multi-core platforms. For multi-threaded programs executing on variation-impacted multi-cores platforms, we make the key observation that thread performance is not only a function of the frequency of the core on which it is executing on, but also depends upon the amount of shared system resources allocated to it. We utilize this insight to design a variation-aware runtime scheme which allocates the ways of a last-level shared L2 cache amongst the different cores/threads of a multi-core platform taking into account both application characteristics as well as chip specific variation profiles. Our experiments on 100 quad-core chips, each with a distinct variation profile, shows on an average 15% performance improvements for a suite of multi-threaded benchmarks. Our final contribution investigates the variation-tolerant design of domain-specific accelerators and demonstrates how the unique architectural properties of these accelerators can be leveraged to create highly effective variation tolerance mechanisms. We explore this concept through the variation-tolerant design of a vector processor that efficiently executes applications from the domains of recognition, mining and synthesis (RMS). We develop a novel design approach for variation tolerance, which leverages the unique nature of the vector reduction operations performed by this processor to effectively predict and preempt the occurrence of timing errors under variations and subsequently restore the correct output at the end of each vector reduction operation. We implement the above predict, preempt and restore operations by suitably enhancing the processor hardware and the application software and demonstrate considerable energy benefits (on an average 32%) across six applications from the domains of RMS. In conclusion, our work provides system designers with powerful tools and mechanisms in their efforts to combat variations, resulting in improved designer productivity and variation-tolerant systems.
Optimizing Blocking and Nonblocking Reduction Operations for Multicore Systems: Hierarchical Design and Implementation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gorentla Venkata, Manjunath; Shamis, Pavel; Graham, Richard L

2013-01-01

Many scientific simulations, using the Message Passing Interface (MPI) programming model, are sensitive to the performance and scalability of reduction collective operations such as MPI Allreduce and MPI Reduce. These operations are the most widely used abstractions to perform mathematical operations over all processes that are part of the simulation. In this work, we propose a hierarchical design to implement the reduction operations on multicore systems. This design aims to improve the efficiency of reductions by 1) tailoring the algorithms and customizing the implementations for various communication mechanisms in the system 2) providing the ability to configure the depth ofmore » hierarchy to match the system architecture, and 3) providing the ability to independently progress each of this hierarchy. Using this design, we implement MPI Allreduce and MPI Reduce operations (and its nonblocking variants MPI Iallreduce and MPI Ireduce) for all message sizes, and evaluate on multiple architectures including InfiniBand and Cray XT5. We leverage and enhance our existing infrastructure, Cheetah, which is a framework for implementing hierarchical collective operations to implement these reductions. The experimental results show that the Cheetah reduction operations outperform the production-grade MPI implementations such as Open MPI default, Cray MPI, and MVAPICH2, demonstrating its efficiency, flexibility and portability. On Infini- Band systems, with a microbenchmark, a 512-process Cheetah nonblocking Allreduce and Reduce achieves a speedup of 23x and 10x, respectively, compared to the default Open MPI reductions. The blocking variants of the reduction operations also show similar performance benefits. A 512-process nonblocking Cheetah Allreduce achieves a speedup of 3x, compared to the default MVAPICH2 Allreduce implementation. On a Cray XT5 system, a 6144-process Cheetah Allreduce outperforms the Cray MPI by 145%. The evaluation with an application kernel, Conjugate Gradient solver, shows that the Cheetah reductions speeds up total time to solution by 195%, demonstrating the potential benefits for scientific simulations.« less
Incentive Compatible Online Scheduling of Malleable Parallel Jobs with Individual Deadlines

DOE Office of Scientific and Technical Information (OSTI.GOV)

Carroll, Thomas E.; Grosu, Daniel

2010-09-13

We consider the online scheduling of malleable jobs on parallel systems, such as clusters, symmetric multiprocessing computers, and multi-core processor computers. Malleable jobs is a model of parallel processing in which jobs adapt to the number of processors assigned to them. This model permits the scheduler and resource manager to make more efficient use of the available resources. Each malleable job is characterized by arrival time, deadline, and value. If the job completes by its deadline, the user earns the payoff indicated by the value; otherwise, she earns a payoff of zero. The scheduling objective is to maximize the summore » of the values of the jobs that complete by their associated deadlines. Complicating the matter is that users in the real world are rational and they will attempt to manipulate the scheduler by misreporting their jobs’ parameters if it benefits them to do so. To mitigate this behavior, we design an incentive compatible online scheduling mechanism. Incentive compatibility assures us that the users will obtain the maximum payoff only if they truthfully report their jobs’ parameters to the scheduler. Finally, we simulate and study the mechanism to show the effects of misreports on the cheaters and on the system.« less
Diderot: a Domain-Specific Language for Portable Parallel Scientific Visualization and Image Analysis.

PubMed

Kindlmann, Gordon; Chiw, Charisee; Seltzer, Nicholas; Samuels, Lamont; Reppy, John

2016-01-01

Many algorithms for scientific visualization and image analysis are rooted in the world of continuous scalar, vector, and tensor fields, but are programmed in low-level languages and libraries that obscure their mathematical foundations. Diderot is a parallel domain-specific language that is designed to bridge this semantic gap by providing the programmer with a high-level, mathematical programming notation that allows direct expression of mathematical concepts in code. Furthermore, Diderot provides parallel performance that takes advantage of modern multicore processors and GPUs. The high-level notation allows a concise and natural expression of the algorithms and the parallelism allows efficient execution on real-world datasets.
Interaction sorting method for molecular dynamics on multi-core SIMD CPU architecture.

PubMed

Matvienko, Sergey; Alemasov, Nikolay; Fomin, Eduard

2015-02-01

Molecular dynamics (MD) is widely used in computational biology for studying binding mechanisms of molecules, molecular transport, conformational transitions, protein folding, etc. The method is computationally expensive; thus, the demand for the development of novel, much more efficient algorithms is still high. Therefore, the new algorithm designed in 2007 and called interaction sorting (IS) clearly attracted interest, as it outperformed the most efficient MD algorithms. In this work, a new IS modification is proposed which allows the algorithm to utilize SIMD processor instructions. This paper shows that the improvement provides an additional gain in performance, 9% to 45% in comparison to the original IS method.
Fault Tolerance Middleware for a Multi-Core System

NASA Technical Reports Server (NTRS)

Some, Raphael R.; Springer, Paul L.; Zima, Hans P.; James, Mark; Wagner, David A.

2012-01-01

Fault Tolerance Middleware (FTM) provides a framework to run on a dedicated core of a multi-core system and handles detection of single-event upsets (SEUs), and the responses to those SEUs, occurring in an application running on multiple cores of the processor. This software was written expressly for a multi-core system and can support different kinds of fault strategies, such as introspection, algorithm-based fault tolerance (ABFT), and triple modular redundancy (TMR). It focuses on providing fault tolerance for the application code, and represents the first step in a plan to eventually include fault tolerance in message passing and the FTM itself. In the multi-core system, the FTM resides on a single, dedicated core, separate from the cores used by the application. This is done in order to isolate the FTM from application faults and to allow it to swap out any application core for a substitute. The structure of the FTM consists of an interface to a fault tolerant strategy module, a responder module, a fault manager module, an error factory, and an error mapper that determines the severity of the error. In the present reference implementation, the only fault tolerant strategy implemented is introspection. The introspection code waits for an application node to send an error notification to it. It then uses the error factory to create an error object, and at this time, a severity level is assigned to the error. The introspection code uses its built-in knowledge base to generate a recommended response to the error. Responses might include ignoring the error, logging it, rolling back the application to a previously saved checkpoint, swapping in a new node to replace a bad one, or restarting the application. The original error and recommended response are passed to the top-level fault manager module, which invokes the response. The responder module also notifies the introspection module of the generated response. This provides additional information to the introspection module that it can use in generating its next response. For example, if the responder triggers an application rollback and errors are still occurring, the introspection module may decide to recommend an application restart.
Exploiting MIC architectures for the simulation of channeling of charged particles in crystals

NASA Astrophysics Data System (ADS)

Bagli, Enrico; Karpusenko, Vadim

2016-08-01

Coherent effects of ultra-relativistic particles in crystals is an area of science under development. DYNECHARM + + is a toolkit for the simulation of coherent interactions between high-energy charged particles and complex crystal structures. The particle trajectory in a crystal is computed through numerical integration of the equation of motion. The code was revised and improved in order to exploit parallelization on multi-cores and vectorization of single instructions on multiple data. An Intel Xeon Phi card was adopted for the performance measurements. The computation time was proved to scale linearly as a function of the number of physical and virtual cores. By enabling the auto-vectorization flag of the compiler a three time speedup was obtained. The performances of the card were compared to the Dual Xeon ones.

Improving the energy efficiency of sparse linear system solvers on multicore and manycore systems.

PubMed

Anzt, H; Quintana-Ortí, E S

2014-06-28

While most recent breakthroughs in scientific research rely on complex simulations carried out in large-scale supercomputers, the power draft and energy spent for this purpose is increasingly becoming a limiting factor to this trend. In this paper, we provide an overview of the current status in energy-efficient scientific computing by reviewing different technologies used to monitor power draft as well as power- and energy-saving mechanisms available in commodity hardware. For the particular domain of sparse linear algebra, we analyse the energy efficiency of a broad collection of hardware architectures and investigate how algorithmic and implementation modifications can improve the energy performance of sparse linear system solvers, without negatively impacting their performance. © 2014 The Author(s) Published by the Royal Society. All rights reserved.
Analysis OpenMP performance of AMD and Intel architecture for breaking waves simulation using MPS

NASA Astrophysics Data System (ADS)

Alamsyah, M. N. A.; Utomo, A.; Gunawan, P. H.

2018-03-01

Simulation of breaking waves by using Navier-Stokes equation via moving particle semi-implicit method (MPS) over close domain is given. The results show the parallel computing on multicore architecture using OpenMP platform can reduce the computational time almost half of the serial time. Here, the comparison using two computer architectures (AMD and Intel) are performed. The results using Intel architecture is shown better than AMD architecture in CPU time. However, in efficiency, the computer with AMD architecture gives slightly higher than the Intel. For the simulation by 1512 number of particles, the CPU time using Intel and AMD are 12662.47 and 28282.30 respectively. Moreover, the efficiency using similar number of particles, AMD obtains 50.09 % and Intel up to 49.42 %.
Composition and Realization of Source-to-Sink High-Performance Flows: File Systems, Storage, Hosts, LAN and WAN

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wu, Chase Qishi

A number of Department of Energy (DOE) science applications, involving exascale computing systems and large experimental facilities, are expected to generate large volumes of data, in the range of petabytes to exabytes, which will be transported over wide-area networks for the purpose of storage, visualization, and analysis. To support such capabilities, significant progress has been made in various components including the deployment of 100 Gbps networks with future 1 Tbps bandwidth, increases in end-host capabilities with multiple cores and buses, capacity improvements in large disk arrays, and deployment of parallel file systems such as Lustre and GPFS. High-performance source-to-sink datamore » flows must be composed of these component systems, which requires significant optimizations of the storage-to-host data and execution paths to match the edge and long-haul network connections. In particular, end systems are currently supported by 10-40 Gbps Network Interface Cards (NIC) and 8-32 Gbps storage Host Channel Adapters (HCAs), which carry the individual flows that collectively must reach network speeds of 100 Gbps and higher. Indeed, such data flows must be synthesized using multicore, multibus hosts connected to high-performance storage systems on one side and to the network on the other side. Current experimental results show that the constituent flows must be optimally composed and preserved from storage systems, across the hosts and the networks with minimal interference. Furthermore, such a capability must be made available transparently to the science users without placing undue demands on them to account for the details of underlying systems and networks. And, this task is expected to become even more complex in the future due to the increasing sophistication of hosts, storage systems, and networks that constitute the high-performance flows. The objectives of this proposal are to (1) develop and test the component technologies and their synthesis methods to achieve source-to-sink high-performance flows, and (2) develop tools that provide these capabilities through simple interfaces to users and applications. In terms of the former, we propose to develop (1) optimization methods that align and transition multiple storage flows to multiple network flows on multicore, multibus hosts; and (2) edge and long-haul network path realization and maintenance using advanced provisioning methods including OSCARS and OpenFlow. We also propose synthesis methods that combine these individual technologies to compose high-performance flows using a collection of constituent storage-network flows, and realize them across the storage and local network connections as well as long-haul connections. We propose to develop automated user tools that profile the hosts, storage systems, and network connections; compose the source-to-sink complex flows; and set up and maintain the needed network connections. These solutions will be tested using (1) 100 Gbps connection(s) between Oak Ridge National Laboratory (ORNL) and Argonne National Laboratory (ANL) with storage systems supported by Lustre and GPFS file systems with an asymmetric connection to University of Memphis (UM); (2) ORNL testbed with multicore and multibus hosts, switches with OpenFlow capabilities, and network emulators; and (3) 100 Gbps connections from ESnet and their Openflow testbed, and other experimental connections. This proposal brings together the expertise and facilities of the two national laboratories, ORNL and ANL, and UM. It also represents a collaboration between DOE and the Department of Defense (DOD) projects at ORNL by sharing technical expertise and personnel costs, and leveraging the existing DOD Extreme Scale Systems Center (ESSC) facilities at ORNL.« less
24 CFR 203.605 - Loss mitigation performance.

Code of Federal Regulations, 2010 CFR

2010-04-01

... performance. (1) HUD will measure and advise mortgagees of their loss mitigation performance through the Tier... mitigation attempts, defaults, and claims. Based on the ratios, HUD will group mortgagees in four tiers (Tiers 1, 2, 3, and 4), with Tier 1 representing the highest or best ranking mortgagees and Tier 4...
Active Job Monitoring in Pilots

NASA Astrophysics Data System (ADS)

Kuehn, Eileen; Fischer, Max; Giffels, Manuel; Jung, Christopher; Petzold, Andreas

2015-12-01

Recent developments in high energy physics (HEP) including multi-core jobs and multi-core pilots require data centres to gain a deep understanding of the system to monitor, design, and upgrade computing clusters. Networking is a critical component. Especially the increased usage of data federations, for example in diskless computing centres or as a fallback solution, relies on WAN connectivity and availability. The specific demands of different experiments and communities, but also the need for identification of misbehaving batch jobs, requires an active monitoring. Existing monitoring tools are not capable of measuring fine-grained information at batch job level. This complicates network-aware scheduling and optimisations. In addition, pilots add another layer of abstraction. They behave like batch systems themselves by managing and executing payloads of jobs internally. The number of real jobs being executed is unknown, as the original batch system has no access to internal information about the scheduling process inside the pilots. Therefore, the comparability of jobs and pilots for predicting run-time behaviour or network performance cannot be ensured. Hence, identifying the actual payload is important. At the GridKa Tier 1 centre a specific tool is in use that allows the monitoring of network traffic information at batch job level. This contribution presents the current monitoring approach and discusses recent efforts and importance to identify pilots and their substructures inside the batch system. It will also show how to determine monitoring data of specific jobs from identified pilots. Finally, the approach is evaluated.
Ultrasound phase rotation beamforming on multi-core DSP.

PubMed

Ma, Jieming; Karadayi, Kerem; Ali, Murtaza; Kim, Yongmin

2014-01-01

Phase rotation beamforming (PRBF) is a commonly-used digital receive beamforming technique. However, due to its high computational requirement, it has traditionally been supported by hardwired architectures, e.g., application-specific integrated circuits (ASICs) or more recently field-programmable gate arrays (FPGAs). In this study, we investigated the feasibility of supporting software-based PRBF on a multi-core DSP. To alleviate the high computing requirement, the analog front-end (AFE) chips integrating quadrature demodulation in addition to analog-to-digital conversion were defined and used. With these new AFE chips, only delay alignment and phase rotation need to be performed by DSP, substantially reducing the computational load. We implemented the delay alignment and phase rotation modules on a Texas Instruments C6678 DSP with 8 cores. We found it takes 200 μs to beamform 2048 samples from 64 channels using 2 cores. With 4 cores, 20 million samples can be beamformed in one second. Therefore, ADC frequencies up to 40 MHz with 2:1 decimation in AFE chips or up to 20 MHz with no decimation can be supported as long as the ADC-to-DSP I/O requirement can be met. The remaining 4 cores can work on back-end processing tasks and applications, e.g., color Doppler or ultrasound elastography. One DSP being able to handle both beamforming and back-end processing could lead to low-power and low-cost ultrasound machines, benefiting ultrasound imaging in general, particularly portable ultrasound machines. Copyright © 2013 Elsevier B.V. All rights reserved.
Multicore-shell nanofiber architecture of polyimide/polyvinylidene fluoride blend for thermal and long-term stability of lithium ion battery separator.

PubMed

Park, Sejoon; Son, Chung Woo; Lee, Sungho; Kim, Dong Young; Park, Cheolmin; Eom, Kwang Sup; Fuller, Thomas F; Joh, Han-Ik; Jo, Seong Mu

2016-11-11

Li-ion battery, separator, multicoreshell structure, thermal stability, long-term stability. A nanofibrous membrane with multiple cores of polyimide (PI) in the shell of polyvinylidene fluoride (PVdF) was prepared using a facile one-pot electrospinning technique with a single nozzle. Unique multicore-shell (MCS) structure of the electrospun composite fibers was obtained, which resulted from electrospinning a phase-separated polymer composite solution. Multiple PI core fibrils with high molecular orientation were well-embedded across the cross-section and contributed remarkable thermal stabilities to the MCS membrane. Thus, no outbreaks were found in its dimension and ionic resistance up to 200 and 250 °C, respectively. Moreover, the MCS membrane (at ~200 °C), as a lithium ion battery (LIB) separator, showed superior thermal and electrochemical stabilities compared with a widely used commercial separator (~120 °C). The average capacity decay rate of LIB for 500 cycles was calculated to be approximately 0.030 mAh/g/cycle. This value demonstrated exceptional long-term stability compared with commercial LIBs and with two other types (single core-shell and co-electrospun separators incorporating with functionalized TiO 2 ) of PI/PVdF composite separators. The proper architecture and synergy effects of multiple PI nanofibrils as a thermally stable polymer in the PVdF shell as electrolyte compatible polymers are responsible for the superior thermal performance and long-term stability of the LIB.
Multicore-shell nanofiber architecture of polyimide/polyvinylidene fluoride blend for thermal and long-term stability of lithium ion battery separator

PubMed Central

Park, Sejoon; Son, Chung Woo; Lee, Sungho; Kim, Dong Young; Park, Cheolmin; Eom, Kwang Sup; Fuller, Thomas F.; Joh, Han-Ik; Jo, Seong Mu

2016-01-01

Li-ion battery, separator, multicoreshell structure, thermal stability, long-term stability. A nanofibrous membrane with multiple cores of polyimide (PI) in the shell of polyvinylidene fluoride (PVdF) was prepared using a facile one-pot electrospinning technique with a single nozzle. Unique multicore-shell (MCS) structure of the electrospun composite fibers was obtained, which resulted from electrospinning a phase-separated polymer composite solution. Multiple PI core fibrils with high molecular orientation were well-embedded across the cross-section and contributed remarkable thermal stabilities to the MCS membrane. Thus, no outbreaks were found in its dimension and ionic resistance up to 200 and 250 °C, respectively. Moreover, the MCS membrane (at ~200 °C), as a lithium ion battery (LIB) separator, showed superior thermal and electrochemical stabilities compared with a widely used commercial separator (~120 °C). The average capacity decay rate of LIB for 500 cycles was calculated to be approximately 0.030 mAh/g/cycle. This value demonstrated exceptional long-term stability compared with commercial LIBs and with two other types (single core-shell and co-electrospun separators incorporating with functionalized TiO2) of PI/PVdF composite separators. The proper architecture and synergy effects of multiple PI nanofibrils as a thermally stable polymer in the PVdF shell as electrolyte compatible polymers are responsible for the superior thermal performance and long-term stability of the LIB. PMID:27833132
Comparing an FPGA to a Cell for an Image Processing Application

NASA Astrophysics Data System (ADS)

Rakvic, Ryan N.; Ngo, Hau; Broussard, Randy P.; Ives, Robert W.

2010-12-01

Modern advancements in configurable hardware, most notably Field-Programmable Gate Arrays (FPGAs), have provided an exciting opportunity to discover the parallel nature of modern image processing algorithms. On the other hand, PlayStation3 (PS3) game consoles contain a multicore heterogeneous processor known as the Cell, which is designed to perform complex image processing algorithms at a high performance. In this research project, our aim is to study the differences in performance of a modern image processing algorithm on these two hardware platforms. In particular, Iris Recognition Systems have recently become an attractive identification method because of their extremely high accuracy. Iris matching, a repeatedly executed portion of a modern iris recognition algorithm, is parallelized on an FPGA system and a Cell processor. We demonstrate a 2.5 times speedup of the parallelized algorithm on the FPGA system when compared to a Cell processor-based version.
An auxiliary graph based dynamic traffic grooming algorithm in spatial division multiplexing enabled elastic optical networks with multi-core fibers

NASA Astrophysics Data System (ADS)

Zhao, Yongli; Tian, Rui; Yu, Xiaosong; Zhang, Jiawei; Zhang, Jie

2017-03-01

A proper traffic grooming strategy in dynamic optical networks can improve the utilization of bandwidth resources. An auxiliary graph (AG) is designed to solve the traffic grooming problem under a dynamic traffic scenario in spatial division multiplexing enabled elastic optical networks (SDM-EON) with multi-core fibers. Five traffic grooming policies achieved by adjusting the edge weights of an AG are proposed and evaluated through simulation: maximal electrical grooming (MEG), maximal optical grooming (MOG), maximal SDM grooming (MSG), minimize virtual hops (MVH), and minimize physical hops (MPH). Numeric results show that each traffic grooming policy has its own features. Among different traffic grooming policies, an MPH policy can achieve the lowest bandwidth blocking ratio, MEG can save the most transponders, and MSG can obtain the fewest cores for each request.
Architected Lattices with High Stiffness and Toughness via Multicore-Shell 3D Printing.

PubMed

Mueller, Jochen; Raney, Jordan R; Shea, Kristina; Lewis, Jennifer A

2018-03-01

The ability to create architected materials that possess both high stiffness and toughness remains an elusive goal, since these properties are often mutually exclusive. Natural materials, such as bone, overcome such limitations by combining different toughening mechanisms across multiple length scales. Here, a new method for creating architected lattices composed of core-shell struts that are both stiff and tough is reported. Specifically, these lattices contain orthotropic struts with flexible epoxy core-brittle epoxy shell motifs in the absence and presence of an elastomeric silicone interfacial layer, which are fabricated by a multicore-shell, 3D printing technique. It is found that architected lattices produced with a flexible core-elastomeric interface-brittle shell motif exhibit both high stiffness and toughness. © 2018 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
muBLASTP: database-indexed protein sequence search on multicore CPUs.

PubMed

Zhang, Jing; Misra, Sanchit; Wang, Hao; Feng, Wu-Chun

2016-11-04

The Basic Local Alignment Search Tool (BLAST) is a fundamental program in the life sciences that searches databases for sequences that are most similar to a query sequence. Currently, the BLAST algorithm utilizes a query-indexed approach. Although many approaches suggest that sequence search with a database index can achieve much higher throughput (e.g., BLAT, SSAHA, and CAFE), they cannot deliver the same level of sensitivity as the query-indexed BLAST, i.e., NCBI BLAST, or they can only support nucleotide sequence search, e.g., MegaBLAST. Due to different challenges and characteristics between query indexing and database indexing, the existing techniques for query-indexed search cannot be used into database indexed search. muBLASTP, a novel database-indexed BLAST for protein sequence search, delivers identical hits returned to NCBI BLAST. On Intel Haswell multicore CPUs, for a single query, the single-threaded muBLASTP achieves up to a 4.41-fold speedup for alignment stages, and up to a 1.75-fold end-to-end speedup over single-threaded NCBI BLAST. For a batch of queries, the multithreaded muBLASTP achieves up to a 5.7-fold speedups for alignment stages, and up to a 4.56-fold end-to-end speedup over multithreaded NCBI BLAST. With a newly designed index structure for protein database and associated optimizations in BLASTP algorithm, we re-factored BLASTP algorithm for modern multicore processors that achieves much higher throughput with acceptable memory footprint for the database index.
Considerations for Future Climate Data Stewardship

NASA Astrophysics Data System (ADS)

Halem, M.; Nguyen, P. T.; Chapman, D. R.

2009-12-01

In this talk, we will describe the lessons learned based on processing and generating a decade of gridded AIRS and MODIS IR sounding data. We describe the challenges faced in accessing and sharing very large data sets, maintaining data provenance under evolving technologies, obtaining access to legacy calibration data and the permanent preservation of Earth science data records for on demand services. These lessons suggest a new approach to data stewardship will be required for the next decade of hyper spectral instruments combined with cloud resolving models. It will not be sufficient for stewards of future data centers to just provide the public with access to archived data but our experience indicates that data needs to reside close to computers with ultra large disc farms and tens of thousands of processors to deliver complex services on demand over very high speed networks much like the offerings of search engines today. Over the first decade of the 21st century, petabyte data records were acquired from the AIRS instrument on Aqua and the MODIS instrument on Aqua and Terra. NOAA data centers also maintain petabytes of operational IR sounders collected over the past four decades. The UMBC Multicore Computational Center (MC2) developed a Service Oriented Atmospheric Radiance gridding system (SOAR) to allow users to select IR sounding instruments from multiple archives and choose space-time- spectral periods of Level 1B data to download, grid, visualize and analyze on demand. Providing this service requires high data rate bandwidth access to the on line disks at Goddard. After 10 years, cost effective disk storage technology finally caught up with the MODIS data volume making it possible for Level 1B MODIS data to be available on line. However, 10Ge fiber optic networks to access large volumes of data are still not available from CSFC to serve the broader community. Data transfer rates are well below 10MB/s limiting their usefulness for climate studies. During this decade, processor performance hit a power wall leading computer vendors to design multicore processor chips. High performance computer systems obtained petaflop performance by clustering tens of thousands of multicore processor chips. Thus, power consumption and autonomic recovery from processor and disc failures have become major cost and technical considerations for future data archives. To address these new architecture requirements, a transparent parallel programming paradigm, the Hadoop MapReduce cloud computing system, became available as an open S/W system. In addition, the Hadoop File System and manages the distribution of data to these processors as well as backs up the processing in the event of any processor or disc failure. However, to employ this paradigm, the data needs to be stored on the computer system. We conclude this talk with a climate data preservation approach that addresses the scalability crisis to exabyte data requirements for the next decade based on projections of processor, disc data density and bandwidth doubling rates.
Science Notes.

ERIC Educational Resources Information Center

School Science Review, 1990

1990-01-01

Presented are 25 science activities on colorations of prey, evolution, blood, physiology, nutrition, enzyme kinetics, leaf pigments, analytical chemistry, milk, proteins, fermentation, surface effects of liquids, magnetism, drug synthesis, solvents, wintergreen synthesis, chemical reactions, multicore cables, diffraction, air resistance,…
Comparative Performance Analysis of Coarse Solvers for Algebraic Multigrid on Multicore and Manycore Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Druinsky, Alex; Ghysels, Pieter; Li, Xiaoye S.

In this paper, we study the performance of a two-level algebraic-multigrid algorithm, with a focus on the impact of the coarse-grid solver on performance. We consider two algorithms for solving the coarse-space systems: the preconditioned conjugate gradient method and a new robust HSS-embedded low-rank sparse-factorization algorithm. Our test data comes from the SPE Comparative Solution Project for oil-reservoir simulations. We contrast the performance of our code on one 12-core socket of a Cray XC30 machine with performance on a 60-core Intel Xeon Phi coprocessor. To obtain top performance, we optimized the code to take full advantage of fine-grained parallelism andmore » made it thread-friendly for high thread count. We also developed a bounds-and-bottlenecks performance model of the solver which we used to guide us through the optimization effort, and also carried out performance tuning in the solver’s large parameter space. Finally, as a result, significant speedups were obtained on both machines.« less
Computational Particle Dynamic Simulations on Multicore Processors (CPDMu) Final Report Phase I

DOE Office of Scientific and Technical Information (OSTI.GOV)

Schmalz, Mark S

2011-07-24

Statement of Problem - Department of Energy has many legacy codes for simulation of computational particle dynamics and computational fluid dynamics applications that are designed to run on sequential processors and are not easily parallelized. Emerging high-performance computing architectures employ massively parallel multicore architectures (e.g., graphics processing units) to increase throughput. Parallelization of legacy simulation codes is a high priority, to achieve compatibility, efficiency, accuracy, and extensibility. General Statement of Solution - A legacy simulation application designed for implementation on mainly-sequential processors has been represented as a graph G. Mathematical transformations, applied to G, produce a graph representation {und G}more » for a high-performance architecture. Key computational and data movement kernels of the application were analyzed/optimized for parallel execution using the mapping G {yields} {und G}, which can be performed semi-automatically. This approach is widely applicable to many types of high-performance computing systems, such as graphics processing units or clusters comprised of nodes that contain one or more such units. Phase I Accomplishments - Phase I research decomposed/profiled computational particle dynamics simulation code for rocket fuel combustion into low and high computational cost regions (respectively, mainly sequential and mainly parallel kernels), with analysis of space and time complexity. Using the research team's expertise in algorithm-to-architecture mappings, the high-cost kernels were transformed, parallelized, and implemented on Nvidia Fermi GPUs. Measured speedups (GPU with respect to single-core CPU) were approximately 20-32X for realistic model parameters, without final optimization. Error analysis showed no loss of computational accuracy. Commercial Applications and Other Benefits - The proposed research will constitute a breakthrough in solution of problems related to efficient parallel computation of particle and fluid dynamics simulations. These problems occur throughout DOE, military and commercial sectors: the potential payoff is high. We plan to license or sell the solution to contractors for military and domestic applications such as disaster simulation (aerodynamic and hydrodynamic), Government agencies (hydrological and environmental simulations), and medical applications (e.g., in tomographic image reconstruction). Keywords - High-performance Computing, Graphic Processing Unit, Fluid/Particle Simulation. Summary for Members of Congress - Department of Energy has many simulation codes that must compute faster, to be effective. The Phase I research parallelized particle/fluid simulations for rocket combustion, for high-performance computing systems.« less
DKIST Adaptive Optics System: Simulation Results

NASA Astrophysics Data System (ADS)

Marino, Jose; Schmidt, Dirk

2016-05-01

The 4 m class Daniel K. Inouye Solar Telescope (DKIST), currently under construction, will be equipped with an ultra high order solar adaptive optics (AO) system. The requirements and capabilities of such a solar AO system are beyond those of any other solar AO system currently in operation. We must rely on solar AO simulations to estimate and quantify its performance.We present performance estimation results of the DKIST AO system obtained with a new solar AO simulation tool. This simulation tool is a flexible and fast end-to-end solar AO simulator which produces accurate solar AO simulations while taking advantage of current multi-core computer technology. It relies on full imaging simulations of the extended field Shack-Hartmann wavefront sensor (WFS), which directly includes important secondary effects such as field dependent distortions and varying contrast of the WFS sub-aperture images.
Parallelization of combinatorial search when solving knapsack optimization problem on computing systems based on multicore processors

NASA Astrophysics Data System (ADS)

Rahman, P. A.

2018-05-01

This scientific paper deals with the model of the knapsack optimization problem and method of its solving based on directed combinatorial search in the boolean space. The offered by the author specialized mathematical model of decomposition of the search-zone to the separate search-spheres and the algorithm of distribution of the search-spheres to the different cores of the multi-core processor are also discussed. The paper also provides an example of decomposition of the search-zone to the several search-spheres and distribution of the search-spheres to the different cores of the quad-core processor. Finally, an offered by the author formula for estimation of the theoretical maximum of the computational acceleration, which can be achieved due to the parallelization of the search-zone to the search-spheres on the unlimited number of the processor cores, is also given.
Passing in Command Line Arguments and Parallel Cluster/Multicore Batching in R with batch.

PubMed

Hoffmann, Thomas J

2011-03-01

It is often useful to rerun a command line R script with some slight change in the parameters used to run it - a new set of parameters for a simulation, a different dataset to process, etc. The R package batch provides a means to pass in multiple command line options, including vectors of values in the usual R format, easily into R. The same script can be setup to run things in parallel via different command line arguments. The R package batch also provides a means to simplify this parallel batching by allowing one to use R and an R-like syntax for arguments to spread a script across a cluster or local multicore/multiprocessor computer, with automated syntax for several popular cluster types. Finally it provides a means to aggregate the results together of multiple processes run on a cluster.
Semiempirical Quantum Chemical Calculations Accelerated on a Hybrid Multicore CPU-GPU Computing Platform.

PubMed

Wu, Xin; Koslowski, Axel; Thiel, Walter

2012-07-10

In this work, we demonstrate that semiempirical quantum chemical calculations can be accelerated significantly by leveraging the graphics processing unit (GPU) as a coprocessor on a hybrid multicore CPU-GPU computing platform. Semiempirical calculations using the MNDO, AM1, PM3, OM1, OM2, and OM3 model Hamiltonians were systematically profiled for three types of test systems (fullerenes, water clusters, and solvated crambin) to identify the most time-consuming sections of the code. The corresponding routines were ported to the GPU and optimized employing both existing library functions and a GPU kernel that carries out a sequence of noniterative Jacobi transformations during pseudodiagonalization. The overall computation times for single-point energy calculations and geometry optimizations of large molecules were reduced by one order of magnitude for all methods, as compared to runs on a single CPU core.

Virtual optical network mapping and core allocation in elastic optical networks using multi-core fibers

NASA Astrophysics Data System (ADS)

Xuan, Hejun; Wang, Yuping; Xu, Zhanqi; Hao, Shanshan; Wang, Xiaoli

2017-11-01

Virtualization technology can greatly improve the efficiency of the networks by allowing the virtual optical networks to share the resources of the physical networks. However, it will face some challenges, such as finding the efficient strategies for virtual nodes mapping, virtual links mapping and spectrum assignment. It is even more complex and challenging when the physical elastic optical networks using multi-core fibers. To tackle these challenges, we establish a constrained optimization model to determine the optimal schemes of optical network mapping, core allocation and spectrum assignment. To solve the model efficiently, tailor-made encoding scheme, crossover and mutation operators are designed. Based on these, an efficient genetic algorithm is proposed to obtain the optimal schemes of the virtual nodes mapping, virtual links mapping, core allocation. The simulation experiments are conducted on three widely used networks, and the experimental results show the effectiveness of the proposed model and algorithm.
Adaptive multiphoton endomicroscopy through a dynamically deformed multicore optical fiber using proximal detection.

PubMed

Warren, Sean C; Kim, Youngchan; Stone, James M; Mitchell, Claire; Knight, Jonathan C; Neil, Mark A A; Paterson, Carl; French, Paul M W; Dunsby, Chris

2016-09-19

This paper demonstrates multiphoton excited fluorescence imaging through a polarisation maintaining multicore fiber (PM-MCF) while the fiber is dynamically deformed using all-proximal detection. Single-shot proximal measurement of the relative optical path lengths of all the cores of the PM-MCF in double pass is achieved using a Mach-Zehnder interferometer read out by a scientific CMOS camera operating at 416 Hz. A non-linear least squares fitting procedure is then employed to determine the deformation-induced lateral shift of the excitation spot at the distal tip of the PM-MCF. An experimental validation of this approach is presented that compares the proximally measured deformation-induced lateral shift in focal spot position to an independent distally measured ground truth. The proximal measurement of deformation-induced shift in focal spot position is applied to correct for deformation-induced shifts in focal spot position during raster-scanning multiphoton excited fluorescence imaging.
All-fiber orbital angular momentum mode generation and transmission system

NASA Astrophysics Data System (ADS)

Heng, Xiaobo; Gan, Jiulin; Zhang, Zhishen; Qian, Qi; Xu, Shanhui; Yang, Zhongmin

2017-11-01

We proposed and demonstrated an all-fiber system for generating and transmitting orbital angular momentum (OAM) mode light. A specially designed multi-core fiber (MCF) was used to endow with guide modes different phase change and two tapered transition regions were used for providing low-loss interfaces between different fiber structures. By arranging the refractive index distribution among the multi-cores and controlling the length of MCF, which essentially change the phase difference between the neighboring cores, OAM modes with different topological charge l can be generated selectively. Through two tapered transition regions, the non-OAM mode light can be effectively injected into the MCF and the generated OAM mode light can be easily launched into OAM mode supporting fiber for long distance and high purity transmission. Such an all-fiber OAM mode generation and transmission system owns the merits of flexibility, compactness, portability, and would have practical application value in OAM optical fiber communication systems.
On the Usage of GPUs for Efficient Motion Estimation in Medical Image Sequences

PubMed Central

Thiyagalingam, Jeyarajan; Goodman, Daniel; Schnabel, Julia A.; Trefethen, Anne; Grau, Vicente

2011-01-01

Images are ubiquitous in biomedical applications from basic research to clinical practice. With the rapid increase in resolution, dimensionality of the images and the need for real-time performance in many applications, computational requirements demand proper exploitation of multicore architectures. Towards this, GPU-specific implementations of image analysis algorithms are particularly promising. In this paper, we investigate the mapping of an enhanced motion estimation algorithm to novel GPU-specific architectures, the resulting challenges and benefits therein. Using a database of three-dimensional image sequences, we show that the mapping leads to substantial performance gains, up to a factor of 60, and can provide near-real-time experience. We also show how architectural peculiarities of these devices can be best exploited in the benefit of algorithms, most specifically for addressing the challenges related to their access patterns and different memory configurations. Finally, we evaluate the performance of the algorithm on three different GPU architectures and perform a comprehensive analysis of the results. PMID:21869880
Real-time 3D adaptive filtering for portable imaging systems

NASA Astrophysics Data System (ADS)

Bockenbach, Olivier; Ali, Murtaza; Wainwright, Ian; Nadeski, Mark

2015-03-01

Portable imaging devices have proven valuable for emergency medical services both in the field and hospital environments and are becoming more prevalent in clinical settings where the use of larger imaging machines is impractical. 3D adaptive filtering is one of the most advanced techniques aimed at noise reduction and feature enhancement, but is computationally very demanding and hence often not able to run with sufficient performance on a portable platform. In recent years, advanced multicore DSPs have been introduced that attain high processing performance while maintaining low levels of power dissipation. These processors enable the implementation of complex algorithms like 3D adaptive filtering, improving the image quality of portable medical imaging devices. In this study, the performance of a 3D adaptive filtering algorithm on a digital signal processor (DSP) is investigated. The performance is assessed by filtering a volume of size 512x256x128 voxels sampled at a pace of 10 MVoxels/sec.
High performance in silico virtual drug screening on many-core processors.

PubMed

McIntosh-Smith, Simon; Price, James; Sessions, Richard B; Ibarra, Amaurys A

2015-05-01

Drug screening is an important part of the drug development pipeline for the pharmaceutical industry. Traditional, lab-based methods are increasingly being augmented with computational methods, ranging from simple molecular similarity searches through more complex pharmacophore matching to more computationally intensive approaches, such as molecular docking. The latter simulates the binding of drug molecules to their targets, typically protein molecules. In this work, we describe BUDE, the Bristol University Docking Engine, which has been ported to the OpenCL industry standard parallel programming language in order to exploit the performance of modern many-core processors. Our highly optimized OpenCL implementation of BUDE sustains 1.43 TFLOP/s on a single Nvidia GTX 680 GPU, or 46% of peak performance. BUDE also exploits OpenCL to deliver effective performance portability across a broad spectrum of different computer architectures from different vendors, including GPUs from Nvidia and AMD, Intel's Xeon Phi and multi-core CPUs with SIMD instruction sets.
High performance in silico virtual drug screening on many-core processors

PubMed Central

Price, James; Sessions, Richard B; Ibarra, Amaurys A

2015-01-01

Drug screening is an important part of the drug development pipeline for the pharmaceutical industry. Traditional, lab-based methods are increasingly being augmented with computational methods, ranging from simple molecular similarity searches through more complex pharmacophore matching to more computationally intensive approaches, such as molecular docking. The latter simulates the binding of drug molecules to their targets, typically protein molecules. In this work, we describe BUDE, the Bristol University Docking Engine, which has been ported to the OpenCL industry standard parallel programming language in order to exploit the performance of modern many-core processors. Our highly optimized OpenCL implementation of BUDE sustains 1.43 TFLOP/s on a single Nvidia GTX 680 GPU, or 46% of peak performance. BUDE also exploits OpenCL to deliver effective performance portability across a broad spectrum of different computer architectures from different vendors, including GPUs from Nvidia and AMD, Intel’s Xeon Phi and multi-core CPUs with SIMD instruction sets. PMID:25972727
A hybrid algorithm for parallel molecular dynamics simulations

NASA Astrophysics Data System (ADS)

Mangiardi, Chris M.; Meyer, R.

2017-10-01

This article describes algorithms for the hybrid parallelization and SIMD vectorization of molecular dynamics simulations with short-range forces. The parallelization method combines domain decomposition with a thread-based parallelization approach. The goal of the work is to enable efficient simulations of very large (tens of millions of atoms) and inhomogeneous systems on many-core processors with hundreds or thousands of cores and SIMD units with large vector sizes. In order to test the efficiency of the method, simulations of a variety of configurations with up to 74 million atoms have been performed. Results are shown that were obtained on multi-core systems with Sandy Bridge and Haswell processors as well as systems with Xeon Phi many-core processors.
Tablet—next generation sequence assembly visualization

PubMed Central

Milne, Iain; Bayer, Micha; Cardle, Linda; Shaw, Paul; Stephen, Gordon; Wright, Frank; Marshall, David

2010-01-01

Summary: Tablet is a lightweight, high-performance graphical viewer for next-generation sequence assemblies and alignments. Supporting a range of input assembly formats, Tablet provides high-quality visualizations showing data in packed or stacked views, allowing instant access and navigation to any region of interest, and whole contig overviews and data summaries. Tablet is both multi-core aware and memory efficient, allowing it to handle assemblies containing millions of reads, even on a 32-bit desktop machine. Availability: Tablet is freely available for Microsoft Windows, Apple Mac OS X, Linux and Solaris. Fully bundled installers can be downloaded from http://bioinf.scri.ac.uk/tablet in 32- and 64-bit versions. Contact: tablet@scri.ac.uk PMID:19965881
A fast ultrasonic simulation tool based on massively parallel implementations

NASA Astrophysics Data System (ADS)

Lambert, Jason; Rougeron, Gilles; Lacassagne, Lionel; Chatillon, Sylvain

2014-02-01

This paper presents a CIVA optimized ultrasonic inspection simulation tool, which takes benefit of the power of massively parallel architectures: graphical processing units (GPU) and multi-core general purpose processors (GPP). This tool is based on the classical approach used in CIVA: the interaction model is based on Kirchoff, and the ultrasonic field around the defect is computed by the pencil method. The model has been adapted and parallelized for both architectures. At this stage, the configurations addressed by the tool are : multi and mono-element probes, planar specimens made of simple isotropic materials, planar rectangular defects or side drilled holes of small diameter. Validations on the model accuracy and performances measurements are presented.
Adaptive MCMC in Bayesian phylogenetics: an application to analyzing partitioned data in BEAST.

PubMed

Baele, Guy; Lemey, Philippe; Rambaut, Andrew; Suchard, Marc A

2017-06-15

Advances in sequencing technology continue to deliver increasingly large molecular sequence datasets that are often heavily partitioned in order to accurately model the underlying evolutionary processes. In phylogenetic analyses, partitioning strategies involve estimating conditionally independent models of molecular evolution for different genes and different positions within those genes, requiring a large number of evolutionary parameters that have to be estimated, leading to an increased computational burden for such analyses. The past two decades have also seen the rise of multi-core processors, both in the central processing unit (CPU) and Graphics processing unit processor markets, enabling massively parallel computations that are not yet fully exploited by many software packages for multipartite analyses. We here propose a Markov chain Monte Carlo (MCMC) approach using an adaptive multivariate transition kernel to estimate in parallel a large number of parameters, split across partitioned data, by exploiting multi-core processing. Across several real-world examples, we demonstrate that our approach enables the estimation of these multipartite parameters more efficiently than standard approaches that typically use a mixture of univariate transition kernels. In one case, when estimating the relative rate parameter of the non-coding partition in a heterochronous dataset, MCMC integration efficiency improves by > 14-fold. Our implementation is part of the BEAST code base, a widely used open source software package to perform Bayesian phylogenetic inference. guy.baele@kuleuven.be. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
Hierarchical Parallelization of Gene Differential Association Analysis

PubMed Central

2011-01-01

Background Microarray gene differential expression analysis is a widely used technique that deals with high dimensional data and is computationally intensive for permutation-based procedures. Microarray gene differential association analysis is even more computationally demanding and must take advantage of multicore computing technology, which is the driving force behind increasing compute power in recent years. In this paper, we present a two-layer hierarchical parallel implementation of gene differential association analysis. It takes advantage of both fine- and coarse-grain (with granularity defined by the frequency of communication) parallelism in order to effectively leverage the non-uniform nature of parallel processing available in the cutting-edge systems of today. Results Our results show that this hierarchical strategy matches data sharing behavior to the properties of the underlying hardware, thereby reducing the memory and bandwidth needs of the application. The resulting improved efficiency reduces computation time and allows the gene differential association analysis code to scale its execution with the number of processors. The code and biological data used in this study are downloadable from http://www.urmc.rochester.edu/biostat/people/faculty/hu.cfm. Conclusions The performance sweet spot occurs when using a number of threads per MPI process that allows the working sets of the corresponding MPI processes running on the multicore to fit within the machine cache. Hence, we suggest that practitioners follow this principle in selecting the appropriate number of MPI processes and threads within each MPI process for their cluster configurations. We believe that the principles of this hierarchical approach to parallelization can be utilized in the parallelization of other computationally demanding kernels. PMID:21936916
Hierarchical parallelization of gene differential association analysis.

PubMed

Needham, Mark; Hu, Rui; Dwarkadas, Sandhya; Qiu, Xing

2011-09-21

Microarray gene differential expression analysis is a widely used technique that deals with high dimensional data and is computationally intensive for permutation-based procedures. Microarray gene differential association analysis is even more computationally demanding and must take advantage of multicore computing technology, which is the driving force behind increasing compute power in recent years. In this paper, we present a two-layer hierarchical parallel implementation of gene differential association analysis. It takes advantage of both fine- and coarse-grain (with granularity defined by the frequency of communication) parallelism in order to effectively leverage the non-uniform nature of parallel processing available in the cutting-edge systems of today. Our results show that this hierarchical strategy matches data sharing behavior to the properties of the underlying hardware, thereby reducing the memory and bandwidth needs of the application. The resulting improved efficiency reduces computation time and allows the gene differential association analysis code to scale its execution with the number of processors. The code and biological data used in this study are downloadable from http://www.urmc.rochester.edu/biostat/people/faculty/hu.cfm. The performance sweet spot occurs when using a number of threads per MPI process that allows the working sets of the corresponding MPI processes running on the multicore to fit within the machine cache. Hence, we suggest that practitioners follow this principle in selecting the appropriate number of MPI processes and threads within each MPI process for their cluster configurations. We believe that the principles of this hierarchical approach to parallelization can be utilized in the parallelization of other computationally demanding kernels.
A parallel and sensitive software tool for methylation analysis on multicore platforms.

PubMed

Tárraga, Joaquín; Pérez, Mariano; Orduña, Juan M; Duato, José; Medina, Ignacio; Dopazo, Joaquín

2015-10-01

DNA methylation analysis suffers from very long processing time, as the advent of Next-Generation Sequencers has shifted the bottleneck of genomic studies from the sequencers that obtain the DNA samples to the software that performs the analysis of these samples. The existing software for methylation analysis does not seem to scale efficiently neither with the size of the dataset nor with the length of the reads to be analyzed. As it is expected that the sequencers will provide longer and longer reads in the near future, efficient and scalable methylation software should be developed. We present a new software tool, called HPG-Methyl, which efficiently maps bisulphite sequencing reads on DNA, analyzing DNA methylation. The strategy used by this software consists of leveraging the speed of the Burrows-Wheeler Transform to map a large number of DNA fragments (reads) rapidly, as well as the accuracy of the Smith-Waterman algorithm, which is exclusively employed to deal with the most ambiguous and shortest reads. Experimental results on platforms with Intel multicore processors show that HPG-Methyl significantly outperforms in both execution time and sensitivity state-of-the-art software such as Bismark, BS-Seeker or BSMAP, particularly for long bisulphite reads. Software in the form of C libraries and functions, together with instructions to compile and execute this software. Available by sftp to anonymous@clariano.uv.es (password 'anonymous'). juan.orduna@uv.es or jdopazo@cipf.es. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
FastGCN: A GPU Accelerated Tool for Fast Gene Co-Expression Networks

PubMed Central

Liang, Meimei; Zhang, Futao; Jin, Gulei; Zhu, Jun

2015-01-01

Gene co-expression networks comprise one type of valuable biological networks. Many methods and tools have been published to construct gene co-expression networks; however, most of these tools and methods are inconvenient and time consuming for large datasets. We have developed a user-friendly, accelerated and optimized tool for constructing gene co-expression networks that can fully harness the parallel nature of GPU (Graphic Processing Unit) architectures. Genetic entropies were exploited to filter out genes with no or small expression changes in the raw data preprocessing step. Pearson correlation coefficients were then calculated. After that, we normalized these coefficients and employed the False Discovery Rate to control the multiple tests. At last, modules identification was conducted to construct the co-expression networks. All of these calculations were implemented on a GPU. We also compressed the coefficient matrix to save space. We compared the performance of the GPU implementation with those of multi-core CPU implementations with 16 CPU threads, single-thread C/C++ implementation and single-thread R implementation. Our results show that GPU implementation largely outperforms single-thread C/C++ implementation and single-thread R implementation, and GPU implementation outperforms multi-core CPU implementation when the number of genes increases. With the test dataset containing 16,000 genes and 590 individuals, we can achieve greater than 63 times the speed using a GPU implementation compared with a single-thread R implementation when 50 percent of genes were filtered out and about 80 times the speed when no genes were filtered out. PMID:25602758
FastGCN: a GPU accelerated tool for fast gene co-expression networks.

PubMed

Liang, Meimei; Zhang, Futao; Jin, Gulei; Zhu, Jun

2015-01-01

Gene co-expression networks comprise one type of valuable biological networks. Many methods and tools have been published to construct gene co-expression networks; however, most of these tools and methods are inconvenient and time consuming for large datasets. We have developed a user-friendly, accelerated and optimized tool for constructing gene co-expression networks that can fully harness the parallel nature of GPU (Graphic Processing Unit) architectures. Genetic entropies were exploited to filter out genes with no or small expression changes in the raw data preprocessing step. Pearson correlation coefficients were then calculated. After that, we normalized these coefficients and employed the False Discovery Rate to control the multiple tests. At last, modules identification was conducted to construct the co-expression networks. All of these calculations were implemented on a GPU. We also compressed the coefficient matrix to save space. We compared the performance of the GPU implementation with those of multi-core CPU implementations with 16 CPU threads, single-thread C/C++ implementation and single-thread R implementation. Our results show that GPU implementation largely outperforms single-thread C/C++ implementation and single-thread R implementation, and GPU implementation outperforms multi-core CPU implementation when the number of genes increases. With the test dataset containing 16,000 genes and 590 individuals, we can achieve greater than 63 times the speed using a GPU implementation compared with a single-thread R implementation when 50 percent of genes were filtered out and about 80 times the speed when no genes were filtered out.
Wetland Mitigation Monitoring at the Fernald Preserve - 13200

DOE Office of Scientific and Technical Information (OSTI.GOV)

Powell, Jane; Bien, Stephanie; Decker, Ashlee

The U.S. Department of Energy is responsible for 7.2 hectares (17.8 acres) of mitigation wetland at the Fernald Preserve, Ohio. Remedial activities affected the wetlands, and mitigation plans were incorporated into site-wide ecological restoration planning. In 2008, the Fernald Natural Resource Trustees developed a comprehensive wetland mitigation monitoring approach to evaluate whether compensatory mitigation requirements have been met. The Fernald Preserve Wetland Mitigation Monitoring Plan provided a guideline for wetland evaluations. The Ohio Environmental Protection Agency (Ohio EPA) wetland mitigation monitoring protocols were adopted as the means for compensatory wetland evaluation. Design, hydrologic regime, vegetation, wildlife, and biogeochemistry were evaluatedmore » from 2009 to 2011. Evaluations showed mixed results when compared to the Ohio EPA performance standards. Results of vegetation monitoring varied, with the best results occurring in wetlands adjacent to forested areas. Amphibians, particularly ambystomatid salamanders, were observed in two areas adjacent to forested areas. Not all wetlands met vegetation performance standards and amphibian biodiversity metrics. However, Fernald mitigation wetlands showed substantially higher ratings compared to other mitigated wetlands in Ohio. Also, soil sampling results remain consistent with other Ohio mitigated wetlands. The performance standards are not intended to be 'pass/fail' criteria; rather, they are reference points for use in making decisions regarding future monitoring and maintenance. The Trustees approved the Fernald Preserve Wetland Mitigation Monitoring Report with the provision that long-term monitoring of the wetlands continues at the Fernald Preserve. (authors)« less
Wireless Interconnects for Intra-chip & Inter-chip Transmission

NASA Astrophysics Data System (ADS)

Narde, Rounak Singh

With the emergence of Internet of Things and information revolution, the demand of high performance computing systems is increasing. The copper interconnects inside the computing chips have evolved into a sophisticated network of interconnects known as Network on Chip (NoC) comprising of routers, switches, repeaters, just like computer networks. When network on chip is implemented on a large scale like in Multicore Multichip (MCMC) systems for High Performance Computing (HPC) systems, length of interconnects increases and so are the problems like power dissipation, interconnect delays, clock synchronization and electrical noise. In this thesis, wireless interconnects are chosen as the substitute for wired copper interconnects. Wireless interconnects offer easy integration with CMOS fabrication and chip packaging. Using wireless interconnects working at unlicensed mm-wave band (57-64GHz), high data rate of Gbps can be achieved. This thesis presents study of transmission between zigzag antennas as wireless interconnects for Multichip multicores (MCMC) systems and 3D IC. For MCMC systems, a four-chips 16-cores model is analyzed with only four wireless interconnects in three configurations with different antenna orientations and locations. Return loss and transmission coefficients are simulated in ANSYS HFSS. Moreover, wireless interconnects are designed, fabricated and tested on a 6'' silicon wafer with resistivity of 55O-cm using a basic standard CMOS process. Wireless interconnect are designed to work at 30GHz using ANSYS HFSS. The fabricated antennas are resonating around 20GHz with a return loss of less than -10dB. The transmission coefficients between antenna pair within a 20mm x 20mm silicon die is found to be varying between -45dB to -55dB. Furthermore, wireless interconnect approach is extended for 3D IC. Wireless interconnects are implemented as zigzag antenna. This thesis extends the work of analyzing the wireless interconnects in 3D IC with different configurations of antenna orientations and coolants. The return loss and transmission coefficients are simulated using ANSYS HFSS.
Is wetland mitigation successful in Southern California?

NASA Astrophysics Data System (ADS)

Cummings, D. L.; Rademacher, L. K.

2004-12-01

Wetlands perform many vital functions within their landscape position; they provide unique habitats for a variety of flora and fauna and they act as treatment systems for upstream natural and anthropogenic waste. California has lost an estimated 91% of its wetlands. Despite the 1989 "No Net Loss" policy and mitigation requirements by the regulatory agencies, the implemented mitigation may not be offsetting wetlands losses. The "No Net Loss" policy is likely failing for numerous reasons related to processes in the wetlands themselves and the policies governing their recovery. Of particular interest is whether these mitigation sites are performing essential wetlands functions. Specific questions include: 1) Are hydric soil conditions forming in mitigation sites; and, 2) are the water quality-related chemical transformations that occur in natural wetlands observed in mitigation sites. This study focuses on success (or lack of success) in wetlands mitigation sites in Southern California. Soil and water quality investigations were conducted in wetland mitigation sites deemed to be successful by vegetation standards. Observations of the Standard National Resource Conservation Service field indicators of reducing conditions were made to determine whether hydric soil conditions have developed in the five or more years since the implementation of mitigation plans. In addition, water quality measurements were performed at the inlet and outlet of these mitigation sites to determine whether these sites perform similar water quality transformations to natural wetlands within the same ecosystem. Water quality measurements included nutrient, trace metal, and carbon species measurements. A wetland location with minimal anthropogenic changes and similar hydrologic and vegetative features was used as a control site. All sites selected for study are within a similar ecosystem, in the interior San Diego and western Riverside Counties, in Southern California.
Mitigating Multipath Bias Using a Dual-Polarization Antenna: Theoretical Performance, Algorithm Design, and Simulation

PubMed Central

Xie, Lin; Cui, Xiaowei; Zhao, Sihao; Lu, Mingquan

2017-01-01

It is well known that multipath effect remains a dominant error source that affects the positioning accuracy of Global Navigation Satellite System (GNSS) receivers. Significant efforts have been made by researchers and receiver manufacturers to mitigate multipath error in the past decades. Recently, a multipath mitigation technique using dual-polarization antennas has become a research hotspot for it provides another degree of freedom to distinguish the line-of-sight (LOS) signal from the LOS and multipath composite signal without extensively increasing the complexity of the receiver. Numbers of multipath mitigation techniques using dual-polarization antennas have been proposed and all of them report performance improvement over the single-polarization methods. However, due to the unpredictability of multipath, multipath mitigation techniques based on dual-polarization are not always effective while few studies discuss the condition under which the multipath mitigation using a dual-polarization antenna can outperform that using a single-polarization antenna, which is a fundamental question for dual-polarization multipath mitigation (DPMM) and the design of multipath mitigation algorithms. In this paper we analyze the characteristics of the signal received by a dual-polarization antenna and use the maximum likelihood estimation (MLE) to assess the theoretical performance of DPMM in different received signal cases. Based on the assessment we answer this fundamental question and find the dual-polarization antenna’s capability in mitigating short delay multipath—the most challenging one among all types of multipath for the majority of the multipath mitigation techniques. Considering these effective conditions, we propose a dual-polarization sequential iterative maximum likelihood estimation (DP-SIMLE) algorithm for DPMM. The simulation results verify our theory and show superior performance of the proposed DP-SIMLE algorithm over the traditional one using only an RHCP antenna. PMID:28208832

DPM — efficient storage in diverse environments

NASA Astrophysics Data System (ADS)

Hellmich, Martin; Furano, Fabrizio; Smith, David; Brito da Rocha, Ricardo; Álvarez Ayllón, Alejandro; Manzi, Andrea; Keeble, Oliver; Calvet, Ivan; Regala, Miguel Antonio

2014-06-01

Recent developments, including low power devices, cluster file systems and cloud storage, represent an explosion in the possibilities for deploying and managing grid storage. In this paper we present how different technologies can be leveraged to build a storage service with differing cost, power, performance, scalability and reliability profiles, using the popular storage solution Disk Pool Manager (DPM/dmlite) as the enabling technology. The storage manager DPM is designed for these new environments, allowing users to scale up and down as they need it, and optimizing their computing centers energy efficiency and costs. DPM runs on high-performance machines, profiting from multi-core and multi-CPU setups. It supports separating the database from the metadata server, the head node, largely reducing its hard disk requirements. Since version 1.8.6, DPM is released in EPEL and Fedora, simplifying distribution and maintenance, but also supporting the ARM architecture beside i386 and x86_64, allowing it to run the smallest low-power machines such as the Raspberry Pi or the CuBox. This usage is facilitated by the possibility to scale horizontally using a main database and a distributed memcached-powered namespace cache. Additionally, DPM supports a variety of storage pools in the backend, most importantly HDFS, S3-enabled storage, and cluster file systems, allowing users to fit their DPM installation exactly to their needs. In this paper, we investigate the power-efficiency and total cost of ownership of various DPM configurations. We develop metrics to evaluate the expected performance of a setup both in terms of namespace and disk access considering the overall cost including equipment, power consumptions, or data/storage fees. The setups tested range from the lowest scale using Raspberry Pis with only 700MHz single cores and a 100Mbps network connections, over conventional multi-core servers to typical virtual machine instances in cloud settings. We evaluate the combinations of different name server setups, for example load-balanced clusters, with different storage setups, from using a classic local configuration to private and public clouds.
Benchmarking the ATLAS software through the Kit Validation engine

NASA Astrophysics Data System (ADS)

De Salvo, Alessandro; Brasolin, Franco

2010-04-01

The measurement of the experiment software performance is a very important metric in order to choose the most effective resources to be used and to discover the bottlenecks of the code implementation. In this work we present the benchmark techniques used to measure the ATLAS software performance through the ATLAS offline testing engine Kit Validation and the online portal Global Kit Validation. The performance measurements, the data collection, the online analysis and display of the results will be presented. The results of the measurement on different platforms and architectures will be shown, giving a full report on the CPU power and memory consumption of the Monte Carlo generation, simulation, digitization and reconstruction of the most CPU-intensive channels. The impact of the multi-core computing on the ATLAS software performance will also be presented, comparing the behavior of different architectures when increasing the number of concurrent processes. The benchmark techniques described in this paper have been used in the HEPiX group since the beginning of 2008 to help defining the performance metrics for the High Energy Physics applications, based on the real experiment software.
Automatic detection and classification of obstacles with applications in autonomous mobile robots

NASA Astrophysics Data System (ADS)

Ponomaryov, Volodymyr I.; Rosas-Miranda, Dario I.

2016-04-01

Hardware implementation of an automatic detection and classification of objects that can represent an obstacle for an autonomous mobile robot using stereo vision algorithms is presented. We propose and evaluate a new method to detect and classify objects for a mobile robot in outdoor conditions. This method is divided in two parts, the first one is the object detection step based on the distance from the objects to the camera and a BLOB analysis. The second part is the classification step that is based on visuals primitives and a SVM classifier. The proposed method is performed in GPU in order to reduce the processing time values. This is performed with help of hardware based on multi-core processors and GPU platform, using a NVIDIA R GeForce R GT640 graphic card and Matlab over a PC with Windows 10.
Parallel, distributed and GPU computing technologies in single-particle electron microscopy

PubMed Central

Schmeisser, Martin; Heisen, Burkhard C.; Luettich, Mario; Busche, Boris; Hauer, Florian; Koske, Tobias; Knauber, Karl-Heinz; Stark, Holger

2009-01-01

Most known methods for the determination of the structure of macromolecular complexes are limited or at least restricted at some point by their computational demands. Recent developments in information technology such as multicore, parallel and GPU processing can be used to overcome these limitations. In particular, graphics processing units (GPUs), which were originally developed for rendering real-time effects in computer games, are now ubiquitous and provide unprecedented computational power for scientific applications. Each parallel-processing paradigm alone can improve overall performance; the increased computational performance obtained by combining all paradigms, unleashing the full power of today’s technology, makes certain applications feasible that were previously virtually impossible. In this article, state-of-the-art paradigms are introduced, the tools and infrastructure needed to apply these paradigms are presented and a state-of-the-art infrastructure and solution strategy for moving scientific applications to the next generation of computer hardware is outlined. PMID:19564686
Accelerating Pathology Image Data Cross-Comparison on CPU-GPU Hybrid Systems

PubMed Central

Wang, Kaibo; Huai, Yin; Lee, Rubao; Wang, Fusheng; Zhang, Xiaodong; Saltz, Joel H.

2012-01-01

As an important application of spatial databases in pathology imaging analysis, cross-comparing the spatial boundaries of a huge amount of segmented micro-anatomic objects demands extremely data- and compute-intensive operations, requiring high throughput at an affordable cost. However, the performance of spatial database systems has not been satisfactory since their implementations of spatial operations cannot fully utilize the power of modern parallel hardware. In this paper, we provide a customized software solution that exploits GPUs and multi-core CPUs to accelerate spatial cross-comparison in a cost-effective way. Our solution consists of an efficient GPU algorithm and a pipelined system framework with task migration support. Extensive experiments with real-world data sets demonstrate the effectiveness of our solution, which improves the performance of spatial cross-comparison by over 18 times compared with a parallelized spatial database approach. PMID:23355955
Experiments with a Parallel Multi-Objective Evolutionary Algorithm for Scheduling

NASA Technical Reports Server (NTRS)

Brown, Matthew; Johnston, Mark D.

2013-01-01

Evolutionary multi-objective algorithms have great potential for scheduling in those situations where tradeoffs among competing objectives represent a key requirement. One challenge, however, is runtime performance, as a consequence of evolving not just a single schedule, but an entire population, while attempting to sample the Pareto frontier as accurately and uniformly as possible. The growing availability of multi-core processors in end user workstations, and even laptops, has raised the question of the extent to which such hardware can be used to speed up evolutionary algorithms. In this paper we report on early experiments in parallelizing a Generalized Differential Evolution (GDE) algorithm for scheduling long-range activities on NASA's Deep Space Network. Initial results show that significant speedups can be achieved, but that performance does not necessarily improve as more cores are utilized. We describe our preliminary results and some initial suggestions from parallelizing the GDE algorithm. Directions for future work are outlined.
Vectorization, threading, and cache-blocking considerations for hydrocodes on emerging architectures

DOE PAGES

Fung, J.; Aulwes, R. T.; Bement, M. T.; ...

2015-07-14

This work reports on considerations for improving computational performance in preparation for current and expected changes to computer architecture. The algorithms studied will include increasingly complex prototypes for radiation hydrodynamics codes, such as gradient routines and diffusion matrix assembly (e.g., in [1-6]). The meshes considered for the algorithms are structured or unstructured meshes. The considerations applied for performance improvements are meant to be general in terms of architecture (not specifically graphical processing unit (GPUs) or multi-core machines, for example) and include techniques for vectorization, threading, tiling, and cache blocking. Out of a survey of optimization techniques on applications such asmore » diffusion and hydrodynamics, we make general recommendations with a view toward making these techniques conceptually accessible to the applications code developer. Published 2015. This article is a U.S. Government work and is in the public domain in the USA.« less
Parallel, distributed and GPU computing technologies in single-particle electron microscopy.

PubMed

Schmeisser, Martin; Heisen, Burkhard C; Luettich, Mario; Busche, Boris; Hauer, Florian; Koske, Tobias; Knauber, Karl-Heinz; Stark, Holger

2009-07-01

Most known methods for the determination of the structure of macromolecular complexes are limited or at least restricted at some point by their computational demands. Recent developments in information technology such as multicore, parallel and GPU processing can be used to overcome these limitations. In particular, graphics processing units (GPUs), which were originally developed for rendering real-time effects in computer games, are now ubiquitous and provide unprecedented computational power for scientific applications. Each parallel-processing paradigm alone can improve overall performance; the increased computational performance obtained by combining all paradigms, unleashing the full power of today's technology, makes certain applications feasible that were previously virtually impossible. In this article, state-of-the-art paradigms are introduced, the tools and infrastructure needed to apply these paradigms are presented and a state-of-the-art infrastructure and solution strategy for moving scientific applications to the next generation of computer hardware is outlined.
2nd Generation QUATARA Flight Computer Project

NASA Technical Reports Server (NTRS)

Falker, Jay; Keys, Andrew; Fraticelli, Jose Molina; Capo-Iugo, Pedro; Peeples, Steven

2015-01-01

Single core flight computer boards have been designed, developed, and tested (DD&T) to be flown in small satellites for the last few years. In this project, a prototype flight computer will be designed as a distributed multi-core system containing four microprocessors running code in parallel. This flight computer will be capable of performing multiple computationally intensive tasks such as processing digital and/or analog data, controlling actuator systems, managing cameras, operating robotic manipulators and transmitting/receiving from/to a ground station. In addition, this flight computer will be designed to be fault tolerant by creating both a robust physical hardware connection and by using a software voting scheme to determine the processor's performance. This voting scheme will leverage on the work done for the Space Launch System (SLS) flight software. The prototype flight computer will be constructed with Commercial Off-The-Shelf (COTS) components which are estimated to survive for two years in a low-Earth orbit.
Scrambled coherent superposition for enhanced optical fiber communication in the nonlinear transmission regime.

PubMed

Liu, Xiang; Chandrasekhar, S; Winzer, P J; Chraplyvy, A R; Tkach, R W; Zhu, B; Taunay, T F; Fishteyn, M; DiGiovanni, D J

2012-08-13

Coherent superposition of light waves has long been used in various fields of science, and recent advances in digital coherent detection and space-division multiplexing have enabled the coherent superposition of information-carrying optical signals to achieve better communication fidelity on amplified-spontaneous-noise limited communication links. However, fiber nonlinearity introduces highly correlated distortions on identical signals and diminishes the benefit of coherent superposition in nonlinear transmission regime. Here we experimentally demonstrate that through coordinated scrambling of signal constellations at the transmitter, together with appropriate unscrambling at the receiver, the full benefit of coherent superposition is retained in the nonlinear transmission regime of a space-diversity fiber link based on an innovatively engineered multi-core fiber. This scrambled coherent superposition may provide the flexibility of trading communication capacity for performance in future optical fiber networks, and may open new possibilities in high-performance and secure optical communications.
Runtime Performance and Virtual Network Control Alternatives in VM-Based High-Fidelity Network Simulations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Yoginath, Srikanth B; Perumalla, Kalyan S; Henz, Brian J

2012-01-01

In prior work (Yoginath and Perumalla, 2011; Yoginath, Perumalla and Henz, 2012), the motivation, challenges and issues were articulated in favor of virtual time ordering of Virtual Machines (VMs) in network simulations hosted on multi-core machines. Two major components in the overall virtualization challenge are (1) virtual timeline establishment and scheduling of VMs, and (2) virtualization of inter-VM communication. Here, we extend prior work by presenting scaling results for the first component, with experiment results on up to 128 VMs scheduled in virtual time order on a single 12-core host. We also explore the solution space of design alternatives formore » the second component, and present performance results from a multi-threaded, multi-queue implementation of inter-VM network control for synchronized execution with VM scheduling, incorporated in our NetWarp simulation system.« less
Crystal MD: The massively parallel molecular dynamics software for metal with BCC structure

NASA Astrophysics Data System (ADS)

Hu, Changjun; Bai, He; He, Xinfu; Zhang, Boyao; Nie, Ningming; Wang, Xianmeng; Ren, Yingwen

2017-02-01

Material irradiation effect is one of the most important keys to use nuclear power. However, the lack of high-throughput irradiation facility and knowledge of evolution process, lead to little understanding of the addressed issues. With the help of high-performance computing, we could make a further understanding of micro-level-material. In this paper, a new data structure is proposed for the massively parallel simulation of the evolution of metal materials under irradiation environment. Based on the proposed data structure, we developed the new molecular dynamics software named Crystal MD. The simulation with Crystal MD achieved over 90% parallel efficiency in test cases, and it takes more than 25% less memory on multi-core clusters than LAMMPS and IMD, which are two popular molecular dynamics simulation software. Using Crystal MD, a two trillion particles simulation has been performed on Tianhe-2 cluster.
Multicore Education through Simulation

ERIC Educational Resources Information Center

Ozturk, O.

2011-01-01

A project-oriented course for advanced undergraduate and graduate students is described for simulating multiple processor cores. Simics, a free simulator for academia, was utilized to enable students to explore computer architecture, operating systems, and hardware/software cosimulation. Motivation for including this course in the curriculum is…
Rigorous study of low-complexity adaptive space-time block-coded MIMO receivers in high-speed mode multiplexed fiber-optic transmission links using few-mode fibers

NASA Astrophysics Data System (ADS)

Weng, Yi; He, Xuan; Wang, Junyi; Pan, Zhongqi

2017-01-01

Spatial-division multiplexing (SDM) techniques have been purposed to increase the capacity of optical fiber transmission links by utilizing multicore fibers or few-mode fibers (FMF). The most challenging impairments of SDMbased long-haul optical links mainly include modal dispersion and mode-dependent loss (MDL), whereas MDL arises from inline component imperfections, and breaks modal orthogonality thus degrading the capacity of multiple-inputmultiple- output (MIMO) receivers. To reduce MDL, optical approaches include mode scramblers and specialty fiber designs, yet these methods were burdened with high cost, yet cannot completely remove the accumulated MDL in the link. Besides, space-time trellis codes (STTC) were purposed to lessen MDL, but suffered from high complexity. In this work, we investigated the performance of space-time block-coding (STBC) scheme to mitigate MDL in SDM-based optical communication by exploiting space and delay diversity, whereas weight matrices of frequency-domain equalization (FDE) were updated heuristically using decision-directed recursive-least-squares (RLS) algorithm for convergence and channel estimation. The STBC was evaluated in a six-mode multiplexed system over 30-km FMF via 6×6 MIMO FDE, with modal gain offset 3 dB, core refractive index 1.49, numerical aperture 0.5. Results show that optical-signal-to-noise ratio (OSNR) tolerance can be improved via STBC by approximately 3.1, 4.9, 7.8 dB for QPSK, 16- and 64-QAM with respective bit-error-rates (BER) and minimum-mean-square-error (MMSE). Besides, we also evaluate the complexity optimization of STBC decoding scheme with zero-forcing decision feedback (ZFDF) equalizer by shortening the coding slot length, which is robust to frequency-selective fading channels, and can be scaled up for SDM systems with more dynamic channels.
High Performance Data Transfer for Distributed Data Intensive Sciences

DOE Office of Scientific and Technical Information (OSTI.GOV)

Fang, Chin; Cottrell, R 'Les' A.; Hanushevsky, Andrew B.

We report on the development of ZX software providing high performance data transfer and encryption. The design scales in: computation power, network interfaces, and IOPS while carefully balancing the available resources. Two U.S. patent-pending algorithms help tackle data sets containing lots of small files and very large files, and provide insensitivity to network latency. It has a cluster-oriented architecture, using peer-to-peer technologies to ease deployment, operation, usage, and resource discovery. Its unique optimizations enable effective use of flash memory. Using a pair of existing data transfer nodes at SLAC and NERSC, we compared its performance to that of bbcp andmore » GridFTP and determined that they were comparable. With a proof of concept created using two four-node clusters with multiple distributed multi-core CPUs, network interfaces and flash memory, we achieved 155Gbps memory-to-memory over a 2x100Gbps link aggregated channel and 70Gbps file-to-file with encryption over a 5000 mile 100Gbps link.« less
GPU accelerated dynamic functional connectivity analysis for functional MRI data.

PubMed

Akgün, Devrim; Sakoğlu, Ünal; Esquivel, Johnny; Adinoff, Bryon; Mete, Mutlu

2015-07-01

Recent advances in multi-core processors and graphics card based computational technologies have paved the way for an improved and dynamic utilization of parallel computing techniques. Numerous applications have been implemented for the acceleration of computationally-intensive problems in various computational science fields including bioinformatics, in which big data problems are prevalent. In neuroimaging, dynamic functional connectivity (DFC) analysis is a computationally demanding method used to investigate dynamic functional interactions among different brain regions or networks identified with functional magnetic resonance imaging (fMRI) data. In this study, we implemented and analyzed a parallel DFC algorithm based on thread-based and block-based approaches. The thread-based approach was designed to parallelize DFC computations and was implemented in both Open Multi-Processing (OpenMP) and Compute Unified Device Architecture (CUDA) programming platforms. Another approach developed in this study to better utilize CUDA architecture is the block-based approach, where parallelization involves smaller parts of fMRI time-courses obtained by sliding-windows. Experimental results showed that the proposed parallel design solutions enabled by the GPUs significantly reduce the computation time for DFC analysis. Multicore implementation using OpenMP on 8-core processor provides up to 7.7× speed-up. GPU implementation using CUDA yielded substantial accelerations ranging from 18.5× to 157× speed-up once thread-based and block-based approaches were combined in the analysis. Proposed parallel programming solutions showed that multi-core processor and CUDA-supported GPU implementations accelerated the DFC analyses significantly. Developed algorithms make the DFC analyses more practical for multi-subject studies with more dynamic analyses. Copyright © 2015 Elsevier Ltd. All rights reserved.
Effect of luting agents on the tensile bond strength of glass fiber posts: An in vitro study.

PubMed

Aleisa, Khalil; Al-Dwairi, Ziad N; Alghabban, Rawda; Goodacre, Charles J

2013-09-01

Fiber posts can fail because of loss of retention; and it is unknown which luting agent provides the highest bond strength. The purpose of this study was to investigate the tensile bond strength of glass fiber posts luted to premolar teeth with 6 resin composite luting agents. Ninety-six single-rooted extracted human mandibular premolars were sectioned 2 mm coronal to the most incisal point of the cementoenamel junction. Root canals were instrumented and obturated with laterally condensed gutta percha and root canal sealer (AH26). Gutta percha was removed from the canals to a depth of 8 mm and diameter post spaces with a 1.5 mm were prepared. The specimens were divided into the following 6 groups according to the luting agent used (n=16): Group V, Variolink II; Group A, RelyX ARC; Group N, Multilink N; Group U, RelyX Unicem; Group P, ParaCore; Group F, MultiCore Flow. Each specimen was secured in a universal testing machine and a separating load was applied at a rate of 0.5 mm/min. The forces required to dislodge the posts were recorded. A 1-way analysis of variance (ANOVA) was applied to the mean retentive strengths of various cement materials (α=.05). Significant differences were recorded among the 6 cement types (P<.001). Three materials provided statistically equivalent mean bond strengths (RelyX Unicem, Paracore, and MultiCore Flow) that were significantly greater than for the other 3 materials. Fiber posts luted with RelyX Unicem, Paracore, and MultiCore Flow demonstrated significantly higher bond strengths. Copyright © 2013 The Editorial Council of the Journal of Prosthetic Dentistry. Published by Mosby, Inc. All rights reserved.
Large Spatial Scale Ground Displacement Mapping through the P-SBAS Processing of Sentinel-1 Data on a Cloud Computing Environment

NASA Astrophysics Data System (ADS)

Casu, F.; Bonano, M.; de Luca, C.; Lanari, R.; Manunta, M.; Manzo, M.; Zinno, I.

2017-12-01

Since its launch in 2014, the Sentinel-1 (S1) constellation has played a key role on SAR data availability and dissemination all over the World. Indeed, the free and open access data policy adopted by the European Copernicus program together with the global coverage acquisition strategy, make the Sentinel constellation as a game changer in the Earth Observation scenario. Being the SAR data become ubiquitous, the technological and scientific challenge is focused on maximizing the exploitation of such huge data flow. In this direction, the use of innovative processing algorithms and distributed computing infrastructures, such as the Cloud Computing platforms, can play a crucial role. In this work we present a Cloud Computing solution for the advanced interferometric (DInSAR) processing chain based on the Parallel SBAS (P-SBAS) approach, aimed at processing S1 Interferometric Wide Swath (IWS) data for the generation of large spatial scale deformation time series in efficient, automatic and systematic way. Such a DInSAR chain ingests Sentinel 1 SLC images and carries out several processing steps, to finally compute deformation time series and mean deformation velocity maps. Different parallel strategies have been designed ad hoc for each processing step of the P-SBAS S1 chain, encompassing both multi-core and multi-node programming techniques, in order to maximize the computational efficiency achieved within a Cloud Computing environment and cut down the relevant processing times. The presented P-SBAS S1 processing chain has been implemented on the Amazon Web Services platform and a thorough analysis of the attained parallel performances has been performed to identify and overcome the major bottlenecks to the scalability. The presented approach is used to perform national-scale DInSAR analyses over Italy, involving the processing of more than 3000 S1 IWS images acquired from both ascending and descending orbits. Such an experiment confirms the big advantage of exploiting large computational and storage resources of Cloud Computing platforms for large scale DInSAR analysis. The presented Cloud Computing P-SBAS processing chain can be a precious tool in the perspective of developing operational services disposable for the EO scientific community related to hazard monitoring and risk prevention and mitigation.
48 CFR 3.1104 - Mitigation or waiver.

Code of Federal Regulations, 2014 CFR

2014-10-01

... 48 Federal Acquisition Regulations System 1 2014-10-01 2014-10-01 false Mitigation or waiver. 3... for Contractor Employees Performing Acquisition Functions 3.1104 Mitigation or waiver. (a) In... impose conditions that provide mitigation of a personal conflict of interest or grant a waiver. (c) This...
48 CFR 3.1104 - Mitigation or waiver.

Code of Federal Regulations, 2012 CFR

2012-10-01

... 48 Federal Acquisition Regulations System 1 2012-10-01 2012-10-01 false Mitigation or waiver. 3... for Contractor Employees Performing Acquisition Functions 3.1104 Mitigation or waiver. (a) In... impose conditions that provide mitigation of a personal conflict of interest or grant a waiver. (c) This...

48 CFR 3.1104 - Mitigation or waiver.

Code of Federal Regulations, 2013 CFR

2013-10-01

... 48 Federal Acquisition Regulations System 1 2013-10-01 2013-10-01 false Mitigation or waiver. 3... for Contractor Employees Performing Acquisition Functions 3.1104 Mitigation or waiver. (a) In... impose conditions that provide mitigation of a personal conflict of interest or grant a waiver. (c) This...
Integrated cladding-pumped multicore few-mode erbium-doped fibre amplifier for space-division-multiplexed communications

NASA Astrophysics Data System (ADS)

Chen, H.; Jin, C.; Huang, B.; Fontaine, N. K.; Ryf, R.; Shang, K.; Grégoire, N.; Morency, S.; Essiambre, R.-J.; Li, G.; Messaddeq, Y.; Larochelle, S.

2016-08-01

Space-division multiplexing (SDM), whereby multiple spatial channels in multimode and multicore optical fibres are used to increase the total transmission capacity per fibre, is being investigated to avert a data capacity crunch and reduce the cost per transmitted bit. With the number of channels employed in SDM transmission experiments continuing to rise, there is a requirement for integrated SDM components that are scalable. Here, we demonstrate a cladding-pumped SDM erbium-doped fibre amplifier (EDFA) that consists of six uncoupled multimode erbium-doped cores. Each core supports three spatial modes, which enables the EDFA to amplify a total of 18 spatial channels (six cores × three modes) simultaneously with a single pump diode and a complexity similar to a single-mode EDFA. The amplifier delivers >20 dBm total output power per core and <7 dB noise figure over the C-band. This cladding-pumped EDFA enables combined space-division and wavelength-division multiplexed transmission over multiple multimode fibre spans.
IGA-ADS: Isogeometric analysis FEM using ADS solver

NASA Astrophysics Data System (ADS)

Łoś, Marcin M.; Woźniak, Maciej; Paszyński, Maciej; Lenharth, Andrew; Hassaan, Muhamm Amber; Pingali, Keshav

2017-08-01

In this paper we present a fast explicit solver for solution of non-stationary problems using L2 projections with isogeometric finite element method. The solver has been implemented within GALOIS framework. It enables parallel multi-core simulations of different time-dependent problems, in 1D, 2D, or 3D. We have prepared the solver framework in a way that enables direct implementation of the selected PDE and corresponding boundary conditions. In this paper we describe the installation, implementation of exemplary three PDEs, and execution of the simulations on multi-core Linux cluster nodes. We consider three case studies, including heat transfer, linear elasticity, as well as non-linear flow in heterogeneous media. The presented package generates output suitable for interfacing with Gnuplot and ParaView visualization software. The exemplary simulations show near perfect scalability on Gilbert shared-memory node with four Intel® Xeon® CPU E7-4860 processors, each possessing 10 physical cores (for a total of 40 cores).
Magnetophoretic separation ICP-MS immunoassay using Cs-doped multicore magnetic nanoparticles for the determination of salmonella typhimurium.

PubMed

Jeong, Arong; Lim, H B

2018-02-01

In this work, a magnetophoretic separation ICP-MS immunoassay using newly synthesized multicore magnetic nanoparticles (MMNPs) was developed for the determination of salmonella typhimurium (typhi). The uniqueness of this method was the use of MMNPs doped with Cs for both separation and detection, which enable us to achieve fast analysis, high sensitivity, and good reliability. For demonstration, heat-killed typhi in a phosphate buffer solution was determined by ICP-MS after the MMNP-typhi reaction product was separated from unreacted MMNPs in a micropipette tip filled with 25% polyethylene glycol through magnetophoretic separation. The calibration curve obtained by plotting 133 Cs intensity vs. the number of synthetic standard, showed a coefficient of determination (R 2 ) of 0.94 with a limit of detection (LOD) of 102 cells/mL without cell culturing. Excellent recoveries, between 98-100%, were obtained from four replicates and compared with a sandwich-type ICP-MS immunoassay for further confirmation. Copyright © 2017 Elsevier B.V. All rights reserved.
Contamination of arctic Fjord sediments by Pb-Zn mining at Maarmorilik in central West Greenland.

PubMed

Perner, K; Leipe, Th; Dellwig, O; Kuijpers, A; Mikkelsen, N; Andersen, T J; Harff, J

2010-07-01

This study focuses on heavy metal contamination of arctic sediments from a small Fjord system adjacent to the Pb-Zn "Black Angel" mine (West Greenland) to investigate the temporal and spatial development of contamination and to provide baseline levels before the mines re-opening in January 2009. For this purpose we collected multi-cores along a transect from Affarlikassaa Fjord, which received high amounts of tailings from 1973 to 1990, to the mouth of Qaumarujuk Fjord. Along with radiochemical dating by (210)Pb and (137)Cs, geochemical analyses of heavy metals (e.g. As, Cd, Hg, Pb, and Zn) were carried out. Maximum contents were found at 12 cm depth in Affarlikassaa. After 17 years the mine last closed, specific local hydrographic conditions continue to disperse heavy metal enriched material derived from the Affarlikassaa into Qaumarujuk. Total Hg profiles from multi-cores along the transect clearly illustrate this transport and spatial distribution pattern of the contaminated material. Copyright 2010 Elsevier Ltd. All rights reserved.
Shape Sensing Using a Multi-Core Optical Fiber Having an Arbitrary Initial Shape in the Presence of Extrinsic Forces

NASA Technical Reports Server (NTRS)

Rogge, Matthew D. (Inventor); Moore, Jason P. (Inventor)

2014-01-01

Shape of a multi-core optical fiber is determined by positioning the fiber in an arbitrary initial shape and measuring strain over the fiber's length using strain sensors. A three-coordinate p-vector is defined for each core as a function of the distance of the corresponding cores from a center point of the fiber and a bending angle of the cores. The method includes calculating, via a controller, an applied strain value of the fiber using the p-vector and the measured strain for each core, and calculating strain due to bending as a function of the measured and the applied strain values. Additionally, an apparent local curvature vector is defined for each core as a function of the calculated strain due to bending. Curvature and bend direction are calculated using the apparent local curvature vector, and fiber shape is determined via the controller using the calculated curvature and bend direction.
Can magneto-plasmonic nanohybrids efficiently combine photothermia with magnetic hyperthermia?

NASA Astrophysics Data System (ADS)

Espinosa, Ana; Bugnet, Mathieu; Radtke, Guillaume; Neveu, Sophie; Botton, Gianluigi A.; Wilhelm, Claire; Abou-Hassan, Ali

2015-11-01

Multifunctional hybrid-design nanomaterials appear to be a promising route to meet the current therapeutics needs required for efficient cancer treatment. Herein, two efficient heat nano-generators were combined into a multifunctional single nanohybrid (a multi-core iron oxide nanoparticle optimized for magnetic hyperthermia, and a gold branched shell with tunable plasmonic properties in the NIR region, for photothermal therapy) which impressively enhanced heat generation, in suspension or in vivo in tumours, opening up exciting new therapeutic perspectives.Multifunctional hybrid-design nanomaterials appear to be a promising route to meet the current therapeutics needs required for efficient cancer treatment. Herein, two efficient heat nano-generators were combined into a multifunctional single nanohybrid (a multi-core iron oxide nanoparticle optimized for magnetic hyperthermia, and a gold branched shell with tunable plasmonic properties in the NIR region, for photothermal therapy) which impressively enhanced heat generation, in suspension or in vivo in tumours, opening up exciting new therapeutic perspectives. Electronic supplementary information (ESI) available. See DOI: 10.1039/c5nr06168g
Methodology, Methods, and Metrics for Testing and Evaluating Augmented Cognition Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Greitzer, Frank L.

The augmented cognition research community seeks cognitive neuroscience-based solutions to improve warfighter performance by applying and managing mitigation strategies to reduce workload and improve the throughput and quality of decisions. The focus of augmented cognition mitigation research is to define, demonstrate, and exploit neuroscience and behavioral measures that support inferences about the warfighter’s cognitive state that prescribe the nature and timing of mitigation. A research challenge is to develop valid evaluation methodologies, metrics and measures to assess the impact of augmented cognition mitigations. Two considerations are external validity, which is the extent to which the results apply to operational contexts;more » and internal validity, which reflects the reliability of performance measures and the conclusions based on analysis of results. The scientific rigor of the research methodology employed in conducting empirical investigations largely affects the validity of the findings. External validity requirements also compel us to demonstrate operational significance of mitigations. Thus it is important to demonstrate effectiveness of mitigations under specific conditions. This chapter reviews some cognitive science and methodological considerations in designing augmented cognition research studies and associated human performance metrics and analysis methods to assess the impact of augmented cognition mitigations.« less
Software for Brain Network Simulations: A Comparative Study

PubMed Central

Tikidji-Hamburyan, Ruben A.; Narayana, Vikram; Bozkus, Zeki; El-Ghazawi, Tarek A.

2017-01-01

Numerical simulations of brain networks are a critical part of our efforts in understanding brain functions under pathological and normal conditions. For several decades, the community has developed many software packages and simulators to accelerate research in computational neuroscience. In this article, we select the three most popular simulators, as determined by the number of models in the ModelDB database, such as NEURON, GENESIS, and BRIAN, and perform an independent evaluation of these simulators. In addition, we study NEST, one of the lead simulators of the Human Brain Project. First, we study them based on one of the most important characteristics, the range of supported models. Our investigation reveals that brain network simulators may be biased toward supporting a specific set of models. However, all simulators tend to expand the supported range of models by providing a universal environment for the computational study of individual neurons and brain networks. Next, our investigations on the characteristics of computational architecture and efficiency indicate that all simulators compile the most computationally intensive procedures into binary code, with the aim of maximizing their computational performance. However, not all simulators provide the simplest method for module development and/or guarantee efficient binary code. Third, a study of their amenability for high-performance computing reveals that NEST can almost transparently map an existing model on a cluster or multicore computer, while NEURON requires code modification if the model developed for a single computer has to be mapped on a computational cluster. Interestingly, parallelization is the weakest characteristic of BRIAN, which provides no support for cluster computations and limited support for multicore computers. Fourth, we identify the level of user support and frequency of usage for all simulators. Finally, we carry out an evaluation using two case studies: a large network with simplified neural and synaptic models and a small network with detailed models. These two case studies allow us to avoid any bias toward a particular software package. The results indicate that BRIAN provides the most concise language for both cases considered. Furthermore, as expected, NEST mostly favors large network models, while NEURON is better suited for detailed models. Overall, the case studies reinforce our general observation that simulators have a bias in the computational performance toward specific types of the brain network models. PMID:28775687
A 60 GOPS/W, -1.8 V to 0.9 V body bias ULP cluster in 28 nm UTBB FD-SOI technology

NASA Astrophysics Data System (ADS)

Rossi, Davide; Pullini, Antonio; Loi, Igor; Gautschi, Michael; Gürkaynak, Frank K.; Bartolini, Andrea; Flatresse, Philippe; Benini, Luca

2016-03-01

Ultra-low power operation and extreme energy efficiency are strong requirements for a number of high-growth application areas, such as E-health, Internet of Things, and wearable Human-Computer Interfaces. A promising approach to achieve up to one order of magnitude of improvement in energy efficiency over current generation of integrated circuits is near-threshold computing. However, frequency degradation due to aggressive voltage scaling may not be acceptable across all performance-constrained applications. Thread-level parallelism over multiple cores can be used to overcome the performance degradation at low voltage. Moreover, enabling the processors to operate on-demand and over a wide supply voltage and body bias ranges allows to achieve the best possible energy efficiency while satisfying a large spectrum of computational demands. In this work we present the first ever implementation of a 4-core cluster fabricated using conventional-well 28 nm UTBB FD-SOI technology. The multi-core architecture we present in this work is able to operate on a wide range of supply voltages starting from 0.44 V to 1.2 V. In addition, the architecture allows a wide range of body bias to be applied from -1.8 V to 0.9 V. The peak energy efficiency 60 GOPS/W is achieved at 0.5 V supply voltage and 0.5 V forward body bias. Thanks to the extended body bias range of conventional-well FD-SOI technology, high energy efficiency can be guaranteed for a wide range of process and environmental conditions. We demonstrate the ability to compensate for up to 99.7% of chips for process variation with only ±0.2 V of body biasing, and compensate temperature variation in the range -40 °C to 120 °C exploiting -1.1 V to 0.8 V body biasing. When compared to leading-edge near-threshold RISC processors optimized for extremely low power applications, the multi-core architecture we propose has 144× more performance at comparable energy efficiency levels. Even when compared to other low-power processors with comparable performance, including those implemented in 28 nm technology, our platform provides 1.4× to 3.7× better energy efficiency.
Accelerating finite-rate chemical kinetics with coprocessors: Comparing vectorization methods on GPUs, MICs, and CPUs

NASA Astrophysics Data System (ADS)

Stone, Christopher P.; Alferman, Andrew T.; Niemeyer, Kyle E.

2018-05-01

Accurate and efficient methods for solving stiff ordinary differential equations (ODEs) are a critical component of turbulent combustion simulations with finite-rate chemistry. The ODEs governing the chemical kinetics at each mesh point are decoupled by operator-splitting allowing each to be solved concurrently. An efficient ODE solver must then take into account the available thread and instruction-level parallelism of the underlying hardware, especially on many-core coprocessors, as well as the numerical efficiency. A stiff Rosenbrock and a nonstiff Runge-Kutta ODE solver are both implemented using the single instruction, multiple thread (SIMT) and single instruction, multiple data (SIMD) paradigms within OpenCL. Both methods solve multiple ODEs concurrently within the same instruction stream. The performance of these parallel implementations was measured on three chemical kinetic models of increasing size across several multicore and many-core platforms. Two separate benchmarks were conducted to clearly determine any performance advantage offered by either method. The first benchmark measured the run-time of evaluating the right-hand-side source terms in parallel and the second benchmark integrated a series of constant-pressure, homogeneous reactors using the Rosenbrock and Runge-Kutta solvers. The right-hand-side evaluations with SIMD parallelism on the host multicore Xeon CPU and many-core Xeon Phi co-processor performed approximately three times faster than the baseline multithreaded C++ code. The SIMT parallel model on the host and Phi was 13%-35% slower than the baseline while the SIMT model on the NVIDIA Kepler GPU provided approximately the same performance as the SIMD model on the Phi. The runtimes for both ODE solvers decreased significantly with the SIMD implementations on the host CPU (2.5-2.7 ×) and Xeon Phi coprocessor (4.7-4.9 ×) compared to the baseline parallel code. The SIMT implementations on the GPU ran 1.5-1.6 times faster than the baseline multithreaded CPU code; however, this was significantly slower than the SIMD versions on the host CPU or the Xeon Phi. The performance difference between the three platforms was attributed to thread divergence caused by the adaptive step-sizes within the ODE integrators. Analysis showed that the wider vector width of the GPU incurs a higher level of divergence than the narrower Sandy Bridge or Xeon Phi. The significant performance improvement provided by the SIMD parallel strategy motivates further research into more ODE solver methods that are both SIMD-friendly and computationally efficient.
Compensatory Mitigation Rule Final Environmental Assessment

EPA Pesticide Factsheets

EA performed to determine the costs resulting from implementation of the Compensatory Mitigation Rule and the extent to which the rule changes aggregate mitigation costs borne by permittees and Corps administrative burdens and associated costs.
Roofline model toolkit: A practical tool for architectural and program analysis

DOE Office of Scientific and Technical Information (OSTI.GOV)

Lo, Yu Jung; Williams, Samuel; Van Straalen, Brian

We present preliminary results of the Roofline Toolkit for multicore, many core, and accelerated architectures. This paper focuses on the processor architecture characterization engine, a collection of portable instrumented micro benchmarks implemented with Message Passing Interface (MPI), and OpenMP used to express thread-level parallelism. These benchmarks are specialized to quantify the behavior of different architectural features. Compared to previous work on performance characterization, these microbenchmarks focus on capturing the performance of each level of the memory hierarchy, along with thread-level parallelism, instruction-level parallelism and explicit SIMD parallelism, measured in the context of the compilers and run-time environments. We also measuremore » sustained PCIe throughput with four GPU memory managed mechanisms. By combining results from the architecture characterization with the Roofline model based solely on architectural specifications, this work offers insights for performance prediction of current and future architectures and their software systems. To that end, we instrument three applications and plot their resultant performance on the corresponding Roofline model when run on a Blue Gene/Q architecture.« less
Study of Solid State Drives performance in PROOF distributed analysis system

NASA Astrophysics Data System (ADS)

Panitkin, S. Y.; Ernst, M.; Petkus, R.; Rind, O.; Wenaus, T.

2010-04-01

Solid State Drives (SSD) is a promising storage technology for High Energy Physics parallel analysis farms. Its combination of low random access time and relatively high read speed is very well suited for situations where multiple jobs concurrently access data located on the same drive. It also has lower energy consumption and higher vibration tolerance than Hard Disk Drive (HDD) which makes it an attractive choice in many applications raging from personal laptops to large analysis farms. The Parallel ROOT Facility - PROOF is a distributed analysis system which allows to exploit inherent event level parallelism of high energy physics data. PROOF is especially efficient together with distributed local storage systems like Xrootd, when data are distributed over computing nodes. In such an architecture the local disk subsystem I/O performance becomes a critical factor, especially when computing nodes use multi-core CPUs. We will discuss our experience with SSDs in PROOF environment. We will compare performance of HDD with SSD in I/O intensive analysis scenarios. In particular we will discuss PROOF system performance scaling with a number of simultaneously running analysis jobs.
Experimental evaluation of the impact of packet capturing tools for web services.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Choe, Yung Ryn; Mohapatra, Prasant; Chuah, Chen-Nee

Network measurement is a discipline that provides the techniques to collect data that are fundamental to many branches of computer science. While many capturing tools and comparisons have made available in the literature and elsewhere, the impact of these packet capturing tools on existing processes have not been thoroughly studied. While not a concern for collection methods in which dedicated servers are used, many usage scenarios of packet capturing now requires the packet capturing tool to run concurrently with operational processes. In this work we perform experimental evaluations of the performance impact that packet capturing process have on web-based services;more » in particular, we observe the impact on web servers. We find that packet capturing processes indeed impact the performance of web servers, but on a multi-core system the impact varies depending on whether the packet capturing and web hosting processes are co-located or not. In addition, the architecture and behavior of the web server and process scheduling is coupled with the behavior of the packet capturing process, which in turn also affect the web server's performance.« less
Accelerating nuclear configuration interaction calculations through a preconditioned block iterative eigensolver

NASA Astrophysics Data System (ADS)

Shao, Meiyue; Aktulga, H. Metin; Yang, Chao; Ng, Esmond G.; Maris, Pieter; Vary, James P.

2018-01-01

We describe a number of recently developed techniques for improving the performance of large-scale nuclear configuration interaction calculations on high performance parallel computers. We show the benefit of using a preconditioned block iterative method to replace the Lanczos algorithm that has traditionally been used to perform this type of computation. The rapid convergence of the block iterative method is achieved by a proper choice of starting guesses of the eigenvectors and the construction of an effective preconditioner. These acceleration techniques take advantage of special structure of the nuclear configuration interaction problem which we discuss in detail. The use of a block method also allows us to improve the concurrency of the computation, and take advantage of the memory hierarchy of modern microprocessors to increase the arithmetic intensity of the computation relative to data movement. We also discuss the implementation details that are critical to achieving high performance on massively parallel multi-core supercomputers, and demonstrate that the new block iterative solver is two to three times faster than the Lanczos based algorithm for problems of moderate sizes on a Cray XC30 system.
Encapsulating model complexity and landscape-scale analyses of state-and-transition simulation models: an application of ecoinformatics and juniper encroachment in sagebrush steppe ecosystems

USGS Publications Warehouse

O'Donnell, Michael

2015-01-01

State-and-transition simulation modeling relies on knowledge of vegetation composition and structure (states) that describe community conditions, mechanistic feedbacks such as fire that can affect vegetation establishment, and ecological processes that drive community conditions as well as the transitions between these states. However, as the need for modeling larger and more complex landscapes increase, a more advanced awareness of computing resources becomes essential. The objectives of this study include identifying challenges of executing state-and-transition simulation models, identifying common bottlenecks of computing resources, developing a workflow and software that enable parallel processing of Monte Carlo simulations, and identifying the advantages and disadvantages of different computing resources. To address these objectives, this study used the ApexRMS® SyncroSim software and embarrassingly parallel tasks of Monte Carlo simulations on a single multicore computer and on distributed computing systems. The results demonstrated that state-and-transition simulation models scale best in distributed computing environments, such as high-throughput and high-performance computing, because these environments disseminate the workloads across many compute nodes, thereby supporting analysis of larger landscapes, higher spatial resolution vegetation products, and more complex models. Using a case study and five different computing environments, the top result (high-throughput computing versus serial computations) indicated an approximate 96.6% decrease of computing time. With a single, multicore compute node (bottom result), the computing time indicated an 81.8% decrease relative to using serial computations. These results provide insight into the tradeoffs of using different computing resources when research necessitates advanced integration of ecoinformatics incorporating large and complicated data inputs and models. - See more at: http://aimspress.com/aimses/ch/reader/view_abstract.aspx?file_no=Environ2015030&flag=1#sthash.p1XKDtF8.dpuf
Multicore fibre technology: the road to multimode photonics

NASA Astrophysics Data System (ADS)

Bland-Hawthorn, J.; Min, Seong-Sik; Lindley, Emma; Leon-Saval, Sergio; Ellis, Simon; Lawrence, Jon; Beyrand, Nicolas; Roth, Martin; Löhmannsröben, Hans-Gerd; Veilleux, Sylvain

2016-07-01

For the past forty years, optical fibres have found widespread use in ground-based and space-based instruments. In most applications, these fibres are used in conjunction with conventional optics to transport light. But photonics offers a huge range of optical manipulations beyond light transport that were rarely exploited before 2001. The fundamental obstacle to the broader use of photonics is the difficulty of achieving photonic action in a multimode fibre. The first step towards a general solution was the invention of the photonic lantern1 in 2004 and the delivery of high-efficiency devices (< 1 dB loss) five years on2. Multicore fibres (MCF), used in conjunction with lanterns, are now enabling an even bigger leap towards multimode photonics. Until recently, the single-moded cores in MCFs were not sufficiently uniform to achieve telecom (SMF-28) performance. Now that high-quality MCFs have been realized, we turn our attention to printing complex functions (e.g. Bragg gratings for OH suppression) into their N cores. Our first work in this direction used a Mach-Zehnder interferometer (near-field phase mask) but this approach was only adequate for N=7 MCFs as measured by the grating uniformity3. We have now built a Sagnac interferometer that gives a three-fold increase in the depth of field sufficient to print across N >= 127 cores. We achieved first light this year with our 500mW Sabre FRED laser. These are sophisticated and complex interferometers. We report on our progress to date and summarize our first-year goals which include multimode OH suppression fibres for the Anglo-Australian Telescope/PRAXIS instrument and the Discovery Channel Telescope/MOHSIS instrument under development at the University of Maryland.
Quantum key distribution in multicore fibre for secure radio access networks

NASA Astrophysics Data System (ADS)

Llorente, Roberto; Provot, Antoine; Morant, Maria

2018-01-01

Broadband access in optical domain usually focuses in providing a pervasive cost-effective high bitrate communication in a given area. Nowadays, it is of utmost interest also to be able to provide a secure communication to the costumers in the area. Wireless access networks rely on optical domain for both fronthaul and backhaul of the radio access network (C-RAN). Multicore fiber (MCF) has been proposed as a promising candidate for the optical media of choice in nextgeneration wireless. The capacity demand of next-generation 5G networks makes interesting the use of high-capacity optical solutions as space-division multiplexing of different signals over MCF media. This work addresses secure MCF communication supporting C-RAN architectures. The paper proposes the use of one core in the MCF to transport securely an optical quantum key encoding altogether with end-to-end wireless signal transmitted in the remaining cores in radio-over-fiber (RoF). The RoF wireless signals are suitable for radio access fronthaul and backhaul. The theoretical principle and simulation analysis of quantum key distribution (QKD) are presented in this paper. The potential impact of optical RoF transmission crosstalk impairments is assessed experimentally considering different cellular signals on the remaining optical cores in the MCF. The experimental results report fronthaul performance over a four-core optical fiber with RoF transmission of full-standard CDMA signals providing 3.5G services in one core, HSPA+ signals providing 3.9G services in the second core and 3GPP LTEAdvanced signals providing 4G services in the third core, considering that the QKD signal is allocated in the fourth core.
PIC codes for plasma accelerators on emerging computer architectures (GPUS, Multicore/Manycore CPUS)

NASA Astrophysics Data System (ADS)

Vincenti, Henri

2016-03-01

The advent of exascale computers will enable 3D simulations of a new laser-plasma interaction regimes that were previously out of reach of current Petasale computers. However, the paradigm used to write current PIC codes will have to change in order to fully exploit the potentialities of these new computing architectures. Indeed, achieving Exascale computing facilities in the next decade will be a great challenge in terms of energy consumption and will imply hardware developments directly impacting our way of implementing PIC codes. As data movement (from die to network) is by far the most energy consuming part of an algorithm future computers will tend to increase memory locality at the hardware level and reduce energy consumption related to data movement by using more and more cores on each compute nodes (''fat nodes'') that will have a reduced clock speed to allow for efficient cooling. To compensate for frequency decrease, CPU machine vendors are making use of long SIMD instruction registers that are able to process multiple data with one arithmetic operator in one clock cycle. SIMD register length is expected to double every four years. GPU's also have a reduced clock speed per core and can process Multiple Instructions on Multiple Datas (MIMD). At the software level Particle-In-Cell (PIC) codes will thus have to achieve both good memory locality and vectorization (for Multicore/Manycore CPU) to fully take advantage of these upcoming architectures. In this talk, we present the portable solutions we implemented in our high performance skeleton PIC code PICSAR to both achieve good memory locality and cache reuse as well as good vectorization on SIMD architectures. We also present the portable solutions used to parallelize the Pseudo-sepctral quasi-cylindrical code FBPIC on GPUs using the Numba python compiler.

Effects of equivalent series resistance on the noise mitigation performance of piezoelectric shunt damping

NASA Astrophysics Data System (ADS)

Lai, Szu Cheng; Sharifzadeh Mirshekarloo, Meysam; Yao, Kui

2017-05-01

Piezoelectric shunt damping (PSD) utilizes an electrically-shunted piezoelectric damper attached on a panel structure to suppress the transmission of acoustic noise. The paper develops an understanding on the effects of equivalent series resistance (ESR) of the piezoelectric damper in a PSD system on noise mitigation performance, and demonstrates that an increased ESR leads to a significant rise in the noise transmissibility due to reduction in the system’s mechanical damping. It is further demonstrated with experimental results that ESR effects can be compensated in the shunt circuit to significantly improve the noise mitigation performance. A theoretical electrical equivalent model of the PSD incorporating the ESR is established for quantitative analysis of ESR effects on noise mitigation.
Optimizing Irregular Applications for Energy and Performance on the Tilera Many-core Architecture

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chavarría-Miranda, Daniel; Panyala, Ajay R.; Halappanavar, Mahantesh

Optimizing applications simultaneously for energy and performance is a complex problem. High performance, parallel, irregular applications are notoriously hard to optimize due to their data-dependent memory accesses, lack of structured locality and complex data structures and code patterns. Irregular kernels are growing in importance in applications such as machine learning, graph analytics and combinatorial scientific computing. Performance- and energy-efficient implementation of these kernels on modern, energy efficient, multicore and many-core platforms is therefore an important and challenging problem. We present results from optimizing two irregular applications { the Louvain method for community detection (Grappolo), and high-performance conjugate gradient (HPCCG) {more » on the Tilera many-core system. We have significantly extended MIT's OpenTuner auto-tuning framework to conduct a detailed study of platform-independent and platform-specific optimizations to improve performance as well as reduce total energy consumption. We explore the optimization design space along three dimensions: memory layout schemes, compiler-based code transformations, and optimization of parallel loop schedules. Using auto-tuning, we demonstrate whole node energy savings of up to 41% relative to a baseline instantiation, and up to 31% relative to manually optimized variants.« less
High performance 3D adaptive filtering for DSP based portable medical imaging systems

NASA Astrophysics Data System (ADS)

Bockenbach, Olivier; Ali, Murtaza; Wainwright, Ian; Nadeski, Mark

2015-03-01

Portable medical imaging devices have proven valuable for emergency medical services both in the field and hospital environments and are becoming more prevalent in clinical settings where the use of larger imaging machines is impractical. Despite their constraints on power, size and cost, portable imaging devices must still deliver high quality images. 3D adaptive filtering is one of the most advanced techniques aimed at noise reduction and feature enhancement, but is computationally very demanding and hence often cannot be run with sufficient performance on a portable platform. In recent years, advanced multicore digital signal processors (DSP) have been developed that attain high processing performance while maintaining low levels of power dissipation. These processors enable the implementation of complex algorithms on a portable platform. In this study, the performance of a 3D adaptive filtering algorithm on a DSP is investigated. The performance is assessed by filtering a volume of size 512x256x128 voxels sampled at a pace of 10 MVoxels/sec with an Ultrasound 3D probe. Relative performance and power is addressed between a reference PC (Quad Core CPU) and a TMS320C6678 DSP from Texas Instruments.
Voluntary climate change mitigation actions of young adults: a classification of mitigators through latent class analysis.

PubMed

Korkala, Essi A E; Hugg, Timo T; Jaakkola, Jouni J K

2014-01-01

Encouraging individuals to take action is important for the overall success of climate change mitigation. Campaigns promoting climate change mitigation could address particular groups of the population on the basis of what kind of mitigation actions the group is already taking. To increase the knowledge of such groups performing similar mitigation actions we conducted a population-based cross-sectional study in Finland. The study population comprised 1623 young adults who returned a self-administered questionnaire (response rate 64%). Our aims were to identify groups of people engaged in similar climate change mitigation actions and to study the gender differences in the grouping. We also determined if socio-demographic characteristics can predict group membership. We performed latent class analysis using 14 mitigation actions as manifest variables. Three classes were identified among men: the Inactive (26%), the Semi-active (63%) and the Active (11%) and two classes among women: the Semi-active (72%) and the Active (28%). The Active among both genders were likely to have mitigated climate change through several actions, such as recycling, using environmentally friendly products, preferring public transport, and conserving energy. The Semi-Active had most probably recycled and preferred public transport because of climate change. The Inactive, a class identified among men only, had very probably done nothing to mitigate climate change. Among males, being single or divorced predicted little involvement in climate change mitigation. Among females, those without tertiary degree and those with annual income €≥16801 were less involved in climate change mitigation. Our results illustrate to what extent young adults are engaged in climate change mitigation, which factors predict little involvement in mitigation and give insight to which segments of the public could be the audiences of targeted mitigation campaigns.
Voluntary Climate Change Mitigation Actions of Young Adults: A Classification of Mitigators through Latent Class Analysis

PubMed Central

Korkala, Essi A. E.; Hugg, Timo T.; Jaakkola, Jouni J. K.

2014-01-01

Encouraging individuals to take action is important for the overall success of climate change mitigation. Campaigns promoting climate change mitigation could address particular groups of the population on the basis of what kind of mitigation actions the group is already taking. To increase the knowledge of such groups performing similar mitigation actions we conducted a population-based cross-sectional study in Finland. The study population comprised 1623 young adults who returned a self-administered questionnaire (response rate 64%). Our aims were to identify groups of people engaged in similar climate change mitigation actions and to study the gender differences in the grouping. We also determined if socio-demographic characteristics can predict group membership. We performed latent class analysis using 14 mitigation actions as manifest variables. Three classes were identified among men: the Inactive (26%), the Semi-active (63%) and the Active (11%) and two classes among women: the Semi-active (72%) and the Active (28%). The Active among both genders were likely to have mitigated climate change through several actions, such as recycling, using environmentally friendly products, preferring public transport, and conserving energy. The Semi-Active had most probably recycled and preferred public transport because of climate change. The Inactive, a class identified among men only, had very probably done nothing to mitigate climate change. Among males, being single or divorced predicted little involvement in climate change mitigation. Among females, those without tertiary degree and those with annual income €≥16801 were less involved in climate change mitigation. Our results illustrate to what extent young adults are engaged in climate change mitigation, which factors predict little involvement in mitigation and give insight to which segments of the public could be the audiences of targeted mitigation campaigns. PMID:25054549
40 CFR 230.95 - Ecological performance standards.

Code of Federal Regulations, 2012 CFR

2012-07-01

... mitigation plan must contain performance standards that will be used to assess whether the project is... mitigation project, so that the project can be objectively evaluated to determine if it is developing into... verifiable. Ecological performance standards must be based on the best available science that can be measured...
40 CFR 230.95 - Ecological performance standards.

Code of Federal Regulations, 2014 CFR

2014-07-01

... mitigation plan must contain performance standards that will be used to assess whether the project is... mitigation project, so that the project can be objectively evaluated to determine if it is developing into... verifiable. Ecological performance standards must be based on the best available science that can be measured...
40 CFR 230.95 - Ecological performance standards.

Code of Federal Regulations, 2013 CFR

2013-07-01

... mitigation plan must contain performance standards that will be used to assess whether the project is... mitigation project, so that the project can be objectively evaluated to determine if it is developing into... verifiable. Ecological performance standards must be based on the best available science that can be measured...
ASSESSMENT PROTOCOLS - DURABILITY OF PERFORMANCE OF A HOME RADON REDUCTION SYSTEM FOR SUB-SLAB DEPRESSURIZA- TION SYSTEMS

EPA Science Inventory

This handbook contains protocols that compare the immediate performance of subslab depressurization (SSD) mitigation system with performance months or years later. These protocols provide a methodology to test SSD radon mitigation systems in situ to determine long-term performanc...
40 CFR 230.95 - Ecological performance standards.

Code of Federal Regulations, 2011 CFR

2011-07-01

... mitigation plan must contain performance standards that will be used to assess whether the project is... mitigation project, so that the project can be objectively evaluated to determine if it is developing into... verifiable. Ecological performance standards must be based on the best available science that can be measured...
STAMPS: Software Tool for Automated MRI Post-processing on a supercomputer.

PubMed

Bigler, Don C; Aksu, Yaman; Miller, David J; Yang, Qing X

2009-08-01

This paper describes a Software Tool for Automated MRI Post-processing (STAMP) of multiple types of brain MRIs on a workstation and for parallel processing on a supercomputer (STAMPS). This software tool enables the automation of nonlinear registration for a large image set and for multiple MR image types. The tool uses standard brain MRI post-processing tools (such as SPM, FSL, and HAMMER) for multiple MR image types in a pipeline fashion. It also contains novel MRI post-processing features. The STAMP image outputs can be used to perform brain analysis using Statistical Parametric Mapping (SPM) or single-/multi-image modality brain analysis using Support Vector Machines (SVMs). Since STAMPS is PBS-based, the supercomputer may be a multi-node computer cluster or one of the latest multi-core computers.
High-performance dynamic quantum clustering on graphics processors

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wittek, Peter, E-mail: peterwittek@acm.org

2013-01-15

Clustering methods in machine learning may benefit from borrowing metaphors from physics. Dynamic quantum clustering associates a Gaussian wave packet with the multidimensional data points and regards them as eigenfunctions of the Schroedinger equation. The clustering structure emerges by letting the system evolve and the visual nature of the algorithm has been shown to be useful in a range of applications. Furthermore, the method only uses matrix operations, which readily lend themselves to parallelization. In this paper, we develop an implementation on graphics hardware and investigate how this approach can accelerate the computations. We achieve a speedup of up tomore » two magnitudes over a multicore CPU implementation, which proves that quantum-like methods and acceleration by graphics processing units have a great relevance to machine learning.« less
Greenhouse Gas Mitigation Options Database(GMOD)and Tool

EPA Science Inventory

Greenhouse Gas Mitigation Options Database (GMOD) is a decision support database and tool that provides cost and performance information for GHG mitigation options for the power, cement, refinery, landfill and pulp and paper sectors. The GMOD includes approximately 450 studies fo...
A FAST ITERATIVE METHOD FOR SOLVING THE EIKONAL EQUATION ON TETRAHEDRAL DOMAINS

PubMed Central

Fu, Zhisong; Kirby, Robert M.; Whitaker, Ross T.

2014-01-01

Generating numerical solutions to the eikonal equation and its many variations has a broad range of applications in both the natural and computational sciences. Efficient solvers on cutting-edge, parallel architectures require new algorithms that may not be theoretically optimal, but that are designed to allow asynchronous solution updates and have limited memory access patterns. This paper presents a parallel algorithm for solving the eikonal equation on fully unstructured tetrahedral meshes. The method is appropriate for the type of fine-grained parallelism found on modern massively-SIMD architectures such as graphics processors and takes into account the particular constraints and capabilities of these computing platforms. This work builds on previous work for solving these equations on triangle meshes; in this paper we adapt and extend previous two-dimensional strategies to accommodate three-dimensional, unstructured, tetrahedralized domains. These new developments include a local update strategy with data compaction for tetrahedral meshes that provides solutions on both serial and parallel architectures, with a generalization to inhomogeneous, anisotropic speed functions. We also propose two new update schemes, specialized to mitigate the natural data increase observed when moving to three dimensions, and the data structures necessary for efficiently mapping data to parallel SIMD processors in a way that maintains computational density. Finally, we present descriptions of the implementations for a single CPU, as well as multicore CPUs with shared memory and SIMD architectures, with comparative results against state-of-the-art eikonal solvers. PMID:25221418
Symbolic Analysis of Concurrent Programs with Polymorphism

NASA Technical Reports Server (NTRS)

Rungta, Neha Shyam

2010-01-01

The current trend of multi-core and multi-processor computing is causing a paradigm shift from inherently sequential to highly concurrent and parallel applications. Certain thread interleavings, data input values, or combinations of both often cause errors in the system. Systematic verification techniques such as explicit state model checking and symbolic execution are extensively used to detect errors in such systems [7, 9]. Explicit state model checking enumerates possible thread schedules and input data values of a program in order to check for errors [3, 9]. To partially mitigate the state space explosion from data input values, symbolic execution techniques substitute data input values with symbolic values [5, 7, 6]. Explicit state model checking and symbolic execution techniques used in conjunction with exhaustive search techniques such as depth-first search are unable to detect errors in medium to large-sized concurrent programs because the number of behaviors caused by data and thread non-determinism is extremely large. We present an overview of abstraction-guided symbolic execution for concurrent programs that detects errors manifested by a combination of thread schedules and data values [8]. The technique generates a set of key program locations relevant in testing the reachability of the target locations. The symbolic execution is then guided along these locations in an attempt to generate a feasible execution path to the error state. This allows the execution to focus in parts of the behavior space more likely to contain an error.
Evaluation of low impact development approach for mitigating flood inundation at a watershed scale in China.

PubMed

Hu, Maochuan; Sayama, Takahiro; Zhang, Xingqi; Tanaka, Kenji; Takara, Kaoru; Yang, Hong

2017-05-15

Low impact development (LID) has attracted growing attention as an important approach for urban flood mitigation. Most studies evaluating LID performance for mitigating floods focus on the changes of peak flow and runoff volume. This paper assessed the performance of LID practices for mitigating flood inundation hazards as retrofitting technologies in an urbanized watershed in Nanjing, China. The findings indicate that LID practices are effective for flood inundation mitigation at the watershed scale, and especially for reducing inundated areas with a high flood hazard risk. Various scenarios of LID implementation levels can reduce total inundated areas by 2%-17% and areas with a high flood hazard level by 6%-80%. Permeable pavement shows better performance than rainwater harvesting against mitigating urban waterlogging. The most efficient scenario is combined rainwater harvesting on rooftops with a cistern capacity of 78.5 mm and permeable pavement installed on 75% of non-busy roads and other impervious surfaces. Inundation modeling is an effective approach to obtaining the information necessary to guide decision-making for designing LID practices at watershed scales. Copyright © 2017 Elsevier Ltd. All rights reserved.
Manycore Performance-Portability: Kokkos Multidimensional Array Library

DOE PAGES

Edwards, H. Carter; Sunderland, Daniel; Porter, Vicki; ...

2012-01-01

Large, complex scientific and engineering application code have a significant investment in computational kernels to implement their mathematical models. Porting these computational kernels to the collection of modern manycore accelerator devices is a major challenge in that these devices have diverse programming models, application programming interfaces (APIs), and performance requirements. The Kokkos Array programming model provides library-based approach to implement computational kernels that are performance-portable to CPU-multicore and GPGPU accelerator devices. This programming model is based upon three fundamental concepts: (1) manycore compute devices each with its own memory space, (2) data parallel kernels and (3) multidimensional arrays. Kernel executionmore » performance is, especially for NVIDIA® devices, extremely dependent on data access patterns. Optimal data access pattern can be different for different manycore devices – potentially leading to different implementations of computational kernels specialized for different devices. The Kokkos Array programming model supports performance-portable kernels by (1) separating data access patterns from computational kernels through a multidimensional array API and (2) introduce device-specific data access mappings when a kernel is compiled. An implementation of Kokkos Array is available through Trilinos [Trilinos website, http://trilinos.sandia.gov/, August 2011].« less
Efficient Aho-Corasick String Matching on Emerging Multicore Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tumeo, Antonino; Villa, Oreste; Secchi, Simone

String matching algorithms are critical to several scientific fields. Beside text processing and databases, emerging applications such as DNA protein sequence analysis, data mining, information security software, antivirus, ma- chine learning, all exploit string matching algorithms [3]. All these applica- tions usually process large quantity of textual data, require high performance and/or predictable execution times. Among all the string matching algorithms, one of the most studied, especially for text processing and security applica- tions, is the Aho-Corasick algorithm. 1 2 Book title goes here Aho-Corasick is an exact, multi-pattern string matching algorithm which performs the search in a time linearlymore » proportional to the length of the input text independently from pattern set size. However, depending on the imple- mentation, when the number of patterns increase, the memory occupation may raise drastically. In turn, this can lead to significant variability in the performance, due to the memory access times and the caching effects. This is a significant concern for many mission critical applications and modern high performance architectures. For example, security applications such as Network Intrusion Detection Systems (NIDS), must be able to scan network traffic against very large dictionaries in real time. Modern Ethernet links reach up to 10 Gbps, and malicious threats are already well over 1 million, and expo- nentially growing [28]. When performing the search, a NIDS should not slow down the network, or let network packets pass unchecked. Nevertheless, on the current state-of-the-art cache based processors, there may be a large per- formance variability when dealing with big dictionaries and inputs that have different frequencies of matching patterns. In particular, when few patterns are matched and they are all in the cache, the procedure is fast. Instead, when they are not in the cache, often because many patterns are matched and the caches are continuously thrashed, they should be retrieved from the system memory and the procedure is slowed down by the increased latency. Efficient implementations of string matching algorithms have been the fo- cus of several works, targeting Field Programmable Gate Arrays [4, 25, 15, 5], highly multi-threaded solutions like the Cray XMT [34], multicore proces- sors [19] or heterogeneous processors like the Cell Broadband Engine [35, 22]. Recently, several researchers have also started to investigate the use Graphic Processing Units (GPUs) for string matching algorithms in security applica- tions [20, 10, 32, 33]. Most of these approaches mainly focus on reaching high peak performance, or try to optimize the memory occupation, rather than looking at performance stability. However, hardware solutions supports only small dictionary sizes due to lack of memory and are difficult to customize, while platforms such as the Cell/B.E. are very complex to program.« less
Multi-core fiber amplifier arrays for intra-satellite links

NASA Astrophysics Data System (ADS)

Kechagias, Marios; Crabb, Jonathan; Stampoulidis, Leontios; Farzana, Jihan; Kehayas, Efstratios; Filipowicz, Marta; Napierala, Marek; Murawski, Michal; Nasilowski, Tomasz; Barbero, Juan

2017-09-01

In this paper we present erbium doped fibre (EDF) aimed at signal amplification within satellite photonic payload systems operating in C telecommunication band. In such volume-hungry applications, the use of advanced optical transmission techniques such as space division multiplexing (SDM) can be advantageous to reduce the component and cable count.
Exploration and Evaluation of Nanometer Low-power Multi-core VLSI Computer Architectures

DTIC Science & Technology

2015-03-01

ICC, the Milkway database was created using the command: milkyway –galaxy –nogui –tcl –log memory.log one.tcl As stated previously, it is...EDA tools. Typically, Synopsys® tools use Milkway databases, whereas, Cadence Design System® use Layout Exchange Format (LEF) formats. To help

Investigation of Large Scale Cortical Models on Clustered Multi-Core Processors

DTIC Science & Technology

2013-02-01

with the bias node ( gray ) denoted as ww and the weights associated with the remaining first layer nodes (black) denoted as W. In forming the overall...Implementation of RBF network on GPU Platform 3.5.1 The Cholesky decomposition algorithm We need to invert the matrix multiplication GTG to
A fast CT reconstruction scheme for a general multi-core PC.

PubMed

Zeng, Kai; Bai, Erwei; Wang, Ge

2007-01-01

Expensive computational cost is a severe limitation in CT reconstruction for clinical applications that need real-time feedback. A primary example is bolus-chasing computed tomography (CT) angiography (BCA) that we have been developing for the past several years. To accelerate the reconstruction process using the filtered backprojection (FBP) method, specialized hardware or graphics cards can be used. However, specialized hardware is expensive and not flexible. The graphics processing unit (GPU) in a current graphic card can only reconstruct images in a reduced precision and is not easy to program. In this paper, an acceleration scheme is proposed based on a multi-core PC. In the proposed scheme, several techniques are integrated, including utilization of geometric symmetry, optimization of data structures, single-instruction multiple-data (SIMD) processing, multithreaded computation, and an Intel C++ compilier. Our scheme maintains the original precision and involves no data exchange between the GPU and CPU. The merits of our scheme are demonstrated in numerical experiments against the traditional implementation. Our scheme achieves a speedup of about 40, which can be further improved by several folds using the latest quad-core processors.
Biodegradation mechanisms of iron oxide monocrystalline nanoflowers and tunable shield effect of gold coating.

PubMed

Javed, Yasir; Lartigue, Lénaic; Hugounenq, Pierre; Vuong, Quoc Lam; Gossuin, Yves; Bazzi, Rana; Wilhelm, Claire; Ricolleau, Christian; Gazeau, Florence; Alloyeau, Damien

2014-08-27

Understanding the relation between the structure and the reactivity of nanomaterials in the organism is a crucial step towards efficient and safe biomedical applications. The multi-scale approach reported here, allows following the magnetic and structural transformations of multicore maghemite nanoflowers in a medium mimicking intracellular lysosomal environment. By confronting atomic-scale and macroscopic information on the biodegradation of these complex nanostuctures, we can unravel the mechanisms involved in the critical alterations of their hyperthermic power and their Magnetic Resonance imaging T1 and T2 contrast effect. This transformation of multicore nanoparticles with outstanding magnetic properties into poorly magnetic single core clusters highlights the harmful influence of cellular medium on the therapeutic and diagnosis effectiveness of iron oxide-based nanomaterials. As biodegradation occurs through surface reactivity mechanism, we demonstrate that the inert activity of gold nanoshells can be exploited to protect iron oxide nanostructures. Such inorganic nanoshields could be a relevant strategy to modulate the degradability and ultimately the long term fate of nanomaterials in the organism. © 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Dual-core optical fiber based strain sensor for remote sensing in hard-to-reach areas

NASA Astrophysics Data System (ADS)

MÄ kowska, Anna; Szostkiewicz, Łukasz; Kołakowska, Agnieszka; Budnicki, Dawid; Bieńkowska, Beata; Ostrowski, Łukasz; Murawski, Michał; Napierała, Marek; Mergo, Paweł; Nasiłowski, Tomasz

2017-10-01

We present research on optical fiber sensors based on microstructured multi-core fiber. Elaborated sensor can be advantageously used in hard-to-reach areas by taking advantage of the fact, that optical fibers can play both the role of sensing elements and they can realize signal delivery. By using the sensor, it is possible to increase the level of the safety in the explosive endangered areas, e.g. in mine-like objects. As a base for the strain remote sensor we use dual-core fibers. The multi-core fibers possess a characteristic parameter called crosstalk, which is a measure of the amount of signal which can pass to the adjacent core. The strain-sensitive area is made by creating the tapered section, in which the level of crosstalk is changed. Due to this fact, we present broadened conception of fiber optic sensor designing. Strain measurement is realized thanks to the fact, that depending on the strain applied, the power distribution between the cores of dual-core fibers changes. Principle of operation allows realization of measurements both in wavelength and power domain.
A Fast CT Reconstruction Scheme for a General Multi-Core PC

PubMed Central

Zeng, Kai; Bai, Erwei; Wang, Ge

2007-01-01

Expensive computational cost is a severe limitation in CT reconstruction for clinical applications that need real-time feedback. A primary example is bolus-chasing computed tomography (CT) angiography (BCA) that we have been developing for the past several years. To accelerate the reconstruction process using the filtered backprojection (FBP) method, specialized hardware or graphics cards can be used. However, specialized hardware is expensive and not flexible. The graphics processing unit (GPU) in a current graphic card can only reconstruct images in a reduced precision and is not easy to program. In this paper, an acceleration scheme is proposed based on a multi-core PC. In the proposed scheme, several techniques are integrated, including utilization of geometric symmetry, optimization of data structures, single-instruction multiple-data (SIMD) processing, multithreaded computation, and an Intel C++ compilier. Our scheme maintains the original precision and involves no data exchange between the GPU and CPU. The merits of our scheme are demonstrated in numerical experiments against the traditional implementation. Our scheme achieves a speedup of about 40, which can be further improved by several folds using the latest quad-core processors. PMID:18256731
High-throughput Bayesian Network Learning using Heterogeneous Multicore Computers

PubMed Central

Linderman, Michael D.; Athalye, Vivek; Meng, Teresa H.; Asadi, Narges Bani; Bruggner, Robert; Nolan, Garry P.

2017-01-01

Aberrant intracellular signaling plays an important role in many diseases. The causal structure of signal transduction networks can be modeled as Bayesian Networks (BNs), and computationally learned from experimental data. However, learning the structure of Bayesian Networks (BNs) is an NP-hard problem that, even with fast heuristics, is too time consuming for large, clinically important networks (20–50 nodes). In this paper, we present a novel graphics processing unit (GPU)-accelerated implementation of a Monte Carlo Markov Chain-based algorithm for learning BNs that is up to 7.5-fold faster than current general-purpose processor (GPP)-based implementations. The GPU-based implementation is just one of several implementations within the larger application, each optimized for a different input or machine configuration. We describe the methodology we use to build an extensible application, assembled from these variants, that can target a broad range of heterogeneous systems, e.g., GPUs, multicore GPPs. Specifically we show how we use the Merge programming model to efficiently integrate, test and intelligently select among the different potential implementations. PMID:28819655
Matrix Algebra for GPU and Multicore Architectures (MAGMA) for Large Petascale Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dongarra, Jack J.; Tomov, Stanimire

2014-03-24

The goal of the MAGMA project is to create a new generation of linear algebra libraries that achieve the fastest possible time to an accurate solution on hybrid Multicore+GPU-based systems, using all the processing power that future high-end systems can make available within given energy constraints. Our efforts at the University of Tennessee achieved the goals set in all of the five areas identified in the proposal: 1. Communication optimal algorithms; 2. Autotuning for GPU and hybrid processors; 3. Scheduling and memory management techniques for heterogeneity and scale; 4. Fault tolerance and robustness for large scale systems; 5. Building energymore » efficiency into software foundations. The University of Tennessee’s main contributions, as proposed, were the research and software development of new algorithms for hybrid multi/many-core CPUs and GPUs, as related to two-sided factorizations and complete eigenproblem solvers, hybrid BLAS, and energy efficiency for dense, as well as sparse, operations. Furthermore, as proposed, we investigated and experimented with various techniques targeting the five main areas outlined.« less
Blocked inverted indices for exact clustering of large chemical spaces.

PubMed

Thiel, Philipp; Sach-Peltason, Lisa; Ottmann, Christian; Kohlbacher, Oliver

2014-09-22

The calculation of pairwise compound similarities based on fingerprints is one of the fundamental tasks in chemoinformatics. Methods for efficient calculation of compound similarities are of the utmost importance for various applications like similarity searching or library clustering. With the increasing size of public compound databases, exact clustering of these databases is desirable, but often computationally prohibitively expensive. We present an optimized inverted index algorithm for the calculation of all pairwise similarities on 2D fingerprints of a given data set. In contrast to other algorithms, it neither requires GPU computing nor yields a stochastic approximation of the clustering. The algorithm has been designed to work well with multicore architectures and shows excellent parallel speedup. As an application example of this algorithm, we implemented a deterministic clustering application, which has been designed to decompose virtual libraries comprising tens of millions of compounds in a short time on current hardware. Our results show that our implementation achieves more than 400 million Tanimoto similarity calculations per second on a common desktop CPU. Deterministic clustering of the available chemical space thus can be done on modern multicore machines within a few days.
Post-inscription tuning of multicore fiber Bragg gratings

NASA Astrophysics Data System (ADS)

Lindley, Emma Y.; Min, Seong-sik; Leon-Saval, Sergio G.; Bland-Hawthorn, Joss

2016-07-01

Fiber Bragg gratings are used in astronomy for their ability to suppress narrow atmospheric emission lines of temporally varying brightness before the light is dispersed. These gratings can only operate in a single-mode fiber as the suppressed wavelength depends on mode velocity in the core. Recent experiments with fibers containing multiple single-moded cores have demonstrated the potential for inscribing identical gratings across all cores in a single pass. We have already improved the uniformity of gratings in 7-core fibers via modifications to the writing process; further progress can be achieved by tuning the gratings of the outer and inner cores relative to one another. Our eventual goal is to make the entire fiber suppress one wavelength to a depth of 30 dB or greater. By coating the fiber in a heat-conductive material with a high expansion coefficient, we can examine the effects of temperature and strain on the spectral response of each core. In this paper we present methods and results from experiments concerning the post-write tuning of gratings in multicore fibers.
An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU

NASA Astrophysics Data System (ADS)

Lyakh, Dmitry I.

2015-04-01

An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typically appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the naïve scattering algorithm (no memory access optimization). The tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).
Multicore runup simulation by under water avalanche using two-layer 1D shallow water equations

NASA Astrophysics Data System (ADS)

Bagustara, B. A. R. H.; Simanjuntak, C. A.; Gunawan, P. H.

2018-03-01

The increasing of layers in shallow water equations (SWE) produces more dynamic model than the one-layer SWE model. The two-layer 1D SWE model has different density for each layer. This model becomes more dynamic and natural, for instance in the ocean, the density of water will decreasing from the bottom to the surface. Here, the source-centered hydro-static reconstruction (SCHR) numerical scheme will be used to approximate the solution of two-layer 1D SWE model, since this scheme is proved to satisfy the mathematical properties for shallow water equation. Additionally in this paper, the algorithm of SCHR is adapted to the multicore architecture. The simulation of runup by under water avalanche is elaborated here. The results show that the runup is depend on the ratio of density of each layers. Moreover by using grid sizes Nx = 8000, the speedup and efficiency by 2 threads are obtained 1.74779 times and 87.3896 % respectively. Nevertheless, by 4 threads the speedup and efficiency are obtained 2.93132 times and 73.2830 % respectively by similar number of grid sizes Nx = 8000.
permGPU: Using graphics processing units in RNA microarray association studies.

PubMed

Shterev, Ivo D; Jung, Sin-Ho; George, Stephen L; Owzar, Kouros

2010-06-16

Many analyses of microarray association studies involve permutation, bootstrap resampling and cross-validation, that are ideally formulated as embarrassingly parallel computing problems. Given that these analyses are computationally intensive, scalable approaches that can take advantage of multi-core processor systems need to be developed. We have developed a CUDA based implementation, permGPU, that employs graphics processing units in microarray association studies. We illustrate the performance and applicability of permGPU within the context of permutation resampling for a number of test statistics. An extensive simulation study demonstrates a dramatic increase in performance when using permGPU on an NVIDIA GTX 280 card compared to an optimized C/C++ solution running on a conventional Linux server. permGPU is available as an open-source stand-alone application and as an extension package for the R statistical environment. It provides a dramatic increase in performance for permutation resampling analysis in the context of microarray association studies. The current version offers six test statistics for carrying out permutation resampling analyses for binary, quantitative and censored time-to-event traits.
Rapid Calculation of Max-Min Fair Rates for Multi-Commodity Flows in Fat-Tree Networks

DOE PAGES

Mollah, Md Atiqul; Yuan, Xin; Pakin, Scott; ...

2017-08-29

Max-min fairness is often used in the performance modeling of interconnection networks. Existing methods to compute max-min fair rates for multi-commodity flows have high complexity and are computationally infeasible for large networks. In this paper, we show that by considering topological features, this problem can be solved efficiently for the fat-tree topology that is widely used in data centers and high performance compute clusters. Several efficient new algorithms are developed for this problem, including a parallel algorithm that can take advantage of multi-core and shared-memory architectures. Using these algorithms, we demonstrate that it is possible to find the max-min fairmore » rate allocation for multi-commodity flows in fat-tree networks that support tens of thousands of nodes. We evaluate the run-time performance of the proposed algorithms and show improvement in orders of magnitude over the previously best known method. Finally, we further demonstrate a new application of max-min fair rate allocation that is only computationally feasible using our new algorithms.« less
Rapid Calculation of Max-Min Fair Rates for Multi-Commodity Flows in Fat-Tree Networks

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mollah, Md Atiqul; Yuan, Xin; Pakin, Scott

Max-min fairness is often used in the performance modeling of interconnection networks. Existing methods to compute max-min fair rates for multi-commodity flows have high complexity and are computationally infeasible for large networks. In this paper, we show that by considering topological features, this problem can be solved efficiently for the fat-tree topology that is widely used in data centers and high performance compute clusters. Several efficient new algorithms are developed for this problem, including a parallel algorithm that can take advantage of multi-core and shared-memory architectures. Using these algorithms, we demonstrate that it is possible to find the max-min fairmore » rate allocation for multi-commodity flows in fat-tree networks that support tens of thousands of nodes. We evaluate the run-time performance of the proposed algorithms and show improvement in orders of magnitude over the previously best known method. Finally, we further demonstrate a new application of max-min fair rate allocation that is only computationally feasible using our new algorithms.« less
Multicore Architecture-aware Scientific Applications

DOE Office of Scientific and Technical Information (OSTI.GOV)

Srinivasa, Avinash

Modern high performance systems are becoming increasingly complex and powerful due to advancements in processor and memory architecture. In order to keep up with this increasing complexity, applications have to be augmented with certain capabilities to fully exploit such systems. These may be at the application level, such as static or dynamic adaptations or at the system level, like having strategies in place to override some of the default operating system polices, the main objective being to improve computational performance of the application. The current work proposes two such capabilites with respect to multi-threaded scientific applications, in particular a largemore » scale physics application computing ab-initio nuclear structure. The first involves using a middleware tool to invoke dynamic adaptations in the application, so as to be able to adjust to the changing computational resource availability at run-time. The second involves a strategy for effective placement of data in main memory, to optimize memory access latencies and bandwidth. These capabilties when included were found to have a significant impact on the application performance, resulting in average speedups of as much as two to four times.« less
High-performance computing for airborne applications

DOE Office of Scientific and Technical Information (OSTI.GOV)

Quinn, Heather M; Manuzzato, Andrea; Fairbanks, Tom

2010-06-28

Recently, there has been attempts to move common satellite tasks to unmanned aerial vehicles (UAVs). UAVs are significantly cheaper to buy than satellites and easier to deploy on an as-needed basis. The more benign radiation environment also allows for an aggressive adoption of state-of-the-art commercial computational devices, which increases the amount of data that can be collected. There are a number of commercial computing devices currently available that are well-suited to high-performance computing. These devices range from specialized computational devices, such as field-programmable gate arrays (FPGAs) and digital signal processors (DSPs), to traditional computing platforms, such as microprocessors. Even thoughmore » the radiation environment is relatively benign, these devices could be susceptible to single-event effects. In this paper, we will present radiation data for high-performance computing devices in a accelerated neutron environment. These devices include a multi-core digital signal processor, two field-programmable gate arrays, and a microprocessor. From these results, we found that all of these devices are suitable for many airplane environments without reliability problems.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)

Sancho Pitarch, Jose Carlos; Kerbyson, Darren; Lang, Mike

Increasing the core-count on current and future processors is posing critical challenges to the memory subsystem to efficiently handle concurrent memory requests. The current trend to cope with this challenge is to increase the number of memory channels available to the processor's memory controller. In this paper we investigate the effectiveness of this approach on the performance of parallel scientific applications. Specifically, we explore the trade-off between employing multiple memory channels per memory controller and the use of multiple memory controllers. Experiments conducted on two current state-of-the-art multicore processors, a 6-core AMD Istanbul and a 4-core Intel Nehalem-EP, for amore » wide range of production applications shows that there is a diminishing return when increasing the number of memory channels per memory controller. In addition, we show that this performance degradation can be efficiently addressed by increasing the ratio of memory controllers to channels while keeping the number of memory channels constant. Significant performance improvements can be achieved in this scheme, up to 28%, in the case of using two memory controllers with each with one channel compared with one controller with two memory channels.« less
FOLLOW-UP RADON MEASUREMENTS IN 14 MITIGATED SCHOOLS

EPA Science Inventory

The report gives results of a determination of the long-term performance of radon mitigation systems installed in U. S. EPA research schools: radon measurements were conducted in 14 schools that had been mitigated between 1988 and 1991. The measurements were made between Februar...
A uniform approach for programming distributed heterogeneous computing systems

PubMed Central

Grasso, Ivan; Pellegrini, Simone; Cosenza, Biagio; Fahringer, Thomas

2014-01-01

Large-scale compute clusters of heterogeneous nodes equipped with multi-core CPUs and GPUs are getting increasingly popular in the scientific community. However, such systems require a combination of different programming paradigms making application development very challenging. In this article we introduce libWater, a library-based extension of the OpenCL programming model that simplifies the development of heterogeneous distributed applications. libWater consists of a simple interface, which is a transparent abstraction of the underlying distributed architecture, offering advanced features such as inter-context and inter-node device synchronization. It provides a runtime system which tracks dependency information enforced by event synchronization to dynamically build a DAG of commands, on which we automatically apply two optimizations: collective communication pattern detection and device-host-device copy removal. We assess libWater’s performance in three compute clusters available from the Vienna Scientific Cluster, the Barcelona Supercomputing Center and the University of Innsbruck, demonstrating improved performance and scaling with different test applications and configurations. PMID:25844015
Optimizing a mobile robot control system using GPU acceleration

NASA Astrophysics Data System (ADS)

Tuck, Nat; McGuinness, Michael; Martin, Fred

2012-01-01

This paper describes our attempt to optimize a robot control program for the Intelligent Ground Vehicle Competition (IGVC) by running computationally intensive portions of the system on a commodity graphics processing unit (GPU). The IGVC Autonomous Challenge requires a control program that performs a number of different computationally intensive tasks ranging from computer vision to path planning. For the 2011 competition our Robot Operating System (ROS) based control system would not run comfortably on the multicore CPU on our custom robot platform. The process of profiling the ROS control program and selecting appropriate modules for porting to run on a GPU is described. A GPU-targeting compiler, Bacon, is used to speed up development and help optimize the ported modules. The impact of the ported modules on overall performance is discussed. We conclude that GPU optimization can free a significant amount of CPU resources with minimal effort for expensive user-written code, but that replacing heavily-optimized library functions is more difficult, and a much less efficient use of time.

Traditional Tracking with Kalman Filter on Parallel Architectures

NASA Astrophysics Data System (ADS)

Cerati, Giuseppe; Elmer, Peter; Lantz, Steven; MacNeill, Ian; McDermott, Kevin; Riley, Dan; Tadel, Matevž; Wittich, Peter; Würthwein, Frank; Yagil, Avi

2015-05-01

Power density constraints are limiting the performance improvements of modern CPUs. To address this, we have seen the introduction of lower-power, multi-core processors, but the future will be even more exciting. In order to stay within the power density limits but still obtain Moore's Law performance/price gains, it will be necessary to parallelize algorithms to exploit larger numbers of lightweight cores and specialized functions like large vector units. Example technologies today include Intel's Xeon Phi and GPGPUs. Track finding and fitting is one of the most computationally challenging problems for event reconstruction in particle physics. At the High Luminosity LHC, for example, this will be by far the dominant problem. The most common track finding techniques in use today are however those based on the Kalman Filter. Significant experience has been accumulated with these techniques on real tracking detector systems, both in the trigger and offline. We report the results of our investigations into the potential and limitations of these algorithms on the new parallel hardware.
Fast access to the CMS detector condition data employing HTML5 technologies

NASA Astrophysics Data System (ADS)

Pierro, Giuseppe Antonio; Cavallari, Francesca; Di Guida, Salvatore; Innocente, Vincenzo

2011-12-01

This paper focuses on using HTML version 5 (HTML5) for accessing condition data for the CMS experiment, evaluating the benefits and risks posed by the use of this technology. According to the authors of HTML5, this technology attempts to solve issues found in previous iterations of HTML and addresses the needs of web applications, an area previously not adequately covered by HTML. We demonstrate that employing HTML5 brings important benefits in terms of access performance to the CMS condition data. The combined use of web storage and web sockets allows increasing the performance and reducing the costs in term of computation power, memory usage and network bandwidth for client and server. Above all, the web workers allow creating different scripts that can be executed using multi-thread mode, exploiting multi-core microprocessors. Web workers have been employed in order to substantially decrease the web page rendering time to display the condition data stored in the CMS condition database.
Energy and time determine scaling in biological and computer designs

PubMed Central

Bezerra, George; Edwards, Benjamin; Brown, James; Forrest, Stephanie

2016-01-01

Metabolic rate in animals and power consumption in computers are analogous quantities that scale similarly with size. We analyse vascular systems of mammals and on-chip networks of microprocessors, where natural selection and human engineering, respectively, have produced systems that minimize both energy dissipation and delivery times. Using a simple network model that simultaneously minimizes energy and time, our analysis explains empirically observed trends in the scaling of metabolic rate in mammals and power consumption and performance in microprocessors across several orders of magnitude in size. Just as the evolutionary transitions from unicellular to multicellular animals in biology are associated with shifts in metabolic scaling, our model suggests that the scaling of power and performance will change as computer designs transition to decentralized multi-core and distributed cyber-physical systems. More generally, a single energy–time minimization principle may govern the design of many complex systems that process energy, materials and information. This article is part of the themed issue ‘The major synthetic evolutionary transitions’. PMID:27431524
Energy and time determine scaling in biological and computer designs.

PubMed

Moses, Melanie; Bezerra, George; Edwards, Benjamin; Brown, James; Forrest, Stephanie

2016-08-19

Metabolic rate in animals and power consumption in computers are analogous quantities that scale similarly with size. We analyse vascular systems of mammals and on-chip networks of microprocessors, where natural selection and human engineering, respectively, have produced systems that minimize both energy dissipation and delivery times. Using a simple network model that simultaneously minimizes energy and time, our analysis explains empirically observed trends in the scaling of metabolic rate in mammals and power consumption and performance in microprocessors across several orders of magnitude in size. Just as the evolutionary transitions from unicellular to multicellular animals in biology are associated with shifts in metabolic scaling, our model suggests that the scaling of power and performance will change as computer designs transition to decentralized multi-core and distributed cyber-physical systems. More generally, a single energy-time minimization principle may govern the design of many complex systems that process energy, materials and information.This article is part of the themed issue 'The major synthetic evolutionary transitions'. © 2016 The Author(s).
A uniform approach for programming distributed heterogeneous computing systems.

PubMed

Grasso, Ivan; Pellegrini, Simone; Cosenza, Biagio; Fahringer, Thomas

2014-12-01

Large-scale compute clusters of heterogeneous nodes equipped with multi-core CPUs and GPUs are getting increasingly popular in the scientific community. However, such systems require a combination of different programming paradigms making application development very challenging. In this article we introduce libWater, a library-based extension of the OpenCL programming model that simplifies the development of heterogeneous distributed applications. libWater consists of a simple interface, which is a transparent abstraction of the underlying distributed architecture, offering advanced features such as inter-context and inter-node device synchronization. It provides a runtime system which tracks dependency information enforced by event synchronization to dynamically build a DAG of commands, on which we automatically apply two optimizations: collective communication pattern detection and device-host-device copy removal. We assess libWater's performance in three compute clusters available from the Vienna Scientific Cluster, the Barcelona Supercomputing Center and the University of Innsbruck, demonstrating improved performance and scaling with different test applications and configurations.
Low-power, transparent optical network interface for high bandwidth off-chip interconnects.

PubMed

Liboiron-Ladouceur, Odile; Wang, Howard; Garg, Ajay S; Bergman, Keren

2009-04-13

The recent emergence of multicore architectures and chip multiprocessors (CMPs) has accelerated the bandwidth requirements in high-performance processors for both on-chip and off-chip interconnects. For next generation computing clusters, the delivery of scalable power efficient off-chip communications to each compute node has emerged as a key bottleneck to realizing the full computational performance of these systems. The power dissipation is dominated by the off-chip interface and the necessity to drive high-speed signals over long distances. We present a scalable photonic network interface approach that fully exploits the bandwidth capacity offered by optical interconnects while offering significant power savings over traditional E/O and O/E approaches. The power-efficient interface optically aggregates electronic serial data streams into a multiple WDM channel packet structure at time-of-flight latencies. We demonstrate a scalable optical network interface with 70% improvement in power efficiency for a complete end-to-end PCI Express data transfer.
ANNarchy: a code generation approach to neural simulations on parallel hardware

PubMed Central

Vitay, Julien; Dinkelbach, Helge Ü.; Hamker, Fred H.

2015-01-01

Many modern neural simulators focus on the simulation of networks of spiking neurons on parallel hardware. Another important framework in computational neuroscience, rate-coded neural networks, is mostly difficult or impossible to implement using these simulators. We present here the ANNarchy (Artificial Neural Networks architect) neural simulator, which allows to easily define and simulate rate-coded and spiking networks, as well as combinations of both. The interface in Python has been designed to be close to the PyNN interface, while the definition of neuron and synapse models can be specified using an equation-oriented mathematical description similar to the Brian neural simulator. This information is used to generate C++ code that will efficiently perform the simulation on the chosen parallel hardware (multi-core system or graphical processing unit). Several numerical methods are available to transform ordinary differential equations into an efficient C++code. We compare the parallel performance of the simulator to existing solutions. PMID:26283957
Assessment of Efficiency and Performance in Tsunami Numerical Modeling with GPU

NASA Astrophysics Data System (ADS)

Yalciner, Bora; Zaytsev, Andrey

2017-04-01

Non-linear shallow water equations (NSWE) are used to solve the propagation and coastal amplification of long waves and tsunamis. Leap Frog scheme of finite difference technique is one of the satisfactory numerical methods which is widely used in these problems. Tsunami numerical models are necessary for not only academic but also operational purposes which need faster and accurate solutions. Recent developments in information technology provide considerably faster numerical solutions in this respect and are becoming one of the crucial requirements. Tsunami numerical code NAMI DANCE uses finite difference numerical method to solve linear and non-linear forms of shallow water equations for long wave problems, specifically for tsunamis. In this study, the new code is structured for Graphical Processing Unit (GPU) using CUDA API. The new code is applied to different (analytical, experimental and field) benchmark problems of tsunamis for tests. One of those applications is 2011 Great East Japan tsunami which was instrumentally recorded on various types of gauges including tide and wave gauges and offshore GPS buoys cabled Ocean Bottom Pressure (OBP) gauges and DART buoys. The accuracy of the results are compared with the measurements and fairly well agreements are obtained. The efficiency and performance of the code is also compared with the version using multi-core Central Processing Unit (CPU). Dependence of simulation speed with GPU on linear or non-linear solutions is also investigated. One of the results is that the simulation speed is increased up to 75 times comparing to the process time in the computer using single 4/8 thread multi-core CPU. The results are presented with comparisons and discussions. Furthermore how multi-dimensional finite difference problems fits towards GPU architecture is also discussed. The research leading to this study has received funding from the European Union's Seventh Framework Programme (FP7/2007-2013) under grant agreement No: 603839 (Project ASTARTE-Assessment, Strategy and Risk Reduction for Tsunamis in Europe). PARI, Japan and NOAA, USA are acknowledged for the data of the measurements. Prof. Ahmet C. Yalciner is also acknowledged for his long term and sustained support to the authors.
Batched matrix computations on hardware accelerators based on GPUs

DOE PAGES

Haidar, Azzam; Dong, Tingxing; Luszczek, Piotr; ...

2015-02-09

Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need for an effective approach to develop energy-efficient, high-performance codes for these small matrix problems that we call batched factorizations. The many applications that need this functionality could especially benefit from the use of GPUs, which currently are four to five times more energy efficient than multicore CPUs on important scientific workloads. This study, consequently, describes the development of the most common, one-sidedmore » factorizations, Cholesky, LU, and QR, for a set of small dense matrices. The algorithms we present together with their implementations are, by design, inherently parallel. In particular, our approach is based on representing the process as a sequence of batched BLAS routines that are executed entirely on a GPU. Importantly, this is unlike the LAPACK and the hybrid MAGMA factorization algorithms that work under drastically different assumptions of hardware design and efficiency of execution of the various computational kernels involved in the implementation. Thus, our approach is more efficient than what works for a combination of multicore CPUs and GPUs for the problems sizes of interest of the application use cases. The paradigm where upon a single chip (a GPU or a CPU) factorizes a single problem at a time is not at all efficient in our applications’ context. We illustrate all of these claims through a detailed performance analysis. With the help of profiling and tracing tools, we guide our development of batched factorizations to achieve up to two-fold speedup and three-fold better energy efficiency as compared against our highly optimized batched CPU implementations based on MKL library. Finally, the tested system featured two sockets of Intel Sandy Bridge CPUs and we compared with a batched LU factorizations featured in the CUBLAS library for GPUs, we achieve as high as 2.5× speedup on the NVIDIA K40 GPU.« less
GREENHOUSE GAS (GHG) MITIGATION AND MONITORING TECHNOLOGY PERFORMANCE: ACTIVITIES OF THE GHG TECHNOLOGY VERIFICATION CENTER

EPA Science Inventory

The paper discusses greenhouse gas (GHG) mitigation and monitoring technology performance activities of the GHG Technology Verification Center. The Center is a public/private partnership between Southern Research Institute and the U.S. EPA's Office of Research and Development. It...
75 FR 61161 - Agency Information Collection Activities: Proposed Collection; Comment Request, OMB No. 1660-0072...

Federal Register 2010, 2011, 2012, 2013, 2014

2010-10-04

... program/project performance for Flood Mitigation Assistance program, Severe Repetitive Loss, Repetitive Flood Claim, and Pre-Disaster Mitigation activities. DATES: Comments must be submitted on or before... INFORMATION: This collection of information is necessary to implement grants for the Flood Mitigation...
Dynamic characterization of frequency response of shock mitigation of a polymethylene diisocyanate (PMDI) based rigid polyurethane foam

DOE PAGES

Song, Bo; Nelson, Kevin

2015-09-01

Kolsky compression bar experiments were conducted to characterize the shock mitigation response of a polymethylene diisocyanate (PMDI) based rigid polyurethane foam, abbreviated as PMDI foam in this study. The Kolsky bar experimental data was analyzed in the frequency domain with respect to impact energy dissipation and acceleration attenuation to perform a shock mitigation assessment on the foam material. The PMDI foam material exhibits excellent performance in both energy dissipation and acceleration attenuation, particularly for the impact frequency content over 1.5 kHz. This frequency (1.5 kHz) was observed to be independent of specimen thickness and impact speed, which may represent themore » characteristic shock mitigation frequency of the PMDI foam material under investigation. The shock mitigation characteristics of the PMDI foam material were insignificantly influenced by the specimen thickness. As a result, impact speed did have some effect.« less
Dynamic characterization of frequency response of shock mitigation of a polymethylene diisocyanate (PMDI) based rigid polyurethane foam

DOE Office of Scientific and Technical Information (OSTI.GOV)

Song, Bo; Nelson, Kevin

Kolsky compression bar experiments were conducted to characterize the shock mitigation response of a polymethylene diisocyanate (PMDI) based rigid polyurethane foam, abbreviated as PMDI foam in this study. The Kolsky bar experimental data was analyzed in the frequency domain with respect to impact energy dissipation and acceleration attenuation to perform a shock mitigation assessment on the foam material. The PMDI foam material exhibits excellent performance in both energy dissipation and acceleration attenuation, particularly for the impact frequency content over 1.5 kHz. This frequency (1.5 kHz) was observed to be independent of specimen thickness and impact speed, which may represent themore » characteristic shock mitigation frequency of the PMDI foam material under investigation. The shock mitigation characteristics of the PMDI foam material were insignificantly influenced by the specimen thickness. As a result, impact speed did have some effect.« less
Accelerating nuclear configuration interaction calculations through a preconditioned block iterative eigensolver

DOE PAGES

Shao, Meiyue; Aktulga, H. Metin; Yang, Chao; ...

2017-09-14

In this paper, we describe a number of recently developed techniques for improving the performance of large-scale nuclear configuration interaction calculations on high performance parallel computers. We show the benefit of using a preconditioned block iterative method to replace the Lanczos algorithm that has traditionally been used to perform this type of computation. The rapid convergence of the block iterative method is achieved by a proper choice of starting guesses of the eigenvectors and the construction of an effective preconditioner. These acceleration techniques take advantage of special structure of the nuclear configuration interaction problem which we discuss in detail. Themore » use of a block method also allows us to improve the concurrency of the computation, and take advantage of the memory hierarchy of modern microprocessors to increase the arithmetic intensity of the computation relative to data movement. Finally, we also discuss the implementation details that are critical to achieving high performance on massively parallel multi-core supercomputers, and demonstrate that the new block iterative solver is two to three times faster than the Lanczos based algorithm for problems of moderate sizes on a Cray XC30 system.« less
Accelerating nuclear configuration interaction calculations through a preconditioned block iterative eigensolver

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shao, Meiyue; Aktulga, H. Metin; Yang, Chao

In this paper, we describe a number of recently developed techniques for improving the performance of large-scale nuclear configuration interaction calculations on high performance parallel computers. We show the benefit of using a preconditioned block iterative method to replace the Lanczos algorithm that has traditionally been used to perform this type of computation. The rapid convergence of the block iterative method is achieved by a proper choice of starting guesses of the eigenvectors and the construction of an effective preconditioner. These acceleration techniques take advantage of special structure of the nuclear configuration interaction problem which we discuss in detail. Themore » use of a block method also allows us to improve the concurrency of the computation, and take advantage of the memory hierarchy of modern microprocessors to increase the arithmetic intensity of the computation relative to data movement. Finally, we also discuss the implementation details that are critical to achieving high performance on massively parallel multi-core supercomputers, and demonstrate that the new block iterative solver is two to three times faster than the Lanczos based algorithm for problems of moderate sizes on a Cray XC30 system.« less
Aeroacoustic Codes for Rotor Harmonic and BVI Noise. CAMRAD.Mod1/HIRES: Methodology and Users' Manual

NASA Technical Reports Server (NTRS)

Boyd, D. Douglas, Jr.; Brooks, Thomas F.; Burley, Casey L.; Jolly, J. Ralph, Jr.

1998-01-01

This document details the methodology and use of the CAMRAD.Mod1/HIRES codes, which were developed at NASA Langley Research Center for the prediction of helicopter harmonic and Blade-Vortex Interaction (BVI) noise. CANMAD.Mod1 is a substantially modified version of the performance/trim/wake code CANMAD. High resolution blade loading is determined in post-processing by HIRES and an associated indicial aerodynamics code. Extensive capabilities of importance to noise prediction accuracy are documented, including a new multi-core tip vortex roll-up wake model, higher harmonic and individual blade control, tunnel and fuselage correction input, diagnostic blade motion input, and interfaces for acoustic and CFD aerodynamics codes. Modifications and new code capabilities are documented with examples. A users' job preparation guide and listings of variables and namelists are given.
Parallel scalability and efficiency of vortex particle method for aeroelasticity analysis of bluff bodies

NASA Astrophysics Data System (ADS)

Tolba, Khaled Ibrahim; Morgenthal, Guido

2018-01-01

This paper presents an analysis of the scalability and efficiency of a simulation framework based on the vortex particle method. The code is applied for the numerical aerodynamic analysis of line-like structures. The numerical code runs on multicore CPU and GPU architectures using OpenCL framework. The focus of this paper is the analysis of the parallel efficiency and scalability of the method being applied to an engineering test case, specifically the aeroelastic response of a long-span bridge girder at the construction stage. The target is to assess the optimal configuration and the required computer architecture, such that it becomes feasible to efficiently utilise the method within the computational resources available for a regular engineering office. The simulations and the scalability analysis are performed on a regular gaming type computer.
A pluggable framework for parallel pairwise sequence search.

PubMed

Archuleta, Jeremy; Feng, Wu-chun; Tilevich, Eli

2007-01-01

The current and near future of the computing industry is one of multi-core and multi-processor technology. Most existing sequence-search tools have been designed with a focus on single-core, single-processor systems. This discrepancy between software design and hardware architecture substantially hinders sequence-search performance by not allowing full utilization of the hardware. This paper presents a novel framework that will aid the conversion of serial sequence-search tools into a parallel version that can take full advantage of the available hardware. The framework, which is based on a software architecture called mixin layers with refined roles, enables modules to be plugged into the framework with minimal effort. The inherent modular design improves maintenance and extensibility, thus opening up a plethora of opportunities for advanced algorithmic features to be developed and incorporated while routine maintenance of the codebase persists.
Secure Heterogeneous Multicore Platform Through Diversity and Redundancy

DTIC Science & Technology

2012-03-31

implementation detects synchronization in this way. If a programmer uses custom synchronization primitives , our approach assumes that such primitives ... synchronization primitives . Primitives such as barriers and spinlocks explicitly enforce a pre- determined ordering among threads. Therefore, the outcome of...these synchronization operations are deterministic. In the discussion, we will refer to these primitives as ordering synchronization operations. On the
A Stroboscopic Light Source for Experiments in Mechanics

ERIC Educational Resources Information Center

Mayer, V. V.; Varaksina, E. I.

2017-01-01

We propose to attach a small stroboscopic light source to a moving object and connect the source to a pulse generator with the help of insulated thin flexible multi-cored wires. Students can assemble such a device independently in a school laboratory. The device can be used to obtain trajectories with time marks in students' research projects in…

Fire behavior simulation in Mediterranean forests using the minimum travel time algorithm

Treesearch

Kostas Kalabokidis; Palaiologos Palaiologou; Mark A. Finney

2014-01-01

Recent large wildfires in Greece exemplify the need for pre-fire burn probability assessment and possible landscape fire flow estimation to enhance fire planning and resource allocation. The Minimum Travel Time (MTT) algorithm, incorporated as FlamMap's version five module, provide valuable fire behavior functions, while enabling multi-core utilization for the...
Flight path-driven mitigation of wavefront curvature effects in SAR images

DOEpatents

Doerry, Armin W [Albuquerque, NM

2009-06-23

A wavefront curvature effect associated with a complex image produced by a synthetic aperture radar (SAR) can be mitigated based on which of a plurality of possible flight paths is taken by the SAR when capturing the image. The mitigation can be performed differently for different ones of the flight paths.
Interference Mitigation Schemes for Wireless Body Area Sensor Networks: A Comparative Survey

PubMed Central

Le, Thien T.T.; Moh, Sangman

2015-01-01

A wireless body area sensor network (WBASN) consists of a coordinator and multiple sensors to monitor the biological signals and functions of the human body. This exciting area has motivated new research and standardization processes, especially in the area of WBASN performance and reliability. In scenarios of mobility or overlapped WBASNs, system performance will be significantly degraded because of unstable signal integrity. Hence, it is necessary to consider interference mitigation in the design. This survey presents a comparative review of interference mitigation schemes in WBASNs. Further, we show that current solutions are limited in reaching satisfactory performance, and thus, more advanced solutions should be developed in the future. PMID:26110407
PDC bits break ground with advanced vibration mitigation

DOE Office of Scientific and Technical Information (OSTI.GOV)

NONE

1995-10-01

Advancements in PDC bit technology have resulted in the identification and characterization of different types of vibrational modes that historically have limited PDC bit performance. As a result, concepts have been developed that prevent the initiation of vibration and also mitigate its damaging effects once it occurs. This vibration-reducing concept ensures more efficient use of the energy available to a PDC bit performance. As a result, concepts have been developed that prevent the imitation of vibration and also mitigate its damaging effects once it occurs. This vibration-reducing concept ensures more efficient use of the energy available to a PDC bit,more » thereby improving its performance. This improved understanding of the complex forces affecting bit performance is driving bit customization for specific drilling programs.« less
FOLLOW-UP DURABILITY MEASUREMENTS AND MITIGATION PERFORMANCE IMPROVEMENT TESTS IN 38 EASTERN PENNSYL- VANIA HOUSES HAVING INDOOR REDUCTION SYSTEMS

EPA Science Inventory

The report gives results of follow-up tests in 38 difficult- to-mitigate Pennsylvania houses where indoor radon reduction systems had been installed 2 to 4 years earlier. bjectives were to assess system durability, methods for improving performance, and methods for reducing insta...
Damage-mitigating control of aerospace systems for high performance and extended life

NASA Technical Reports Server (NTRS)

Ray, Asok; Wu, Min-Kuang; Carpino, Marc; Lorenzo, Carl F.; Merrill, Walter C.

1992-01-01

The concept of damage-mitigating control is to minimize fatigue (as well as creep and corrosion) damage of critical components of mechanical structures while simultaneously maximizing the system dynamic performance. Given a dynamic model of the plant and the specifications for performance and stability robustness, the task is to synthesize a control law that would meet the system requirements and, at the same time, satisfy the constraints that are imposed by the material and structural properties of the critical components. The authors present the concept of damage-mitigating control systems design with the following objectives: (1) to achieve high performance with a prolonged life span; and (2) to systematically update the controller as the new technology of advanced materials evolves. The major challenge is to extract the information from the material properties and then utilize this information in a mathematical form so that it can be directly applied to robust control synthesis for mechanical systems. The basic concept of damage-mitigating control is illustrated using a relatively simplified model of a space shuttle main engine.
Adapting Wave-front Algorithms to Efficiently Utilize Systems with Deep Communication Hierarchies

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kerbyson, Darren J.; Lang, Michael; Pakin, Scott

2011-09-30

Large-scale systems increasingly exhibit a differential between intra-chip and inter-chip communication performance especially in hybrid systems using accelerators. Processorcores on the same socket are able to communicate at lower latencies, and with higher bandwidths, than cores on different sockets either within the same node or between nodes. A key challenge is to efficiently use this communication hierarchy and hence optimize performance. We consider here the class of applications that contains wavefront processing. In these applications data can only be processed after their upstream neighbors have been processed. Similar dependencies result between processors in which communication is required to pass boundarymore » data downstream and whose cost is typically impacted by the slowest communication channel in use. In this work we develop a novel hierarchical wave-front approach that reduces the use of slower communications in the hierarchy but at the cost of additional steps in the parallel computation and higher use of on-chip communications. This tradeoff is explored using a performance model. An implementation using the Reverse-acceleration programming model on the petascale Roadrunner system demonstrates a 27% performance improvement at full system-scale on a kernel application. The approach is generally applicable to large-scale multi-core and accelerated systems where a differential in system communication performance exists.« less
Adapting wave-front algorithms to efficiently utilize systems with deep communication hierarchies

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kerbyson, Darren J; Lang, Michael; Pakin, Scott

2009-01-01

Large-scale systems increasingly exhibit a differential between intra-chip and inter-chip communication performance. Processor-cores on the same socket are able to communicate at lower latencies, and with higher bandwidths, than cores on different sockets either within the same node or between nodes. A key challenge is to efficiently use this communication hierarchy and hence optimize performance. We consider here the class of applications that contain wave-front processing. In these applications data can only be processed after their upstream neighbors have been processed. Similar dependencies result between processors in which communication is required to pass boundary data downstream and whose cost ismore » typically impacted by the slowest communication channel in use. In this work we develop a novel hierarchical wave-front approach that reduces the use of slower communications in the hierarchy but at the cost of additional computation and higher use of on-chip communications. This tradeoff is explored using a performance model and an implementation on the Petascale Roadrunner system demonstrates a 27% performance improvement at full system-scale on a kernel application. The approach is generally applicable to large-scale multi-core and accelerated systems where a differential in system communication performance exists.« less
Scalable, High-performance 3D Imaging Software Platform: System Architecture and Application to Virtual Colonoscopy

PubMed Central

Yoshida, Hiroyuki; Wu, Yin; Cai, Wenli; Brett, Bevin

2013-01-01

One of the key challenges in three-dimensional (3D) medical imaging is to enable the fast turn-around time, which is often required for interactive or real-time response. This inevitably requires not only high computational power but also high memory bandwidth due to the massive amount of data that need to be processed. In this work, we have developed a software platform that is designed to support high-performance 3D medical image processing for a wide range of applications using increasingly available and affordable commodity computing systems: multi-core, clusters, and cloud computing systems. To achieve scalable, high-performance computing, our platform (1) employs size-adaptive, distributable block volumes as a core data structure for efficient parallelization of a wide range of 3D image processing algorithms; (2) supports task scheduling for efficient load distribution and balancing; and (3) consists of a layered parallel software libraries that allow a wide range of medical applications to share the same functionalities. We evaluated the performance of our platform by applying it to an electronic cleansing system in virtual colonoscopy, with initial experimental results showing a 10 times performance improvement on an 8-core workstation over the original sequential implementation of the system. PMID:23366803
Design and optimization of a portable LQCD Monte Carlo code using OpenACC

NASA Astrophysics Data System (ADS)

Bonati, Claudio; Coscetti, Simone; D'Elia, Massimo; Mesiti, Michele; Negro, Francesco; Calore, Enrico; Schifano, Sebastiano Fabio; Silvi, Giorgio; Tripiccione, Raffaele

The present panorama of HPC architectures is extremely heterogeneous, ranging from traditional multi-core CPU processors, supporting a wide class of applications but delivering moderate computing performance, to many-core Graphics Processor Units (GPUs), exploiting aggressive data-parallelism and delivering higher performances for streaming computing applications. In this scenario, code portability (and performance portability) become necessary for easy maintainability of applications; this is very relevant in scientific computing where code changes are very frequent, making it tedious and prone to error to keep different code versions aligned. In this work, we present the design and optimization of a state-of-the-art production-level LQCD Monte Carlo application, using the directive-based OpenACC programming model. OpenACC abstracts parallel programming to a descriptive level, relieving programmers from specifying how codes should be mapped onto the target architecture. We describe the implementation of a code fully written in OpenAcc, and show that we are able to target several different architectures, including state-of-the-art traditional CPUs and GPUs, with the same code. We also measure performance, evaluating the computing efficiency of our OpenACC code on several architectures, comparing with GPU-specific implementations and showing that a good level of performance-portability can be reached.
Evaluation of Low-Cost Mitigation Measures Implemented to Improve Air Quality in Nursery and Primary Schools.

PubMed

Sá, Juliana P; Branco, Pedro T B S; Alvim-Ferraz, Maria C M; Martins, Fernando G; Sousa, Sofia I V

2017-05-31

Indoor air pollution mitigation measures are highly important due to the associated health impacts, especially on children, a risk group that spends significant time indoors. Thus, the main goal of the work here reported was the evaluation of mitigation measures implemented in nursery and primary schools to improve air quality. Continuous measurements of CO₂, CO, NO₂, O₃, CH₂O, total volatile organic compounds (VOC), PM₁, PM 2.5 , PM 10 , Total Suspended Particles (TSP) and radon, as well as temperature and relative humidity were performed in two campaigns, before and after the implementation of low-cost mitigation measures. Evaluation of those mitigation measures was performed through the comparison of the concentrations measured in both campaigns. Exceedances to the values set by the national legislation and World Health Organization (WHO) were found for PM 2.5 , PM 10 , CO₂ and CH₂O during both indoor air quality campaigns. Temperature and relative humidity values were also above the ranges recommended by American Society of Heating, Refrigerating, and Air-Conditioning Engineers (ASHRAE). In general, pollutant concentrations measured after the implementation of low-cost mitigation measures were significantly lower, mainly for CO₂. However, mitigation measures were not always sufficient to decrease the pollutants' concentrations till values considered safe to protect human health.
Nanocarbon-based membrane filtration integrated with electric field driving for effective membrane fouling mitigation.

PubMed

Fan, Xinfei; Zhao, Huimin; Quan, Xie; Liu, Yanming; Chen, Shuo

2016-01-01

Membrane filtration provides an effective solution for removing pollutants from water but is limited by serious membrane fouling. In this work, an effective approach was used to mitigate membrane fouling by integrating membrane filtration with electropolarization using an electroconductive nanocarbon-based membrane. The electropolarized membrane (EM) by alternating square-wave potentials between +1.0 V and -1.0 V with a pulse width of 60 s exhibited a permeate flux 8.1 times as high as that without electropolarization for filtering feed water containing bacteria, which confirms the ability of the EM to achieve biofouling mitigation. Moreover, the permeate flux of EM was 1.5 times as high as that without electropolarization when filtrating natural organic matter (NOM) from water, and demonstrated good performance in organic fouling mitigation with EM. Furthermore, the EM was also effective for complex fouling mitigation in filtering water containing coexisting bacteria and NOM, and presented an increased flux rate 1.9 times as high as that without electropolarization. The superior fouling mitigation performance of EM was attributed to the synergistic effects of electrostatic repulsion, electrochemical oxidation and electrokinetic behaviors. This work opens an effective avenue for membrane fouling mitigation of water-treatment membrane filtration systems. Copyright © 2015 Elsevier Ltd. All rights reserved.
Attacking the One-Out-Of-m Multicore Problem by Combining Hardware Management with Mixed-Criticality Provisioning

DTIC Science & Technology

2015-05-01

LLC and DRAM banks. For each µB task and isolation configuration, we ran experiments with all 256 possible LLC area sizes (given by 1 to 16 ways and 1...isolation on multicoore platforms. In RTAS ’14. [29] H. Yun, G. Yao, R. Pellizzoni, M. Caccamo, and L. Sha . Memory access control in multiprocessor
Examination of Multi-Core Architectures

DTIC Science & Technology

2010-11-01

NOVEMBER 2010 2. REPORT TYPE Interim Technical Report 3. DATES COVERED (From - To) February 2010 – July 2010 4 . TITLE AND SUBTITLE EXAMINATION OF...STATEMENT 1 2.0 BACKGROUND 1 3.0 ARCHITECTURE CHARACTERISTICS 3 3.1 NVIDIA Tesla 3 3.2 TILE64 4 ...1 Tesla Architecture 3 2 TILE64 Architecture 4 3 Single Tile Architecture 4 4 STI Cell Broadband Engine
Localized states in a triangular set of linearly coupled complex Ginzburg-Landau equations.

PubMed

Sigler, Ariel; Malomed, Boris A; Skryabin, Dmitry V

2006-12-01

We introduce a pattern-formation model based on a symmetric system of three linearly coupled cubic-quintic complex Ginzburg-Landau equations, which form a triangular configuration. This is the simplest model of a multicore fiber laser. We identify stability regions for various types of localized patterns possible in this setting, which include stationary and breathing triangular vortices.
A Scalable Multicore Architecture With Heterogeneous Memory Structures for Dynamic Neuromorphic Asynchronous Processors (DYNAPs).

PubMed

Moradi, Saber; Qiao, Ning; Stefanini, Fabio; Indiveri, Giacomo

2018-02-01

Neuromorphic computing systems comprise networks of neurons that use asynchronous events for both computation and communication. This type of representation offers several advantages in terms of bandwidth and power consumption in neuromorphic electronic systems. However, managing the traffic of asynchronous events in large scale systems is a daunting task, both in terms of circuit complexity and memory requirements. Here, we present a novel routing methodology that employs both hierarchical and mesh routing strategies and combines heterogeneous memory structures for minimizing both memory requirements and latency, while maximizing programming flexibility to support a wide range of event-based neural network architectures, through parameter configuration. We validated the proposed scheme in a prototype multicore neuromorphic processor chip that employs hybrid analog/digital circuits for emulating synapse and neuron dynamics together with asynchronous digital circuits for managing the address-event traffic. We present a theoretical analysis of the proposed connectivity scheme, describe the methods and circuits used to implement such scheme, and characterize the prototype chip. Finally, we demonstrate the use of the neuromorphic processor with a convolutional neural network for the real-time classification of visual symbols being flashed to a dynamic vision sensor (DVS) at high speed.
An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU

DOE PAGES

Lyakh, Dmitry I.

2015-01-05

An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typicallymore » appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the na ve scattering algorithm (no memory access optimization). Furthermore, the tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).« less
Cd-free Cu-Zn-In-S/ZnS quantum dots@SiO2 multiple cores nanostructure: preparation and application for white LEDs

NASA Astrophysics Data System (ADS)

Jiang, Tongtong; Shen, Mohan; Dai, Peng; Wu, Mingzai; Yu, Xinxin; Li, Guang; Xu, Xiaoliang; Zeng, Haibo

2017-10-01

The work reports the fabrication of Cu doped Zn-In-S (CZIS) alloy quantum dots (QDs) using dodecanethiol and oleic acid as stabilizing ligands. With the increase of doped Cu element, the photoluminescence (PL) peak is monotonically red shifted. After coating ZnS shell, the PL quantum yield of CZIS QDs can reach 78%. Using reverse micelle microemulsion method, CZIS/ZnS QDs@SiO2 multi-core nanospheres were synthesized to improve the colloidal stability and avoid the aggregation of QDs. The obtained multi-core nanospheres were dispersed in curing adhesive, and applied as a color conversion layer in down converted light-emitting diodes. After encapsulation in curing adhesive, the newly designed LEDs show artifically regulated color coordinates with varying the weight ratio of green QDs and red QDs, and the concentrations of these two types of QDs. Moreover, natural white and warm white LEDs with correlated color temperature of 5287, 6732, 2731, and 3309 K can be achieved, which indicates that CZIS/ZnS QDs@SiO2 nanostructures are promising color conversion layer material for solid-state lighting application.
MUTILS - a set of efficient modeling tools for multi-core CPUs implemented in MEX

NASA Astrophysics Data System (ADS)

Krotkiewski, Marcin; Dabrowski, Marcin

2013-04-01

The need for computational performance is common in scientific applications, and in particular in numerical simulations, where high resolution models require efficient processing of large amounts of data. Especially in the context of geological problems the need to increase the model resolution to resolve physical and geometrical complexities seems to have no limits. Alas, the performance of new generations of CPUs does not improve any longer by simply increasing clock speeds. Current industrial trends are to increase the number of computational cores. As a result, parallel implementations are required in order to fully utilize the potential of new processors, and to study more complex models. We target simulations on small to medium scale shared memory computers: laptops and desktop PCs with ~8 CPU cores and up to tens of GB of memory to high-end servers with ~50 CPU cores and hundereds of GB of memory. In this setting MATLAB is often the environment of choice for scientists that want to implement their own models with little effort. It is a useful general purpose mathematical software package, but due to its versatility some of its functionality is not as efficient as it could be. In particular, the challanges of modern multi-core architectures are not fully addressed. We have developed MILAMIN 2 - an efficient FEM modeling environment written in native MATLAB. Amongst others, MILAMIN provides functions to define model geometry, generate and convert structured and unstructured meshes (also through interfaces to external mesh generators), compute element and system matrices, apply boundary conditions, solve the system of linear equations, address non-linear and transient problems, and perform post-processing. MILAMIN strives to combine the ease of code development and the computational efficiency. Where possible, the code is optimized and/or parallelized within the MATLAB framework. Native MATLAB is augmented with the MUTILS library - a set of MEX functions that implement the computationally intensive, performance critical parts of the code, which we have identified to be bottlenecks. Here, we discuss the functionality and performance of the MUTILS library. Currently, it includes: 1. time and memory efficient assembly of sparse matrices for FEM simulations 2. parallel sparse matrix - vector product with optimizations speficic to symmetric matrices and multiple degrees of freedom per node 3. parallel point in triangle location and point in tetrahedron location for unstructured, adaptive 2D and 3D meshes (useful for 'marker in cell' type of methods) 4. parallel FEM interpolation for 2D and 3D meshes of elements of different types and orders, and for different number of degrees of freedom per node 5. a stand-alone, MEX implementation of the Conjugate Gradients iterative solver 6. interface to METIS graph partitioning and a fast implementation of RCM reordering
Performance evaluation of a semi-active cladding connection for multi-hazard mitigation

NASA Astrophysics Data System (ADS)

Gong, Yongqiang; Cao, Liang; Micheli, Laura; Laflamme, Simon; Quiel, Spencer; Ricles, James

2018-03-01

A novel semi-active damping device termed Variable Friction Cladding Connection (VFCC) has been previously proposed to leverage cladding systems for the mitigation of natural and man-made hazards. The VFCC is a semi-active friction damper that connects cladding elements to the structural system. The friction force is generated by sliding plates and varied using an actuator through a system of adjustable toggles. The dynamics of the device has been previously characterized in a laboratory environment. In this paper, the performance of the VFCC at mitigating non-simultaneous multi-hazard excitations that includes wind and seismic loads is investigated on a simulated benchmark building. Simulations consider the robustness with respect to some uncertainties, including the wear of the friction surfaces and sensor failure. The performance of the VFCC is compared against other connection strategies including traditional stiffness, passive viscous, and passive friction elements. Results show that the VFCC is robust and capable of outperforming passive systems for the mitigation of multiple hazards.

Multi-core, multi-constraint chronostratigraphic framework over past 50,000 years places high-resolution Gulf of Alaska ocean-ice-sediment history into a global framework

NASA Astrophysics Data System (ADS)

Mix, A. C.; Walczak, M.; Asahi, H.; Belanger, C. L.; Cowan, E. A.; Du, J.; Fallon, S.; Fifield, L. K.; Hobern, T.; Jaeger, J. M.; Jensen, B. J. L.; McKay, J. L.; Padman, J.; Ross, A.; Sharon, S.; Stoner, J. S.; Zellers, S.

2017-12-01

Development of precise chronologies extending older than late glacial time in the subpolar North Pacific has been notoriously difficult due to limited record length in sediment cores, poor carbonate preservation, and (in many cases) relatively low resolution records. This is a key gap in our understanding of Northern Hemisphere and global paleoclimate change, now addressed with results from IODP Expedition 341 in the Gulf of Alaska. Here we utilize marine core and drill sites (U1417, U1418, U1419, U1421 and co-located site-survey cores) some of which provide exceptionally high sustained sedimentation rates (up to 2 cm per year in extended glacial intervals). This facilitates a multifaceted approach to chronology development over the past 50,000 years including radiocarbon, foraminiferal stable isotopes and other geochemical proxies, sediment physical properties, sedimentology, and tephrochronology. Given high sedimentation rates and the superb preservation this provides, we have developed marine time series that rival the resolution of the polar ice core records, which allows us to compare radiocarbon-based chronologies with several strategies involving signal tuning. Such a multifaceted approach mitigates weaknesses in any of the individual methods and allows a rigorous analysis of uncertainties in ages and sediment accumulation rates. The resulting record reveals dynamic changes in the Cordilleran Ice Sheet and North Pacific Ocean and most importantly facilitates placing these records into the context of global climate changes. (We acknowledge the contributions of J. Addison and S. Praetorius, who were not listed as co-authors due to USGS submission rules).
Implementation and evaluation of the Level Set method: Towards efficient and accurate simulation of wet etching for microengineering applications

NASA Astrophysics Data System (ADS)

Montoliu, C.; Ferrando, N.; Gosálvez, M. A.; Cerdá, J.; Colom, R. J.

2013-10-01

The use of atomistic methods, such as the Continuous Cellular Automaton (CCA), is currently regarded as a computationally efficient and experimentally accurate approach for the simulation of anisotropic etching of various substrates in the manufacture of Micro-electro-mechanical Systems (MEMS). However, when the features of the chemical process are modified, a time-consuming calibration process needs to be used to transform the new macroscopic etch rates into a corresponding set of atomistic rates. Furthermore, changing the substrate requires a labor-intensive effort to reclassify most atomistic neighborhoods. In this context, the Level Set (LS) method provides an alternative approach where the macroscopic forces affecting the front evolution are directly applied at the discrete level, thus avoiding the need for reclassification and/or calibration. Correspondingly, we present a fully-operational Sparse Field Method (SFM) implementation of the LS approach, discussing in detail the algorithm and providing a thorough characterization of the computational cost and simulation accuracy, including a comparison to the performance by the most recent CCA model. We conclude that the SFM implementation achieves similar accuracy as the CCA method with less fluctuations in the etch front and requiring roughly 4 times less memory. Although SFM can be up to 2 times slower than CCA for the simulation of anisotropic etchants, it can also be up to 10 times faster than CCA for isotropic etchants. In addition, we present a parallel, GPU-based implementation (gSFM) and compare it to an optimized, multicore CPU version (cSFM), demonstrating that the SFM algorithm can be successfully parallelized and the simulation times consequently reduced, while keeping the accuracy of the simulations. Although modern multicore CPUs provide an acceptable option, the massively parallel architecture of modern GPUs is more suitable, as reflected by computational times for gSFM up to 7.4 times faster than for cSFM.
Rubus: A compiler for seamless and extensible parallelism.

PubMed

Adnan, Muhammad; Aslam, Faisal; Nawaz, Zubair; Sarwar, Syed Mansoor

2017-01-01

Nowadays, a typical processor may have multiple processing cores on a single chip. Furthermore, a special purpose processing unit called Graphic Processing Unit (GPU), originally designed for 2D/3D games, is now available for general purpose use in computers and mobile devices. However, the traditional programming languages which were designed to work with machines having single core CPUs, cannot utilize the parallelism available on multi-core processors efficiently. Therefore, to exploit the extraordinary processing power of multi-core processors, researchers are working on new tools and techniques to facilitate parallel programming. To this end, languages like CUDA and OpenCL have been introduced, which can be used to write code with parallelism. The main shortcoming of these languages is that programmer needs to specify all the complex details manually in order to parallelize the code across multiple cores. Therefore, the code written in these languages is difficult to understand, debug and maintain. Furthermore, to parallelize legacy code can require rewriting a significant portion of code in CUDA or OpenCL, which can consume significant time and resources. Thus, the amount of parallelism achieved is proportional to the skills of the programmer and the time spent in code optimizations. This paper proposes a new open source compiler, Rubus, to achieve seamless parallelism. The Rubus compiler relieves the programmer from manually specifying the low-level details. It analyses and transforms a sequential program into a parallel program automatically, without any user intervention. This achieves massive speedup and better utilization of the underlying hardware without a programmer's expertise in parallel programming. For five different benchmarks, on average a speedup of 34.54 times has been achieved by Rubus as compared to Java on a basic GPU having only 96 cores. Whereas, for a matrix multiplication benchmark the average execution speedup of 84 times has been achieved by Rubus on the same GPU. Moreover, Rubus achieves this performance without drastically increasing the memory footprint of a program.
Rubus: A compiler for seamless and extensible parallelism

PubMed Central

Adnan, Muhammad; Aslam, Faisal; Sarwar, Syed Mansoor

2017-01-01

Nowadays, a typical processor may have multiple processing cores on a single chip. Furthermore, a special purpose processing unit called Graphic Processing Unit (GPU), originally designed for 2D/3D games, is now available for general purpose use in computers and mobile devices. However, the traditional programming languages which were designed to work with machines having single core CPUs, cannot utilize the parallelism available on multi-core processors efficiently. Therefore, to exploit the extraordinary processing power of multi-core processors, researchers are working on new tools and techniques to facilitate parallel programming. To this end, languages like CUDA and OpenCL have been introduced, which can be used to write code with parallelism. The main shortcoming of these languages is that programmer needs to specify all the complex details manually in order to parallelize the code across multiple cores. Therefore, the code written in these languages is difficult to understand, debug and maintain. Furthermore, to parallelize legacy code can require rewriting a significant portion of code in CUDA or OpenCL, which can consume significant time and resources. Thus, the amount of parallelism achieved is proportional to the skills of the programmer and the time spent in code optimizations. This paper proposes a new open source compiler, Rubus, to achieve seamless parallelism. The Rubus compiler relieves the programmer from manually specifying the low-level details. It analyses and transforms a sequential program into a parallel program automatically, without any user intervention. This achieves massive speedup and better utilization of the underlying hardware without a programmer’s expertise in parallel programming. For five different benchmarks, on average a speedup of 34.54 times has been achieved by Rubus as compared to Java on a basic GPU having only 96 cores. Whereas, for a matrix multiplication benchmark the average execution speedup of 84 times has been achieved by Rubus on the same GPU. Moreover, Rubus achieves this performance without drastically increasing the memory footprint of a program. PMID:29211758
Virtual machine-based simulation platform for mobile ad-hoc network-based cyber infrastructure

DOE PAGES

Yoginath, Srikanth B.; Perumalla, Kayla S.; Henz, Brian J.

2015-09-29

In modeling and simulating complex systems such as mobile ad-hoc networks (MANETs) in de-fense communications, it is a major challenge to reconcile multiple important considerations: the rapidity of unavoidable changes to the software (network layers and applications), the difficulty of modeling the critical, implementation-dependent behavioral effects, the need to sustain larger scale scenarios, and the desire for faster simulations. Here we present our approach in success-fully reconciling them using a virtual time-synchronized virtual machine(VM)-based parallel ex-ecution framework that accurately lifts both the devices as well as the network communications to a virtual time plane while retaining full fidelity. At themore » core of our framework is a scheduling engine that operates at the level of a hypervisor scheduler, offering a unique ability to execute multi-core guest nodes over multi-core host nodes in an accurate, virtual time-synchronized manner. In contrast to other related approaches that suffer from either speed or accuracy issues, our framework provides MANET node-wise scalability, high fidelity of software behaviors, and time-ordering accuracy. The design and development of this framework is presented, and an ac-tual implementation based on the widely used Xen hypervisor system is described. Benchmarks with synthetic and actual applications are used to identify the benefits of our approach. The time inaccuracy of traditional emulation methods is demonstrated, in comparison with the accurate execution of our framework verified by theoretically correct results expected from analytical models of the same scenarios. In the largest high fidelity tests, we are able to perform virtual time-synchronized simulation of 64-node VM-based full-stack, actual software behaviors of MANETs containing a mix of static and mobile (unmanned airborne vehicle) nodes, hosted on a 32-core host, with full fidelity of unmodified ad-hoc routing protocols, unmodified application executables, and user-controllable physical layer effects including inter-device wireless signal strength, reachability, and connectivity.« less
Virtual machine-based simulation platform for mobile ad-hoc network-based cyber infrastructure

DOE Office of Scientific and Technical Information (OSTI.GOV)

Yoginath, Srikanth B.; Perumalla, Kayla S.; Henz, Brian J.

In modeling and simulating complex systems such as mobile ad-hoc networks (MANETs) in de-fense communications, it is a major challenge to reconcile multiple important considerations: the rapidity of unavoidable changes to the software (network layers and applications), the difficulty of modeling the critical, implementation-dependent behavioral effects, the need to sustain larger scale scenarios, and the desire for faster simulations. Here we present our approach in success-fully reconciling them using a virtual time-synchronized virtual machine(VM)-based parallel ex-ecution framework that accurately lifts both the devices as well as the network communications to a virtual time plane while retaining full fidelity. At themore » core of our framework is a scheduling engine that operates at the level of a hypervisor scheduler, offering a unique ability to execute multi-core guest nodes over multi-core host nodes in an accurate, virtual time-synchronized manner. In contrast to other related approaches that suffer from either speed or accuracy issues, our framework provides MANET node-wise scalability, high fidelity of software behaviors, and time-ordering accuracy. The design and development of this framework is presented, and an ac-tual implementation based on the widely used Xen hypervisor system is described. Benchmarks with synthetic and actual applications are used to identify the benefits of our approach. The time inaccuracy of traditional emulation methods is demonstrated, in comparison with the accurate execution of our framework verified by theoretically correct results expected from analytical models of the same scenarios. In the largest high fidelity tests, we are able to perform virtual time-synchronized simulation of 64-node VM-based full-stack, actual software behaviors of MANETs containing a mix of static and mobile (unmanned airborne vehicle) nodes, hosted on a 32-core host, with full fidelity of unmodified ad-hoc routing protocols, unmodified application executables, and user-controllable physical layer effects including inter-device wireless signal strength, reachability, and connectivity.« less
A Hybrid MPI/OpenMP Approach for Parallel Groundwater Model Calibration on Multicore Computers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tang, Guoping; D'Azevedo, Ed F; Zhang, Fan

2010-01-01

Groundwater model calibration is becoming increasingly computationally time intensive. We describe a hybrid MPI/OpenMP approach to exploit two levels of parallelism in software and hardware to reduce calibration time on multicore computers with minimal parallelization effort. At first, HydroGeoChem 5.0 (HGC5) is parallelized using OpenMP for a uranium transport model with over a hundred species involving nearly a hundred reactions, and a field scale coupled flow and transport model. In the first application, a single parallelizable loop is identified to consume over 97% of the total computational time. With a few lines of OpenMP compiler directives inserted into the code,more » the computational time reduces about ten times on a compute node with 16 cores. The performance is further improved by selectively parallelizing a few more loops. For the field scale application, parallelizable loops in 15 of the 174 subroutines in HGC5 are identified to take more than 99% of the execution time. By adding the preconditioned conjugate gradient solver and BICGSTAB, and using a coloring scheme to separate the elements, nodes, and boundary sides, the subroutines for finite element assembly, soil property update, and boundary condition application are parallelized, resulting in a speedup of about 10 on a 16-core compute node. The Levenberg-Marquardt (LM) algorithm is added into HGC5 with the Jacobian calculation and lambda search parallelized using MPI. With this hybrid approach, compute nodes at the number of adjustable parameters (when the forward difference is used for Jacobian approximation), or twice that number (if the center difference is used), are used to reduce the calibration time from days and weeks to a few hours for the two applications. This approach can be extended to global optimization scheme and Monte Carol analysis where thousands of compute nodes can be efficiently utilized.« less
Towards European-scale convection-resolving climate simulations with GPUs: a study with COSMO 4.19

NASA Astrophysics Data System (ADS)

Leutwyler, David; Fuhrer, Oliver; Lapillonne, Xavier; Lüthi, Daniel; Schär, Christoph

2016-09-01

The representation of moist convection in climate models represents a major challenge, due to the small scales involved. Using horizontal grid spacings of O(1km), convection-resolving weather and climate models allows one to explicitly resolve deep convection. However, due to their extremely demanding computational requirements, they have so far been limited to short simulations and/or small computational domains. Innovations in supercomputing have led to new hybrid node designs, mixing conventional multi-core hardware and accelerators such as graphics processing units (GPUs). One of the first atmospheric models that has been fully ported to these architectures is the COSMO (Consortium for Small-scale Modeling) model.Here we present the convection-resolving COSMO model on continental scales using a version of the model capable of using GPU accelerators. The verification of a week-long simulation containing winter storm Kyrill shows that, for this case, convection-parameterizing simulations and convection-resolving simulations agree well. Furthermore, we demonstrate the applicability of the approach to longer simulations by conducting a 3-month-long simulation of the summer season 2006. Its results corroborate the findings found on smaller domains such as more credible representation of the diurnal cycle of precipitation in convection-resolving models and a tendency to produce more intensive hourly precipitation events. Both simulations also show how the approach allows for the representation of interactions between synoptic-scale and meso-scale atmospheric circulations at scales ranging from 1000 to 10 km. This includes the formation of sharp cold frontal structures, convection embedded in fronts and small eddies, or the formation and organization of propagating cold pools. Finally, we assess the performance gain from using heterogeneous hardware equipped with GPUs relative to multi-core hardware. With the COSMO model, we now use a weather and climate model that has all the necessary modules required for real-case convection-resolving regional climate simulations on GPUs.
LHCb detector and trigger performance in Run II

NASA Astrophysics Data System (ADS)

Francesca, Dordei

2017-12-01

The LHCb detector is a forward spectrometer at the LHC, designed to perform high precision studies of b- and c- hadrons. In Run II of the LHC, a new scheme for the software trigger at LHCb allows splitting the triggering of events into two stages, giving room to perform the alignment and calibration in real time. In the novel detector alignment and calibration strategy for Run II, data collected at the start of the fill are processed in a few minutes and used to update the alignment, while the calibration constants are evaluated for each run. This allows identical constants to be used in the online and offline reconstruction, thus improving the correlation between triggered and offline selected events. The required computing time constraints are met thanks to a new dedicated framework using the multi-core farm infrastructure for the trigger. The larger timing budget, available in the trigger, allows to perform the same track reconstruction online and offline. This enables LHCb to achieve the best reconstruction performance already in the trigger, and allows physics analyses to be performed directly on the data produced by the trigger reconstruction. The novel real-time processing strategy at LHCb is discussed from both the technical and operational point of view. The overall performance of the LHCb detector on the data of Run II is presented as well.
Fabrication of mitigation pits for improving laser damage resistance in dielectric mirrors by femtosecond laser machining

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wolfe, Justin E.; Qiu, S. Roger; Stolz, Christopher J.

2011-03-20

Femtosecond laser machining is used to create mitigation pits to stabilize nanosecond laser-induced damage in multilayer dielectric mirror coatings on BK7 substrates. In this paper, we characterize features and the artifacts associated with mitigation pits and further investigate the impact of pulse energy and pulse duration on pit quality and damage resistance. Our results show that these mitigation features can double the fluence-handling capability of large-aperture optical multilayer mirror coatings and further demonstrate that femtosecond laser macromachining is a promising means for fabricating mitigation geometry in multilayer coatings to increase mirror performance under high-power laser irradiation.
300 Gb/s IM/DD based SDM-WDM-PON with laserless ONUs.

PubMed

Bao, Fangdi; Morioka, Toshio; Oxenløwe, Leif K; Hu, Hao

2018-04-02

A low-cost, high-speed SDM-WDM-PON architecture is proposed by using a multi-core fiber (MCF) and intensity modulation/directly detection (IM/DD). One of the MCF cores is used for sending laser sources from optical line terminal (OLT) to optical network unit (ONU), thus facilitating laserless and colorless ONUs, and providing ease of network management and maintenance. In addition, the wavelengths of the ONUs are controlled on the OLT side, which also enables flexible optical networks. Thanks to the low inter-core crosstalk of a MCF, downstream (DS) and upstream (US) signals are transmitted independently in different cores of the MCF, not only increasing the aggregated capacity but also avoiding the Rayleigh backscattering noise. Finally, a proof-of-principle experiment is performed by using a 7-core fiber, achieving 300 /120 Gb/s aggregated capacity for DS and US (3 × cores, 4 × wavelengths, 25/10 Gb/s per wavelength), respectively.
Numerical modelling of powder caking at REV scale by using DEM

NASA Astrophysics Data System (ADS)

Guessasma, Mohamed; Silva Tavares, Homayra; Afrassiabian, Zahra; Saleh, Khashayar

2017-06-01

This work deals with numerical simulation of powder caking process caused by capillary condensation phenomenon. Caking consists in unwanted agglomeration of powder particles. This process is often irreversible and not easy to predict. To reproduce mechanism involved by caking phenomenon we have used the Discrete Elements Method (DEM). In the present work, we mainly focus on the role of capillary condensation and subsequent liquid bridge formation within a granular medium exposed to fluctuations of ambient relative humidity. Such bridges cause an attractive force between particles, leading to the formation of a cake with intrinsic physicochemical and mechanical properties. By considering a Representative Elementary Volume (REV), the DEM is then performed by means of a MULTICOR-3D software tacking into account the properties of the cake (degree of saturation) in order to establish relationships between the microscopic parameters and the macroscopic behaviour (tensile strength).
An acquisition system for CMOS imagers with a genuine 10 Gbit/s bandwidth

NASA Astrophysics Data System (ADS)

Guérin, C.; Mahroug, J.; Tromeur, W.; Houles, J.; Calabria, P.; Barbier, R.

2012-12-01

This paper presents a high data throughput acquisition system for pixel detector readout such as CMOS imagers. This CMOS acquisition board offers a genuine 10 Gbit/s bandwidth to the workstation and can provide an on-line and continuous high frame rate imaging capability. On-line processing can be implemented either on the Data Acquisition Board or on the multi-cores workstation depending on the complexity of the algorithms. The different parts composing the acquisition board have been designed to be used first with a single-photon detector called LUSIPHER (800×800 pixels), developed in our laboratory for scientific applications ranging from nano-photonics to adaptive optics. The architecture of the acquisition board is presented and the performances achieved by the produced boards are described. The future developments (hardware and software) concerning the on-line implementation of algorithms dedicated to single-photon imaging are tackled.
Cache Locality Optimization for Recursive Programs

DOE Office of Scientific and Technical Information (OSTI.GOV)

Lifflander, Jonathan; Krishnamoorthy, Sriram

We present an approach to optimize the cache locality for recursive programs by dynamically splicing--recursively interleaving--the execution of distinct function invocations. By utilizing data effect annotations, we identify concurrency and data reuse opportunities across function invocations and interleave them to reduce reuse distance. We present algorithms that efficiently track effects in recursive programs, detect interference and dependencies, and interleave execution of function invocations using user-level (non-kernel) lightweight threads. To enable multi-core execution, a program is parallelized using a nested fork/join programming model. Our cache optimization strategy is designed to work in the context of a random work stealing scheduler. Wemore » present an implementation using the MIT Cilk framework that demonstrates significant improvements in sequential and parallel performance, competitive with a state-of-the-art compile-time optimizer for loop programs and a domain- specific optimizer for stencil programs.« less
Exploring Manycore Multinode Systems for Irregular Applications with FPGA Prototyping

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ceriani, Marco; Palermo, Gianluca; Secchi, Simone

We present a prototype of a multi-core architecture implemented on FPGA, designed to enable efficient execution of irregular applications on distributed shared memory machines, while maintaining high performance on regular workloads. The architecture is composed of off-the-shelf soft-core cores, local interconnection and memory interface, integrated with custom components that optimize it for irregular applications. It relies on three key elements: a global address space, multithreading, and fine-grained synchronization. Global addresses are scrambled to reduce the formation of network hot-spots, while the latency of the transactions is covered by integrating an hardware scheduler within the custom load/store buffers to take advantagemore » from the availability of multiple executions threads, increasing the efficiency in a transparent way to the application. We evaluated a dual node system irregular kernels showing scalability in the number of cores and threads.« less
QMCPACK: an open source ab initio quantum Monte Carlo package for the electronic structure of atoms, molecules and solids

NASA Astrophysics Data System (ADS)

Kim, Jeongnim; Baczewski, Andrew D.; Beaudet, Todd D.; Benali, Anouar; Chandler Bennett, M.; Berrill, Mark A.; Blunt, Nick S.; Josué Landinez Borda, Edgar; Casula, Michele; Ceperley, David M.; Chiesa, Simone; Clark, Bryan K.; Clay, Raymond C., III; Delaney, Kris T.; Dewing, Mark; Esler, Kenneth P.; Hao, Hongxia; Heinonen, Olle; Kent, Paul R. C.; Krogel, Jaron T.; Kylänpää, Ilkka; Li, Ying Wai; Lopez, M. Graham; Luo, Ye; Malone, Fionn D.; Martin, Richard M.; Mathuriya, Amrita; McMinis, Jeremy; Melton, Cody A.; Mitas, Lubos; Morales, Miguel A.; Neuscamman, Eric; Parker, William D.; Pineda Flores, Sergio D.; Romero, Nichols A.; Rubenstein, Brenda M.; Shea, Jacqueline A. R.; Shin, Hyeondeok; Shulenburger, Luke; Tillack, Andreas F.; Townsend, Joshua P.; Tubman, Norm M.; Van Der Goetz, Brett; Vincent, Jordan E.; ChangMo Yang, D.; Yang, Yubo; Zhang, Shuai; Zhao, Luning

2018-05-01

QMCPACK is an open source quantum Monte Carlo package for ab initio electronic structure calculations. It supports calculations of metallic and insulating solids, molecules, atoms, and some model Hamiltonians. Implemented real space quantum Monte Carlo algorithms include variational, diffusion, and reptation Monte Carlo. QMCPACK uses Slater–Jastrow type trial wavefunctions in conjunction with a sophisticated optimizer capable of optimizing tens of thousands of parameters. The orbital space auxiliary-field quantum Monte Carlo method is also implemented, enabling cross validation between different highly accurate methods. The code is specifically optimized for calculations with large numbers of electrons on the latest high performance computing architectures, including multicore central processing unit and graphical processing unit systems. We detail the program’s capabilities, outline its structure, and give examples of its use in current research calculations. The package is available at http://qmcpack.org.
Diffractive optics for combined spatial- and mode- division demultiplexing of optical vortices: design, fabrication and optical characterization.

PubMed

Ruffato, Gianluca; Massari, Michele; Romanato, Filippo

2016-04-20

During the last decade, the orbital angular momentum (OAM) of light has attracted growing interest as a new degree of freedom for signal channel multiplexing in order to increase the information transmission capacity in today's optical networks. Here we present the design, fabrication and characterization of phase-only diffractive optical elements (DOE) performing mode-division (de)multiplexing (MDM) and spatial-division (de)multiplexing (SDM) at the same time. Samples have been fabricated with high-resolution electron-beam lithography patterning a polymethylmethacrylate (PMMA) resist layer spun over a glass substrate. Different DOE designs are presented for the sorting of optical vortices differing in either OAM content or beam size in the optical regime, with different steering geometries in far-field. These novel DOE designs appear promising for telecom applications both in free-space and in multi-core fibers propagation.
Parallel Computation of the Jacobian Matrix for Nonlinear Equation Solvers Using MATLAB

NASA Technical Reports Server (NTRS)

Rose, Geoffrey K.; Nguyen, Duc T.; Newman, Brett A.

2017-01-01

Demonstrating speedup for parallel code on a multicore shared memory PC can be challenging in MATLAB due to underlying parallel operations that are often opaque to the user. This can limit potential for improvement of serial code even for the so-called embarrassingly parallel applications. One such application is the computation of the Jacobian matrix inherent to most nonlinear equation solvers. Computation of this matrix represents the primary bottleneck in nonlinear solver speed such that commercial finite element (FE) and multi-body-dynamic (MBD) codes attempt to minimize computations. A timing study using MATLAB's Parallel Computing Toolbox was performed for numerical computation of the Jacobian. Several approaches for implementing parallel code were investigated while only the single program multiple data (spmd) method using composite objects provided positive results. Parallel code speedup is demonstrated but the goal of linear speedup through the addition of processors was not achieved due to PC architecture.
Magnetism and Mössbauer study of formation of multi-core γ -Fe2O3 nanoparticles

NASA Astrophysics Data System (ADS)

Kamali, Saeed; Bringas, Eugenio; Hah, Hien-Yoong; Bates, Brian; Johnson, Jacqueline A.; Johnson, Charles E.; Stroeve, Pieter

2018-04-01

A systematic investigation of magnetic nanoparticles and the formation of a core-shell structure, consisting of multiple maghemite (γ -Fe2O3) nanoparticles as the core and silica as the shell, has been performed using various techniques. High-resolution transmission electron microscopy clearly shows isolated maghemite nanoparticles with an average diameter of 13 nm and the formation of a core-shell structure. Low temperature Mössbauer spectroscopy reveals the presence of pure maghemite nanoparticles with all vacancies at the B-sites. Isothermal magnetization and zero-field-cooled and field-cooled measurements are used for investigating the magnetic properties of the nanoparticles. The magnetization results are in good accordance with the contents of the magnetic core and the non-magnetic shell. The multiple-core γ -Fe2O3 nanoparticles show similar behavior to isolated particles of the same size.
QMCPACK: an open source ab initio quantum Monte Carlo package for the electronic structure of atoms, molecules and solids.

PubMed

Kim, Jeongnim; Baczewski, Andrew T; Beaudet, Todd D; Benali, Anouar; Bennett, M Chandler; Berrill, Mark A; Blunt, Nick S; Borda, Edgar Josué Landinez; Casula, Michele; Ceperley, David M; Chiesa, Simone; Clark, Bryan K; Clay, Raymond C; Delaney, Kris T; Dewing, Mark; Esler, Kenneth P; Hao, Hongxia; Heinonen, Olle; Kent, Paul R C; Krogel, Jaron T; Kylänpää, Ilkka; Li, Ying Wai; Lopez, M Graham; Luo, Ye; Malone, Fionn D; Martin, Richard M; Mathuriya, Amrita; McMinis, Jeremy; Melton, Cody A; Mitas, Lubos; Morales, Miguel A; Neuscamman, Eric; Parker, William D; Pineda Flores, Sergio D; Romero, Nichols A; Rubenstein, Brenda M; Shea, Jacqueline A R; Shin, Hyeondeok; Shulenburger, Luke; Tillack, Andreas F; Townsend, Joshua P; Tubman, Norm M; Van Der Goetz, Brett; Vincent, Jordan E; Yang, D ChangMo; Yang, Yubo; Zhang, Shuai; Zhao, Luning

2018-05-16

QMCPACK is an open source quantum Monte Carlo package for ab initio electronic structure calculations. It supports calculations of metallic and insulating solids, molecules, atoms, and some model Hamiltonians. Implemented real space quantum Monte Carlo algorithms include variational, diffusion, and reptation Monte Carlo. QMCPACK uses Slater-Jastrow type trial wavefunctions in conjunction with a sophisticated optimizer capable of optimizing tens of thousands of parameters. The orbital space auxiliary-field quantum Monte Carlo method is also implemented, enabling cross validation between different highly accurate methods. The code is specifically optimized for calculations with large numbers of electrons on the latest high performance computing architectures, including multicore central processing unit and graphical processing unit systems. We detail the program's capabilities, outline its structure, and give examples of its use in current research calculations. The package is available at http://qmcpack.org.

A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems

DOE PAGES

Song, Fengguang; Dongarra, Jack

2014-10-01

Aiming to fully exploit the computing power of all CPUs and all graphics processing units (GPUs) on hybrid CPU-GPU systems to solve dense linear algebra problems, in this paper we design a class of heterogeneous tile algorithms to maximize the degree of parallelism, to minimize the communication volume, and to accommodate the heterogeneity between CPUs and GPUs. The new heterogeneous tile algorithms are executed upon our decentralized dynamic scheduling runtime system, which schedules a task graph dynamically and transfers data between compute nodes automatically. The runtime system uses a new distributed task assignment protocol to solve data dependencies between tasksmore » without any coordination between processing units. By overlapping computation and communication through dynamic scheduling, we are able to attain scalable performance for the double-precision Cholesky factorization and QR factorization. Finally, our approach demonstrates a performance comparable to Intel MKL on shared-memory multicore systems and better performance than both vendor (e.g., Intel MKL) and open source libraries (e.g., StarPU) in the following three environments: heterogeneous clusters with GPUs, conventional clusters without GPUs, and shared-memory systems with multiple GPUs.« less
THC-MP: High performance numerical simulation of reactive transport and multiphase flow in porous media

NASA Astrophysics Data System (ADS)

Wei, Xiaohui; Li, Weishan; Tian, Hailong; Li, Hongliang; Xu, Haixiao; Xu, Tianfu

2015-07-01

The numerical simulation of multiphase flow and reactive transport in the porous media on complex subsurface problem is a computationally intensive application. To meet the increasingly computational requirements, this paper presents a parallel computing method and architecture. Derived from TOUGHREACT that is a well-established code for simulating subsurface multi-phase flow and reactive transport problems, we developed a high performance computing THC-MP based on massive parallel computer, which extends greatly on the computational capability for the original code. The domain decomposition method was applied to the coupled numerical computing procedure in the THC-MP. We designed the distributed data structure, implemented the data initialization and exchange between the computing nodes and the core solving module using the hybrid parallel iterative and direct solver. Numerical accuracy of the THC-MP was verified through a CO2 injection-induced reactive transport problem by comparing the results obtained from the parallel computing and sequential computing (original code). Execution efficiency and code scalability were examined through field scale carbon sequestration applications on the multicore cluster. The results demonstrate successfully the enhanced performance using the THC-MP on parallel computing facilities.
Parallel peak pruning for scalable SMP contour tree computation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Carr, Hamish A.; Weber, Gunther H.; Sewell, Christopher M.

As data sets grow to exascale, automated data analysis and visualisation are increasingly important, to intermediate human understanding and to reduce demands on disk storage via in situ analysis. Trends in architecture of high performance computing systems necessitate analysis algorithms to make effective use of combinations of massively multicore and distributed systems. One of the principal analytic tools is the contour tree, which analyses relationships between contours to identify features of more than local importance. Unfortunately, the predominant algorithms for computing the contour tree are explicitly serial, and founded on serial metaphors, which has limited the scalability of this formmore » of analysis. While there is some work on distributed contour tree computation, and separately on hybrid GPU-CPU computation, there is no efficient algorithm with strong formal guarantees on performance allied with fast practical performance. Here in this paper, we report the first shared SMP algorithm for fully parallel contour tree computation, withfor-mal guarantees of O(lgnlgt) parallel steps and O(n lgn) work, and implementations with up to 10x parallel speed up in OpenMP and up to 50x speed up in NVIDIA Thrust.« less
Achieving energy efficiency during collective communications

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sundriyal, Vaibhav; Sosonkina, Masha; Zhang, Zhao

2012-09-13

Energy consumption has become a major design constraint in modern computing systems. With the advent of petaflops architectures, power-efficient software stacks have become imperative for scalability. Techniques such as dynamic voltage and frequency scaling (called DVFS) and CPU clock modulation (called throttling) are often used to reduce the power consumption of the compute nodes. To avoid significant performance losses, these techniques should be used judiciously during parallel application execution. For example, its communication phases may be good candidates to apply the DVFS and CPU throttling without incurring a considerable performance loss. They are often considered as indivisible operations although littlemore » attention is being devoted to the energy saving potential of their algorithmic steps. In this work, two important collective communication operations, all-to-all and allgather, are investigated as to their augmentation with energy saving strategies on the per-call basis. The experiments prove the viability of such a fine-grain approach. They also validate a theoretical power consumption estimate for multicore nodes proposed here. While keeping the performance loss low, the obtained energy savings were always significantly higher than those achieved when DVFS or throttling were switched on across the entire application run« less
NeuroFlow: A General Purpose Spiking Neural Network Simulation Platform using Customizable Processors.

PubMed

Cheung, Kit; Schultz, Simon R; Luk, Wayne

2015-01-01

NeuroFlow is a scalable spiking neural network simulation platform for off-the-shelf high performance computing systems using customizable hardware processors such as Field-Programmable Gate Arrays (FPGAs). Unlike multi-core processors and application-specific integrated circuits, the processor architecture of NeuroFlow can be redesigned and reconfigured to suit a particular simulation to deliver optimized performance, such as the degree of parallelism to employ. The compilation process supports using PyNN, a simulator-independent neural network description language, to configure the processor. NeuroFlow supports a number of commonly used current or conductance based neuronal models such as integrate-and-fire and Izhikevich models, and the spike-timing-dependent plasticity (STDP) rule for learning. A 6-FPGA system can simulate a network of up to ~600,000 neurons and can achieve a real-time performance of 400,000 neurons. Using one FPGA, NeuroFlow delivers a speedup of up to 33.6 times the speed of an 8-core processor, or 2.83 times the speed of GPU-based platforms. With high flexibility and throughput, NeuroFlow provides a viable environment for large-scale neural network simulation.
End-To-End performance test of the LINC-NIRVANA Wavefront-Sensor system.

NASA Astrophysics Data System (ADS)

Berwein, Juergen; Bertram, Thomas; Conrad, Al; Briegel, Florian; Kittmann, Frank; Zhang, Xiangyu; Mohr, Lars

2011-09-01

LINC-NIRVANA is an imaging Fizeau interferometer, for use in near infrared wavelengths, being built for the Large Binocular Telescope. Multi-conjugate adaptive optics (MCAO) increases the sky coverage and the field of view over which diffraction limited images can be obtained. For its MCAO implementation, Linc-Nirvana utilizes four total wavefront sensors; each of the two beams is corrected by both a ground-layer wavefront sensor (GWS) and a high-layer wavefront sensor (HWS). The GWS controls the adaptive secondary deformable mirror (DM), which is based on an DSP slope computing unit. Whereas the HWS controls an internal DM via computations provided by an off-the-shelf multi-core Linux system. Using wavefront sensor data collected from a prior lab experiment, we have shown via simulation that the Linux based system is sufficient to operate at 1kHz, with jitter well below the needs of the final system. Based on that setup we tested the end-to-end performance and latency through all parts of the system which includes the camera, the wavefront controller, and the deformable mirror. We will present our loop control structure and the results of those performance tests.
A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Song, Fengguang; Dongarra, Jack

Aiming to fully exploit the computing power of all CPUs and all graphics processing units (GPUs) on hybrid CPU-GPU systems to solve dense linear algebra problems, in this paper we design a class of heterogeneous tile algorithms to maximize the degree of parallelism, to minimize the communication volume, and to accommodate the heterogeneity between CPUs and GPUs. The new heterogeneous tile algorithms are executed upon our decentralized dynamic scheduling runtime system, which schedules a task graph dynamically and transfers data between compute nodes automatically. The runtime system uses a new distributed task assignment protocol to solve data dependencies between tasksmore » without any coordination between processing units. By overlapping computation and communication through dynamic scheduling, we are able to attain scalable performance for the double-precision Cholesky factorization and QR factorization. Finally, our approach demonstrates a performance comparable to Intel MKL on shared-memory multicore systems and better performance than both vendor (e.g., Intel MKL) and open source libraries (e.g., StarPU) in the following three environments: heterogeneous clusters with GPUs, conventional clusters without GPUs, and shared-memory systems with multiple GPUs.« less
NeuroFlow: A General Purpose Spiking Neural Network Simulation Platform using Customizable Processors

PubMed Central

Cheung, Kit; Schultz, Simon R.; Luk, Wayne

2016-01-01

NeuroFlow is a scalable spiking neural network simulation platform for off-the-shelf high performance computing systems using customizable hardware processors such as Field-Programmable Gate Arrays (FPGAs). Unlike multi-core processors and application-specific integrated circuits, the processor architecture of NeuroFlow can be redesigned and reconfigured to suit a particular simulation to deliver optimized performance, such as the degree of parallelism to employ. The compilation process supports using PyNN, a simulator-independent neural network description language, to configure the processor. NeuroFlow supports a number of commonly used current or conductance based neuronal models such as integrate-and-fire and Izhikevich models, and the spike-timing-dependent plasticity (STDP) rule for learning. A 6-FPGA system can simulate a network of up to ~600,000 neurons and can achieve a real-time performance of 400,000 neurons. Using one FPGA, NeuroFlow delivers a speedup of up to 33.6 times the speed of an 8-core processor, or 2.83 times the speed of GPU-based platforms. With high flexibility and throughput, NeuroFlow provides a viable environment for large-scale neural network simulation. PMID:26834542
Mitigating Decision-Making Paralysis During Catastrophic Disasters

DTIC Science & Technology

2011-03-01

COVERED Master’s Thesis 4. TITLE AND SUBTITLE Mitigating Decision-Making Paralysis During Catastrophic Disasters 6. AUTHOR( S ) Terrence J. Winters 5...FUNDING NUMBERS 7. PERFORMING ORGANIZATION NAME( S ) AND ADDRESS(ES) Naval Postgraduate School Monterey, CA 93943-5000 8. PERFORMING ORGANIZATION...REPORT NUMBER 9. SPONSORING /MONITORING AGENCY NAME( S ) AND ADDRESS(ES) N/A 10. SPONSORING/MONITORING AGENCY REPORT NUMBER 11
Evaluation of Low-Cost Mitigation Measures Implemented to Improve Air Quality in Nursery and Primary Schools

PubMed Central

Sá, Juliana P.; Branco, Pedro T. B. S.; Alvim-Ferraz, Maria C. M.; Martins, Fernando G.; Sousa, Sofia I. V.

2017-01-01

Indoor air pollution mitigation measures are highly important due to the associated health impacts, especially on children, a risk group that spends significant time indoors. Thus, the main goal of the work here reported was the evaluation of mitigation measures implemented in nursery and primary schools to improve air quality. Continuous measurements of CO2, CO, NO2, O3, CH2O, total volatile organic compounds (VOC), PM1, PM2.5, PM10, Total Suspended Particles (TSP) and radon, as well as temperature and relative humidity were performed in two campaigns, before and after the implementation of low-cost mitigation measures. Evaluation of those mitigation measures was performed through the comparison of the concentrations measured in both campaigns. Exceedances to the values set by the national legislation and World Health Organization (WHO) were found for PM2.5, PM10, CO2 and CH2O during both indoor air quality campaigns. Temperature and relative humidity values were also above the ranges recommended by American Society of Heating, Refrigerating, and Air-Conditioning Engineers (ASHRAE). In general, pollutant concentrations measured after the implementation of low-cost mitigation measures were significantly lower, mainly for CO2. However, mitigation measures were not always sufficient to decrease the pollutants’ concentrations till values considered safe to protect human health. PMID:28561795
On Parallel Push-Relabel based Algorithms for Bipartite Maximum Matching

DOE Office of Scientific and Technical Information (OSTI.GOV)

Langguth, Johannes; Azad, Md Ariful; Halappanavar, Mahantesh

2014-07-01

We study multithreaded push-relabel based algorithms for computing maximum cardinality matching in bipartite graphs. Matching is a fundamental combinatorial (graph) problem with applications in a wide variety of problems in science and engineering. We are motivated by its use in the context of sparse linear solvers for computing maximum transversal of a matrix. We implement and test our algorithms on several multi-socket multicore systems and compare their performance to state-of-the-art augmenting path-based serial and parallel algorithms using a testset comprised of a wide range of real-world instances. Building on several heuristics for enhancing performance, we demonstrate good scaling for themore » parallel push-relabel algorithm. We show that it is comparable to the best augmenting path-based algorithms for bipartite matching. To the best of our knowledge, this is the first extensive study of multithreaded push-relabel based algorithms. In addition to a direct impact on the applications using matching, the proposed algorithmic techniques can be extended to preflow-push based algorithms for computing maximum flow in graphs.« less
An Efficient Multicore Implementation of a Novel HSS-Structured Multifrontal Solver Using Randomized Sampling

DOE PAGES

Ghysels, Pieter; Li, Xiaoye S.; Rouet, Francois -Henry; ...

2016-10-27

Here, we present a sparse linear system solver that is based on a multifrontal variant of Gaussian elimination and exploits low-rank approximation of the resulting dense frontal matrices. We use hierarchically semiseparable (HSS) matrices, which have low-rank off-diagonal blocks, to approximate the frontal matrices. For HSS matrix construction, a randomized sampling algorithm is used together with interpolative decompositions. The combination of the randomized compression with a fast ULV HSS factoriz ation leads to a solver with lower computational complexity than the standard multifrontal method for many applications, resulting in speedups up to 7 fold for problems in our test suite.more » The implementation targets many-core systems by using task parallelism with dynamic runtime scheduling. Numerical experiments show performance improvements over state-of-the-art sparse direct solvers. The implementation achieves high performance and good scalability on a range of modern shared memory parallel systems, including the Intel Xeon Phi (MIC). The code is part of a software package called STRUMPACK - STRUctured Matrices PACKage, which also has a distributed memory component for dense rank-structured matrices.« less
Verification of Electromagnetic Physics Models for Parallel Computing Architectures in the GeantV Project

DOE Office of Scientific and Technical Information (OSTI.GOV)

Amadio, G.; et al.

An intensive R&D and programming effort is required to accomplish new challenges posed by future experimental high-energy particle physics (HEP) programs. The GeantV project aims to narrow the gap between the performance of the existing HEP detector simulation software and the ideal performance achievable, exploiting latest advances in computing technology. The project has developed a particle detector simulation prototype capable of transporting in parallel particles in complex geometries exploiting instruction level microparallelism (SIMD and SIMT), task-level parallelism (multithreading) and high-level parallelism (MPI), leveraging both the multi-core and the many-core opportunities. We present preliminary verification results concerning the electromagnetic (EM) physicsmore » models developed for parallel computing architectures within the GeantV project. In order to exploit the potential of vectorization and accelerators and to make the physics model effectively parallelizable, advanced sampling techniques have been implemented and tested. In this paper we introduce a set of automated statistical tests in order to verify the vectorized models by checking their consistency with the corresponding Geant4 models and to validate them against experimental data.« less
Scalability Issues for Remote Sensing Infrastructure: A Case Study.

PubMed

Liu, Yang; Picard, Sean; Williamson, Carey

2017-04-29

For the past decade, a team of University of Calgary researchers has operated a large "sensor Web" to collect, analyze, and share scientific data from remote measurement instruments across northern Canada. This sensor Web receives real-time data streams from over a thousand Internet-connected sensors, with a particular emphasis on environmental data (e.g., space weather, auroral phenomena, atmospheric imaging). Through research collaborations, we had the opportunity to evaluate the performance and scalability of their remote sensing infrastructure. This article reports the lessons learned from our study, which considered both data collection and data dissemination aspects of their system. On the data collection front, we used benchmarking techniques to identify and fix a performance bottleneck in the system's memory management for TCP data streams, while also improving system efficiency on multi-core architectures. On the data dissemination front, we used passive and active network traffic measurements to identify and reduce excessive network traffic from the Web robots and JavaScript techniques used for data sharing. While our results are from one specific sensor Web system, the lessons learned may apply to other scientific Web sites with remote sensing infrastructure.
Scaling Support Vector Machines On Modern HPC Platforms

DOE Office of Scientific and Technical Information (OSTI.GOV)

You, Yang; Fu, Haohuan; Song, Shuaiwen

2015-02-01

We designed and implemented MIC-SVM, a highly efficient parallel SVM for x86 based multicore and many-core architectures, such as the Intel Ivy Bridge CPUs and Intel Xeon Phi co-processor (MIC). We propose various novel analysis methods and optimization techniques to fully utilize the multilevel parallelism provided by these architectures and serve as general optimization methods for other machine learning tools.
MIT Lincoln Laboratory Takes the Mystery Out of Supercomupting

DTIC Science & Technology

2017-01-18

analysis, designing sensors, and developing algorithms. In 2008, the Lincoln demonstrated the largest single problem ever run on a computer using ... computation . As we design and prototype these devices, the use of leading–edge engineering practices have become the de facto standard. This includes...MIT Lincoln Laboratory Takes the Mystery Out of Supercomputing By Dr. Jeremy Kepner 1 The introduction of multicore and manycore processors
Photogrammetric Verification of Fiber Optic Shape Sensors on Flexible Aerospace Structures

NASA Technical Reports Server (NTRS)

Moore, Jason P.; Rogge, Matthew D.; Jones, Thomas W.

2012-01-01

Multi-core fiber (MCF) optic shape sensing offers the possibility of providing in-flight shape measurements of highly flexible aerospace structures and control surfaces for such purposes as gust load alleviation, flutter suppression, general flight control and structural health monitoring. Photogrammetric measurements of surface mounted MCF shape sensing cable can be used to quantify the MCF installation path and verify measurement methods.
Integrated Optoelectronic Networks for Application-Driven Multicore Computing

DTIC Science & Technology

2017-05-08

hybrid photonic torus, the all-optical Corona crossbar, and the hybrid hierarchical Firefly crossbar. • The key challenges for waveguide photonics...improves SXR but with relatively higher EDP overhead. Our evaluation results indicate that the encoding schemes improve worst-case-SXR in Corona and...photonic crossbar architectures ( Corona and Firefly) indicate that our approach improves worst-case signal-to-noise ratio (SNR) by up to 51.7
FOS: A Factored Operating Systems for High Assurance and Scalability on Multicores

DTIC Science & Technology

2012-08-01

computing. It builds on previous work in distributed and microkernel OSes by factoring services out of the kernel, and then further distributing each...2 3.0 Methods, Assumptions, and Procedures (System Design) .................................................. 4 3.1 Microkernel ...cooperating servers. We term such a service a fleet. Figure 2 shows the high-level architecture of fos. A small microkernel runs on every core
Software defined multi-spectral imaging for Arctic sensor networks

NASA Astrophysics Data System (ADS)

Siewert, Sam; Angoth, Vivek; Krishnamurthy, Ramnarayan; Mani, Karthikeyan; Mock, Kenrick; Singh, Surjith B.; Srivistava, Saurav; Wagner, Chris; Claus, Ryan; Vis, Matthew Demi

2016-05-01

Availability of off-the-shelf infrared sensors combined with high definition visible cameras has made possible the construction of a Software Defined Multi-Spectral Imager (SDMSI) combining long-wave, near-infrared and visible imaging. The SDMSI requires a real-time embedded processor to fuse images and to create real-time depth maps for opportunistic uplink in sensor networks. Researchers at Embry Riddle Aeronautical University working with University of Alaska Anchorage at the Arctic Domain Awareness Center and the University of Colorado Boulder have built several versions of a low-cost drop-in-place SDMSI to test alternatives for power efficient image fusion. The SDMSI is intended for use in field applications including marine security, search and rescue operations and environmental surveys in the Arctic region. Based on Arctic marine sensor network mission goals, the team has designed the SDMSI to include features to rank images based on saliency and to provide on camera fusion and depth mapping. A major challenge has been the design of the camera computing system to operate within a 10 to 20 Watt power budget. This paper presents a power analysis of three options: 1) multi-core, 2) field programmable gate array with multi-core, and 3) graphics processing units with multi-core. For each test, power consumed for common fusion workloads has been measured at a range of frame rates and resolutions. Detailed analyses from our power efficiency comparison for workloads specific to stereo depth mapping and sensor fusion are summarized. Preliminary mission feasibility results from testing with off-the-shelf long-wave infrared and visible cameras in Alaska and Arizona are also summarized to demonstrate the value of the SDMSI for applications such as ice tracking, ocean color, soil moisture, animal and marine vessel detection and tracking. The goal is to select the most power efficient solution for the SDMSI for use on UAVs (Unoccupied Aerial Vehicles) and other drop-in-place installations in the Arctic. The prototype selected will be field tested in Alaska in the summer of 2016.

Some links on this page may take you to non-federal websites. Their policies may differ from this site.