A parallel Jacobson-Oksman optimization algorithm. [parallel processing (computers)
NASA Technical Reports Server (NTRS)
Straeter, T. A.; Markos, A. T.
1975-01-01
A gradient-dependent optimization technique which exploits the vector-streaming or parallel-computing capabilities of some modern computers is presented. The algorithm, derived by assuming that the function to be minimized is homogeneous, is a modification of the Jacobson-Oksman serial minimization method. In addition to describing the algorithm, conditions insuring the convergence of the iterates of the algorithm and the results of numerical experiments on a group of sample test functions are presented. The results of these experiments indicate that this algorithm will solve optimization problems in less computing time than conventional serial methods on machines having vector-streaming or parallel-computing capabilities.
NASA Technical Reports Server (NTRS)
Treinish, Lloyd A.; Gough, Michael L.; Wildenhain, W. David
1987-01-01
The capability was developed of rapidly producing visual representations of large, complex, multi-dimensional space and earth sciences data sets via the implementation of computer graphics modeling techniques on the Massively Parallel Processor (MPP) by employing techniques recently developed for typically non-scientific applications. Such capabilities can provide a new and valuable tool for the understanding of complex scientific data, and a new application of parallel computing via the MPP. A prototype system with such capabilities was developed and integrated into the National Space Science Data Center's (NSSDC) Pilot Climate Data System (PCDS) data-independent environment for computer graphics data display to provide easy access to users. While developing these capabilities, several problems had to be solved independently of the actual use of the MPP, all of which are outlined.
A parallel variable metric optimization algorithm
NASA Technical Reports Server (NTRS)
Straeter, T. A.
1973-01-01
An algorithm, designed to exploit the parallel computing or vector streaming (pipeline) capabilities of computers is presented. When p is the degree of parallelism, then one cycle of the parallel variable metric algorithm is defined as follows: first, the function and its gradient are computed in parallel at p different values of the independent variable; then the metric is modified by p rank-one corrections; and finally, a single univariant minimization is carried out in the Newton-like direction. Several properties of this algorithm are established. The convergence of the iterates to the solution is proved for a quadratic functional on a real separable Hilbert space. For a finite-dimensional space the convergence is in one cycle when p equals the dimension of the space. Results of numerical experiments indicate that the new algorithm will exploit parallel or pipeline computing capabilities to effect faster convergence than serial techniques.
Design of on-board parallel computer on nano-satellite
NASA Astrophysics Data System (ADS)
You, Zheng; Tian, Hexiang; Yu, Shijie; Meng, Li
2007-11-01
This paper provides one scheme of the on-board parallel computer system designed for the Nano-satellite. Based on the development request that the Nano-satellite should have a small volume, low weight, low power cost, and intelligence, this scheme gets rid of the traditional one-computer system and dual-computer system with endeavor to improve the dependability, capability and intelligence simultaneously. According to the method of integration design, it employs the parallel computer system with shared memory as the main structure, connects the telemetric system, attitude control system, and the payload system by the intelligent bus, designs the management which can deal with the static tasks and dynamic task-scheduling, protect and recover the on-site status and so forth in light of the parallel algorithms, and establishes the fault diagnosis, restoration and system restructure mechanism. It accomplishes an on-board parallel computer system with high dependability, capability and intelligence, a flexible management on hardware resources, an excellent software system, and a high ability in extension, which satisfies with the conception and the tendency of the integration electronic design sufficiently.
Computing NLTE Opacities -- Node Level Parallel Calculation
DOE Office of Scientific and Technical Information (OSTI.GOV)
Holladay, Daniel
Presentation. The goal: to produce a robust library capable of computing reasonably accurate opacities inline with the assumption of LTE relaxed (non-LTE). Near term: demonstrate acceleration of non-LTE opacity computation. Far term (if funded): connect to application codes with in-line capability and compute opacities. Study science problems. Use efficient algorithms that expose many levels of parallelism and utilize good memory access patterns for use on advanced architectures. Portability to multiple types of hardware including multicore processors, manycore processors such as KNL, GPUs, etc. Easily coupled to radiation hydrodynamics and thermal radiative transfer codes.
User's Guide for TOUGH2-MP - A Massively Parallel Version of the TOUGH2 Code
DOE Office of Scientific and Technical Information (OSTI.GOV)
Earth Sciences Division; Zhang, Keni; Zhang, Keni
TOUGH2-MP is a massively parallel (MP) version of the TOUGH2 code, designed for computationally efficient parallel simulation of isothermal and nonisothermal flows of multicomponent, multiphase fluids in one, two, and three-dimensional porous and fractured media. In recent years, computational requirements have become increasingly intensive in large or highly nonlinear problems for applications in areas such as radioactive waste disposal, CO2 geological sequestration, environmental assessment and remediation, reservoir engineering, and groundwater hydrology. The primary objective of developing the parallel-simulation capability is to significantly improve the computational performance of the TOUGH2 family of codes. The particular goal for the parallel simulator ismore » to achieve orders-of-magnitude improvement in computational time for models with ever-increasing complexity. TOUGH2-MP is designed to perform parallel simulation on multi-CPU computational platforms. An earlier version of TOUGH2-MP (V1.0) was based on the TOUGH2 Version 1.4 with EOS3, EOS9, and T2R3D modules, a software previously qualified for applications in the Yucca Mountain project, and was designed for execution on CRAY T3E and IBM SP supercomputers. The current version of TOUGH2-MP (V2.0) includes all fluid property modules of the standard version TOUGH2 V2.0. It provides computationally efficient capabilities using supercomputers, Linux clusters, or multi-core PCs, and also offers many user-friendly features. The parallel simulator inherits all process capabilities from V2.0 together with additional capabilities for handling fractured media from V1.4. This report provides a quick starting guide on how to set up and run the TOUGH2-MP program for users with a basic knowledge of running the (standard) version TOUGH2 code, The report also gives a brief technical description of the code, including a discussion of parallel methodology, code structure, as well as mathematical and numerical methods used. To familiarize users with the parallel code, illustrative sample problems are presented.« less
Parallel computations and control of adaptive structures
NASA Technical Reports Server (NTRS)
Park, K. C.; Alvin, Kenneth F.; Belvin, W. Keith; Chong, K. P. (Editor); Liu, S. C. (Editor); Li, J. C. (Editor)
1991-01-01
The equations of motion for structures with adaptive elements for vibration control are presented for parallel computations to be used as a software package for real-time control of flexible space structures. A brief introduction of the state-of-the-art parallel computational capability is also presented. Time marching strategies are developed for an effective use of massive parallel mapping, partitioning, and the necessary arithmetic operations. An example is offered for the simulation of control-structure interaction on a parallel computer and the impact of the approach presented for applications in other disciplines than aerospace industry is assessed.
Xyce parallel electronic simulator : users' guide.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mei, Ting; Rankin, Eric Lamont; Thornquist, Heidi K.
2011-05-01
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: (1) Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). Note that this includes support for most popular parallel and serial computers; (2) Improved performance for all numerical kernels (e.g., time integrator, nonlinear and linear solvers) through state-of-the-artmore » algorithms and novel techniques. (3) Device models which are specifically tailored to meet Sandia's needs, including some radiation-aware devices (for Sandia users only); and (4) Object-oriented code design and implementation using modern coding practices that ensure that the Xyce Parallel Electronic Simulator will be maintainable and extensible far into the future. Xyce is a parallel code in the most general sense of the phrase - a message passing parallel implementation - which allows it to run efficiently on the widest possible number of computing platforms. These include serial, shared-memory and distributed-memory parallel as well as heterogeneous platforms. Careful attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows. The development of Xyce provides a platform for computational research and development aimed specifically at the needs of the Laboratory. With Xyce, Sandia has an 'in-house' capability with which both new electrical (e.g., device model development) and algorithmic (e.g., faster time-integration methods, parallel solver algorithms) research and development can be performed. As a result, Xyce is a unique electrical simulation capability, designed to meet the unique needs of the laboratory.« less
Parallel computing in genomic research: advances and applications
Ocaña, Kary; de Oliveira, Daniel
2015-01-01
Today’s genomic experiments have to process the so-called “biological big data” that is now reaching the size of Terabytes and Petabytes. To process this huge amount of data, scientists may require weeks or months if they use their own workstations. Parallelism techniques and high-performance computing (HPC) environments can be applied for reducing the total processing time and to ease the management, treatment, and analyses of this data. However, running bioinformatics experiments in HPC environments such as clouds, grids, clusters, and graphics processing unit requires the expertise from scientists to integrate computational, biological, and mathematical techniques and technologies. Several solutions have already been proposed to allow scientists for processing their genomic experiments using HPC capabilities and parallelism techniques. This article brings a systematic review of literature that surveys the most recently published research involving genomics and parallel computing. Our objective is to gather the main characteristics, benefits, and challenges that can be considered by scientists when running their genomic experiments to benefit from parallelism techniques and HPC capabilities. PMID:26604801
Parallel computing in genomic research: advances and applications.
Ocaña, Kary; de Oliveira, Daniel
2015-01-01
Today's genomic experiments have to process the so-called "biological big data" that is now reaching the size of Terabytes and Petabytes. To process this huge amount of data, scientists may require weeks or months if they use their own workstations. Parallelism techniques and high-performance computing (HPC) environments can be applied for reducing the total processing time and to ease the management, treatment, and analyses of this data. However, running bioinformatics experiments in HPC environments such as clouds, grids, clusters, and graphics processing unit requires the expertise from scientists to integrate computational, biological, and mathematical techniques and technologies. Several solutions have already been proposed to allow scientists for processing their genomic experiments using HPC capabilities and parallelism techniques. This article brings a systematic review of literature that surveys the most recently published research involving genomics and parallel computing. Our objective is to gather the main characteristics, benefits, and challenges that can be considered by scientists when running their genomic experiments to benefit from parallelism techniques and HPC capabilities.
An FPGA-based High Speed Parallel Signal Processing System for Adaptive Optics Testbed
NASA Astrophysics Data System (ADS)
Kim, H.; Choi, Y.; Yang, Y.
In this paper a state-of-the-art FPGA (Field Programmable Gate Array) based high speed parallel signal processing system (SPS) for adaptive optics (AO) testbed with 1 kHz wavefront error (WFE) correction frequency is reported. The AO system consists of Shack-Hartmann sensor (SHS) and deformable mirror (DM), tip-tilt sensor (TTS), tip-tilt mirror (TTM) and an FPGA-based high performance SPS to correct wavefront aberrations. The SHS is composed of 400 subapertures and the DM 277 actuators with Fried geometry, requiring high speed parallel computing capability SPS. In this study, the target WFE correction speed is 1 kHz; therefore, it requires massive parallel computing capabilities as well as strict hard real time constraints on measurements from sensors, matrix computation latency for correction algorithms, and output of control signals for actuators. In order to meet them, an FPGA based real-time SPS with parallel computing capabilities is proposed. In particular, the SPS is made up of a National Instrument's (NI's) real time computer and five FPGA boards based on state-of-the-art Xilinx Kintex 7 FPGA. Programming is done with NI's LabView environment, providing flexibility when applying different algorithms for WFE correction. It also facilitates faster programming and debugging environment as compared to conventional ones. One of the five FPGA's is assigned to measure TTS and calculate control signals for TTM, while the rest four are used to receive SHS signal, calculate slops for each subaperture and correction signal for DM. With this parallel processing capabilities of the SPS the overall closed-loop WFE correction speed of 1 kHz has been achieved. System requirements, architecture and implementation issues are described; furthermore, experimental results are also given.
NASA Technical Reports Server (NTRS)
Fijany, Amir (Inventor); Bejczy, Antal K. (Inventor)
1993-01-01
This is a real-time robotic controller and simulator which is a MIMD-SIMD parallel architecture for interfacing with an external host computer and providing a high degree of parallelism in computations for robotic control and simulation. It includes a host processor for receiving instructions from the external host computer and for transmitting answers to the external host computer. There are a plurality of SIMD microprocessors, each SIMD processor being a SIMD parallel processor capable of exploiting fine grain parallelism and further being able to operate asynchronously to form a MIMD architecture. Each SIMD processor comprises a SIMD architecture capable of performing two matrix-vector operations in parallel while fully exploiting parallelism in each operation. There is a system bus connecting the host processor to the plurality of SIMD microprocessors and a common clock providing a continuous sequence of clock pulses. There is also a ring structure interconnecting the plurality of SIMD microprocessors and connected to the clock for providing the clock pulses to the SIMD microprocessors and for providing a path for the flow of data and instructions between the SIMD microprocessors. The host processor includes logic for controlling the RRCS by interpreting instructions sent by the external host computer, decomposing the instructions into a series of computations to be performed by the SIMD microprocessors, using the system bus to distribute associated data among the SIMD microprocessors, and initiating activity of the SIMD microprocessors to perform the computations on the data by procedure call.
NASA Astrophysics Data System (ADS)
Ford, Eric B.; Dindar, Saleh; Peters, Jorg
2015-08-01
The realism of astrophysical simulations and statistical analyses of astronomical data are set by the available computational resources. Thus, astronomers and astrophysicists are constantly pushing the limits of computational capabilities. For decades, astronomers benefited from massive improvements in computational power that were driven primarily by increasing clock speeds and required relatively little attention to details of the computational hardware. For nearly a decade, increases in computational capabilities have come primarily from increasing the degree of parallelism, rather than increasing clock speeds. Further increases in computational capabilities will likely be led by many-core architectures such as Graphical Processing Units (GPUs) and Intel Xeon Phi. Successfully harnessing these new architectures, requires significantly more understanding of the hardware architecture, cache hierarchy, compiler capabilities and network network characteristics.I will provide an astronomer's overview of the opportunities and challenges provided by modern many-core architectures and elastic cloud computing. The primary goal is to help an astronomical audience understand what types of problems are likely to yield more than order of magnitude speed-ups and which problems are unlikely to parallelize sufficiently efficiently to be worth the development time and/or costs.I will draw on my experience leading a team in developing the Swarm-NG library for parallel integration of large ensembles of small n-body systems on GPUs, as well as several smaller software projects. I will share lessons learned from collaborating with computer scientists, including both technical and soft skills. Finally, I will discuss the challenges of training the next generation of astronomers to be proficient in this new era of high-performance computing, drawing on experience teaching a graduate class on High-Performance Scientific Computing for Astrophysics and organizing a 2014 advanced summer school on Bayesian Computing for Astronomical Data Analysis with support of the Penn State Center for Astrostatistics and Institute for CyberScience.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Langer, S; Rotman, D; Schwegler, E
The Institutional Computing Executive Group (ICEG) review of FY05-06 Multiprogrammatic and Institutional Computing (M and IC) activities is presented in the attached report. In summary, we find that the M and IC staff does an outstanding job of acquiring and supporting a wide range of institutional computing resources to meet the programmatic and scientific goals of LLNL. The responsiveness and high quality of support given to users and the programs investing in M and IC reflects the dedication and skill of the M and IC staff. M and IC has successfully managed serial capacity, parallel capacity, and capability computing resources.more » Serial capacity computing supports a wide range of scientific projects which require access to a few high performance processors within a shared memory computer. Parallel capacity computing supports scientific projects that require a moderate number of processors (up to roughly 1000) on a parallel computer. Capability computing supports parallel jobs that push the limits of simulation science. M and IC has worked closely with Stockpile Stewardship, and together they have made LLNL a premier institution for computational and simulation science. Such a standing is vital to the continued success of laboratory science programs and to the recruitment and retention of top scientists. This report provides recommendations to build on M and IC's accomplishments and improve simulation capabilities at LLNL. We recommend that institution fully fund (1) operation of the atlas cluster purchased in FY06 to support a few large projects; (2) operation of the thunder and zeus clusters to enable 'mid-range' parallel capacity simulations during normal operation and a limited number of large simulations during dedicated application time; (3) operation of the new yana cluster to support a wide range of serial capacity simulations; (4) improvements to the reliability and performance of the Lustre parallel file system; (5) support for the new GDO petabyte-class storage facility on the green network for use in data intensive external collaborations; and (6) continued support for visualization and other methods for analyzing large simulations. We also recommend that M and IC begin planning in FY07 for the next upgrade of its parallel clusters. LLNL investments in M and IC have resulted in a world-class simulation capability leading to innovative science. We thank the LLNL management for its continued support and thank the M and IC staff for its vision and dedicated efforts to make it all happen.« less
Parallel aeroelastic computations for wing and wing-body configurations
NASA Technical Reports Server (NTRS)
Byun, Chansup
1994-01-01
The objective of this research is to develop computationally efficient methods for solving fluid-structural interaction problems by directly coupling finite difference Euler/Navier-Stokes equations for fluids and finite element dynamics equations for structures on parallel computers. This capability will significantly impact many aerospace projects of national importance such as Advanced Subsonic Civil Transport (ASCT), where the structural stability margin becomes very critical at the transonic region. This research effort will have direct impact on the High Performance Computing and Communication (HPCC) Program of NASA in the area of parallel computing.
Spatiotemporal Domain Decomposition for Massive Parallel Computation of Space-Time Kernel Density
NASA Astrophysics Data System (ADS)
Hohl, A.; Delmelle, E. M.; Tang, W.
2015-07-01
Accelerated processing capabilities are deemed critical when conducting analysis on spatiotemporal datasets of increasing size, diversity and availability. High-performance parallel computing offers the capacity to solve computationally demanding problems in a limited timeframe, but likewise poses the challenge of preventing processing inefficiency due to workload imbalance between computing resources. Therefore, when designing new algorithms capable of implementing parallel strategies, careful spatiotemporal domain decomposition is necessary to account for heterogeneity in the data. In this study, we perform octtree-based adaptive decomposition of the spatiotemporal domain for parallel computation of space-time kernel density. In order to avoid edge effects near subdomain boundaries, we establish spatiotemporal buffers to include adjacent data-points that are within the spatial and temporal kernel bandwidths. Then, we quantify computational intensity of each subdomain to balance workloads among processors. We illustrate the benefits of our methodology using a space-time epidemiological dataset of Dengue fever, an infectious vector-borne disease that poses a severe threat to communities in tropical climates. Our parallel implementation of kernel density reaches substantial speedup compared to sequential processing, and achieves high levels of workload balance among processors due to great accuracy in quantifying computational intensity. Our approach is portable of other space-time analytical tests.
SAPNEW: Parallel finite element code for thin shell structures on the Alliant FX-80
NASA Astrophysics Data System (ADS)
Kamat, Manohar P.; Watson, Brian C.
1992-11-01
The finite element method has proven to be an invaluable tool for analysis and design of complex, high performance systems, such as bladed-disk assemblies in aircraft turbofan engines. However, as the problem size increase, the computation time required by conventional computers can be prohibitively high. Parallel processing computers provide the means to overcome these computation time limits. This report summarizes the results of a research activity aimed at providing a finite element capability for analyzing turbomachinery bladed-disk assemblies in a vector/parallel processing environment. A special purpose code, named with the acronym SAPNEW, has been developed to perform static and eigen analysis of multi-degree-of-freedom blade models built-up from flat thin shell elements. SAPNEW provides a stand alone capability for static and eigen analysis on the Alliant FX/80, a parallel processing computer. A preprocessor, named with the acronym NTOS, has been developed to accept NASTRAN input decks and convert them to the SAPNEW format to make SAPNEW more readily used by researchers at NASA Lewis Research Center.
A scalable parallel black oil simulator on distributed memory parallel computers
NASA Astrophysics Data System (ADS)
Wang, Kun; Liu, Hui; Chen, Zhangxin
2015-11-01
This paper presents our work on developing a parallel black oil simulator for distributed memory computers based on our in-house parallel platform. The parallel simulator is designed to overcome the performance issues of common simulators that are implemented for personal computers and workstations. The finite difference method is applied to discretize the black oil model. In addition, some advanced techniques are employed to strengthen the robustness and parallel scalability of the simulator, including an inexact Newton method, matrix decoupling methods, and algebraic multigrid methods. A new multi-stage preconditioner is proposed to accelerate the solution of linear systems from the Newton methods. Numerical experiments show that our simulator is scalable and efficient, and is capable of simulating extremely large-scale black oil problems with tens of millions of grid blocks using thousands of MPI processes on parallel computers.
Xyce parallel electronic simulator users guide, version 6.1
DOE Office of Scientific and Technical Information (OSTI.GOV)
Keiter, Eric R; Mei, Ting; Russo, Thomas V.
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas; Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). This includes support for most popular parallel and serial computers; A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to developmore » new types of analysis without requiring the implementation of analysis-specific device models; Device models that are specifically tailored to meet Sandia's needs, including some radiationaware devices (for Sandia users only); and Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase-a message passing parallel implementation-which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows.« less
Xyce parallel electronic simulator users' guide, Version 6.0.1.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Keiter, Eric R; Mei, Ting; Russo, Thomas V.
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to developmore » new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandias needs, including some radiationaware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase a message passing parallel implementation which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows.« less
Xyce parallel electronic simulator users guide, version 6.0.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Keiter, Eric R; Mei, Ting; Russo, Thomas V.
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to developmore » new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandias needs, including some radiationaware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase a message passing parallel implementation which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows.« less
Electromagnetic Physics Models for Parallel Computing Architectures
NASA Astrophysics Data System (ADS)
Amadio, G.; Ananya, A.; Apostolakis, J.; Aurora, A.; Bandieramonte, M.; Bhattacharyya, A.; Bianchini, C.; Brun, R.; Canal, P.; Carminati, F.; Duhem, L.; Elvira, D.; Gheata, A.; Gheata, M.; Goulas, I.; Iope, R.; Jun, S. Y.; Lima, G.; Mohanty, A.; Nikitina, T.; Novak, M.; Pokorski, W.; Ribon, A.; Seghal, R.; Shadura, O.; Vallecorsa, S.; Wenzel, S.; Zhang, Y.
2016-10-01
The recent emergence of hardware architectures characterized by many-core or accelerated processors has opened new opportunities for concurrent programming models taking advantage of both SIMD and SIMT architectures. GeantV, a next generation detector simulation, has been designed to exploit both the vector capability of mainstream CPUs and multi-threading capabilities of coprocessors including NVidia GPUs and Intel Xeon Phi. The characteristics of these architectures are very different in terms of the vectorization depth and type of parallelization needed to achieve optimal performance. In this paper we describe implementation of electromagnetic physics models developed for parallel computing architectures as a part of the GeantV project. Results of preliminary performance evaluation and physics validation are presented as well.
Xyce Parallel Electronic Simulator : users' guide, version 2.0.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hoekstra, Robert John; Waters, Lon J.; Rankin, Eric Lamont
2004-06-01
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator capable of simulating electrical circuits at a variety of abstraction levels. Primarily, Xyce has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability the current state-of-the-art in the following areas: {sm_bullet} Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). Note that this includes support for most popular parallel and serial computers. {sm_bullet} Improved performance for allmore » numerical kernels (e.g., time integrator, nonlinear and linear solvers) through state-of-the-art algorithms and novel techniques. {sm_bullet} Device models which are specifically tailored to meet Sandia's needs, including many radiation-aware devices. {sm_bullet} A client-server or multi-tiered operating model wherein the numerical kernel can operate independently of the graphical user interface (GUI). {sm_bullet} Object-oriented code design and implementation using modern coding practices that ensure that the Xyce Parallel Electronic Simulator will be maintainable and extensible far into the future. Xyce is a parallel code in the most general sense of the phrase - a message passing of computing platforms. These include serial, shared-memory and distributed-memory parallel implementation - which allows it to run efficiently on the widest possible number parallel as well as heterogeneous platforms. Careful attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows. One feature required by designers is the ability to add device models, many specific to the needs of Sandia, to the code. To this end, the device package in the Xyce These input formats include standard analytical models, behavioral models look-up Parallel Electronic Simulator is designed to support a variety of device model inputs. tables, and mesh-level PDE device models. Combined with this flexible interface is an architectural design that greatly simplifies the addition of circuit models. One of the most important feature of Xyce is in providing a platform for computational research and development aimed specifically at the needs of the Laboratory. With Xyce, Sandia now has an 'in-house' capability with which both new electrical (e.g., device model development) and algorithmic (e.g., faster time-integration methods) research and development can be performed. Ultimately, these capabilities are migrated to end users.« less
NASA Astrophysics Data System (ADS)
Valasek, Lukas; Glasa, Jan
2017-12-01
Current fire simulation systems are capable to utilize advantages of high-performance computer (HPC) platforms available and to model fires efficiently in parallel. In this paper, efficiency of a corridor fire simulation on a HPC computer cluster is discussed. The parallel MPI version of Fire Dynamics Simulator is used for testing efficiency of selected strategies of allocation of computational resources of the cluster using a greater number of computational cores. Simulation results indicate that if the number of cores used is not equal to a multiple of the total number of cluster node cores there are allocation strategies which provide more efficient calculations.
Transferring ecosystem simulation codes to supercomputers
NASA Technical Reports Server (NTRS)
Skiles, J. W.; Schulbach, C. H.
1995-01-01
Many ecosystem simulation computer codes have been developed in the last twenty-five years. This development took place initially on main-frame computers, then mini-computers, and more recently, on micro-computers and workstations. Supercomputing platforms (both parallel and distributed systems) have been largely unused, however, because of the perceived difficulty in accessing and using the machines. Also, significant differences in the system architectures of sequential, scalar computers and parallel and/or vector supercomputers must be considered. We have transferred a grassland simulation model (developed on a VAX) to a Cray Y-MP/C90. We describe porting the model to the Cray and the changes we made to exploit the parallelism in the application and improve code execution. The Cray executed the model 30 times faster than the VAX and 10 times faster than a Unix workstation. We achieved an additional speedup of 30 percent by using the compiler's vectoring and 'in-line' capabilities. The code runs at only about 5 percent of the Cray's peak speed because it ineffectively uses the vector and parallel processing capabilities of the Cray. We expect that by restructuring the code, it could execute an additional six to ten times faster.
A language comparison for scientific computing on MIMD architectures
NASA Technical Reports Server (NTRS)
Jones, Mark T.; Patrick, Merrell L.; Voigt, Robert G.
1989-01-01
Choleski's method for solving banded symmetric, positive definite systems is implemented on a multiprocessor computer using three FORTRAN based parallel programming languages, the Force, PISCES and Concurrent FORTRAN. The capabilities of the language for expressing parallelism and their user friendliness are discussed, including readability of the code, debugging assistance offered, and expressiveness of the languages. The performance of the different implementations is compared. It is argued that PISCES, using the Force for medium-grained parallelism, is the appropriate choice for programming Choleski's method on the multiprocessor computer, Flex/32.
GPU computing in medical physics: a review.
Pratx, Guillem; Xing, Lei
2011-05-01
The graphics processing unit (GPU) has emerged as a competitive platform for computing massively parallel problems. Many computing applications in medical physics can be formulated as data-parallel tasks that exploit the capabilities of the GPU for reducing processing times. The authors review the basic principles of GPU computing as well as the main performance optimization techniques, and survey existing applications in three areas of medical physics, namely image reconstruction, dose calculation and treatment plan optimization, and image processing.
Electromagnetic physics models for parallel computing architectures
Amadio, G.; Ananya, A.; Apostolakis, J.; ...
2016-11-21
The recent emergence of hardware architectures characterized by many-core or accelerated processors has opened new opportunities for concurrent programming models taking advantage of both SIMD and SIMT architectures. GeantV, a next generation detector simulation, has been designed to exploit both the vector capability of mainstream CPUs and multi-threading capabilities of coprocessors including NVidia GPUs and Intel Xeon Phi. The characteristics of these architectures are very different in terms of the vectorization depth and type of parallelization needed to achieve optimal performance. In this paper we describe implementation of electromagnetic physics models developed for parallel computing architectures as a part ofmore » the GeantV project. Finally, the results of preliminary performance evaluation and physics validation are presented as well.« less
Parallel and Portable Monte Carlo Particle Transport
NASA Astrophysics Data System (ADS)
Lee, S. R.; Cummings, J. C.; Nolen, S. D.; Keen, N. D.
1997-08-01
We have developed a multi-group, Monte Carlo neutron transport code in C++ using object-oriented methods and the Parallel Object-Oriented Methods and Applications (POOMA) class library. This transport code, called MC++, currently computes k and α eigenvalues of the neutron transport equation on a rectilinear computational mesh. It is portable to and runs in parallel on a wide variety of platforms, including MPPs, clustered SMPs, and individual workstations. It contains appropriate classes and abstractions for particle transport and, through the use of POOMA, for portable parallelism. Current capabilities are discussed, along with physics and performance results for several test problems on a variety of hardware, including all three Accelerated Strategic Computing Initiative (ASCI) platforms. Current parallel performance indicates the ability to compute α-eigenvalues in seconds or minutes rather than days or weeks. Current and future work on the implementation of a general transport physics framework (TPF) is also described. This TPF employs modern C++ programming techniques to provide simplified user interfaces, generic STL-style programming, and compile-time performance optimization. Physics capabilities of the TPF will be extended to include continuous energy treatments, implicit Monte Carlo algorithms, and a variety of convergence acceleration techniques such as importance combing.
Providing a parallel and distributed capability for JMASS using SPEEDES
NASA Astrophysics Data System (ADS)
Valinski, Maria; Driscoll, Jonathan; McGraw, Robert M.; Meyer, Bob
2002-07-01
The Joint Modeling And Simulation System (JMASS) is a Tri-Service simulation environment that supports engineering and engagement-level simulations. As JMASS is expanded to support other Tri-Service domains, the current set of modeling services must be expanded for High Performance Computing (HPC) applications by adding support for advanced time-management algorithms, parallel and distributed topologies, and high speed communications. By providing support for these services, JMASS can better address modeling domains requiring parallel computationally intense calculations such clutter, vulnerability and lethality calculations, and underwater-based scenarios. A risk reduction effort implementing some HPC services for JMASS using the SPEEDES (Synchronous Parallel Environment for Emulation and Discrete Event Simulation) Simulation Framework has recently concluded. As an artifact of the JMASS-SPEEDES integration, not only can HPC functionality be brought to the JMASS program through SPEEDES, but an additional HLA-based capability can be demonstrated that further addresses interoperability issues. The JMASS-SPEEDES integration provided a means of adding HLA capability to preexisting JMASS scenarios through an implementation of the standard JMASS port communication mechanism that allows players to communicate.
NASA Astrophysics Data System (ADS)
Russkova, Tatiana V.
2017-11-01
One tool to improve the performance of Monte Carlo methods for numerical simulation of light transport in the Earth's atmosphere is the parallel technology. A new algorithm oriented to parallel execution on the CUDA-enabled NVIDIA graphics processor is discussed. The efficiency of parallelization is analyzed on the basis of calculating the upward and downward fluxes of solar radiation in both a vertically homogeneous and inhomogeneous models of the atmosphere. The results of testing the new code under various atmospheric conditions including continuous singlelayered and multilayered clouds, and selective molecular absorption are presented. The results of testing the code using video cards with different compute capability are analyzed. It is shown that the changeover of computing from conventional PCs to the architecture of graphics processors gives more than a hundredfold increase in performance and fully reveals the capabilities of the technology used.
Vectorization for Molecular Dynamics on Intel Xeon Phi Corpocessors
NASA Astrophysics Data System (ADS)
Yi, Hongsuk
2014-03-01
Many modern processors are capable of exploiting data-level parallelism through the use of single instruction multiple data (SIMD) execution. The new Intel Xeon Phi coprocessor supports 512 bit vector registers for the high performance computing. In this paper, we have developed a hierarchical parallelization scheme for accelerated molecular dynamics simulations with the Terfoff potentials for covalent bond solid crystals on Intel Xeon Phi coprocessor systems. The scheme exploits multi-level parallelism computing. We combine thread-level parallelism using a tightly coupled thread-level and task-level parallelism with 512-bit vector register. The simulation results show that the parallel performance of SIMD implementations on Xeon Phi is apparently superior to their x86 CPU architecture.
Parallel-vector out-of-core equation solver for computational mechanics
NASA Technical Reports Server (NTRS)
Qin, J.; Agarwal, T. K.; Storaasli, O. O.; Nguyen, D. T.; Baddourah, M. A.
1993-01-01
A parallel/vector out-of-core equation solver is developed for shared-memory computers, such as the Cray Y-MP machine. The input/ output (I/O) time is reduced by using the a synchronous BUFFER IN and BUFFER OUT, which can be executed simultaneously with the CPU instructions. The parallel and vector capability provided by the supercomputers is also exploited to enhance the performance. Numerical applications in large-scale structural analysis are given to demonstrate the efficiency of the present out-of-core solver.
Parallel grid generation algorithm for distributed memory computers
NASA Technical Reports Server (NTRS)
Moitra, Stuti; Moitra, Anutosh
1994-01-01
A parallel grid-generation algorithm and its implementation on the Intel iPSC/860 computer are described. The grid-generation scheme is based on an algebraic formulation of homotopic relations. Methods for utilizing the inherent parallelism of the grid-generation scheme are described, and implementation of multiple levELs of parallelism on multiple instruction multiple data machines are indicated. The algorithm is capable of providing near orthogonality and spacing control at solid boundaries while requiring minimal interprocessor communications. Results obtained on the Intel hypercube for a blended wing-body configuration are used to demonstrate the effectiveness of the algorithm. Fortran implementations bAsed on the native programming model of the iPSC/860 computer and the Express system of software tools are reported. Computational gains in execution time speed-up ratios are given.
NASA Technical Reports Server (NTRS)
Nguyen, D. T.; Watson, Willie R. (Technical Monitor)
2005-01-01
The overall objectives of this research work are to formulate and validate efficient parallel algorithms, and to efficiently design/implement computer software for solving large-scale acoustic problems, arised from the unified frameworks of the finite element procedures. The adopted parallel Finite Element (FE) Domain Decomposition (DD) procedures should fully take advantages of multiple processing capabilities offered by most modern high performance computing platforms for efficient parallel computation. To achieve this objective. the formulation needs to integrate efficient sparse (and dense) assembly techniques, hybrid (or mixed) direct and iterative equation solvers, proper pre-conditioned strategies, unrolling strategies, and effective processors' communicating schemes. Finally, the numerical performance of the developed parallel finite element procedures will be evaluated by solving series of structural, and acoustic (symmetrical and un-symmetrical) problems (in different computing platforms). Comparisons with existing "commercialized" and/or "public domain" software are also included, whenever possible.
User's Guide for ENSAERO_FE Parallel Finite Element Solver
NASA Technical Reports Server (NTRS)
Eldred, Lloyd B.; Guruswamy, Guru P.
1999-01-01
A high fidelity parallel static structural analysis capability is created and interfaced to the multidisciplinary analysis package ENSAERO-MPI of Ames Research Center. This new module replaces ENSAERO's lower fidelity simple finite element and modal modules. Full aircraft structures may be more accurately modeled using the new finite element capability. Parallel computation is performed by breaking the full structure into multiple substructures. This approach is conceptually similar to ENSAERO's multizonal fluid analysis capability. The new substructure code is used to solve the structural finite element equations for each substructure in parallel. NASTRANKOSMIC is utilized as a front end for this code. Its full library of elements can be used to create an accurate and realistic aircraft model. It is used to create the stiffness matrices for each substructure. The new parallel code then uses an iterative preconditioned conjugate gradient method to solve the global structural equations for the substructure boundary nodes.
NASA Astrophysics Data System (ADS)
Wei, Xiaohui; Li, Weishan; Tian, Hailong; Li, Hongliang; Xu, Haixiao; Xu, Tianfu
2015-07-01
The numerical simulation of multiphase flow and reactive transport in the porous media on complex subsurface problem is a computationally intensive application. To meet the increasingly computational requirements, this paper presents a parallel computing method and architecture. Derived from TOUGHREACT that is a well-established code for simulating subsurface multi-phase flow and reactive transport problems, we developed a high performance computing THC-MP based on massive parallel computer, which extends greatly on the computational capability for the original code. The domain decomposition method was applied to the coupled numerical computing procedure in the THC-MP. We designed the distributed data structure, implemented the data initialization and exchange between the computing nodes and the core solving module using the hybrid parallel iterative and direct solver. Numerical accuracy of the THC-MP was verified through a CO2 injection-induced reactive transport problem by comparing the results obtained from the parallel computing and sequential computing (original code). Execution efficiency and code scalability were examined through field scale carbon sequestration applications on the multicore cluster. The results demonstrate successfully the enhanced performance using the THC-MP on parallel computing facilities.
Methodologies and systems for heterogeneous concurrent computing
NASA Technical Reports Server (NTRS)
Sunderam, V. S.
1994-01-01
Heterogeneous concurrent computing is gaining increasing acceptance as an alternative or complementary paradigm to multiprocessor-based parallel processing as well as to conventional supercomputing. While algorithmic and programming aspects of heterogeneous concurrent computing are similar to their parallel processing counterparts, system issues, partitioning and scheduling, and performance aspects are significantly different. In this paper, we discuss critical design and implementation issues in heterogeneous concurrent computing, and describe techniques for enhancing its effectiveness. In particular, we highlight the system level infrastructures that are required, aspects of parallel algorithm development that most affect performance, system capabilities and limitations, and tools and methodologies for effective computing in heterogeneous networked environments. We also present recent developments and experiences in the context of the PVM system and comment on ongoing and future work.
NASA Technical Reports Server (NTRS)
Janetzke, David C.; Murthy, Durbha V.
1991-01-01
Aeroelastic analysis is multi-disciplinary and computationally expensive. Hence, it can greatly benefit from parallel processing. As part of an effort to develop an aeroelastic capability on a distributed memory transputer network, a parallel algorithm for the computation of aerodynamic influence coefficients is implemented on a network of 32 transputers. The aerodynamic influence coefficients are calculated using a 3-D unsteady aerodynamic model and a parallel discretization. Efficiencies up to 85 percent were demonstrated using 32 processors. The effect of subtask ordering, problem size, and network topology are presented. A comparison to results on a shared memory computer indicates that higher speedup is achieved on the distributed memory system.
Xyce™ Parallel Electronic Simulator Users' Guide, Version 6.5.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Keiter, Eric R.; Aadithya, Karthik V.; Mei, Ting
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to developmore » new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandia's needs, including some radiation- aware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase -- a message passing parallel implementation -- which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows. The information herein is subject to change without notice. Copyright © 2002-2016 Sandia Corporation. All rights reserved.« less
Optical memories in digital computing
NASA Technical Reports Server (NTRS)
Alford, C. O.; Gaylord, T. K.
1979-01-01
High capacity optical memories with relatively-high data-transfer rate and multiport simultaneous access capability may serve as basis for new computer architectures. Several computer structures that might profitably use memories are: a) simultaneous record-access system, b) simultaneously-shared memory computer system, and c) parallel digital processing structure.
HeNCE: A Heterogeneous Network Computing Environment
Beguelin, Adam; Dongarra, Jack J.; Geist, George Al; ...
1994-01-01
Network computing seeks to utilize the aggregate resources of many networked computers to solve a single problem. In so doing it is often possible to obtain supercomputer performance from an inexpensive local area network. The drawback is that network computing is complicated and error prone when done by hand, especially if the computers have different operating systems and data formats and are thus heterogeneous. The heterogeneous network computing environment (HeNCE) is an integrated graphical environment for creating and running parallel programs over a heterogeneous collection of computers. It is built on a lower level package called parallel virtual machine (PVM).more » The HeNCE philosophy of parallel programming is to have the programmer graphically specify the parallelism of a computation and to automate, as much as possible, the tasks of writing, compiling, executing, debugging, and tracing the network computation. Key to HeNCE is a graphical language based on directed graphs that describe the parallelism and data dependencies of an application. Nodes in the graphs represent conventional Fortran or C subroutines and the arcs represent data and control flow. This article describes the present state of HeNCE, its capabilities, limitations, and areas of future research.« less
NETRA: A parallel architecture for integrated vision systems. 1: Architecture and organization
NASA Technical Reports Server (NTRS)
Choudhary, Alok N.; Patel, Janak H.; Ahuja, Narendra
1989-01-01
Computer vision is regarded as one of the most complex and computationally intensive problems. An integrated vision system (IVS) is considered to be a system that uses vision algorithms from all levels of processing for a high level application (such as object recognition). A model of computation is presented for parallel processing for an IVS. Using the model, desired features and capabilities of a parallel architecture suitable for IVSs are derived. Then a multiprocessor architecture (called NETRA) is presented. This architecture is highly flexible without the use of complex interconnection schemes. The topology of NETRA is recursively defined and hence is easily scalable from small to large systems. Homogeneity of NETRA permits fault tolerance and graceful degradation under faults. It is a recursively defined tree-type hierarchical architecture where each of the leaf nodes consists of a cluster of processors connected with a programmable crossbar with selective broadcast capability to provide for desired flexibility. A qualitative evaluation of NETRA is presented. Then general schemes are described to map parallel algorithms onto NETRA. Algorithms are classified according to their communication requirements for parallel processing. An extensive analysis of inter-cluster communication strategies in NETRA is presented, and parameters affecting performance of parallel algorithms when mapped on NETRA are discussed. Finally, a methodology to evaluate performance of algorithms on NETRA is described.
Tankam, Patrice; Santhanam, Anand P.; Lee, Kye-Sung; Won, Jungeun; Canavesi, Cristina; Rolland, Jannick P.
2014-01-01
Abstract. Gabor-domain optical coherence microscopy (GD-OCM) is a volumetric high-resolution technique capable of acquiring three-dimensional (3-D) skin images with histological resolution. Real-time image processing is needed to enable GD-OCM imaging in a clinical setting. We present a parallelized and scalable multi-graphics processing unit (GPU) computing framework for real-time GD-OCM image processing. A parallelized control mechanism was developed to individually assign computation tasks to each of the GPUs. For each GPU, the optimal number of amplitude-scans (A-scans) to be processed in parallel was selected to maximize GPU memory usage and core throughput. We investigated five computing architectures for computational speed-up in processing 1000×1000 A-scans. The proposed parallelized multi-GPU computing framework enables processing at a computational speed faster than the GD-OCM image acquisition, thereby facilitating high-speed GD-OCM imaging in a clinical setting. Using two parallelized GPUs, the image processing of a 1×1×0.6 mm3 skin sample was performed in about 13 s, and the performance was benchmarked at 6.5 s with four GPUs. This work thus demonstrates that 3-D GD-OCM data may be displayed in real-time to the examiner using parallelized GPU processing. PMID:24695868
Tankam, Patrice; Santhanam, Anand P; Lee, Kye-Sung; Won, Jungeun; Canavesi, Cristina; Rolland, Jannick P
2014-07-01
Gabor-domain optical coherence microscopy (GD-OCM) is a volumetric high-resolution technique capable of acquiring three-dimensional (3-D) skin images with histological resolution. Real-time image processing is needed to enable GD-OCM imaging in a clinical setting. We present a parallelized and scalable multi-graphics processing unit (GPU) computing framework for real-time GD-OCM image processing. A parallelized control mechanism was developed to individually assign computation tasks to each of the GPUs. For each GPU, the optimal number of amplitude-scans (A-scans) to be processed in parallel was selected to maximize GPU memory usage and core throughput. We investigated five computing architectures for computational speed-up in processing 1000×1000 A-scans. The proposed parallelized multi-GPU computing framework enables processing at a computational speed faster than the GD-OCM image acquisition, thereby facilitating high-speed GD-OCM imaging in a clinical setting. Using two parallelized GPUs, the image processing of a 1×1×0.6 mm3 skin sample was performed in about 13 s, and the performance was benchmarked at 6.5 s with four GPUs. This work thus demonstrates that 3-D GD-OCM data may be displayed in real-time to the examiner using parallelized GPU processing.
Hybrid parallel computing architecture for multiview phase shifting
NASA Astrophysics Data System (ADS)
Zhong, Kai; Li, Zhongwei; Zhou, Xiaohui; Shi, Yusheng; Wang, Congjun
2014-11-01
The multiview phase-shifting method shows its powerful capability in achieving high resolution three-dimensional (3-D) shape measurement. Unfortunately, this ability results in very high computation costs and 3-D computations have to be processed offline. To realize real-time 3-D shape measurement, a hybrid parallel computing architecture is proposed for multiview phase shifting. In this architecture, the central processing unit can co-operate with the graphic processing unit (GPU) to achieve hybrid parallel computing. The high computation cost procedures, including lens distortion rectification, phase computation, correspondence, and 3-D reconstruction, are implemented in GPU, and a three-layer kernel function model is designed to simultaneously realize coarse-grained and fine-grained paralleling computing. Experimental results verify that the developed system can perform 50 fps (frame per second) real-time 3-D measurement with 260 K 3-D points per frame. A speedup of up to 180 times is obtained for the performance of the proposed technique using a NVIDIA GT560Ti graphics card rather than a sequential C in a 3.4 GHZ Inter Core i7 3770.
NASA Astrophysics Data System (ADS)
Newman, Gregory A.
2014-01-01
Many geoscientific applications exploit electrostatic and electromagnetic fields to interrogate and map subsurface electrical resistivity—an important geophysical attribute for characterizing mineral, energy, and water resources. In complex three-dimensional geologies, where many of these resources remain to be found, resistivity mapping requires large-scale modeling and imaging capabilities, as well as the ability to treat significant data volumes, which can easily overwhelm single-core and modest multicore computing hardware. To treat such problems requires large-scale parallel computational resources, necessary for reducing the time to solution to a time frame acceptable to the exploration process. The recognition that significant parallel computing processes must be brought to bear on these problems gives rise to choices that must be made in parallel computing hardware and software. In this review, some of these choices are presented, along with the resulting trade-offs. We also discuss future trends in high-performance computing and the anticipated impact on electromagnetic (EM) geophysics. Topics discussed in this review article include a survey of parallel computing platforms, graphics processing units to multicore CPUs with a fast interconnect, along with effective parallel solvers and associated solver libraries effective for inductive EM modeling and imaging.
Toward an automated parallel computing environment for geosciences
NASA Astrophysics Data System (ADS)
Zhang, Huai; Liu, Mian; Shi, Yaolin; Yuen, David A.; Yan, Zhenzhen; Liang, Guoping
2007-08-01
Software for geodynamic modeling has not kept up with the fast growing computing hardware and network resources. In the past decade supercomputing power has become available to most researchers in the form of affordable Beowulf clusters and other parallel computer platforms. However, to take full advantage of such computing power requires developing parallel algorithms and associated software, a task that is often too daunting for geoscience modelers whose main expertise is in geosciences. We introduce here an automated parallel computing environment built on open-source algorithms and libraries. Users interact with this computing environment by specifying the partial differential equations, solvers, and model-specific properties using an English-like modeling language in the input files. The system then automatically generates the finite element codes that can be run on distributed or shared memory parallel machines. This system is dynamic and flexible, allowing users to address different problems in geosciences. It is capable of providing web-based services, enabling users to generate source codes online. This unique feature will facilitate high-performance computing to be integrated with distributed data grids in the emerging cyber-infrastructures for geosciences. In this paper we discuss the principles of this automated modeling environment and provide examples to demonstrate its versatility.
Fault Tolerant Parallel Implementations of Iterative Algorithms for Optimal Control Problems
1988-01-21
p/.V)] steps, but did not discuss any specific parallel implementation. Gajski [51 improved upon this result by performing the SIMD computation in...N = p2. our approach reduces to that of [51, except that Gajski presents the coefficient computation and partial solution phases as a single...8217>. the SIMD algo- rithm presented by Gajski [5] can be most efficiently mapped to a unidirec- tional ring network with broadcasting capability. Based
1985-05-01
unit in the data base, with knowing one generic assembly language. °-’--a 139 The 5-tuple describing single operation execution time of the operations...TSi-- generate , random eventi ( ,.0-15 tieit tmls - ((floa egus ()16 274 r Ispt imet imel I at :EVE’JS- II ktime=0.0; /0 present time 0/ rrs ptime=0.0...computing machinery capable of performing these tasks within a given time constraint. Because the majority of the available computing machinery is general
NASA Technical Reports Server (NTRS)
Luke, Edward Allen
1993-01-01
Two algorithms capable of computing a transonic 3-D inviscid flow field about rotating machines are considered for parallel implementation. During the study of these algorithms, a significant new method of measuring the performance of parallel algorithms is developed. The theory that supports this new method creates an empirical definition of scalable parallel algorithms that is used to produce quantifiable evidence that a scalable parallel application was developed. The implementation of the parallel application and an automated domain decomposition tool are also discussed.
Large-scale Parallel Unstructured Mesh Computations for 3D High-lift Analysis
NASA Technical Reports Server (NTRS)
Mavriplis, Dimitri J.; Pirzadeh, S.
1999-01-01
A complete "geometry to drag-polar" analysis capability for the three-dimensional high-lift configurations is described. The approach is based on the use of unstructured meshes in order to enable rapid turnaround for complicated geometries that arise in high-lift configurations. Special attention is devoted to creating a capability for enabling analyses on highly resolved grids. Unstructured meshes of several million vertices are initially generated on a work-station, and subsequently refined on a supercomputer. The flow is solved on these refined meshes on large parallel computers using an unstructured agglomeration multigrid algorithm. Good prediction of lift and drag throughout the range of incidences is demonstrated on a transport take-off configuration using up to 24.7 million grid points. The feasibility of using this approach in a production environment on existing parallel machines is demonstrated, as well as the scalability of the solver on machines using up to 1450 processors.
OceanXtremes: Scalable Anomaly Detection in Oceanographic Time-Series
NASA Astrophysics Data System (ADS)
Wilson, B. D.; Armstrong, E. M.; Chin, T. M.; Gill, K. M.; Greguska, F. R., III; Huang, T.; Jacob, J. C.; Quach, N.
2016-12-01
The oceanographic community must meet the challenge to rapidly identify features and anomalies in complex and voluminous observations to further science and improve decision support. Given this data-intensive reality, we are developing an anomaly detection system, called OceanXtremes, powered by an intelligent, elastic Cloud-based analytic service backend that enables execution of domain-specific, multi-scale anomaly and feature detection algorithms across the entire archive of 15 to 30-year ocean science datasets.Our parallel analytics engine is extending the NEXUS system and exploits multiple open-source technologies: Apache Cassandra as a distributed spatial "tile" cache, Apache Spark for in-memory parallel computation, and Apache Solr for spatial search and storing pre-computed tile statistics and other metadata. OceanXtremes provides these key capabilities: Parallel generation (Spark on a compute cluster) of 15 to 30-year Ocean Climatologies (e.g. sea surface temperature or SST) in hours or overnight, using simple pixel averages or customizable Gaussian-weighted "smoothing" over latitude, longitude, and time; Parallel pre-computation, tiling, and caching of anomaly fields (daily variables minus a chosen climatology) with pre-computed tile statistics; Parallel detection (over the time-series of tiles) of anomalies or phenomena by regional area-averages exceeding a specified threshold (e.g. high SST in El Nino or SST "blob" regions), or more complex, custom data mining algorithms; Shared discovery and exploration of ocean phenomena and anomalies (facet search using Solr), along with unexpected correlations between key measured variables; Scalable execution for all capabilities on a hybrid Cloud, using our on-premise OpenStack Cloud cluster or at Amazon. The key idea is that the parallel data-mining operations will be run "near" the ocean data archives (a local "network" hop) so that we can efficiently access the thousands of files making up a three decade time-series. The presentation will cover the architecture of OceanXtremes, parallelization of the climatology computation and anomaly detection algorithms using Spark, example results for SST and other time-series, and parallel performance metrics.
Optics Program Modified for Multithreaded Parallel Computing
NASA Technical Reports Server (NTRS)
Lou, John; Bedding, Dave; Basinger, Scott
2006-01-01
A powerful high-performance computer program for simulating and analyzing adaptive and controlled optical systems has been developed by modifying the serial version of the Modeling and Analysis for Controlled Optical Systems (MACOS) program to impart capabilities for multithreaded parallel processing on computing systems ranging from supercomputers down to Symmetric Multiprocessing (SMP) personal computers. The modifications included the incorporation of OpenMP, a portable and widely supported application interface software, that can be used to explicitly add multithreaded parallelism to an application program under a shared-memory programming model. OpenMP was applied to parallelize ray-tracing calculations, one of the major computing components in MACOS. Multithreading is also used in the diffraction propagation of light in MACOS based on pthreads [POSIX Thread, (where "POSIX" signifies a portable operating system for UNIX)]. In tests of the parallelized version of MACOS, the speedup in ray-tracing calculations was found to be linear, or proportional to the number of processors, while the speedup in diffraction calculations ranged from 50 to 60 percent, depending on the type and number of processors. The parallelized version of MACOS is portable, and, to the user, its interface is basically the same as that of the original serial version of MACOS.
Aeroelasticity of wing and wing-body configurations on parallel computers
NASA Technical Reports Server (NTRS)
Byun, Chansup
1995-01-01
The objective of this research is to develop computationally efficient methods for solving aeroelasticity problems on parallel computers. Both uncoupled and coupled methods are studied in this research. For the uncoupled approach, the conventional U-g method is used to determine the flutter boundary. The generalized aerodynamic forces required are obtained by the pulse transfer-function analysis method. For the coupled approach, the fluid-structure interaction is obtained by directly coupling finite difference Euler/Navier-Stokes equations for fluids and finite element dynamics equations for structures. This capability will significantly impact many aerospace projects of national importance such as Advanced Subsonic Civil Transport (ASCT), where the structural stability margin becomes very critical at the transonic region. This research effort will have direct impact on the High Performance Computing and Communication (HPCC) Program of NASA in the area of parallel computing.
Parallel computers - Estimate errors caused by imprecise data
NASA Technical Reports Server (NTRS)
Kreinovich, Vladik; Bernat, Andrew; Villa, Elsa; Mariscal, Yvonne
1991-01-01
A new approach to the problem of estimating errors caused by imprecise data is proposed in the context of software engineering. A software device is used to produce an ideal solution to the problem, when the computer is capable of computing errors of arbitrary programs. The software engineering aspect of this problem is to describe a device for computing the error estimates in software terms and then to provide precise numbers with error estimates to the user. The feasibility of the program capable of computing both some quantity and its error estimate in the range of possible measurement errors is demonstrated.
Global Magnetohydrodynamic Simulation Using High Performance FORTRAN on Parallel Computers
NASA Astrophysics Data System (ADS)
Ogino, T.
High Performance Fortran (HPF) is one of modern and common techniques to achieve high performance parallel computation. We have translated a 3-dimensional magnetohydrodynamic (MHD) simulation code of the Earth's magnetosphere from VPP Fortran to HPF/JA on the Fujitsu VPP5000/56 vector-parallel supercomputer and the MHD code was fully vectorized and fully parallelized in VPP Fortran. The entire performance and capability of the HPF MHD code could be shown to be almost comparable to that of VPP Fortran. A 3-dimensional global MHD simulation of the earth's magnetosphere was performed at a speed of over 400 Gflops with an efficiency of 76.5 VPP5000/56 in vector and parallel computation that permitted comparison with catalog values. We have concluded that fluid and MHD codes that are fully vectorized and fully parallelized in VPP Fortran can be translated with relative ease to HPF/JA, and a code in HPF/JA may be expected to perform comparably to the same code written in VPP Fortran.
Xyce Parallel Electronic Simulator Users' Guide Version 6.8
DOE Office of Scientific and Technical Information (OSTI.GOV)
Keiter, Eric R.; Aadithya, Karthik Venkatraman; Mei, Ting
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been de- signed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel com- puting platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows onemore » to develop new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandia's needs, including some radiation- aware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase$-$ a message passing parallel implementation $-$ which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows.« less
Compute as Fast as the Engineers Can Think! ULTRAFAST COMPUTING TEAM FINAL REPORT
NASA Technical Reports Server (NTRS)
Biedron, R. T.; Mehrotra, P.; Nelson, M. L.; Preston, M. L.; Rehder, J. J.; Rogersm J. L.; Rudy, D. H.; Sobieski, J.; Storaasli, O. O.
1999-01-01
This report documents findings and recommendations by the Ultrafast Computing Team (UCT). In the period 10-12/98, UCT reviewed design case scenarios for a supersonic transport and a reusable launch vehicle to derive computing requirements necessary for support of a design process with efficiency so radically improved that human thought rather than the computer paces the process. Assessment of the present computing capability against the above requirements indicated a need for further improvement in computing speed by several orders of magnitude to reduce time to solution from tens of hours to seconds in major applications. Evaluation of the trends in computer technology revealed a potential to attain the postulated improvement by further increases of single processor performance combined with massively parallel processing in a heterogeneous environment. However, utilization of massively parallel processing to its full capability will require redevelopment of the engineering analysis and optimization methods, including invention of new paradigms. To that end UCT recommends initiation of a new activity at LaRC called Computational Engineering for development of new methods and tools geared to the new computer architectures in disciplines, their coordination, and validation and benefit demonstration through applications.
High-Performance Parallel Analysis of Coupled Problems for Aircraft Propulsion
NASA Technical Reports Server (NTRS)
Felippa, C. A.; Farhat, C.; Park, K. C.; Gumaste, U.; Chen, P.-S.; Lesoinne, M.; Stern, P.
1996-01-01
This research program dealt with the application of high-performance computing methods to the numerical simulation of complete jet engines. The program was initiated in January 1993 by applying two-dimensional parallel aeroelastic codes to the interior gas flow problem of a bypass jet engine. The fluid mesh generation, domain decomposition and solution capabilities were successfully tested. Attention was then focused on methodology for the partitioned analysis of the interaction of the gas flow with a flexible structure and with the fluid mesh motion driven by these structural displacements. The latter is treated by a ALE technique that models the fluid mesh motion as that of a fictitious mechanical network laid along the edges of near-field fluid elements. New partitioned analysis procedures to treat this coupled three-component problem were developed during 1994 and 1995. These procedures involved delayed corrections and subcycling, and have been successfully tested on several massively parallel computers, including the iPSC-860, Paragon XP/S and the IBM SP2. For the global steady-state axisymmetric analysis of a complete engine we have decided to use the NASA-sponsored ENG10 program, which uses a regular FV-multiblock-grid discretization in conjunction with circumferential averaging to include effects of blade forces, loss, combustor heat addition, blockage, bleeds and convective mixing. A load-balancing preprocessor tor parallel versions of ENG10 was developed. During 1995 and 1996 we developed the capability tor the first full 3D aeroelastic simulation of a multirow engine stage. This capability was tested on the IBM SP2 parallel supercomputer at NASA Ames. Benchmark results were presented at the 1196 Computational Aeroscience meeting.
Li, Chuan; Petukh, Marharyta; Li, Lin; Alexov, Emil
2013-08-15
Due to the enormous importance of electrostatics in molecular biology, calculating the electrostatic potential and corresponding energies has become a standard computational approach for the study of biomolecules and nano-objects immersed in water and salt phase or other media. However, the electrostatics of large macromolecules and macromolecular complexes, including nano-objects, may not be obtainable via explicit methods and even the standard continuum electrostatics methods may not be applicable due to high computational time and memory requirements. Here, we report further development of the parallelization scheme reported in our previous work (Li, et al., J. Comput. Chem. 2012, 33, 1960) to include parallelization of the molecular surface and energy calculations components of the algorithm. The parallelization scheme utilizes different approaches such as space domain parallelization, algorithmic parallelization, multithreading, and task scheduling, depending on the quantity being calculated. This allows for efficient use of the computing resources of the corresponding computer cluster. The parallelization scheme is implemented in the popular software DelPhi and results in speedup of several folds. As a demonstration of the efficiency and capability of this methodology, the electrostatic potential, and electric field distributions are calculated for the bovine mitochondrial supercomplex illustrating their complex topology, which cannot be obtained by modeling the supercomplex components alone. Copyright © 2013 Wiley Periodicals, Inc.
Fast Particle Methods for Multiscale Phenomena Simulations
NASA Technical Reports Server (NTRS)
Koumoutsakos, P.; Wray, A.; Shariff, K.; Pohorille, Andrew
2000-01-01
We are developing particle methods oriented at improving computational modeling capabilities of multiscale physical phenomena in : (i) high Reynolds number unsteady vortical flows, (ii) particle laden and interfacial flows, (iii)molecular dynamics studies of nanoscale droplets and studies of the structure, functions, and evolution of the earliest living cell. The unifying computational approach involves particle methods implemented in parallel computer architectures. The inherent adaptivity, robustness and efficiency of particle methods makes them a multidisciplinary computational tool capable of bridging the gap of micro-scale and continuum flow simulations. Using efficient tree data structures, multipole expansion algorithms, and improved particle-grid interpolation, particle methods allow for simulations using millions of computational elements, making possible the resolution of a wide range of length and time scales of these important physical phenomena.The current challenges in these simulations are in : [i] the proper formulation of particle methods in the molecular and continuous level for the discretization of the governing equations [ii] the resolution of the wide range of time and length scales governing the phenomena under investigation. [iii] the minimization of numerical artifacts that may interfere with the physics of the systems under consideration. [iv] the parallelization of processes such as tree traversal and grid-particle interpolations We are conducting simulations using vortex methods, molecular dynamics and smooth particle hydrodynamics, exploiting their unifying concepts such as : the solution of the N-body problem in parallel computers, highly accurate particle-particle and grid-particle interpolations, parallel FFT's and the formulation of processes such as diffusion in the context of particle methods. This approach enables us to transcend among seemingly unrelated areas of research.
Implementation of a parallel unstructured Euler solver on the CM-5
NASA Technical Reports Server (NTRS)
Morano, Eric; Mavriplis, D. J.
1995-01-01
An efficient unstructured 3D Euler solver is parallelized on a Thinking Machine Corporation Connection Machine 5, distributed memory computer with vectoring capability. In this paper, the single instruction multiple data (SIMD) strategy is employed through the use of the CM Fortran language and the CMSSL scientific library. The performance of the CMSSL mesh partitioner is evaluated and the overall efficiency of the parallel flow solver is discussed.
NASA Astrophysics Data System (ADS)
Zatarain Salazar, Jazmin; Reed, Patrick M.; Quinn, Julianne D.; Giuliani, Matteo; Castelletti, Andrea
2017-11-01
Reservoir operations are central to our ability to manage river basin systems serving conflicting multi-sectoral demands under increasingly uncertain futures. These challenges motivate the need for new solution strategies capable of effectively and efficiently discovering the multi-sectoral tradeoffs that are inherent to alternative reservoir operation policies. Evolutionary many-objective direct policy search (EMODPS) is gaining importance in this context due to its capability of addressing multiple objectives and its flexibility in incorporating multiple sources of uncertainties. This simulation-optimization framework has high potential for addressing the complexities of water resources management, and it can benefit from current advances in parallel computing and meta-heuristics. This study contributes a diagnostic assessment of state-of-the-art parallel strategies for the auto-adaptive Borg Multi Objective Evolutionary Algorithm (MOEA) to support EMODPS. Our analysis focuses on the Lower Susquehanna River Basin (LSRB) system where multiple sectoral demands from hydropower production, urban water supply, recreation and environmental flows need to be balanced. Using EMODPS with different parallel configurations of the Borg MOEA, we optimize operating policies over different size ensembles of synthetic streamflows and evaporation rates. As we increase the ensemble size, we increase the statistical fidelity of our objective function evaluations at the cost of higher computational demands. This study demonstrates how to overcome the mathematical and computational barriers associated with capturing uncertainties in stochastic multiobjective reservoir control optimization, where parallel algorithmic search serves to reduce the wall-clock time in discovering high quality representations of key operational tradeoffs. Our results show that emerging self-adaptive parallelization schemes exploiting cooperative search populations are crucial. Such strategies provide a promising new set of tools for effectively balancing exploration, uncertainty, and computational demands when using EMODPS.
Analysis and selection of optimal function implementations in massively parallel computer
Archer, Charles Jens [Rochester, MN; Peters, Amanda [Rochester, MN; Ratterman, Joseph D [Rochester, MN
2011-05-31
An apparatus, program product and method optimize the operation of a parallel computer system by, in part, collecting performance data for a set of implementations of a function capable of being executed on the parallel computer system based upon the execution of the set of implementations under varying input parameters in a plurality of input dimensions. The collected performance data may be used to generate selection program code that is configured to call selected implementations of the function in response to a call to the function under varying input parameters. The collected performance data may be used to perform more detailed analysis to ascertain the comparative performance of the set of implementations of the function under the varying input parameters.
A highly efficient multi-core algorithm for clustering extremely large datasets
2010-01-01
Background In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer. Results We introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization. Conclusions Most desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer. PMID:20370922
NASA Technical Reports Server (NTRS)
Gorospe, George E., Jr.; Daigle, Matthew J.; Sankararaman, Shankar; Kulkarni, Chetan S.; Ng, Eley
2017-01-01
Prognostic methods enable operators and maintainers to predict the future performance for critical systems. However, these methods can be computationally expensive and may need to be performed each time new information about the system becomes available. In light of these computational requirements, we have investigated the application of graphics processing units (GPUs) as a computational platform for real-time prognostics. Recent advances in GPU technology have reduced cost and increased the computational capability of these highly parallel processing units, making them more attractive for the deployment of prognostic software. We present a survey of model-based prognostic algorithms with considerations for leveraging the parallel architecture of the GPU and a case study of GPU-accelerated battery prognostics with computational performance results.
Parallel computation and the basis system
DOE Office of Scientific and Technical Information (OSTI.GOV)
Smith, G.R.
1993-05-01
A software package has been written that can facilitate efforts to develop powerful, flexible, and easy-to use programs that can run in single-processor, massively parallel, and distributed computing environments. Particular attention has been given to the difficulties posed by a program consisting of many science packages that represent subsystems of a complicated, coupled system. Methods have been found to maintain independence of the packages by hiding data structures without increasing the communications costs in a parallel computing environment. Concepts developed in this work are demonstrated by a prototype program that uses library routines from two existing software systems, Basis andmore » Parallel Virtual Machine (PVM). Most of the details of these libraries have been encapsulated in routines and macros that could be rewritten for alternative libraries that possess certain minimum capabilities. The prototype software uses a flexible master-and-slaves paradigm for parallel computation and supports domain decomposition with message passing for partitioning work among slaves. Facilities are provided for accessing variables that are distributed among the memories of slaves assigned to subdomains. The software is named PROTOPAR.« less
A high performance linear equation solver on the VPP500 parallel supercomputer
DOE Office of Scientific and Technical Information (OSTI.GOV)
Nakanishi, Makoto; Ina, Hiroshi; Miura, Kenichi
1994-12-31
This paper describes the implementation of two high performance linear equation solvers developed for the Fujitsu VPP500, a distributed memory parallel supercomputer system. The solvers take advantage of the key architectural features of VPP500--(1) scalability for an arbitrary number of processors up to 222 processors, (2) flexible data transfer among processors provided by a crossbar interconnection network, (3) vector processing capability on each processor, and (4) overlapped computation and transfer. The general linear equation solver based on the blocked LU decomposition method achieves 120.0 GFLOPS performance with 100 processors in the LIN-PACK Highly Parallel Computing benchmark.
MPACT Standard Input User s Manual, Version 2.2.0
DOE Office of Scientific and Technical Information (OSTI.GOV)
Collins, Benjamin S.; Downar, Thomas; Fitzgerald, Andrew
The MPACT (Michigan PArallel Charactistics based Transport) code is designed to perform high-fidelity light water reactor (LWR) analysis using whole-core pin-resolved neutron transport calculations on modern parallel-computing hardware. The code consists of several libraries which provide the functionality necessary to solve steady-state eigenvalue problems. Several transport capabilities are available within MPACT including both 2-D and 3-D Method of Characteristics (MOC). A three-dimensional whole core solution based on the 2D-1D solution method provides the capability for full core depletion calculations.
Reconfigurable Computing for Computational Science: A New Focus in High Performance Computing
2006-11-01
in the past decade. Researchers are regularly employing the power of large computing systems and parallel processing to tackle larger and more...complex problems in all of the physical sciences. For the past decade or so, most of this growth in computing power has been “free” with increased...the scientific computing community as a means to continued growth in computing capability. This paper offers a glimpse of the hardware and
Dynamic Load-Balancing for Distributed Heterogeneous Computing of Parallel CFD Problems
NASA Technical Reports Server (NTRS)
Ecer, A.; Chien, Y. P.; Boenisch, T.; Akay, H. U.
2000-01-01
The developed methodology is aimed at improving the efficiency of executing block-structured algorithms on parallel, distributed, heterogeneous computers. The basic approach of these algorithms is to divide the flow domain into many sub- domains called blocks, and solve the governing equations over these blocks. Dynamic load balancing problem is defined as the efficient distribution of the blocks among the available processors over a period of several hours of computations. In environments with computers of different architecture, operating systems, CPU speed, memory size, load, and network speed, balancing the loads and managing the communication between processors becomes crucial. Load balancing software tools for mutually dependent parallel processes have been created to efficiently utilize an advanced computation environment and algorithms. These tools are dynamic in nature because of the chances in the computer environment during execution time. More recently, these tools were extended to a second operating system: NT. In this paper, the problems associated with this application will be discussed. Also, the developed algorithms were combined with the load sharing capability of LSF to efficiently utilize workstation clusters for parallel computing. Finally, results will be presented on running a NASA based code ADPAC to demonstrate the developed tools for dynamic load balancing.
GPU-based Parallel Application Design for Emerging Mobile Devices
NASA Astrophysics Data System (ADS)
Gupta, Kshitij
A revolution is underway in the computing world that is causing a fundamental paradigm shift in device capabilities and form-factor, with a move from well-established legacy desktop/laptop computers to mobile devices in varying sizes and shapes. Amongst all the tasks these devices must support, graphics has emerged as the 'killer app' for providing a fluid user interface and high-fidelity game rendering, effectively making the graphics processor (GPU) one of the key components in (present and future) mobile systems. By utilizing the GPU as a general-purpose parallel processor, this dissertation explores the GPU computing design space from an applications standpoint, in the mobile context, by focusing on key challenges presented by these devices---limited compute, memory bandwidth, and stringent power consumption requirements---while improving the overall application efficiency of the increasingly important speech recognition workload for mobile user interaction. We broadly partition trends in GPU computing into four major categories. We analyze hardware and programming model limitations in current-generation GPUs and detail an alternate programming style called Persistent Threads, identify four use case patterns, and propose minimal modifications that would be required for extending native support. We show how by manually extracting data locality and altering the speech recognition pipeline, we are able to achieve significant savings in memory bandwidth while simultaneously reducing the compute burden on GPU-like parallel processors. As we foresee GPU computing to evolve from its current 'co-processor' model into an independent 'applications processor' that is capable of executing complex work independently, we create an alternate application framework that enables the GPU to handle all control-flow dependencies autonomously at run-time while minimizing host involvement to just issuing commands, that facilitates an efficient application implementation. Finally, as compute and communication capabilities of mobile devices improve, we analyze energy implications of processing speech recognition locally (on-chip) and offloading it to servers (in-cloud).
Simulation framework for intelligent transportation systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ewing, T.; Doss, E.; Hanebutte, U.
1996-10-01
A simulation framework has been developed for a large-scale, comprehensive, scaleable simulation of an Intelligent Transportation System (ITS). The simulator is designed for running on parallel computers and distributed (networked) computer systems, but can run on standalone workstations for smaller simulations. The simulator currently models instrumented smart vehicles with in-vehicle navigation units capable of optimal route planning and Traffic Management Centers (TMC). The TMC has probe vehicle tracking capabilities (display position and attributes of instrumented vehicles), and can provide two-way interaction with traffic to provide advisories and link times. Both the in-vehicle navigation module and the TMC feature detailed graphicalmore » user interfaces to support human-factors studies. Realistic modeling of variations of the posted driving speed are based on human factors studies that take into consideration weather, road conditions, driver personality and behavior, and vehicle type. The prototype has been developed on a distributed system of networked UNIX computers but is designed to run on parallel computers, such as ANL`s IBM SP-2, for large-scale problems. A novel feature of the approach is that vehicles are represented by autonomous computer processes which exchange messages with other processes. The vehicles have a behavior model which governs route selection and driving behavior, and can react to external traffic events much like real vehicles. With this approach, the simulation is scaleable to take advantage of emerging massively parallel processor (MPP) systems.« less
Johnson, Timothy C.; Versteeg, Roelof J.; Ward, Andy; Day-Lewis, Frederick D.; Revil, André
2010-01-01
Electrical geophysical methods have found wide use in the growing discipline of hydrogeophysics for characterizing the electrical properties of the subsurface and for monitoring subsurface processes in terms of the spatiotemporal changes in subsurface conductivity, chargeability, and source currents they govern. Presently, multichannel and multielectrode data collections systems can collect large data sets in relatively short periods of time. Practitioners, however, often are unable to fully utilize these large data sets and the information they contain because of standard desktop-computer processing limitations. These limitations can be addressed by utilizing the storage and processing capabilities of parallel computing environments. We have developed a parallel distributed-memory forward and inverse modeling algorithm for analyzing resistivity and time-domain induced polar-ization (IP) data. The primary components of the parallel computations include distributed computation of the pole solutions in forward mode, distributed storage and computation of the Jacobian matrix in inverse mode, and parallel execution of the inverse equation solver. We have tested the corresponding parallel code in three efforts: (1) resistivity characterization of the Hanford 300 Area Integrated Field Research Challenge site in Hanford, Washington, U.S.A., (2) resistivity characterization of a volcanic island in the southern Tyrrhenian Sea in Italy, and (3) resistivity and IP monitoring of biostimulation at a Superfund site in Brandywine, Maryland, U.S.A. Inverse analysis of each of these data sets would be limited or impossible in a standard serial computing environment, which underscores the need for parallel high-performance computing to fully utilize the potential of electrical geophysical methods in hydrogeophysical applications.
NASA Astrophysics Data System (ADS)
Yang, Sheng-Chun; Lu, Zhong-Yuan; Qian, Hu-Jun; Wang, Yong-Lei; Han, Jie-Ping
2017-11-01
In this work, we upgraded the electrostatic interaction method of CU-ENUF (Yang, et al., 2016) which first applied CUNFFT (nonequispaced Fourier transforms based on CUDA) to the reciprocal-space electrostatic computation and made the computation of electrostatic interaction done thoroughly in GPU. The upgraded edition of CU-ENUF runs concurrently in a hybrid parallel way that enables the computation parallelizing on multiple computer nodes firstly, then further on the installed GPU in each computer. By this parallel strategy, the size of simulation system will be never restricted to the throughput of a single CPU or GPU. The most critical technical problem is how to parallelize a CUNFFT in the parallel strategy, which is conquered effectively by deep-seated research of basic principles and some algorithm skills. Furthermore, the upgraded method is capable of computing electrostatic interactions for both the atomistic molecular dynamics (MD) and the dissipative particle dynamics (DPD). Finally, the benchmarks conducted for validation and performance indicate that the upgraded method is able to not only present a good precision when setting suitable parameters, but also give an efficient way to compute electrostatic interactions for huge simulation systems. Program Files doi:http://dx.doi.org/10.17632/zncf24fhpv.1 Licensing provisions: GNU General Public License 3 (GPL) Programming language: C, C++, and CUDA C Supplementary material: The program is designed for effective electrostatic interactions of large-scale simulation systems, which runs on particular computers equipped with NVIDIA GPUs. It has been tested on (a) single computer node with Intel(R) Core(TM) i7-3770@ 3.40 GHz (CPU) and GTX 980 Ti (GPU), and (b) MPI parallel computer nodes with the same configurations. Nature of problem: For molecular dynamics simulation, the electrostatic interaction is the most time-consuming computation because of its long-range feature and slow convergence in simulation space, which approximately take up most of the total simulation time. Although the parallel method CU-ENUF (Yang et al., 2016) based on GPU has achieved a qualitative leap compared with previous methods in electrostatic interactions computation, the computation capability is limited to the throughput capacity of a single GPU for super-scale simulation system. Therefore, we should look for an effective method to handle the calculation of electrostatic interactions efficiently for a simulation system with super-scale size. Solution method: We constructed a hybrid parallel architecture, in which CPU and GPU are combined to accelerate the electrostatic computation effectively. Firstly, the simulation system is divided into many subtasks via domain-decomposition method. Then MPI (Message Passing Interface) is used to implement the CPU-parallel computation with each computer node corresponding to a particular subtask, and furthermore each subtask in one computer node will be executed in GPU in parallel efficiently. In this hybrid parallel method, the most critical technical problem is how to parallelize a CUNFFT (nonequispaced fast Fourier transform based on CUDA) in the parallel strategy, which is conquered effectively by deep-seated research of basic principles and some algorithm skills. Restrictions: The HP-ENUF is mainly oriented to super-scale system simulations, in which the performance superiority is shown adequately. However, for a small simulation system containing less than 106 particles, the mode of multiple computer nodes has no apparent efficiency advantage or even lower efficiency due to the serious network delay among computer nodes, than the mode of single computer node. References: (1) S.-C. Yang, H.-J. Qian, Z.-Y. Lu, Appl. Comput. Harmon. Anal. 2016, http://dx.doi.org/10.1016/j.acha.2016.04.009. (2) S.-C. Yang, Y.-L. Wang, G.-S. Jiao, H.-J. Qian, Z.-Y. Lu, J. Comput. Chem. 37 (2016) 378. (3) S.-C. Yang, Y.-L. Zhu, H.-J. Qian, Z.-Y. Lu, Appl. Chem. Res. Chin. Univ., 2017, http://dx.doi.org/10.1007/s40242-016-6354-5. (4) Y.-L. Zhu, H. Liu, Z.-W. Li, H.-J. Qian, G. Milano, Z.-Y. Lu, J. Comput. Chem. 34 (2013) 2197.
A Multi-Level Parallelization Concept for High-Fidelity Multi-Block Solvers
NASA Technical Reports Server (NTRS)
Hatay, Ferhat F.; Jespersen, Dennis C.; Guruswamy, Guru P.; Rizk, Yehia M.; Byun, Chansup; Gee, Ken; VanDalsem, William R. (Technical Monitor)
1997-01-01
The integration of high-fidelity Computational Fluid Dynamics (CFD) analysis tools with the industrial design process benefits greatly from the robust implementations that are transportable across a wide range of computer architectures. In the present work, a hybrid domain-decomposition and parallelization concept was developed and implemented into the widely-used NASA multi-block Computational Fluid Dynamics (CFD) packages implemented in ENSAERO and OVERFLOW. The new parallel solver concept, PENS (Parallel Euler Navier-Stokes Solver), employs both fine and coarse granularity in data partitioning as well as data coalescing to obtain the desired load-balance characteristics on the available computer platforms. This multi-level parallelism implementation itself introduces no changes to the numerical results, hence the original fidelity of the packages are identically preserved. The present implementation uses the Message Passing Interface (MPI) library for interprocessor message passing and memory accessing. By choosing an appropriate combination of the available partitioning and coalescing capabilities only during the execution stage, the PENS solver becomes adaptable to different computer architectures from shared-memory to distributed-memory platforms with varying degrees of parallelism. The PENS implementation on the IBM SP2 distributed memory environment at the NASA Ames Research Center obtains 85 percent scalable parallel performance using fine-grain partitioning of single-block CFD domains using up to 128 wide computational nodes. Multi-block CFD simulations of complete aircraft simulations achieve 75 percent perfect load-balanced executions using data coalescing and the two levels of parallelism. SGI PowerChallenge, SGI Origin 2000, and a cluster of workstations are the other platforms where the robustness of the implementation is tested. The performance behavior on the other computer platforms with a variety of realistic problems will be included as this on-going study progresses.
Potential Application of a Graphical Processing Unit to Parallel Computations in the NUBEAM Code
NASA Astrophysics Data System (ADS)
Payne, J.; McCune, D.; Prater, R.
2010-11-01
NUBEAM is a comprehensive computational Monte Carlo based model for neutral beam injection (NBI) in tokamaks. NUBEAM computes NBI-relevant profiles in tokamak plasmas by tracking the deposition and the slowing of fast ions. At the core of NUBEAM are vector calculations used to track fast ions. These calculations have recently been parallelized to run on MPI clusters. However, cost and interlink bandwidth limit the ability to fully parallelize NUBEAM on an MPI cluster. Recent implementation of double precision capabilities for Graphical Processing Units (GPUs) presents a cost effective and high performance alternative or complement to MPI computation. Commercially available graphics cards can achieve up to 672 GFLOPS double precision and can handle hundreds of thousands of threads. The ability to execute at least one thread per particle simultaneously could significantly reduce the execution time and the statistical noise of NUBEAM. Progress on implementation on a GPU will be presented.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Barney, B; Shuler, J
2006-08-21
Purple is an Advanced Simulation and Computing (ASC) funded massively parallel supercomputer located at Lawrence Livermore National Laboratory (LLNL). The Purple Computational Environment documents the capabilities and the environment provided for the FY06 LLNL Level 1 General Availability Milestone. This document describes specific capabilities, tools, and procedures to support both local and remote users. The model is focused on the needs of the ASC user working in the secure computing environments at Los Alamos National Laboratory, Lawrence Livermore National Laboratory, and Sandia National Laboratories, but also documents needs of the LLNL and Alliance users working in the unclassified environment. Additionally,more » the Purple Computational Environment maps the provided capabilities to the Trilab ASC Computing Environment (ACE) Version 8.0 requirements. The ACE requirements reflect the high performance computing requirements for the General Availability user environment capabilities of the ASC community. Appendix A lists these requirements and includes a description of ACE requirements met and those requirements that are not met for each section of this document. The Purple Computing Environment, along with the ACE mappings, has been issued and reviewed throughout the Tri-lab community.« less
Pandya, Tara M.; Johnson, Seth R.; Evans, Thomas M.; ...
2015-12-21
This paper discusses the implementation, capabilities, and validation of Shift, a massively parallel Monte Carlo radiation transport package developed and maintained at Oak Ridge National Laboratory. It has been developed to scale well from laptop to small computing clusters to advanced supercomputers. Special features of Shift include hybrid capabilities for variance reduction such as CADIS and FW-CADIS, and advanced parallel decomposition and tally methods optimized for scalability on supercomputing architectures. Shift has been validated and verified against various reactor physics benchmarks and compares well to other state-of-the-art Monte Carlo radiation transport codes such as MCNP5, CE KENO-VI, and OpenMC. Somemore » specific benchmarks used for verification and validation include the CASL VERA criticality test suite and several Westinghouse AP1000 ® problems. These benchmark and scaling studies show promising results.« less
Computing with motile bio-agents
NASA Astrophysics Data System (ADS)
Nicolau, Dan V., Jr.; Burrage, Kevin; Nicolau, Dan V.
2007-12-01
We describe a model of computation of the parallel type, which we call 'computing with bio-agents', based on the concept that motions of biological objects such as bacteria or protein molecular motors in confined spaces can be regarded as computations. We begin with the observation that the geometric nature of the physical structures in which model biological objects move modulates the motions of the latter. Consequently, by changing the geometry, one can control the characteristic trajectories of the objects; on the basis of this, we argue that such systems are computing devices. We investigate the computing power of mobile bio-agent systems and show that they are computationally universal in the sense that they are capable of computing any Boolean function in parallel. We argue also that using appropriate conditions, bio-agent systems can solve NP-complete problems in probabilistic polynomial time.
A learnable parallel processing architecture towards unity of memory and computing
NASA Astrophysics Data System (ADS)
Li, H.; Gao, B.; Chen, Z.; Zhao, Y.; Huang, P.; Ye, H.; Liu, L.; Liu, X.; Kang, J.
2015-08-01
Developing energy-efficient parallel information processing systems beyond von Neumann architecture is a long-standing goal of modern information technologies. The widely used von Neumann computer architecture separates memory and computing units, which leads to energy-hungry data movement when computers work. In order to meet the need of efficient information processing for the data-driven applications such as big data and Internet of Things, an energy-efficient processing architecture beyond von Neumann is critical for the information society. Here we show a non-von Neumann architecture built of resistive switching (RS) devices named “iMemComp”, where memory and logic are unified with single-type devices. Leveraging nonvolatile nature and structural parallelism of crossbar RS arrays, we have equipped “iMemComp” with capabilities of computing in parallel and learning user-defined logic functions for large-scale information processing tasks. Such architecture eliminates the energy-hungry data movement in von Neumann computers. Compared with contemporary silicon technology, adder circuits based on “iMemComp” can improve the speed by 76.8% and the power dissipation by 60.3%, together with a 700 times aggressive reduction in the circuit area.
A learnable parallel processing architecture towards unity of memory and computing.
Li, H; Gao, B; Chen, Z; Zhao, Y; Huang, P; Ye, H; Liu, L; Liu, X; Kang, J
2015-08-14
Developing energy-efficient parallel information processing systems beyond von Neumann architecture is a long-standing goal of modern information technologies. The widely used von Neumann computer architecture separates memory and computing units, which leads to energy-hungry data movement when computers work. In order to meet the need of efficient information processing for the data-driven applications such as big data and Internet of Things, an energy-efficient processing architecture beyond von Neumann is critical for the information society. Here we show a non-von Neumann architecture built of resistive switching (RS) devices named "iMemComp", where memory and logic are unified with single-type devices. Leveraging nonvolatile nature and structural parallelism of crossbar RS arrays, we have equipped "iMemComp" with capabilities of computing in parallel and learning user-defined logic functions for large-scale information processing tasks. Such architecture eliminates the energy-hungry data movement in von Neumann computers. Compared with contemporary silicon technology, adder circuits based on "iMemComp" can improve the speed by 76.8% and the power dissipation by 60.3%, together with a 700 times aggressive reduction in the circuit area.
JETSPIN: A specific-purpose open-source software for simulations of nanofiber electrospinning
NASA Astrophysics Data System (ADS)
Lauricella, Marco; Pontrelli, Giuseppe; Coluzza, Ivan; Pisignano, Dario; Succi, Sauro
2015-12-01
We present the open-source computer program JETSPIN, specifically designed to simulate the electrospinning process of nanofibers. Its capabilities are shown with proper reference to the underlying model, as well as a description of the relevant input variables and associated test-case simulations. The various interactions included in the electrospinning model implemented in JETSPIN are discussed in detail. The code is designed to exploit different computational architectures, from single to parallel processor workstations. This paper provides an overview of JETSPIN, focusing primarily on its structure, parallel implementations, functionality, performance, and availability.
An embedded multi-core parallel model for real-time stereo imaging
NASA Astrophysics Data System (ADS)
He, Wenjing; Hu, Jian; Niu, Jingyu; Li, Chuanrong; Liu, Guangyu
2018-04-01
The real-time processing based on embedded system will enhance the application capability of stereo imaging for LiDAR and hyperspectral sensor. The task partitioning and scheduling strategies for embedded multiprocessor system starts relatively late, compared with that for PC computer. In this paper, aimed at embedded multi-core processing platform, a parallel model for stereo imaging is studied and verified. After analyzing the computing amount, throughout capacity and buffering requirements, a two-stage pipeline parallel model based on message transmission is established. This model can be applied to fast stereo imaging for airborne sensors with various characteristics. To demonstrate the feasibility and effectiveness of the parallel model, a parallel software was designed using test flight data, based on the 8-core DSP processor TMS320C6678. The results indicate that the design performed well in workload distribution and had a speed-up ratio up to 6.4.
The science of computing - The evolution of parallel processing
NASA Technical Reports Server (NTRS)
Denning, P. J.
1985-01-01
The present paper is concerned with the approaches to be employed to overcome the set of limitations in software technology which impedes currently an effective use of parallel hardware technology. The process required to solve the arising problems is found to involve four different stages. At the present time, Stage One is nearly finished, while Stage Two is under way. Tentative explorations are beginning on Stage Three, and Stage Four is more distant. In Stage One, parallelism is introduced into the hardware of a single computer, which consists of one or more processors, a main storage system, a secondary storage system, and various peripheral devices. In Stage Two, parallel execution of cooperating programs on different machines becomes explicit, while in Stage Three, new languages will make parallelism implicit. In Stage Four, there will be very high level user interfaces capable of interacting with scientists at the same level of abstraction as scientists do with each other.
A parallel graded-mesh FDTD algorithm for human-antenna interaction problems.
Catarinucci, Luca; Tarricone, Luciano
2009-01-01
The finite difference time domain method (FDTD) is frequently used for the numerical solution of a wide variety of electromagnetic (EM) problems and, among them, those concerning human exposure to EM fields. In many practical cases related to the assessment of occupational EM exposure, large simulation domains are modeled and high space resolution adopted, so that strong memory and central processing unit power requirements have to be satisfied. To better afford the computational effort, the use of parallel computing is a winning approach; alternatively, subgridding techniques are often implemented. However, the simultaneous use of subgridding schemes and parallel algorithms is very new. In this paper, an easy-to-implement and highly-efficient parallel graded-mesh (GM) FDTD scheme is proposed and applied to human-antenna interaction problems, demonstrating its appropriateness in dealing with complex occupational tasks and showing its capability to guarantee the advantages of a traditional subgridding technique without affecting the parallel FDTD performance.
Pyramidal neurovision architecture for vision machines
NASA Astrophysics Data System (ADS)
Gupta, Madan M.; Knopf, George K.
1993-08-01
The vision system employed by an intelligent robot must be active; active in the sense that it must be capable of selectively acquiring the minimal amount of relevant information for a given task. An efficient active vision system architecture that is based loosely upon the parallel-hierarchical (pyramidal) structure of the biological visual pathway is presented in this paper. Although the computational architecture of the proposed pyramidal neuro-vision system is far less sophisticated than the architecture of the biological visual pathway, it does retain some essential features such as the converging multilayered structure of its biological counterpart. In terms of visual information processing, the neuro-vision system is constructed from a hierarchy of several interactive computational levels, whereupon each level contains one or more nonlinear parallel processors. Computationally efficient vision machines can be developed by utilizing both the parallel and serial information processing techniques within the pyramidal computing architecture. A computer simulation of a pyramidal vision system for active scene surveillance is presented.
Time-Dependent Simulations of Turbopump Flows
NASA Technical Reports Server (NTRS)
Kiris, Cetin; Kwak, Dochan; Chan, William; Williams, Robert
2002-01-01
Unsteady flow simulations for RLV (Reusable Launch Vehicles) 2nd Generation baseline turbopump for one and half impeller rotations have been completed by using a 34.3 Million grid points model. MLP (Multi-Level Parallelism) shared memory parallelism has been implemented in INS3D, and benchmarked. Code optimization for cash based platforms will be completed by the end of September 2001. Moving boundary capability is obtained by using DCF module. Scripting capability from CAD (computer aided design) geometry to solution has been developed. Data compression is applied to reduce data size in post processing. Fluid/Structure coupling has been initiated.
Parallel computation and the Basis system
DOE Office of Scientific and Technical Information (OSTI.GOV)
Smith, G.R.
1992-12-16
A software package has been written that can facilitate efforts to develop powerful, flexible, and easy-to-use programs that can run in single-processor, massively parallel, and distributed computing environments. Particular attention has been given to the difficulties posed by a program consisting of many science packages that represent subsystems of a complicated, coupled system. Methods have been found to maintain independence of the packages by hiding data structures without increasing the communication costs in a parallel computing environment. Concepts developed in this work are demonstrated by a prototype program that uses library routines from two existing software systems, Basis and Parallelmore » Virtual Machine (PVM). Most of the details of these libraries have been encapsulated in routines and macros that could be rewritten for alternative libraries that possess certain minimum capabilities. The prototype software uses a flexible master-and-slaves paradigm for parallel computation and supports domain decomposition with message passing for partitioning work among slaves. Facilities are provided for accessing variables that are distributed among the memories of slaves assigned to subdomains. The software is named PROTOPAR.« less
NASA Technical Reports Server (NTRS)
Morgan, Philip E.
2004-01-01
This final report contains reports of research related to the tasks "Scalable High Performance Computing: Direct and Lark-Eddy Turbulent FLow Simulations Using Massively Parallel Computers" and "Devleop High-Performance Time-Domain Computational Electromagnetics Capability for RCS Prediction, Wave Propagation in Dispersive Media, and Dual-Use Applications. The discussion of Scalable High Performance Computing reports on three objectives: validate, access scalability, and apply two parallel flow solvers for three-dimensional Navier-Stokes flows; develop and validate a high-order parallel solver for Direct Numerical Simulations (DNS) and Large Eddy Simulation (LES) problems; and Investigate and develop a high-order Reynolds averaged Navier-Stokes turbulence model. The discussion of High-Performance Time-Domain Computational Electromagnetics reports on five objectives: enhancement of an electromagnetics code (CHARGE) to be able to effectively model antenna problems; utilize lessons learned in high-order/spectral solution of swirling 3D jets to apply to solving electromagnetics project; transition a high-order fluids code, FDL3DI, to be able to solve Maxwell's Equations using compact-differencing; develop and demonstrate improved radiation absorbing boundary conditions for high-order CEM; and extend high-order CEM solver to address variable material properties. The report also contains a review of work done by the systems engineer.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Amadio, G.; et al.
An intensive R&D and programming effort is required to accomplish new challenges posed by future experimental high-energy particle physics (HEP) programs. The GeantV project aims to narrow the gap between the performance of the existing HEP detector simulation software and the ideal performance achievable, exploiting latest advances in computing technology. The project has developed a particle detector simulation prototype capable of transporting in parallel particles in complex geometries exploiting instruction level microparallelism (SIMD and SIMT), task-level parallelism (multithreading) and high-level parallelism (MPI), leveraging both the multi-core and the many-core opportunities. We present preliminary verification results concerning the electromagnetic (EM) physicsmore » models developed for parallel computing architectures within the GeantV project. In order to exploit the potential of vectorization and accelerators and to make the physics model effectively parallelizable, advanced sampling techniques have been implemented and tested. In this paper we introduce a set of automated statistical tests in order to verify the vectorized models by checking their consistency with the corresponding Geant4 models and to validate them against experimental data.« less
Automating the parallel processing of fluid and structural dynamics calculations
NASA Technical Reports Server (NTRS)
Arpasi, Dale J.; Cole, Gary L.
1987-01-01
The NASA Lewis Research Center is actively involved in the development of expert system technology to assist users in applying parallel processing to computational fluid and structural dynamic analysis. The goal of this effort is to eliminate the necessity for the physical scientist to become a computer scientist in order to effectively use the computer as a research tool. Programming and operating software utilities have previously been developed to solve systems of ordinary nonlinear differential equations on parallel scalar processors. Current efforts are aimed at extending these capabilities to systems of partial differential equations, that describe the complex behavior of fluids and structures within aerospace propulsion systems. This paper presents some important considerations in the redesign, in particular, the need for algorithms and software utilities that can automatically identify data flow patterns in the application program and partition and allocate calculations to the parallel processors. A library-oriented multiprocessing concept for integrating the hardware and software functions is described.
Kepper, Nick; Ettig, Ramona; Dickmann, Frank; Stehr, Rene; Grosveld, Frank G; Wedemann, Gero; Knoch, Tobias A
2010-01-01
Especially in the life-science and the health-care sectors the huge IT requirements are imminent due to the large and complex systems to be analysed and simulated. Grid infrastructures play here a rapidly increasing role for research, diagnostics, and treatment, since they provide the necessary large-scale resources efficiently. Whereas grids were first used for huge number crunching of trivially parallelizable problems, increasingly parallel high-performance computing is required. Here, we show for the prime example of molecular dynamic simulations how the presence of large grid clusters including very fast network interconnects within grid infrastructures allows now parallel high-performance grid computing efficiently and thus combines the benefits of dedicated super-computing centres and grid infrastructures. The demands for this service class are the highest since the user group has very heterogeneous requirements: i) two to many thousands of CPUs, ii) different memory architectures, iii) huge storage capabilities, and iv) fast communication via network interconnects, are all needed in different combinations and must be considered in a highly dedicated manner to reach highest performance efficiency. Beyond, advanced and dedicated i) interaction with users, ii) the management of jobs, iii) accounting, and iv) billing, not only combines classic with parallel high-performance grid usage, but more importantly is also able to increase the efficiency of IT resource providers. Consequently, the mere "yes-we-can" becomes a huge opportunity like e.g. the life-science and health-care sectors as well as grid infrastructures by reaching higher level of resource efficiency.
Topical perspective on massive threading and parallelism.
Farber, Robert M
2011-09-01
Unquestionably computer architectures have undergone a recent and noteworthy paradigm shift that now delivers multi- and many-core systems with tens to many thousands of concurrent hardware processing elements per workstation or supercomputer node. GPGPU (General Purpose Graphics Processor Unit) technology in particular has attracted significant attention as new software development capabilities, namely CUDA (Compute Unified Device Architecture) and OpenCL™, have made it possible for students as well as small and large research organizations to achieve excellent speedup for many applications over more conventional computing architectures. The current scientific literature reflects this shift with numerous examples of GPGPU applications that have achieved one, two, and in some special cases, three-orders of magnitude increased computational performance through the use of massive threading to exploit parallelism. Multi-core architectures are also evolving quickly to exploit both massive-threading and massive-parallelism such as the 1.3 million threads Blue Waters supercomputer. The challenge confronting scientists in planning future experimental and theoretical research efforts--be they individual efforts with one computer or collaborative efforts proposing to use the largest supercomputers in the world is how to capitalize on these new massively threaded computational architectures--especially as not all computational problems will scale to massive parallelism. In particular, the costs associated with restructuring software (and potentially redesigning algorithms) to exploit the parallelism of these multi- and many-threaded machines must be considered along with application scalability and lifespan. This perspective is an overview of the current state of threading and parallelize with some insight into the future. Published by Elsevier Inc.
A Parallel Finite Set Statistical Simulator for Multi-Target Detection and Tracking
NASA Astrophysics Data System (ADS)
Hussein, I.; MacMillan, R.
2014-09-01
Finite Set Statistics (FISST) is a powerful Bayesian inference tool for the joint detection, classification and tracking of multi-target environments. FISST is capable of handling phenomena such as clutter, misdetections, and target birth and decay. Implicit within the approach are solutions to the data association and target label-tracking problems. Finally, FISST provides generalized information measures that can be used for sensor allocation across different types of tasks such as: searching for new targets, and classification and tracking of known targets. These FISST capabilities have been demonstrated on several small-scale illustrative examples. However, for implementation in a large-scale system as in the Space Situational Awareness problem, these capabilities require a lot of computational power. In this paper, we implement FISST in a parallel environment for the joint detection and tracking of multi-target systems. In this implementation, false alarms and misdetections will be modeled. Target birth and decay will not be modeled in the present paper. We will demonstrate the success of the method for as many targets as we possibly can in a desktop parallel environment. Performance measures will include: number of targets in the simulation, certainty of detected target tracks, computational time as a function of clutter returns and number of targets, among other factors.
Some Problems and Solutions in Transferring Ecosystem Simulation Codes to Supercomputers
NASA Technical Reports Server (NTRS)
Skiles, J. W.; Schulbach, C. H.
1994-01-01
Many computer codes for the simulation of ecological systems have been developed in the last twenty-five years. This development took place initially on main-frame computers, then mini-computers, and more recently, on micro-computers and workstations. Recent recognition of ecosystem science as a High Performance Computing and Communications Program Grand Challenge area emphasizes supercomputers (both parallel and distributed systems) as the next set of tools for ecological simulation. Transferring ecosystem simulation codes to such systems is not a matter of simply compiling and executing existing code on the supercomputer since there are significant differences in the system architectures of sequential, scalar computers and parallel and/or vector supercomputers. To more appropriately match the application to the architecture (necessary to achieve reasonable performance), the parallelism (if it exists) of the original application must be exploited. We discuss our work in transferring a general grassland simulation model (developed on a VAX in the FORTRAN computer programming language) to a Cray Y-MP. We show the Cray shared-memory vector-architecture, and discuss our rationale for selecting the Cray. We describe porting the model to the Cray and executing and verifying a baseline version, and we discuss the changes we made to exploit the parallelism in the application and to improve code execution. As a result, the Cray executed the model 30 times faster than the VAX 11/785 and 10 times faster than a Sun 4 workstation. We achieved an additional speed-up of approximately 30 percent over the original Cray run by using the compiler's vectorizing capabilities and the machine's ability to put subroutines and functions "in-line" in the code. With the modifications, the code still runs at only about 5% of the Cray's peak speed because it makes ineffective use of the vector processing capabilities of the Cray. We conclude with a discussion and future plans.
The potential of multi-port optical memories in digital computing
NASA Technical Reports Server (NTRS)
Alford, C. O.; Gaylord, T. K.
1975-01-01
A high-capacity memory with a relatively high data transfer rate and multi-port simultaneous access capability may serve as the basis for new computer architectures. The implementation of a multi-port optical memory is discussed. Several computer structures are presented that might profitably use such a memory. These structures include (1) a simultaneous record access system, (2) a simultaneously shared memory computer system, and (3) a parallel digital processing structure.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Vigil,Benny Manuel; Ballance, Robert; Haskell, Karen
Cielo is a massively parallel supercomputer funded by the DOE/NNSA Advanced Simulation and Computing (ASC) program, and operated by the Alliance for Computing at Extreme Scale (ACES), a partnership between Los Alamos National Laboratory (LANL) and Sandia National Laboratories (SNL). The primary Cielo compute platform is physically located at Los Alamos National Laboratory. This Cielo Computational Environment Usage Model documents the capabilities and the environment to be provided for the Q1 FY12 Level 2 Cielo Capability Computing (CCC) Platform Production Readiness Milestone. This document describes specific capabilities, tools, and procedures to support both local and remote users. The model ismore » focused on the needs of the ASC user working in the secure computing environments at Lawrence Livermore National Laboratory (LLNL), Los Alamos National Laboratory, or Sandia National Laboratories, but also addresses the needs of users working in the unclassified environment. The Cielo Computational Environment Usage Model maps the provided capabilities to the tri-Lab ASC Computing Environment (ACE) Version 8.0 requirements. The ACE requirements reflect the high performance computing requirements for the Production Readiness Milestone user environment capabilities of the ASC community. A description of ACE requirements met, and those requirements that are not met, are included in each section of this document. The Cielo Computing Environment, along with the ACE mappings, has been issued and reviewed throughout the tri-Lab community.« less
A Next-Generation Parallel File System Environment for the OLCF
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dillow, David A; Fuller, Douglas; Gunasekaran, Raghul
2012-01-01
When deployed in 2008/2009 the Spider system at the Oak Ridge National Laboratory s Leadership Computing Facility (OLCF) was the world s largest scale Lustre parallel file system. Envisioned as a shared parallel file system capable of delivering both the bandwidth and capacity requirements of the OLCF s diverse computational environment, Spider has since become a blueprint for shared Lustre environments deployed worldwide. Designed to support the parallel I/O requirements of the Jaguar XT5 system and other smallerscale platforms at the OLCF, the upgrade to the Titan XK6 heterogeneous system will begin to push the limits of Spider s originalmore » design by mid 2013. With a doubling in total system memory and a 10x increase in FLOPS, Titan will require both higher bandwidth and larger total capacity. Our goal is to provide a 4x increase in total I/O bandwidth from over 240GB=sec today to 1TB=sec and a doubling in total capacity. While aggregate bandwidth and total capacity remain important capabilities, an equally important goal in our efforts is dramatically increasing metadata performance, currently the Achilles heel of parallel file systems at leadership. We present in this paper an analysis of our current I/O workloads, our operational experiences with the Spider parallel file systems, the high-level design of our Spider upgrade, and our efforts in developing benchmarks that synthesize our performance requirements based on our workload characterization studies.« less
Nonvolatile “AND,” “OR,” and “NOT” Boolean logic gates based on phase-change memory
DOE Office of Scientific and Technical Information (OSTI.GOV)
Li, Y.; Zhong, Y. P.; Deng, Y. F.
2013-12-21
Electronic devices or circuits that can implement both logic and memory functions are regarded as the building blocks for future massive parallel computing beyond von Neumann architecture. Here we proposed phase-change memory (PCM)-based nonvolatile logic gates capable of AND, OR, and NOT Boolean logic operations verified in SPICE simulations and circuit experiments. The logic operations are parallel computing and results can be stored directly in the states of the logic gates, facilitating the combination of computing and memory in the same circuit. These results are encouraging for ultralow-power and high-speed nonvolatile logic circuit design based on novel memory devices.
A parallel computing engine for a class of time critical processes.
Nabhan, T M; Zomaya, A Y
1997-01-01
This paper focuses on the efficient parallel implementation of systems of numerically intensive nature over loosely coupled multiprocessor architectures. These analytical models are of significant importance to many real-time systems that have to meet severe time constants. A parallel computing engine (PCE) has been developed in this work for the efficient simplification and the near optimal scheduling of numerical models over the different cooperating processors of the parallel computer. First, the analytical system is efficiently coded in its general form. The model is then simplified by using any available information (e.g., constant parameters). A task graph representing the interconnections among the different components (or equations) is generated. The graph can then be compressed to control the computation/communication requirements. The task scheduler employs a graph-based iterative scheme, based on the simulated annealing algorithm, to map the vertices of the task graph onto a Multiple-Instruction-stream Multiple-Data-stream (MIMD) type of architecture. The algorithm uses a nonanalytical cost function that properly considers the computation capability of the processors, the network topology, the communication time, and congestion possibilities. Moreover, the proposed technique is simple, flexible, and computationally viable. The efficiency of the algorithm is demonstrated by two case studies with good results.
Advances in Parallelization for Large Scale Oct-Tree Mesh Generation
NASA Technical Reports Server (NTRS)
O'Connell, Matthew; Karman, Steve L.
2015-01-01
Despite great advancements in the parallelization of numerical simulation codes over the last 20 years, it is still common to perform grid generation in serial. Generating large scale grids in serial often requires using special "grid generation" compute machines that can have more than ten times the memory of average machines. While some parallel mesh generation techniques have been proposed, generating very large meshes for LES or aeroacoustic simulations is still a challenging problem. An automated method for the parallel generation of very large scale off-body hierarchical meshes is presented here. This work enables large scale parallel generation of off-body meshes by using a novel combination of parallel grid generation techniques and a hybrid "top down" and "bottom up" oct-tree method. Meshes are generated using hardware commonly found in parallel compute clusters. The capability to generate very large meshes is demonstrated by the generation of off-body meshes surrounding complex aerospace geometries. Results are shown including a one billion cell mesh generated around a Predator Unmanned Aerial Vehicle geometry, which was generated on 64 processors in under 45 minutes.
Orthorectification by Using Gpgpu Method
NASA Astrophysics Data System (ADS)
Sahin, H.; Kulur, S.
2012-07-01
Thanks to the nature of the graphics processing, the newly released products offer highly parallel processing units with high-memory bandwidth and computational power of more than teraflops per second. The modern GPUs are not only powerful graphic engines but also they are high level parallel programmable processors with very fast computing capabilities and high-memory bandwidth speed compared to central processing units (CPU). Data-parallel computations can be shortly described as mapping data elements to parallel processing threads. The rapid development of GPUs programmability and capabilities attracted the attentions of researchers dealing with complex problems which need high level calculations. This interest has revealed the concepts of "General Purpose Computation on Graphics Processing Units (GPGPU)" and "stream processing". The graphic processors are powerful hardware which is really cheap and affordable. So the graphic processors became an alternative to computer processors. The graphic chips which were standard application hardware have been transformed into modern, powerful and programmable processors to meet the overall needs. Especially in recent years, the phenomenon of the usage of graphics processing units in general purpose computation has led the researchers and developers to this point. The biggest problem is that the graphics processing units use different programming models unlike current programming methods. Therefore, an efficient GPU programming requires re-coding of the current program algorithm by considering the limitations and the structure of the graphics hardware. Currently, multi-core processors can not be programmed by using traditional programming methods. Event procedure programming method can not be used for programming the multi-core processors. GPUs are especially effective in finding solution for repetition of the computing steps for many data elements when high accuracy is needed. Thus, it provides the computing process more quickly and accurately. Compared to the GPUs, CPUs which perform just one computing in a time according to the flow control are slower in performance. This structure can be evaluated for various applications of computer technology. In this study covers how general purpose parallel programming and computational power of the GPUs can be used in photogrammetric applications especially direct georeferencing. The direct georeferencing algorithm is coded by using GPGPU method and CUDA (Compute Unified Device Architecture) programming language. Results provided by this method were compared with the traditional CPU programming. In the other application the projective rectification is coded by using GPGPU method and CUDA programming language. Sample images of various sizes, as compared to the results of the program were evaluated. GPGPU method can be used especially in repetition of same computations on highly dense data, thus finding the solution quickly.
DOE Office of Scientific and Technical Information (OSTI.GOV)
D'Azevedo, Eduardo; Abbott, Stephen; Koskela, Tuomas
The XGC fusion gyrokinetic code combines state-of-the-art, portable computational and algorithmic technologies to enable complicated multiscale simulations of turbulence and transport dynamics in ITER edge plasma on the largest US open-science computer, the CRAY XK7 Titan, at its maximal heterogeneous capability, which have not been possible before due to a factor of over 10 shortage in the time-to-solution for less than 5 days of wall-clock time for one physics case. Frontier techniques such as nested OpenMP parallelism, adaptive parallel I/O, staging I/O and data reduction using dynamic and asynchronous applications interactions, dynamic repartitioning.
Hypercluster - Parallel processing for computational mechanics
NASA Technical Reports Server (NTRS)
Blech, Richard A.
1988-01-01
An account is given of the development status, performance capabilities and implications for further development of NASA-Lewis' testbed 'hypercluster' parallel computer network, in which multiple processors communicate through a shared memory. Processors have local as well as shared memory; the hypercluster is expanded in the same manner as the hypercube, with processor clusters replacing the normal single processor node. The NASA-Lewis machine has three nodes with a vector personality and one node with a scalar personality. Each of the vector nodes uses four board-level vector processors, while the scalar node uses four general-purpose microcomputer boards.
Parallel Unsteady Turbopump Simulations for Liquid Rocket Engines
NASA Technical Reports Server (NTRS)
Kiris, Cetin C.; Kwak, Dochan; Chan, William
2000-01-01
This paper reports the progress being made towards complete turbo-pump simulation capability for liquid rocket engines. Space Shuttle Main Engine (SSME) turbo-pump impeller is used as a test case for the performance evaluation of the MPI and hybrid MPI/Open-MP versions of the INS3D code. Then, a computational model of a turbo-pump has been developed for the shuttle upgrade program. Relative motion of the grid system for rotor-stator interaction was obtained by employing overset grid techniques. Time-accuracy of the scheme has been evaluated by using simple test cases. Unsteady computations for SSME turbo-pump, which contains 136 zones with 35 Million grid points, are currently underway on Origin 2000 systems at NASA Ames Research Center. Results from time-accurate simulations with moving boundary capability, and the performance of the parallel versions of the code will be presented in the final paper.
Using parallel computing for the display and simulation of the space debris environment
NASA Astrophysics Data System (ADS)
Möckel, M.; Wiedemann, C.; Flegel, S.; Gelhaus, J.; Vörsmann, P.; Klinkrad, H.; Krag, H.
2011-07-01
Parallelism is becoming the leading paradigm in today's computer architectures. In order to take full advantage of this development, new algorithms have to be specifically designed for parallel execution while many old ones have to be upgraded accordingly. One field in which parallel computing has been firmly established for many years is computer graphics. Calculating and displaying three-dimensional computer generated imagery in real time requires complex numerical operations to be performed at high speed on a large number of objects. Since most of these objects can be processed independently, parallel computing is applicable in this field. Modern graphics processing units (GPUs) have become capable of performing millions of matrix and vector operations per second on multiple objects simultaneously. As a side project, a software tool is currently being developed at the Institute of Aerospace Systems that provides an animated, three-dimensional visualization of both actual and simulated space debris objects. Due to the nature of these objects it is possible to process them individually and independently from each other. Therefore, an analytical orbit propagation algorithm has been implemented to run on a GPU. By taking advantage of all its processing power a huge performance increase, compared to its CPU-based counterpart, could be achieved. For several years efforts have been made to harness this computing power for applications other than computer graphics. Software tools for the simulation of space debris are among those that could profit from embracing parallelism. With recently emerged software development tools such as OpenCL it is possible to transfer the new algorithms used in the visualization outside the field of computer graphics and implement them, for example, into the space debris simulation environment. This way they can make use of parallel hardware such as GPUs and Multi-Core-CPUs for faster computation. In this paper the visualization software will be introduced, including a comparison between the serial and the parallel method of orbit propagation. Ways of how to use the benefits of the latter method for space debris simulation will be discussed. An introduction to OpenCL will be given as well as an exemplary algorithm from the field of space debris simulation.
Using parallel computing for the display and simulation of the space debris environment
NASA Astrophysics Data System (ADS)
Moeckel, Marek; Wiedemann, Carsten; Flegel, Sven Kevin; Gelhaus, Johannes; Klinkrad, Heiner; Krag, Holger; Voersmann, Peter
Parallelism is becoming the leading paradigm in today's computer architectures. In order to take full advantage of this development, new algorithms have to be specifically designed for parallel execution while many old ones have to be upgraded accordingly. One field in which parallel computing has been firmly established for many years is computer graphics. Calculating and displaying three-dimensional computer generated imagery in real time requires complex numerical operations to be performed at high speed on a large number of objects. Since most of these objects can be processed independently, parallel computing is applicable in this field. Modern graphics processing units (GPUs) have become capable of performing millions of matrix and vector operations per second on multiple objects simultaneously. As a side project, a software tool is currently being developed at the Institute of Aerospace Systems that provides an animated, three-dimensional visualization of both actual and simulated space debris objects. Due to the nature of these objects it is possible to process them individually and independently from each other. Therefore, an analytical orbit propagation algorithm has been implemented to run on a GPU. By taking advantage of all its processing power a huge performance increase, compared to its CPU-based counterpart, could be achieved. For several years efforts have been made to harness this computing power for applications other than computer graphics. Software tools for the simulation of space debris are among those that could profit from embracing parallelism. With recently emerged software development tools such as OpenCL it is possible to transfer the new algorithms used in the visualization outside the field of computer graphics and implement them, for example, into the space debris simulation environment. This way they can make use of parallel hardware such as GPUs and Multi-Core-CPUs for faster computation. In this paper the visualization software will be introduced, including a comparison between the serial and the parallel method of orbit propagation. Ways of how to use the benefits of the latter method for space debris simulation will be discussed. An introduction of OpenCL will be given as well as an exemplary algorithm from the field of space debris simulation.
B-MIC: An Ultrafast Three-Level Parallel Sequence Aligner Using MIC.
Cui, Yingbo; Liao, Xiangke; Zhu, Xiaoqian; Wang, Bingqiang; Peng, Shaoliang
2016-03-01
Sequence alignment is the central process for sequence analysis, where mapping raw sequencing data to reference genome. The large amount of data generated by NGS is far beyond the process capabilities of existing alignment tools. Consequently, sequence alignment becomes the bottleneck of sequence analysis. Intensive computing power is required to address this challenge. Intel recently announced the MIC coprocessor, which can provide massive computing power. The Tianhe-2 is the world's fastest supercomputer now equipped with three MIC coprocessors each compute node. A key feature of sequence alignment is that different reads are independent. Considering this property, we proposed a MIC-oriented three-level parallelization strategy to speed up BWA, a widely used sequence alignment tool, and developed our ultrafast parallel sequence aligner: B-MIC. B-MIC contains three levels of parallelization: firstly, parallelization of data IO and reads alignment by a three-stage parallel pipeline; secondly, parallelization enabled by MIC coprocessor technology; thirdly, inter-node parallelization implemented by MPI. In this paper, we demonstrate that B-MIC outperforms BWA by a combination of those techniques using Inspur NF5280M server and the Tianhe-2 supercomputer. To the best of our knowledge, B-MIC is the first sequence alignment tool to run on Intel MIC and it can achieve more than fivefold speedup over the original BWA while maintaining the alignment precision.
Concurrent Probabilistic Simulation of High Temperature Composite Structural Response
NASA Technical Reports Server (NTRS)
Abdi, Frank
1996-01-01
A computational structural/material analysis and design tool which would meet industry's future demand for expedience and reduced cost is presented. This unique software 'GENOA' is dedicated to parallel and high speed analysis to perform probabilistic evaluation of high temperature composite response of aerospace systems. The development is based on detailed integration and modification of diverse fields of specialized analysis techniques and mathematical models to combine their latest innovative capabilities into a commercially viable software package. The technique is specifically designed to exploit the availability of processors to perform computationally intense probabilistic analysis assessing uncertainties in structural reliability analysis and composite micromechanics. The primary objectives which were achieved in performing the development were: (1) Utilization of the power of parallel processing and static/dynamic load balancing optimization to make the complex simulation of structure, material and processing of high temperature composite affordable; (2) Computational integration and synchronization of probabilistic mathematics, structural/material mechanics and parallel computing; (3) Implementation of an innovative multi-level domain decomposition technique to identify the inherent parallelism, and increasing convergence rates through high- and low-level processor assignment; (4) Creating the framework for Portable Paralleled architecture for the machine independent Multi Instruction Multi Data, (MIMD), Single Instruction Multi Data (SIMD), hybrid and distributed workstation type of computers; and (5) Market evaluation. The results of Phase-2 effort provides a good basis for continuation and warrants Phase-3 government, and industry partnership.
Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU Deep Collaborative Computing
Zhang, Fan; Li, Guojun; Li, Wei; Hu, Wei; Hu, Yuxin
2016-01-01
With the development of synthetic aperture radar (SAR) technologies in recent years, the huge amount of remote sensing data brings challenges for real-time imaging processing. Therefore, high performance computing (HPC) methods have been presented to accelerate SAR imaging, especially the GPU based methods. In the classical GPU based imaging algorithm, GPU is employed to accelerate image processing by massive parallel computing, and CPU is only used to perform the auxiliary work such as data input/output (IO). However, the computing capability of CPU is ignored and underestimated. In this work, a new deep collaborative SAR imaging method based on multiple CPU/GPU is proposed to achieve real-time SAR imaging. Through the proposed tasks partitioning and scheduling strategy, the whole image can be generated with deep collaborative multiple CPU/GPU computing. In the part of CPU parallel imaging, the advanced vector extension (AVX) method is firstly introduced into the multi-core CPU parallel method for higher efficiency. As for the GPU parallel imaging, not only the bottlenecks of memory limitation and frequent data transferring are broken, but also kinds of optimized strategies are applied, such as streaming, parallel pipeline and so on. Experimental results demonstrate that the deep CPU/GPU collaborative imaging method enhances the efficiency of SAR imaging on single-core CPU by 270 times and realizes the real-time imaging in that the imaging rate outperforms the raw data generation rate. PMID:27070606
Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU Deep Collaborative Computing.
Zhang, Fan; Li, Guojun; Li, Wei; Hu, Wei; Hu, Yuxin
2016-04-07
With the development of synthetic aperture radar (SAR) technologies in recent years, the huge amount of remote sensing data brings challenges for real-time imaging processing. Therefore, high performance computing (HPC) methods have been presented to accelerate SAR imaging, especially the GPU based methods. In the classical GPU based imaging algorithm, GPU is employed to accelerate image processing by massive parallel computing, and CPU is only used to perform the auxiliary work such as data input/output (IO). However, the computing capability of CPU is ignored and underestimated. In this work, a new deep collaborative SAR imaging method based on multiple CPU/GPU is proposed to achieve real-time SAR imaging. Through the proposed tasks partitioning and scheduling strategy, the whole image can be generated with deep collaborative multiple CPU/GPU computing. In the part of CPU parallel imaging, the advanced vector extension (AVX) method is firstly introduced into the multi-core CPU parallel method for higher efficiency. As for the GPU parallel imaging, not only the bottlenecks of memory limitation and frequent data transferring are broken, but also kinds of optimized strategies are applied, such as streaming, parallel pipeline and so on. Experimental results demonstrate that the deep CPU/GPU collaborative imaging method enhances the efficiency of SAR imaging on single-core CPU by 270 times and realizes the real-time imaging in that the imaging rate outperforms the raw data generation rate.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kostin, Mikhail; Mokhov, Nikolai; Niita, Koji
A parallel computing framework has been developed to use with general-purpose radiation transport codes. The framework was implemented as a C++ module that uses MPI for message passing. It is intended to be used with older radiation transport codes implemented in Fortran77, Fortran 90 or C. The module is significantly independent of radiation transport codes it can be used with, and is connected to the codes by means of a number of interface functions. The framework was developed and tested in conjunction with the MARS15 code. It is possible to use it with other codes such as PHITS, FLUKA andmore » MCNP after certain adjustments. Besides the parallel computing functionality, the framework offers a checkpoint facility that allows restarting calculations with a saved checkpoint file. The checkpoint facility can be used in single process calculations as well as in the parallel regime. The framework corrects some of the known problems with the scheduling and load balancing found in the original implementations of the parallel computing functionality in MARS15 and PHITS. The framework can be used efficiently on homogeneous systems and networks of workstations, where the interference from the other users is possible.« less
Longitudinal train dynamics: an overview
NASA Astrophysics Data System (ADS)
Wu, Qing; Spiryagin, Maksym; Cole, Colin
2016-12-01
This paper discusses the evolution of longitudinal train dynamics (LTD) simulations, which covers numerical solvers, vehicle connection systems, air brake systems, wagon dumper systems and locomotives, resistance forces and gravitational components, vehicle in-train instabilities, and computing schemes. A number of potential research topics are suggested, such as modelling of friction, polymer, and transition characteristics for vehicle connection simulations, studies of wagon dumping operations, proper modelling of vehicle in-train instabilities, and computing schemes for LTD simulations. Evidence shows that LTD simulations have evolved with computing capabilities. Currently, advanced component models that directly describe the working principles of the operation of air brake systems, vehicle connection systems, and traction systems are available. Parallel computing is a good solution to combine and simulate all these advanced models. Parallel computing can also be used to conduct three-dimensional long train dynamics simulations.
High-Performance Parallel Analysis of Coupled Problems for Aircraft Propulsion
NASA Technical Reports Server (NTRS)
Felippa, C. A.; Farhat, C.; Park, K. C.; Gumaste, U.; Chen, P.-S.; Lesoinne, M.; Stern, P.
1997-01-01
Applications are described of high-performance computing methods to the numerical simulation of complete jet engines. The methodology focuses on the partitioned analysis of the interaction of the gas flow with a flexible structure and with the fluid mesh motion driven by structural displacements. The latter is treated by a ALE technique that models the fluid mesh motion as that of a fictitious mechanical network laid along the edges of near-field elements. New partitioned analysis procedures to treat this coupled three-component problem were developed. These procedures involved delayed corrections and subcycling, and have been successfully tested on several massively parallel computers, including the iPSC-860, Paragon XP/S and the IBM SP2. The NASA-sponsored ENG10 program was used for the global steady state analysis of the whole engine. This program uses a regular FV-multiblock-grid discretization in conjunction with circumferential averaging to include effects of blade forces, loss, combustor heat addition, blockage, bleeds and convective mixing. A load-balancing preprocessor for parallel versions of ENG10 was developed as well as the capability for the first full 3D aeroelastic simulation of a multirow engine stage. This capability was tested on the IBM SP2 parallel supercomputer at NASA Ames.
NASA Astrophysics Data System (ADS)
Bednar, Earl; Drager, Steven L.
2007-04-01
Quantum information processing's objective is to utilize revolutionary computing capability based on harnessing the paradigm shift offered by quantum computing to solve classically hard and computationally challenging problems. Some of our computationally challenging problems of interest include: the capability for rapid image processing, rapid optimization of logistics, protecting information, secure distributed simulation, and massively parallel computation. Currently, one important problem with quantum information processing is that the implementation of quantum computers is difficult to realize due to poor scalability and great presence of errors. Therefore, we have supported the development of Quantum eXpress and QuIDD Pro, two quantum computer simulators running on classical computers for the development and testing of new quantum algorithms and processes. This paper examines the different methods used by these two quantum computing simulators. It reviews both simulators, highlighting each simulators background, interface, and special features. It also demonstrates the implementation of current quantum algorithms on each simulator. It concludes with summary comments on both simulators.
Array-based Hierarchical Mesh Generation in Parallel
Ray, Navamita; Grindeanu, Iulian; Zhao, Xinglin; ...
2015-11-03
In this paper, we describe an array-based hierarchical mesh generation capability through uniform refinement of unstructured meshes for efficient solution of PDE's using finite element methods and multigrid solvers. A multi-degree, multi-dimensional and multi-level framework is designed to generate the nested hierarchies from an initial mesh that can be used for a number of purposes such as multi-level methods to generating large meshes. The capability is developed under the parallel mesh framework “Mesh Oriented dAtaBase” a.k.a MOAB. We describe the underlying data structures and algorithms to generate such hierarchies and present numerical results for computational efficiency and mesh quality. Inmore » conclusion, we also present results to demonstrate the applicability of the developed capability to a multigrid finite-element solver.« less
SAPNEW: Parallel finite element code for thin shell structures on the Alliant FX/80
NASA Astrophysics Data System (ADS)
Kamat, Manohar P.; Watson, Brian C.
1992-02-01
The results of a research activity aimed at providing a finite element capability for analyzing turbo-machinery bladed-disk assemblies in a vector/parallel processing environment are summarized. Analysis of aircraft turbofan engines is very computationally intensive. The performance limit of modern day computers with a single processing unit was estimated at 3 billions of floating point operations per second (3 gigaflops). In view of this limit of a sequential unit, performance rates higher than 3 gigaflops can be achieved only through vectorization and/or parallelization as on Alliant FX/80. Accordingly, the efforts of this critically needed research were geared towards developing and evaluating parallel finite element methods for static and vibration analysis. A special purpose code, named with the acronym SAPNEW, performs static and eigen analysis of multi-degree-of-freedom blade models built-up from flat thin shell elements.
NASA Astrophysics Data System (ADS)
Rodrigues, Manuel J.; Fernandes, David E.; Silveirinha, Mário G.; Falcão, Gabriel
2018-01-01
This work introduces a parallel computing framework to characterize the propagation of electron waves in graphene-based nanostructures. The electron wave dynamics is modeled using both "microscopic" and effective medium formalisms and the numerical solution of the two-dimensional massless Dirac equation is determined using a Finite-Difference Time-Domain scheme. The propagation of electron waves in graphene superlattices with localized scattering centers is studied, and the role of the symmetry of the microscopic potential in the electron velocity is discussed. The computational methodologies target the parallel capabilities of heterogeneous multi-core CPU and multi-GPU environments and are built with the OpenCL parallel programming framework which provides a portable, vendor agnostic and high throughput-performance solution. The proposed heterogeneous multi-GPU implementation achieves speedup ratios up to 75x when compared to multi-thread and multi-core CPU execution, reducing simulation times from several hours to a couple of minutes.
SAPNEW: Parallel finite element code for thin shell structures on the Alliant FX/80
NASA Technical Reports Server (NTRS)
Kamat, Manohar P.; Watson, Brian C.
1992-01-01
The results of a research activity aimed at providing a finite element capability for analyzing turbo-machinery bladed-disk assemblies in a vector/parallel processing environment are summarized. Analysis of aircraft turbofan engines is very computationally intensive. The performance limit of modern day computers with a single processing unit was estimated at 3 billions of floating point operations per second (3 gigaflops). In view of this limit of a sequential unit, performance rates higher than 3 gigaflops can be achieved only through vectorization and/or parallelization as on Alliant FX/80. Accordingly, the efforts of this critically needed research were geared towards developing and evaluating parallel finite element methods for static and vibration analysis. A special purpose code, named with the acronym SAPNEW, performs static and eigen analysis of multi-degree-of-freedom blade models built-up from flat thin shell elements.
Execution of parallel algorithms on a heterogeneous multicomputer
NASA Astrophysics Data System (ADS)
Isenstein, Barry S.; Greene, Jonathon
1995-04-01
Many aerospace/defense sensing and dual-use applications require high-performance computing, extensive high-bandwidth interconnect and realtime deterministic operation. This paper will describe the architecture of a scalable multicomputer that includes DSP and RISC processors. A single chassis implementation is capable of delivering in excess of 10 GFLOPS of DSP processing power with 2 Gbytes/s of realtime sensor I/O. A software approach to implementing parallel algorithms called the Parallel Application System (PAS) is also presented. An example of applying PAS to a DSP application is shown.
Evaluation of the Intel iWarp parallel processor for space flight applications
NASA Technical Reports Server (NTRS)
Hine, Butler P., III; Fong, Terrence W.
1993-01-01
The potential of a DARPA-sponsored advanced processor, the Intel iWarp, for use in future SSF Data Management Systems (DMS) upgrades is evaluated through integration into the Ames DMS testbed and applications testing. The iWarp is a distributed, parallel computing system well suited for high performance computing applications such as matrix operations and image processing. The system architecture is modular, supports systolic and message-based computation, and is capable of providing massive computational power in a low-cost, low-power package. As a consequence, the iWarp offers significant potential for advanced space-based computing. This research seeks to determine the iWarp's suitability as a processing device for space missions. In particular, the project focuses on evaluating the ease of integrating the iWarp into the SSF DMS baseline architecture and the iWarp's ability to support computationally stressing applications representative of SSF tasks.
A high performance scientific cloud computing environment for materials simulations
NASA Astrophysics Data System (ADS)
Jorissen, K.; Vila, F. D.; Rehr, J. J.
2012-09-01
We describe the development of a scientific cloud computing (SCC) platform that offers high performance computation capability. The platform consists of a scientific virtual machine prototype containing a UNIX operating system and several materials science codes, together with essential interface tools (an SCC toolset) that offers functionality comparable to local compute clusters. In particular, our SCC toolset provides automatic creation of virtual clusters for parallel computing, including tools for execution and monitoring performance, as well as efficient I/O utilities that enable seamless connections to and from the cloud. Our SCC platform is optimized for the Amazon Elastic Compute Cloud (EC2). We present benchmarks for prototypical scientific applications and demonstrate performance comparable to local compute clusters. To facilitate code execution and provide user-friendly access, we have also integrated cloud computing capability in a JAVA-based GUI. Our SCC platform may be an alternative to traditional HPC resources for materials science or quantum chemistry applications.
Simulation of Hypervelocity Impact on Aluminum-Nextel-Kevlar Orbital Debris Shields
NASA Technical Reports Server (NTRS)
Fahrenthold, Eric P.
2000-01-01
An improved hybrid particle-finite element method has been developed for hypervelocity impact simulation. The method combines the general contact-impact capabilities of particle codes with the true Lagrangian kinematics of large strain finite element formulations. Unlike some alternative schemes which couple Lagrangian finite element models with smooth particle hydrodynamics, the present formulation makes no use of slidelines or penalty forces. The method has been implemented in a parallel, three dimensional computer code. Simulations of three dimensional orbital debris impact problems using this parallel hybrid particle-finite element code, show good agreement with experiment and good speedup in parallel computation. The simulations included single and multi-plate shields as well as aluminum and composite shielding materials. at an impact velocity of eleven kilometers per second.
A convenient and accurate parallel Input/Output USB device for E-Prime.
Canto, Rosario; Bufalari, Ilaria; D'Ausilio, Alessandro
2011-03-01
Psychological and neurophysiological experiments require the accurate control of timing and synchrony for Input/Output signals. For instance, a typical Event-Related Potential (ERP) study requires an extremely accurate synchronization of stimulus delivery with recordings. This is typically done via computer software such as E-Prime, and fast communications are typically assured by the Parallel Port (PP). However, the PP is an old and disappearing technology that, for example, is no longer available on portable computers. Here we propose a convenient USB device enabling parallel I/O capabilities. We tested this device against the PP on both a desktop and a laptop machine in different stress tests. Our data demonstrate the accuracy of our system, which suggests that it may be a good substitute for the PP with E-Prime.
A GPU-paralleled implementation of an enhanced face recognition algorithm
NASA Astrophysics Data System (ADS)
Chen, Hao; Liu, Xiyang; Shao, Shuai; Zan, Jiguo
2013-03-01
Face recognition algorithm based on compressed sensing and sparse representation is hotly argued in these years. The scheme of this algorithm increases recognition rate as well as anti-noise capability. However, the computational cost is expensive and has become a main restricting factor for real world applications. In this paper, we introduce a GPU-accelerated hybrid variant of face recognition algorithm named parallel face recognition algorithm (pFRA). We describe here how to carry out parallel optimization design to take full advantage of many-core structure of a GPU. The pFRA is tested and compared with several other implementations under different data sample size. Finally, Our pFRA, implemented with NVIDIA GPU and Computer Unified Device Architecture (CUDA) programming model, achieves a significant speedup over the traditional CPU implementations.
NASA Astrophysics Data System (ADS)
Jaber, Khalid Mohammad; Alia, Osama Moh'd.; Shuaib, Mohammed Mahmod
2018-03-01
Finding the optimal parameters that can reproduce experimental data (such as the velocity-density relation and the specific flow rate) is a very important component of the validation and calibration of microscopic crowd dynamic models. Heavy computational demand during parameter search is a known limitation that exists in a previously developed model known as the Harmony Search-Based Social Force Model (HS-SFM). In this paper, a parallel-based mechanism is proposed to reduce the computational time and memory resource utilisation required to find these parameters. More specifically, two MATLAB-based multicore techniques (parfor and create independent jobs) using shared memory are developed by taking advantage of the multithreading capabilities of parallel computing, resulting in a new framework called the Parallel Harmony Search-Based Social Force Model (P-HS-SFM). The experimental results show that the parfor-based P-HS-SFM achieved a better computational time of about 26 h, an efficiency improvement of ? 54% and a speedup factor of 2.196 times in comparison with the HS-SFM sequential processor. The performance of the P-HS-SFM using the create independent jobs approach is also comparable to parfor with a computational time of 26.8 h, an efficiency improvement of about 30% and a speedup of 2.137 times.
NASA Astrophysics Data System (ADS)
Mosby, Matthew; Matouš, Karel
2015-12-01
Three-dimensional simulations capable of resolving the large range of spatial scales, from the failure-zone thickness up to the size of the representative unit cell, in damage mechanics problems of particle reinforced adhesives are presented. We show that resolving this wide range of scales in complex three-dimensional heterogeneous morphologies is essential in order to apprehend fracture characteristics, such as strength, fracture toughness and shape of the softening profile. Moreover, we show that computations that resolve essential physical length scales capture the particle size-effect in fracture toughness, for example. In the vein of image-based computational materials science, we construct statistically optimal unit cells containing hundreds to thousands of particles. We show that these statistically representative unit cells are capable of capturing the first- and second-order probability functions of a given data-source with better accuracy than traditional inclusion packing techniques. In order to accomplish these large computations, we use a parallel multiscale cohesive formulation and extend it to finite strains including damage mechanics. The high-performance parallel computational framework is executed on up to 1024 processing cores. A mesh convergence and a representative unit cell study are performed. Quantifying the complex damage patterns in simulations consisting of tens of millions of computational cells and millions of highly nonlinear equations requires data-mining the parallel simulations, and we propose two damage metrics to quantify the damage patterns. A detailed study of volume fraction and filler size on the macroscopic traction-separation response of heterogeneous adhesives is presented.
Parallel Simulation of Unsteady Turbulent Flames
NASA Technical Reports Server (NTRS)
Menon, Suresh
1996-01-01
Time-accurate simulation of turbulent flames in high Reynolds number flows is a challenging task since both fluid dynamics and combustion must be modeled accurately. To numerically simulate this phenomenon, very large computer resources (both time and memory) are required. Although current vector supercomputers are capable of providing adequate resources for simulations of this nature, the high cost and their limited availability, makes practical use of such machines less than satisfactory. At the same time, the explicit time integration algorithms used in unsteady flow simulations often possess a very high degree of parallelism, making them very amenable to efficient implementation on large-scale parallel computers. Under these circumstances, distributed memory parallel computers offer an excellent near-term solution for greatly increased computational speed and memory, at a cost that may render the unsteady simulations of the type discussed above more feasible and affordable.This paper discusses the study of unsteady turbulent flames using a simulation algorithm that is capable of retaining high parallel efficiency on distributed memory parallel architectures. Numerical studies are carried out using large-eddy simulation (LES). In LES, the scales larger than the grid are computed using a time- and space-accurate scheme, while the unresolved small scales are modeled using eddy viscosity based subgrid models. This is acceptable for the moment/energy closure since the small scales primarily provide a dissipative mechanism for the energy transferred from the large scales. However, for combustion to occur, the species must first undergo mixing at the small scales and then come into molecular contact. Therefore, global models cannot be used. Recently, a new model for turbulent combustion was developed, in which the combustion is modeled, within the subgrid (small-scales) using a methodology that simulates the mixing and the molecular transport and the chemical kinetics within each LES grid cell. Finite-rate kinetics can be included without any closure and this approach actually provides a means to predict the turbulent rates and the turbulent flame speed. The subgrid combustion model requires resolution of the local time scales associated with small-scale mixing, molecular diffusion and chemical kinetics and, therefore, within each grid cell, a significant amount of computations must be carried out before the large-scale (LES resolved) effects are incorporated. Therefore, this approach is uniquely suited for parallel processing and has been implemented on various systems such as: Intel Paragon, IBM SP-2, Cray T3D and SGI Power Challenge (PC) using the system independent Message Passing Interface (MPI) compiler. In this paper, timing data on these machines is reported along with some characteristic results.
Planned development of a 3D computer based on free-space optical interconnects
NASA Astrophysics Data System (ADS)
Neff, John A.; Guarino, David R.
1994-05-01
Free-space optical interconnection has the potential to provide upwards of a million data channels between planes of electronic circuits. This may result in the planar board and backplane structures of today giving away to 3-D stacks of wafers or multi-chip modules interconnected via channels running perpendicular to the processor planes, thereby eliminating much of the packaging overhead. Three-dimensional packaging is very appealing for tightly coupled fine-grained parallel computing where the need for massive numbers of interconnections is severely taxing the capabilities of the planar structures. This paper describes a coordinated effort by four research organizations to demonstrate an operational fine-grained parallel computer that achieves global connectivity through the use of free space optical interconnects.
Numerical propulsion system simulation: An interdisciplinary approach
NASA Technical Reports Server (NTRS)
Nichols, Lester D.; Chamis, Christos C.
1991-01-01
The tremendous progress being made in computational engineering and the rapid growth in computing power that is resulting from parallel processing now make it feasible to consider the use of computer simulations to gain insights into the complex interactions in aerospace propulsion systems and to evaluate new concepts early in the design process before a commitment to hardware is made. Described here is a NASA initiative to develop a Numerical Propulsion System Simulation (NPSS) capability.
Numerical propulsion system simulation - An interdisciplinary approach
NASA Technical Reports Server (NTRS)
Nichols, Lester D.; Chamis, Christos C.
1991-01-01
The tremendous progress being made in computational engineering and the rapid growth in computing power that is resulting from parallel processing now make it feasible to consider the use of computer simulations to gain insights into the complex interactions in aerospace propulsion systems and to evaluate new concepts early in the design process before a commitment to hardware is made. Described here is a NASA initiative to develop a Numerical Propulsion System Simulation (NPSS) capability.
Architectures for single-chip image computing
NASA Astrophysics Data System (ADS)
Gove, Robert J.
1992-04-01
This paper will focus on the architectures of VLSI programmable processing components for image computing applications. TI, the maker of industry-leading RISC, DSP, and graphics components, has developed an architecture for a new-generation of image processors capable of implementing a plurality of image, graphics, video, and audio computing functions. We will show that the use of a single-chip heterogeneous MIMD parallel architecture best suits this class of processors--those which will dominate the desktop multimedia, document imaging, computer graphics, and visualization systems of this decade.
Proceedings of the 3rd Annual Conference on Aerospace Computational Control, volume 2
NASA Technical Reports Server (NTRS)
Bernard, Douglas E. (Editor); Man, Guy K. (Editor)
1989-01-01
This volume of the conference proceedings contain papers and discussions in the following topical areas: Parallel processing; Emerging integrated capabilities; Low order controllers; Real time simulation; Multibody component representation; User environment; and Distributed parameter techniques.
Integration experiences and performance studies of A COTS parallel archive systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chen, Hsing-bung; Scott, Cody; Grider, Bary
2010-01-01
Current and future Archive Storage Systems have been asked to (a) scale to very high bandwidths, (b) scale in metadata performance, (c) support policy-based hierarchical storage management capability, (d) scale in supporting changing needs of very large data sets, (e) support standard interface, and (f) utilize commercial-off-the-shelf(COTS) hardware. Parallel file systems have been asked to do the same thing but at one or more orders of magnitude faster in performance. Archive systems continue to move closer to file systems in their design due to the need for speed and bandwidth, especially metadata searching speeds such as more caching and lessmore » robust semantics. Currently the number of extreme highly scalable parallel archive solutions is very small especially those that will move a single large striped parallel disk file onto many tapes in parallel. We believe that a hybrid storage approach of using COTS components and innovative software technology can bring new capabilities into a production environment for the HPC community much faster than the approach of creating and maintaining a complete end-to-end unique parallel archive software solution. In this paper, we relay our experience of integrating a global parallel file system and a standard backup/archive product with a very small amount of additional code to provide a scalable, parallel archive. Our solution has a high degree of overlap with current parallel archive products including (a) doing parallel movement to/from tape for a single large parallel file, (b) hierarchical storage management, (c) ILM features, (d) high volume (non-single parallel file) archives for backup/archive/content management, and (e) leveraging all free file movement tools in Linux such as copy, move, ls, tar, etc. We have successfully applied our working COTS Parallel Archive System to the current world's first petaflop/s computing system, LANL's Roadrunner, and demonstrated its capability to address requirements of future archival storage systems.« less
Integration experiments and performance studies of a COTS parallel archive system
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chen, Hsing-bung; Scott, Cody; Grider, Gary
2010-06-16
Current and future Archive Storage Systems have been asked to (a) scale to very high bandwidths, (b) scale in metadata performance, (c) support policy-based hierarchical storage management capability, (d) scale in supporting changing needs of very large data sets, (e) support standard interface, and (f) utilize commercial-off-the-shelf (COTS) hardware. Parallel file systems have been asked to do the same thing but at one or more orders of magnitude faster in performance. Archive systems continue to move closer to file systems in their design due to the need for speed and bandwidth, especially metadata searching speeds such as more caching andmore » less robust semantics. Currently the number of extreme highly scalable parallel archive solutions is very small especially those that will move a single large striped parallel disk file onto many tapes in parallel. We believe that a hybrid storage approach of using COTS components and innovative software technology can bring new capabilities into a production environment for the HPC community much faster than the approach of creating and maintaining a complete end-to-end unique parallel archive software solution. In this paper, we relay our experience of integrating a global parallel file system and a standard backup/archive product with a very small amount of additional code to provide a scalable, parallel archive. Our solution has a high degree of overlap with current parallel archive products including (a) doing parallel movement to/from tape for a single large parallel file, (b) hierarchical storage management, (c) ILM features, (d) high volume (non-single parallel file) archives for backup/archive/content management, and (e) leveraging all free file movement tools in Linux such as copy, move, Is, tar, etc. We have successfully applied our working COTS Parallel Archive System to the current world's first petafiop/s computing system, LANL's Roadrunner machine, and demonstrated its capability to address requirements of future archival storage systems.« less
Cazzaniga, Paolo; Nobile, Marco S.; Besozzi, Daniela; Bellini, Matteo; Mauri, Giancarlo
2014-01-01
The introduction of general-purpose Graphics Processing Units (GPUs) is boosting scientific applications in Bioinformatics, Systems Biology, and Computational Biology. In these fields, the use of high-performance computing solutions is motivated by the need of performing large numbers of in silico analysis to study the behavior of biological systems in different conditions, which necessitate a computing power that usually overtakes the capability of standard desktop computers. In this work we present coagSODA, a CUDA-powered computational tool that was purposely developed for the analysis of a large mechanistic model of the blood coagulation cascade (BCC), defined according to both mass-action kinetics and Hill functions. coagSODA allows the execution of parallel simulations of the dynamics of the BCC by automatically deriving the system of ordinary differential equations and then exploiting the numerical integration algorithm LSODA. We present the biological results achieved with a massive exploration of perturbed conditions of the BCC, carried out with one-dimensional and bi-dimensional parameter sweep analysis, and show that GPU-accelerated parallel simulations of this model can increase the computational performances up to a 181× speedup compared to the corresponding sequential simulations. PMID:25025072
Clock Agreement Among Parallel Supercomputer Nodes
Jones, Terry R.; Koenig, Gregory A.
2014-04-30
This dataset presents measurements that quantify the clock synchronization time-agreement characteristics among several high performance computers including the current world's most powerful machine for open science, the U.S. Department of Energy's Titan machine sited at Oak Ridge National Laboratory. These ultra-fast machines derive much of their computational capability from extreme node counts (over 18000 nodes in the case of the Titan machine). Time-agreement is commonly utilized by parallel programming applications and tools, distributed programming application and tools, and system software. Our time-agreement measurements detail the degree of time variance between nodes and how that variance changes over time. The dataset includes empirical measurements and the accompanying spreadsheets.
Bhanot, Gyan [Princeton, NJ; Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton On Hudson, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Steinmacher-Burow, Burkhard D [Mount Kisco, NY; Takken, Todd E [Mount Kisco, NY; Vranas, Pavlos M [Bedford Hills, NY
2009-09-08
Class network routing is implemented in a network such as a computer network comprising a plurality of parallel compute processors at nodes thereof. Class network routing allows a compute processor to broadcast a message to a range (one or more) of other compute processors in the computer network, such as processors in a column or a row. Normally this type of operation requires a separate message to be sent to each processor. With class network routing pursuant to the invention, a single message is sufficient, which generally reduces the total number of messages in the network as well as the latency to do a broadcast. Class network routing is also applied to dense matrix inversion algorithms on distributed memory parallel supercomputers with hardware class function (multicast) capability. This is achieved by exploiting the fact that the communication patterns of dense matrix inversion can be served by hardware class functions, which results in faster execution times.
Parallel simulation of tsunami inundation on a large-scale supercomputer
NASA Astrophysics Data System (ADS)
Oishi, Y.; Imamura, F.; Sugawara, D.
2013-12-01
An accurate prediction of tsunami inundation is important for disaster mitigation purposes. One approach is to approximate the tsunami wave source through an instant inversion analysis using real-time observation data (e.g., Tsushima et al., 2009) and then use the resulting wave source data in an instant tsunami inundation simulation. However, a bottleneck of this approach is the large computational cost of the non-linear inundation simulation and the computational power of recent massively parallel supercomputers is helpful to enable faster than real-time execution of a tsunami inundation simulation. Parallel computers have become approximately 1000 times faster in 10 years (www.top500.org), and so it is expected that very fast parallel computers will be more and more prevalent in the near future. Therefore, it is important to investigate how to efficiently conduct a tsunami simulation on parallel computers. In this study, we are targeting very fast tsunami inundation simulations on the K computer, currently the fastest Japanese supercomputer, which has a theoretical peak performance of 11.2 PFLOPS. One computing node of the K computer consists of 1 CPU with 8 cores that share memory, and the nodes are connected through a high-performance torus-mesh network. The K computer is designed for distributed-memory parallel computation, so we have developed a parallel tsunami model. Our model is based on TUNAMI-N2 model of Tohoku University, which is based on a leap-frog finite difference method. A grid nesting scheme is employed to apply high-resolution grids only at the coastal regions. To balance the computation load of each CPU in the parallelization, CPUs are first allocated to each nested layer in proportion to the number of grid points of the nested layer. Using CPUs allocated to each layer, 1-D domain decomposition is performed on each layer. In the parallel computation, three types of communication are necessary: (1) communication to adjacent neighbours for the finite difference calculation, (2) communication between adjacent layers for the calculations to connect each layer, and (3) global communication to obtain the time step which satisfies the CFL condition in the whole domain. A preliminary test on the K computer showed the parallel efficiency on 1024 cores was 57% relative to 64 cores. We estimate that the parallel efficiency will be considerably improved by applying a 2-D domain decomposition instead of the present 1-D domain decomposition in future work. The present parallel tsunami model was applied to the 2011 Great Tohoku tsunami. The coarsest resolution layer covers a 758 km × 1155 km region with a 405 m grid spacing. A nesting of five layers was used with the resolution ratio of 1/3 between nested layers. The finest resolution region has 5 m resolution and covers most of the coastal region of Sendai city. To complete 2 hours of simulation time, the serial (non-parallel) computation took approximately 4 days on a workstation. To complete the same simulation on 1024 cores of the K computer, it took 45 minutes which is more than two times faster than real-time. This presentation discusses the updated parallel computational performance and the efficient use of the K computer when considering the characteristics of the tsunami inundation simulation model in relation to the characteristics and capabilities of the K computer.
Scalable Parallel Density-based Clustering and Applications
NASA Astrophysics Data System (ADS)
Patwary, Mostofa Ali
2014-04-01
Recently, density-based clustering algorithms (DBSCAN and OPTICS) have gotten significant attention of the scientific community due to their unique capability of discovering arbitrary shaped clusters and eliminating noise data. These algorithms have several applications, which require high performance computing, including finding halos and subhalos (clusters) from massive cosmology data in astrophysics, analyzing satellite images, X-ray crystallography, and anomaly detection. However, parallelization of these algorithms are extremely challenging as they exhibit inherent sequential data access order, unbalanced workload resulting in low parallel efficiency. To break the data access sequentiality and to achieve high parallelism, we develop new parallel algorithms, both for DBSCAN and OPTICS, designed using graph algorithmic techniques. For example, our parallel DBSCAN algorithm exploits the similarities between DBSCAN and computing connected components. Using datasets containing up to a billion floating point numbers, we show that our parallel density-based clustering algorithms significantly outperform the existing algorithms, achieving speedups up to 27.5 on 40 cores on shared memory architecture and speedups up to 5,765 using 8,192 cores on distributed memory architecture. In our experiments, we found that while achieving the scalability, our algorithms produce clustering results with comparable quality to the classical algorithms.
Array-based, parallel hierarchical mesh refinement algorithms for unstructured meshes
Ray, Navamita; Grindeanu, Iulian; Zhao, Xinglin; ...
2016-08-18
In this paper, we describe an array-based hierarchical mesh refinement capability through uniform refinement of unstructured meshes for efficient solution of PDE's using finite element methods and multigrid solvers. A multi-degree, multi-dimensional and multi-level framework is designed to generate the nested hierarchies from an initial coarse mesh that can be used for a variety of purposes such as in multigrid solvers/preconditioners, to do solution convergence and verification studies and to improve overall parallel efficiency by decreasing I/O bandwidth requirements (by loading smaller meshes and in memory refinement). We also describe a high-order boundary reconstruction capability that can be used tomore » project the new points after refinement using high-order approximations instead of linear projection in order to minimize and provide more control on geometrical errors introduced by curved boundaries.The capability is developed under the parallel unstructured mesh framework "Mesh Oriented dAtaBase" (MOAB Tautges et al. (2004)). We describe the underlying data structures and algorithms to generate such hierarchies in parallel and present numerical results for computational efficiency and effect on mesh quality. Furthermore, we also present results to demonstrate the applicability of the developed capability to study convergence properties of different point projection schemes for various mesh hierarchies and to a multigrid finite-element solver for elliptic problems.« less
Analysis of multigrid methods on massively parallel computers: Architectural implications
NASA Technical Reports Server (NTRS)
Matheson, Lesley R.; Tarjan, Robert E.
1993-01-01
We study the potential performance of multigrid algorithms running on massively parallel computers with the intent of discovering whether presently envisioned machines will provide an efficient platform for such algorithms. We consider the domain parallel version of the standard V cycle algorithm on model problems, discretized using finite difference techniques in two and three dimensions on block structured grids of size 10(exp 6) and 10(exp 9), respectively. Our models of parallel computation were developed to reflect the computing characteristics of the current generation of massively parallel multicomputers. These models are based on an interconnection network of 256 to 16,384 message passing, 'workstation size' processors executing in an SPMD mode. The first model accomplishes interprocessor communications through a multistage permutation network. The communication cost is a logarithmic function which is similar to the costs in a variety of different topologies. The second model allows single stage communication costs only. Both models were designed with information provided by machine developers and utilize implementation derived parameters. With the medium grain parallelism of the current generation and the high fixed cost of an interprocessor communication, our analysis suggests an efficient implementation requires the machine to support the efficient transmission of long messages, (up to 1000 words) or the high initiation cost of a communication must be significantly reduced through an alternative optimization technique. Furthermore, with variable length message capability, our analysis suggests the low diameter multistage networks provide little or no advantage over a simple single stage communications network.
On the impact of communication complexity in the design of parallel numerical algorithms
NASA Technical Reports Server (NTRS)
Gannon, D.; Vanrosendale, J.
1984-01-01
This paper describes two models of the cost of data movement in parallel numerical algorithms. One model is a generalization of an approach due to Hockney, and is suitable for shared memory multiprocessors where each processor has vector capabilities. The other model is applicable to highly parallel nonshared memory MIMD systems. In the second model, algorithm performance is characterized in terms of the communication network design. Techniques used in VLSI complexity theory are also brought in, and algorithm independent upper bounds on system performance are derived for several problems that are important to scientific computation.
On the impact of communication complexity on the design of parallel numerical algorithms
NASA Technical Reports Server (NTRS)
Gannon, D. B.; Van Rosendale, J.
1984-01-01
This paper describes two models of the cost of data movement in parallel numerical alorithms. One model is a generalization of an approach due to Hockney, and is suitable for shared memory multiprocessors where each processor has vector capabilities. The other model is applicable to highly parallel nonshared memory MIMD systems. In this second model, algorithm performance is characterized in terms of the communication network design. Techniques used in VLSI complexity theory are also brought in, and algorithm-independent upper bounds on system performance are derived for several problems that are important to scientific computation.
DOE Office of Scientific and Technical Information (OSTI.GOV)
McCoy, M; Kissel, L
2002-01-29
We are experimenting with a new computing model to be applied to a new computer dedicated to that model. Several LLNL science teams now have computational requirements, evidenced by the mature scientific applications that have been developed over the past five plus years, that far exceed the capability of the institution's computing resources. Thus, there is increased demand for dedicated, powerful parallel computational systems. Computation can, in the coming year, potentially field a capability system that is low cost because it will be based on a model that employs open source software and because it will use PC (IA32-P4) hardware.more » This incurs significant computer science risk regarding stability and system features but also presents great opportunity. We believe the risks can be managed, but the existence of risk cannot be ignored. In order to justify the budget for this system, we need to make the case that it serves science and, through serving science, serves the institution. That is the point of the meeting and the White Paper that we are proposing to prepare. The questions are listed and the responses received are in this report.« less
Modern gyrokinetic particle-in-cell simulation of fusion plasmas on top supercomputers
Wang, Bei; Ethier, Stephane; Tang, William; ...
2017-06-29
The Gyrokinetic Toroidal Code at Princeton (GTC-P) is a highly scalable and portable particle-in-cell (PIC) code. It solves the 5D Vlasov-Poisson equation featuring efficient utilization of modern parallel computer architectures at the petascale and beyond. Motivated by the goal of developing a modern code capable of dealing with the physics challenge of increasing problem size with sufficient resolution, new thread-level optimizations have been introduced as well as a key additional domain decomposition. GTC-P's multiple levels of parallelism, including inter-node 2D domain decomposition and particle decomposition, as well as intra-node shared memory partition and vectorization have enabled pushing the scalability ofmore » the PIC method to extreme computational scales. In this paper, we describe the methods developed to build a highly parallelized PIC code across a broad range of supercomputer designs. This particularly includes implementations on heterogeneous systems using NVIDIA GPU accelerators and Intel Xeon Phi (MIC) co-processors and performance comparisons with state-of-the-art homogeneous HPC systems such as Blue Gene/Q. New discovery science capabilities in the magnetic fusion energy application domain are enabled, including investigations of Ion-Temperature-Gradient (ITG) driven turbulence simulations with unprecedented spatial resolution and long temporal duration. Performance studies with realistic fusion experimental parameters are carried out on multiple supercomputing systems spanning a wide range of cache capacities, cache-sharing configurations, memory bandwidth, interconnects and network topologies. These performance comparisons using a realistic discovery-science-capable domain application code provide valuable insights on optimization techniques across one of the broadest sets of current high-end computing platforms worldwide.« less
Modern gyrokinetic particle-in-cell simulation of fusion plasmas on top supercomputers
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wang, Bei; Ethier, Stephane; Tang, William
The Gyrokinetic Toroidal Code at Princeton (GTC-P) is a highly scalable and portable particle-in-cell (PIC) code. It solves the 5D Vlasov-Poisson equation featuring efficient utilization of modern parallel computer architectures at the petascale and beyond. Motivated by the goal of developing a modern code capable of dealing with the physics challenge of increasing problem size with sufficient resolution, new thread-level optimizations have been introduced as well as a key additional domain decomposition. GTC-P's multiple levels of parallelism, including inter-node 2D domain decomposition and particle decomposition, as well as intra-node shared memory partition and vectorization have enabled pushing the scalability ofmore » the PIC method to extreme computational scales. In this paper, we describe the methods developed to build a highly parallelized PIC code across a broad range of supercomputer designs. This particularly includes implementations on heterogeneous systems using NVIDIA GPU accelerators and Intel Xeon Phi (MIC) co-processors and performance comparisons with state-of-the-art homogeneous HPC systems such as Blue Gene/Q. New discovery science capabilities in the magnetic fusion energy application domain are enabled, including investigations of Ion-Temperature-Gradient (ITG) driven turbulence simulations with unprecedented spatial resolution and long temporal duration. Performance studies with realistic fusion experimental parameters are carried out on multiple supercomputing systems spanning a wide range of cache capacities, cache-sharing configurations, memory bandwidth, interconnects and network topologies. These performance comparisons using a realistic discovery-science-capable domain application code provide valuable insights on optimization techniques across one of the broadest sets of current high-end computing platforms worldwide.« less
Xyce Parallel Electronic Simulator - Users' Guide Version 2.1.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hutchinson, Scott A; Hoekstra, Robert J.; Russo, Thomas V.
This manual describes the use of theXyceParallel Electronic Simulator.Xycehasbeen designed as a SPICE-compatible, high-performance analog circuit simulator, andhas been written to support the simulation needs of the Sandia National Laboratorieselectrical designers. This development has focused on improving capability over thecurrent state-of-the-art in the following areas:%04Capability to solve extremely large circuit problems by supporting large-scale par-allel computing platforms (up to thousands of processors). Note that this includessupport for most popular parallel and serial computers.%04Improved performance for all numerical kernels (e.g., time integrator, nonlinearand linear solvers) through state-of-the-art algorithms and novel techniques.%04Device models which are specifically tailored to meet Sandia's needs, includingmanymore » radiation-aware devices.3 XyceTMUsers' Guide%04Object-oriented code design and implementation using modern coding practicesthat ensure that theXyceParallel Electronic Simulator will be maintainable andextensible far into the future.Xyceis a parallel code in the most general sense of the phrase - a message passingparallel implementation - which allows it to run efficiently on the widest possible numberof computing platforms. These include serial, shared-memory and distributed-memoryparallel as well as heterogeneous platforms. Careful attention has been paid to thespecific nature of circuit-simulation problems to ensure that optimal parallel efficiencyis achieved as the number of processors grows.The development ofXyceprovides a platform for computational research and de-velopment aimed specifically at the needs of the Laboratory. WithXyce, Sandia hasan %22in-house%22 capability with which both new electrical (e.g., device model develop-ment) and algorithmic (e.g., faster time-integration methods, parallel solver algorithms)research and development can be performed. As a result,Xyceis a unique electricalsimulation capability, designed to meet the unique needs of the laboratory.4 XyceTMUsers' GuideAcknowledgementsThe authors would like to acknowledge the entire Sandia National Laboratories HPEMS(High Performance Electrical Modeling and Simulation) team, including Steve Wix, CarolynBogdan, Regina Schells, Ken Marx, Steve Brandon and Bill Ballard, for their support onthis project. We also appreciate very much the work of Jim Emery, Becky Arnold and MikeWilliamson for the help in reviewing this document.Lastly, a very special thanks to Hue Lai for typesetting this document with LATEX.TrademarksThe information herein is subject to change without notice.Copyrightc 2002-2003 Sandia Corporation. All rights reserved.XyceTMElectronic Simulator andXyceTMtrademarks of Sandia Corporation.Orcad, Orcad Capture, PSpice and Probe are registered trademarks of Cadence DesignSystems, Inc.Silicon Graphics, the Silicon Graphics logo and IRIX are registered trademarks of SiliconGraphics, Inc.Microsoft, Windows and Windows 2000 are registered trademark of Microsoft Corporation.Solaris and UltraSPARC are registered trademarks of Sun Microsystems Corporation.Medici, DaVinci and Taurus are registered trademarks of Synopsys Corporation.HP and Alpha are registered trademarks of Hewlett-Packard company.Amtec and TecPlot are trademarks of Amtec Engineering, Inc.Xyce's expression library is based on that inside Spice 3F5 developed by the EECS De-partment at the University of California.All other trademarks are property of their respective owners.ContactsBug Reportshttp://tvrusso.sandia.gov/bugzillaEmailxyce-support%40sandia.govWorld Wide Webhttp://www.cs.sandia.gov/xyce5 XyceTMUsers' GuideThis page is left intentionally blank6« less
Argonne simulation framework for intelligent transportation systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ewing, T.; Doss, E.; Hanebutte, U.
1996-04-01
A simulation framework has been developed which defines a high-level architecture for a large-scale, comprehensive, scalable simulation of an Intelligent Transportation System (ITS). The simulator is designed to run on parallel computers and distributed (networked) computer systems; however, a version for a stand alone workstation is also available. The ITS simulator includes an Expert Driver Model (EDM) of instrumented ``smart`` vehicles with in-vehicle navigation units. The EDM is capable of performing optimal route planning and communicating with Traffic Management Centers (TMC). A dynamic road map data base is sued for optimum route planning, where the data is updated periodically tomore » reflect any changes in road or weather conditions. The TMC has probe vehicle tracking capabilities (display position and attributes of instrumented vehicles), and can provide 2-way interaction with traffic to provide advisories and link times. Both the in-vehicle navigation module and the TMC feature detailed graphical user interfaces that includes human-factors studies to support safety and operational research. Realistic modeling of variations of the posted driving speed are based on human factor studies that take into consideration weather, road conditions, driver`s personality and behavior and vehicle type. The simulator has been developed on a distributed system of networked UNIX computers, but is designed to run on ANL`s IBM SP-X parallel computer system for large scale problems. A novel feature of the developed simulator is that vehicles will be represented by autonomous computer processes, each with a behavior model which performs independent route selection and reacts to external traffic events much like real vehicles. Vehicle processes interact with each other and with ITS components by exchanging messages. With this approach, one will be able to take advantage of emerging massively parallel processor (MPP) systems.« less
Fast Acceleration of 2D Wave Propagation Simulations Using Modern Computational Accelerators
Wang, Wei; Xu, Lifan; Cavazos, John; Huang, Howie H.; Kay, Matthew
2014-01-01
Recent developments in modern computational accelerators like Graphics Processing Units (GPUs) and coprocessors provide great opportunities for making scientific applications run faster than ever before. However, efficient parallelization of scientific code using new programming tools like CUDA requires a high level of expertise that is not available to many scientists. This, plus the fact that parallelized code is usually not portable to different architectures, creates major challenges for exploiting the full capabilities of modern computational accelerators. In this work, we sought to overcome these challenges by studying how to achieve both automated parallelization using OpenACC and enhanced portability using OpenCL. We applied our parallelization schemes using GPUs as well as Intel Many Integrated Core (MIC) coprocessor to reduce the run time of wave propagation simulations. We used a well-established 2D cardiac action potential model as a specific case-study. To the best of our knowledge, we are the first to study auto-parallelization of 2D cardiac wave propagation simulations using OpenACC. Our results identify several approaches that provide substantial speedups. The OpenACC-generated GPU code achieved more than speedup above the sequential implementation and required the addition of only a few OpenACC pragmas to the code. An OpenCL implementation provided speedups on GPUs of at least faster than the sequential implementation and faster than a parallelized OpenMP implementation. An implementation of OpenMP on Intel MIC coprocessor provided speedups of with only a few code changes to the sequential implementation. We highlight that OpenACC provides an automatic, efficient, and portable approach to achieve parallelization of 2D cardiac wave simulations on GPUs. Our approach of using OpenACC, OpenCL, and OpenMP to parallelize this particular model on modern computational accelerators should be applicable to other computational models of wave propagation in multi-dimensional media. PMID:24497950
SciSpark: Highly Interactive and Scalable Model Evaluation and Climate Metrics
NASA Astrophysics Data System (ADS)
Wilson, B. D.; Palamuttam, R. S.; Mogrovejo, R. M.; Whitehall, K. D.; Mattmann, C. A.; Verma, R.; Waliser, D. E.; Lee, H.
2015-12-01
Remote sensing data and climate model output are multi-dimensional arrays of massive sizes locked away in heterogeneous file formats (HDF5/4, NetCDF 3/4) and metadata models (HDF-EOS, CF) making it difficult to perform multi-stage, iterative science processing since each stage requires writing and reading data to and from disk. We are developing a lightning fast Big Data technology called SciSpark based on ApacheTM Spark under a NASA AIST grant (PI Mattmann). Spark implements the map-reduce paradigm for parallel computing on a cluster, but emphasizes in-memory computation, "spilling" to disk only as needed, and so outperforms the disk-based ApacheTM Hadoop by 100x in memory and by 10x on disk. SciSpark will enable scalable model evaluation by executing large-scale comparisons of A-Train satellite observations to model grids on a cluster of 10 to 1000 compute nodes. This 2nd generation capability for NASA's Regional Climate Model Evaluation System (RCMES) will compute simple climate metrics at interactive speeds, and extend to quite sophisticated iterative algorithms such as machine-learning based clustering of temperature PDFs, and even graph-based algorithms for searching for Mesocale Convective Complexes. We have implemented a parallel data ingest capability in which the user specifies desired variables (arrays) as several time-sorted lists of URL's (i.e. using OPeNDAP model.nc?varname, or local files). The specified variables are partitioned by time/space and then each Spark node pulls its bundle of arrays into memory to begin a computation pipeline. We also investigated the performance of several N-dim. array libraries (scala breeze, java jblas & netlib-java, and ND4J). We are currently developing science codes using ND4J and studying memory behavior on the JVM. On the pyspark side, many of our science codes already use the numpy and SciPy ecosystems. The talk will cover: the architecture of SciSpark, the design of the scientific RDD (sRDD) data structure, our efforts to integrate climate science algorithms in Python and Scala, parallel ingest and partitioning of A-Train satellite observations from HDF files and model grids from netCDF files, first parallel runs to compute comparison statistics and PDF's, and first metrics quantifying parallel speedups and memory & disk usage.
NASA Astrophysics Data System (ADS)
Vivoni, Enrique R.; Mascaro, Giuseppe; Mniszewski, Susan; Fasel, Patricia; Springer, Everett P.; Ivanov, Valeriy Y.; Bras, Rafael L.
2011-10-01
SummaryA major challenge in the use of fully-distributed hydrologic models has been the lack of computational capabilities for high-resolution, long-term simulations in large river basins. In this study, we present the parallel model implementation and real-world hydrologic assessment of the Triangulated Irregular Network (TIN)-based Real-time Integrated Basin Simulator (tRIBS). Our parallelization approach is based on the decomposition of a complex watershed using the channel network as a directed graph. The resulting sub-basin partitioning divides effort among processors and handles hydrologic exchanges across boundaries. Through numerical experiments in a set of nested basins, we quantify parallel performance relative to serial runs for a range of processors, simulation complexities and lengths, and sub-basin partitioning methods, while accounting for inter-run variability on a parallel computing system. In contrast to serial simulations, the parallel model speed-up depends on the variability of hydrologic processes. Load balancing significantly improves parallel speed-up with proportionally faster runs as simulation complexity (domain resolution and channel network extent) increases. The best strategy for large river basins is to combine a balanced partitioning with an extended channel network, with potential savings through a lower TIN resolution. Based on these advances, a wider range of applications for fully-distributed hydrologic models are now possible. This is illustrated through a set of ensemble forecasts that account for precipitation uncertainty derived from a statistical downscaling model.
TDat: An Efficient Platform for Processing Petabyte-Scale Whole-Brain Volumetric Images.
Li, Yuxin; Gong, Hui; Yang, Xiaoquan; Yuan, Jing; Jiang, Tao; Li, Xiangning; Sun, Qingtao; Zhu, Dan; Wang, Zhenyu; Luo, Qingming; Li, Anan
2017-01-01
Three-dimensional imaging of whole mammalian brains at single-neuron resolution has generated terabyte (TB)- and even petabyte (PB)-sized datasets. Due to their size, processing these massive image datasets can be hindered by the computer hardware and software typically found in biological laboratories. To fill this gap, we have developed an efficient platform named TDat, which adopts a novel data reformatting strategy by reading cuboid data and employing parallel computing. In data reformatting, TDat is more efficient than any other software. In data accessing, we adopted parallelization to fully explore the capability for data transmission in computers. We applied TDat in large-volume data rigid registration and neuron tracing in whole-brain data with single-neuron resolution, which has never been demonstrated in other studies. We also showed its compatibility with various computing platforms, image processing software and imaging systems.
A distributed version of the NASA Engine Performance Program
NASA Technical Reports Server (NTRS)
Cours, Jeffrey T.; Curlett, Brian P.
1993-01-01
Distributed NEPP, a version of the NASA Engine Performance Program, uses the original NEPP code but executes it in a distributed computer environment. Multiple workstations connected by a network increase the program's speed and, more importantly, the complexity of the cases it can handle in a reasonable time. Distributed NEPP uses the public domain software package, called Parallel Virtual Machine, allowing it to execute on clusters of machines containing many different architectures. It includes the capability to link with other computers, allowing them to process NEPP jobs in parallel. This paper discusses the design issues and granularity considerations that entered into programming Distributed NEPP and presents the results of timing runs.
Aprà, E; Kowalski, K
2016-03-08
In this paper we discuss the implementation of multireference coupled-cluster formalism with singles, doubles, and noniterative triples (MRCCSD(T)), which is capable of taking advantage of the processing power of the Intel Xeon Phi coprocessor. We discuss the integration of two levels of parallelism underlying the MRCCSD(T) implementation with computational kernels designed to offload the computationally intensive parts of the MRCCSD(T) formalism to Intel Xeon Phi coprocessors. Special attention is given to the enhancement of the parallel performance by task reordering that has improved load balancing in the noniterative part of the MRCCSD(T) calculations. We also discuss aspects regarding efficient optimization and vectorization strategies.
Parallel-Vector Algorithm For Rapid Structural Anlysis
NASA Technical Reports Server (NTRS)
Agarwal, Tarun R.; Nguyen, Duc T.; Storaasli, Olaf O.
1993-01-01
New algorithm developed to overcome deficiency of skyline storage scheme by use of variable-band storage scheme. Exploits both parallel and vector capabilities of modern high-performance computers. Gives engineers and designers opportunity to include more design variables and constraints during optimization of structures. Enables use of more refined finite-element meshes to obtain improved understanding of complex behaviors of aerospace structures leading to better, safer designs. Not only attractive for current supercomputers but also for next generation of shared-memory supercomputers.
NASA Technical Reports Server (NTRS)
Schreiber, Robert; Simon, Horst D.
1992-01-01
We are surveying current projects in the area of parallel supercomputers. The machines considered here will become commercially available in the 1990 - 1992 time frame. All are suitable for exploring the critical issues in applying parallel processors to large scale scientific computations, in particular CFD calculations. This chapter presents an overview of the surveyed machines, and a detailed analysis of the various architectural and technology approaches taken. Particular emphasis is placed on the feasibility of a Teraflops capability following the paths proposed by various developers.
Application of high-performance computing to numerical simulation of human movement
NASA Technical Reports Server (NTRS)
Anderson, F. C.; Ziegler, J. M.; Pandy, M. G.; Whalen, R. T.
1995-01-01
We have examined the feasibility of using massively-parallel and vector-processing supercomputers to solve large-scale optimization problems for human movement. Specifically, we compared the computational expense of determining the optimal controls for the single support phase of gait using a conventional serial machine (SGI Iris 4D25), a MIMD parallel machine (Intel iPSC/860), and a parallel-vector-processing machine (Cray Y-MP 8/864). With the human body modeled as a 14 degree-of-freedom linkage actuated by 46 musculotendinous units, computation of the optimal controls for gait could take up to 3 months of CPU time on the Iris. Both the Cray and the Intel are able to reduce this time to practical levels. The optimal solution for gait can be found with about 77 hours of CPU on the Cray and with about 88 hours of CPU on the Intel. Although the overall speeds of the Cray and the Intel were found to be similar, the unique capabilities of each machine are better suited to different portions of the computational algorithm used. The Intel was best suited to computing the derivatives of the performance criterion and the constraints whereas the Cray was best suited to parameter optimization of the controls. These results suggest that the ideal computer architecture for solving very large-scale optimal control problems is a hybrid system in which a vector-processing machine is integrated into the communication network of a MIMD parallel machine.
Shared direct memory access on the Explorer 2-LX
NASA Technical Reports Server (NTRS)
Musgrave, Jeffrey L.
1990-01-01
Advances in Expert System technology and Artificial Intelligence have provided a framework for applying automated Intelligence to the solution of problems which were generally perceived as intractable using more classical approaches. As a result, hybrid architectures and parallel processing capability have become more common in computing environments. The Texas Instruments Explorer II-LX is an example of a machine which combines a symbolic processing environment, and a computationally oriented environment in a single chassis for integrated problem solutions. This user's manual is an attempt to make these capabilities more accessible to a wider range of engineers and programmers with problems well suited to solution in such an environment.
Parallel design of JPEG-LS encoder on graphics processing units
NASA Astrophysics Data System (ADS)
Duan, Hao; Fang, Yong; Huang, Bormin
2012-01-01
With recent technical advances in graphic processing units (GPUs), GPUs have outperformed CPUs in terms of compute capability and memory bandwidth. Many successful GPU applications to high performance computing have been reported. JPEG-LS is an ISO/IEC standard for lossless image compression which utilizes adaptive context modeling and run-length coding to improve compression ratio. However, adaptive context modeling causes data dependency among adjacent pixels and the run-length coding has to be performed in a sequential way. Hence, using JPEG-LS to compress large-volume hyperspectral image data is quite time-consuming. We implement an efficient parallel JPEG-LS encoder for lossless hyperspectral compression on a NVIDIA GPU using the computer unified device architecture (CUDA) programming technology. We use the block parallel strategy, as well as such CUDA techniques as coalesced global memory access, parallel prefix sum, and asynchronous data transfer. We also show the relation between GPU speedup and AVIRIS block size, as well as the relation between compression ratio and AVIRIS block size. When AVIRIS images are divided into blocks, each with 64×64 pixels, we gain the best GPU performance with 26.3x speedup over its original CPU code.
Real science at the petascale.
Saksena, Radhika S; Boghosian, Bruce; Fazendeiro, Luis; Kenway, Owain A; Manos, Steven; Mazzeo, Marco D; Sadiq, S Kashif; Suter, James L; Wright, David; Coveney, Peter V
2009-06-28
We describe computational science research that uses petascale resources to achieve scientific results at unprecedented scales and resolution. The applications span a wide range of domains, from investigation of fundamental problems in turbulence through computational materials science research to biomedical applications at the forefront of HIV/AIDS research and cerebrovascular haemodynamics. This work was mainly performed on the US TeraGrid 'petascale' resource, Ranger, at Texas Advanced Computing Center, in the first half of 2008 when it was the largest computing system in the world available for open scientific research. We have sought to use this petascale supercomputer optimally across application domains and scales, exploiting the excellent parallel scaling performance found on up to at least 32 768 cores for certain of our codes in the so-called 'capability computing' category as well as high-throughput intermediate-scale jobs for ensemble simulations in the 32-512 core range. Furthermore, this activity provides evidence that conventional parallel programming with MPI should be successful at the petascale in the short to medium term. We also report on the parallel performance of some of our codes on up to 65 636 cores on the IBM Blue Gene/P system at the Argonne Leadership Computing Facility, which has recently been named the fastest supercomputer in the world for open science.
A Three-Dimensional Parallel Time-Accurate Turbopump Simulation Procedure Using Overset Grid System
NASA Technical Reports Server (NTRS)
Kiris, Cetin; Chan, William; Kwak, Dochan
2002-01-01
The objective of the current effort is to provide a computational framework for design and analysis of the entire fuel supply system of a liquid rocket engine, including high-fidelity unsteady turbopump flow analysis. This capability is needed to support the design of pump sub-systems for advanced space transportation vehicles that are likely to involve liquid propulsion systems. To date, computational tools for design/analysis of turbopump flows are based on relatively lower fidelity methods. An unsteady, three-dimensional viscous flow analysis tool involving stationary and rotational components for the entire turbopump assembly has not been available for real-world engineering applications. The present effort provides developers with information such as transient flow phenomena at start up, and nonuniform inflows, and will eventually impact on system vibration and structures. In the proposed paper, the progress toward the capability of complete simulation of the turbo-pump for a liquid rocket engine is reported. The Space Shuttle Main Engine (SSME) turbo-pump is used as a test case for evaluation of the hybrid MPI/Open-MP and MLP versions of the INS3D code. CAD to solution auto-scripting capability is being developed for turbopump applications. The relative motion of the grid systems for the rotor-stator interaction was obtained using overset grid techniques. Unsteady computations for the SSME turbo-pump, which contains 114 zones with 34.5 million grid points, are carried out on Origin 3000 systems at NASA Ames Research Center. Results from these time-accurate simulations with moving boundary capability are presented along with the performance of parallel versions of the code.
A Scalable O(N) Algorithm for Large-Scale Parallel First-Principles Molecular Dynamics Simulations
DOE Office of Scientific and Technical Information (OSTI.GOV)
Osei-Kuffuor, Daniel; Fattebert, Jean-Luc
2014-01-01
Traditional algorithms for first-principles molecular dynamics (FPMD) simulations only gain a modest capability increase from current petascale computers, due to their O(N 3) complexity and their heavy use of global communications. To address this issue, we are developing a truly scalable O(N) complexity FPMD algorithm, based on density functional theory (DFT), which avoids global communications. The computational model uses a general nonorthogonal orbital formulation for the DFT energy functional, which requires knowledge of selected elements of the inverse of the associated overlap matrix. We present a scalable algorithm for approximately computing selected entries of the inverse of the overlap matrix,more » based on an approximate inverse technique, by inverting local blocks corresponding to principal submatrices of the global overlap matrix. The new FPMD algorithm exploits sparsity and uses nearest neighbor communication to provide a computational scheme capable of extreme scalability. Accuracy is controlled by the mesh spacing of the finite difference discretization, the size of the localization regions in which the electronic orbitals are confined, and a cutoff beyond which the entries of the overlap matrix can be omitted when computing selected entries of its inverse. We demonstrate the algorithm's excellent parallel scaling for up to O(100K) atoms on O(100K) processors, with a wall-clock time of O(1) minute per molecular dynamics time step.« less
Retargeting of existing FORTRAN program and development of parallel compilers
NASA Technical Reports Server (NTRS)
Agrawal, Dharma P.
1988-01-01
The software models used in implementing the parallelizing compiler for the B-HIVE multiprocessor system are described. The various models and strategies used in the compiler development are: flexible granularity model, which allows a compromise between two extreme granularity models; communication model, which is capable of precisely describing the interprocessor communication timings and patterns; loop type detection strategy, which identifies different types of loops; critical path with coloring scheme, which is a versatile scheduling strategy for any multicomputer with some associated communication costs; and loop allocation strategy, which realizes optimum overlapped operations between computation and communication of the system. Using these models, several sample routines of the AIR3D package are examined and tested. It may be noted that automatically generated codes are highly parallelized to provide the maximized degree of parallelism, obtaining the speedup up to a 28 to 32-processor system. A comparison of parallel codes for both the existing and proposed communication model, is performed and the corresponding expected speedup factors are obtained. The experimentation shows that the B-HIVE compiler produces more efficient codes than existing techniques. Work is progressing well in completing the final phase of the compiler. Numerous enhancements are needed to improve the capabilities of the parallelizing compiler.
Advanced techniques in reliability model representation and solution
NASA Technical Reports Server (NTRS)
Palumbo, Daniel L.; Nicol, David M.
1992-01-01
The current tendency of flight control system designs is towards increased integration of applications and increased distribution of computational elements. The reliability analysis of such systems is difficult because subsystem interactions are increasingly interdependent. Researchers at NASA Langley Research Center have been working for several years to extend the capability of Markov modeling techniques to address these problems. This effort has been focused in the areas of increased model abstraction and increased computational capability. The reliability model generator (RMG) is a software tool that uses as input a graphical object-oriented block diagram of the system. RMG uses a failure-effects algorithm to produce the reliability model from the graphical description. The ASSURE software tool is a parallel processing program that uses the semi-Markov unreliability range evaluator (SURE) solution technique and the abstract semi-Markov specification interface to the SURE tool (ASSIST) modeling language. A failure modes-effects simulation is used by ASSURE. These tools were used to analyze a significant portion of a complex flight control system. The successful combination of the power of graphical representation, automated model generation, and parallel computation leads to the conclusion that distributed fault-tolerant system architectures can now be analyzed.
DOE Office of Scientific and Technical Information (OSTI.GOV)
McGhee, J.M.; Roberts, R.M.; Morel, J.E.
1997-06-01
A spherical harmonics research code (DANTE) has been developed which is compatible with parallel computer architectures. DANTE provides 3-D, multi-material, deterministic, transport capabilities using an arbitrary finite element mesh. The linearized Boltzmann transport equation is solved in a second order self-adjoint form utilizing a Galerkin finite element spatial differencing scheme. The core solver utilizes a preconditioned conjugate gradient algorithm. Other distinguishing features of the code include options for discrete-ordinates and simplified spherical harmonics angular differencing, an exact Marshak boundary treatment for arbitrarily oriented boundary faces, in-line matrix construction techniques to minimize memory consumption, and an effective diffusion based preconditioner formore » scattering dominated problems. Algorithm efficiency is demonstrated for a massively parallel SIMD architecture (CM-5), and compatibility with MPP multiprocessor platforms or workstation clusters is anticipated.« less
Efficient parallel architecture for highly coupled real-time linear system applications
NASA Technical Reports Server (NTRS)
Carroll, Chester C.; Homaifar, Abdollah; Barua, Soumavo
1988-01-01
A systematic procedure is developed for exploiting the parallel constructs of computation in a highly coupled, linear system application. An overall top-down design approach is adopted. Differential equations governing the application under consideration are partitioned into subtasks on the basis of a data flow analysis. The interconnected task units constitute a task graph which has to be computed in every update interval. Multiprocessing concepts utilizing parallel integration algorithms are then applied for efficient task graph execution. A simple scheduling routine is developed to handle task allocation while in the multiprocessor mode. Results of simulation and scheduling are compared on the basis of standard performance indices. Processor timing diagrams are developed on the basis of program output accruing to an optimal set of processors. Basic architectural attributes for implementing the system are discussed together with suggestions for processing element design. Emphasis is placed on flexible architectures capable of accommodating widely varying application specifics.
Automatic mesh refinement and parallel load balancing for Fokker-Planck-DSMC algorithm
NASA Astrophysics Data System (ADS)
Küchlin, Stephan; Jenny, Patrick
2018-06-01
Recently, a parallel Fokker-Planck-DSMC algorithm for rarefied gas flow simulation in complex domains at all Knudsen numbers was developed by the authors. Fokker-Planck-DSMC (FP-DSMC) is an augmentation of the classical DSMC algorithm, which mitigates the near-continuum deficiencies in terms of computational cost of pure DSMC. At each time step, based on a local Knudsen number criterion, the discrete DSMC collision operator is dynamically switched to the Fokker-Planck operator, which is based on the integration of continuous stochastic processes in time, and has fixed computational cost per particle, rather than per collision. In this contribution, we present an extension of the previous implementation with automatic local mesh refinement and parallel load-balancing. In particular, we show how the properties of discrete approximations to space-filling curves enable an efficient implementation. Exemplary numerical studies highlight the capabilities of the new code.
Field-Programmable Gate Array Computer in Structural Analysis: An Initial Exploration
NASA Technical Reports Server (NTRS)
Singleterry, Robert C., Jr.; Sobieszczanski-Sobieski, Jaroslaw; Brown, Samuel
2002-01-01
This paper reports on an initial assessment of using a Field-Programmable Gate Array (FPGA) computational device as a new tool for solving structural mechanics problems. A FPGA is an assemblage of binary gates arranged in logical blocks that are interconnected via software in a manner dependent on the algorithm being implemented and can be reprogrammed thousands of times per second. In effect, this creates a computer specialized for the problem that automatically exploits all the potential for parallel computing intrinsic in an algorithm. This inherent parallelism is the most important feature of the FPGA computational environment. It is therefore important that if a problem offers a choice of different solution algorithms, an algorithm of a higher degree of inherent parallelism should be selected. It is found that in structural analysis, an 'analog computer' style of programming, which solves problems by direct simulation of the terms in the governing differential equations, yields a more favorable solution algorithm than current solution methods. This style of programming is facilitated by a 'drag-and-drop' graphic programming language that is supplied with the particular type of FPGA computer reported in this paper. Simple examples in structural dynamics and statics illustrate the solution approach used. The FPGA system also allows linear scalability in computing capability. As the problem grows, the number of FPGA chips can be increased with no loss of computing efficiency due to data flow or algorithmic latency that occurs when a single problem is distributed among many conventional processors that operate in parallel. This initial assessment finds the FPGA hardware and software to be in their infancy in regard to the user conveniences; however, they have enormous potential for shrinking the elapsed time of structural analysis solutions if programmed with algorithms that exhibit inherent parallelism and linear scalability. This potential warrants further development of FPGA-tailored algorithms for structural analysis.
NASA Astrophysics Data System (ADS)
Carter, Rachel; Huhman, Brett; Love, Corey T.; Zenyuk, Iryna V.
2018-03-01
X-ray computed tomography (X-ray CT) across multiple length scales is utilized for the first time to investigate the physical abuse of high C-rate pulsed discharge on cells wired individually and in parallel.. Manufactured lithium iron phosphate cells boasting high rate capability were pulse power tested in both wiring conditions with high discharge currents of 10C for a high number of cycles (up to 1200) until end of life (<80% of initial discharge capacity retained). The parallel assembly reached end of life more rapidly for reasons unknown prior to CT investigations. The investigation revealed evidence of overdischarge in the most degraded cell from the parallel assembly, compared to more traditional failure in the individual cell. The parallel-wired cell exhibited dissolution of copper from the anode current collector and subsequent deposition throughout the separator near the cathode of the cell. This overdischarge-induced copper deposition, notably impossible to confirm with other state of health (SOH) monitoring methods, is diagnosed using CT by rendering the interior current collector without harm or alteration to the active materials. Correlation of CT observations to the electrochemical pulse data from the parallel-wired cells reveals the risk of parallel wiring during high C-rate pulse discharge.
Settgast, Randolph R.; Fu, Pengcheng; Walsh, Stuart D. C.; ...
2016-09-18
This study describes a fully coupled finite element/finite volume approach for simulating field-scale hydraulically driven fractures in three dimensions, using massively parallel computing platforms. The proposed method is capable of capturing realistic representations of local heterogeneities, layering and natural fracture networks in a reservoir. A detailed description of the numerical implementation is provided, along with numerical studies comparing the model with both analytical solutions and experimental results. The results demonstrate the effectiveness of the proposed method for modeling large-scale problems involving hydraulically driven fractures in three dimensions.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Settgast, Randolph R.; Fu, Pengcheng; Walsh, Stuart D. C.
This study describes a fully coupled finite element/finite volume approach for simulating field-scale hydraulically driven fractures in three dimensions, using massively parallel computing platforms. The proposed method is capable of capturing realistic representations of local heterogeneities, layering and natural fracture networks in a reservoir. A detailed description of the numerical implementation is provided, along with numerical studies comparing the model with both analytical solutions and experimental results. The results demonstrate the effectiveness of the proposed method for modeling large-scale problems involving hydraulically driven fractures in three dimensions.
NASA Astrophysics Data System (ADS)
Blume, H.; Alexandru, R.; Applegate, R.; Giordano, T.; Kamiya, K.; Kresina, R.
1986-06-01
In a digital diagnostic imaging department, the majority of operations for handling and processing of images can be grouped into a small set of basic operations, such as image data buffering and storage, image processing and analysis, image display, image data transmission and image data compression. These operations occur in almost all nodes of the diagnostic imaging communications network of the department. An image processor architecture was developed in which each of these functions has been mapped into hardware and software modules. The modular approach has advantages in terms of economics, service, expandability and upgradeability. The architectural design is based on the principles of hierarchical functionality, distributed and parallel processing and aims at real time response. Parallel processing and real time response is facilitated in part by a dual bus system: a VME control bus and a high speed image data bus, consisting of 8 independent parallel 16-bit busses, capable of handling combined up to 144 MBytes/sec. The presented image processor is versatile enough to meet the video rate processing needs of digital subtraction angiography, the large pixel matrix processing requirements of static projection radiography, or the broad range of manipulation and display needs of a multi-modality diagnostic work station. Several hardware modules are described in detail. For illustrating the capabilities of the image processor, processed 2000 x 2000 pixel computed radiographs are shown and estimated computation times for executing the processing opera-tions are presented.
A Real-Time Capable Software-Defined Receiver Using GPU for Adaptive Anti-Jam GPS Sensors
Seo, Jiwon; Chen, Yu-Hsuan; De Lorenzo, David S.; Lo, Sherman; Enge, Per; Akos, Dennis; Lee, Jiyun
2011-01-01
Due to their weak received signal power, Global Positioning System (GPS) signals are vulnerable to radio frequency interference. Adaptive beam and null steering of the gain pattern of a GPS antenna array can significantly increase the resistance of GPS sensors to signal interference and jamming. Since adaptive array processing requires intensive computational power, beamsteering GPS receivers were usually implemented using hardware such as field-programmable gate arrays (FPGAs). However, a software implementation using general-purpose processors is much more desirable because of its flexibility and cost effectiveness. This paper presents a GPS software-defined radio (SDR) with adaptive beamsteering capability for anti-jam applications. The GPS SDR design is based on an optimized desktop parallel processing architecture using a quad-core Central Processing Unit (CPU) coupled with a new generation Graphics Processing Unit (GPU) having massively parallel processors. This GPS SDR demonstrates sufficient computational capability to support a four-element antenna array and future GPS L5 signal processing in real time. After providing the details of our design and optimization schemes for future GPU-based GPS SDR developments, the jamming resistance of our GPS SDR under synthetic wideband jamming is presented. Since the GPS SDR uses commercial-off-the-shelf hardware and processors, it can be easily adopted in civil GPS applications requiring anti-jam capabilities. PMID:22164116
A real-time capable software-defined receiver using GPU for adaptive anti-jam GPS sensors.
Seo, Jiwon; Chen, Yu-Hsuan; De Lorenzo, David S; Lo, Sherman; Enge, Per; Akos, Dennis; Lee, Jiyun
2011-01-01
Due to their weak received signal power, Global Positioning System (GPS) signals are vulnerable to radio frequency interference. Adaptive beam and null steering of the gain pattern of a GPS antenna array can significantly increase the resistance of GPS sensors to signal interference and jamming. Since adaptive array processing requires intensive computational power, beamsteering GPS receivers were usually implemented using hardware such as field-programmable gate arrays (FPGAs). However, a software implementation using general-purpose processors is much more desirable because of its flexibility and cost effectiveness. This paper presents a GPS software-defined radio (SDR) with adaptive beamsteering capability for anti-jam applications. The GPS SDR design is based on an optimized desktop parallel processing architecture using a quad-core Central Processing Unit (CPU) coupled with a new generation Graphics Processing Unit (GPU) having massively parallel processors. This GPS SDR demonstrates sufficient computational capability to support a four-element antenna array and future GPS L5 signal processing in real time. After providing the details of our design and optimization schemes for future GPU-based GPS SDR developments, the jamming resistance of our GPS SDR under synthetic wideband jamming is presented. Since the GPS SDR uses commercial-off-the-shelf hardware and processors, it can be easily adopted in civil GPS applications requiring anti-jam capabilities.
Neural network applications in telecommunications
NASA Technical Reports Server (NTRS)
Alspector, Joshua
1994-01-01
Neural network capabilities include automatic and organized handling of complex information, quick adaptation to continuously changing environments, nonlinear modeling, and parallel implementation. This viewgraph presentation presents Bellcore work on applications, learning chip computational function, learning system block diagram, neural network equalization, broadband access control, calling-card fraud detection, software reliability prediction, and conclusions.
Hypercube matrix computation task
NASA Technical Reports Server (NTRS)
Calalo, Ruel H.; Imbriale, William A.; Jacobi, Nathan; Liewer, Paulett C.; Lockhart, Thomas G.; Lyzenga, Gregory A.; Lyons, James R.; Manshadi, Farzin; Patterson, Jean E.
1988-01-01
A major objective of the Hypercube Matrix Computation effort at the Jet Propulsion Laboratory (JPL) is to investigate the applicability of a parallel computing architecture to the solution of large-scale electromagnetic scattering problems. Three scattering analysis codes are being implemented and assessed on a JPL/California Institute of Technology (Caltech) Mark 3 Hypercube. The codes, which utilize different underlying algorithms, give a means of evaluating the general applicability of this parallel architecture. The three analysis codes being implemented are a frequency domain method of moments code, a time domain finite difference code, and a frequency domain finite elements code. These analysis capabilities are being integrated into an electromagnetics interactive analysis workstation which can serve as a design tool for the construction of antennas and other radiating or scattering structures. The first two years of work on the Hypercube Matrix Computation effort is summarized. It includes both new developments and results as well as work previously reported in the Hypercube Matrix Computation Task: Final Report for 1986 to 1987 (JPL Publication 87-18).
Visualization of Pulsar Search Data
NASA Astrophysics Data System (ADS)
Foster, R. S.; Wolszczan, A.
1993-05-01
The search for periodic signals from rotating neutron stars or pulsars has been a computationally taxing problem to astronomers for more than twenty-five years. Over this time interval, increases in computational capability have allowed ever more sensitive searches, covering a larger parameter space. The volume of input data and the general presence of radio frequency interference typically produce numerous spurious signals. Visualization of the search output and enhanced real-time processing of significant candidate events allow the pulsar searcher to optimally processes and search for new radio pulsars. The pulsar search algorithm and visualization system presented in this paper currently runs on serial RISC based workstations, a traditional vector based super computer, and a massively parallel computer. A description of the serial software algorithm and its modifications for massively parallel computing are describe. The results of four successive searches for millisecond period radio pulsars using the Arecibo telescope at 430 MHz have resulted in the successful detection of new long-period and millisecond period radio pulsars.
Photochromic molecular implementations of universal computation.
Chaplin, Jack C; Krasnogor, Natalio; Russell, Noah A
2014-12-01
Unconventional computing is an area of research in which novel materials and paradigms are utilised to implement computation. Previously we have demonstrated how registers, logic gates and logic circuits can be implemented, unconventionally, with a biocompatible molecular switch, NitroBIPS, embedded in a polymer matrix. NitroBIPS and related molecules have been shown elsewhere to be capable of modifying many biological processes in a manner that is dependent on its molecular form. Thus, one possible application of this type of unconventional computing is to embed computational processes into biological systems. Here we expand on our earlier proof-of-principle work and demonstrate that universal computation can be implemented using NitroBIPS. We have previously shown that spatially localised computational elements, including registers and logic gates, can be produced. We explain how parallel registers can be implemented, then demonstrate an application of parallel registers in the form of Turing machine tapes, and demonstrate both parallel registers and logic circuits in the form of elementary cellular automata. The Turing machines and elementary cellular automata utilise the same samples and same hardware to implement their registers, logic gates and logic circuits; and both represent examples of universal computing paradigms. This shows that homogenous photochromic computational devices can be dynamically repurposed without invasive reconfiguration. The result represents an important, necessary step towards demonstrating the general feasibility of interfacial computation embedded in biological systems or other unconventional materials and environments. Copyright © 2014 The Authors. Published by Elsevier Ireland Ltd.. All rights reserved.
Karasick, Michael S.; Strip, David R.
1996-01-01
A parallel computing system is described that comprises a plurality of uniquely labeled, parallel processors, each processor capable of modelling a three-dimensional object that includes a plurality of vertices, faces and edges. The system comprises a front-end processor for issuing a modelling command to the parallel processors, relating to a three-dimensional object. Each parallel processor, in response to the command and through the use of its own unique label, creates a directed-edge (d-edge) data structure that uniquely relates an edge of the three-dimensional object to one face of the object. Each d-edge data structure at least includes vertex descriptions of the edge and a description of the one face. As a result, each processor, in response to the modelling command, operates upon a small component of the model and generates results, in parallel with all other processors, without the need for processor-to-processor intercommunication.
IOPA: I/O-aware parallelism adaption for parallel programs
Liu, Tao; Liu, Yi; Qian, Chen; Qian, Depei
2017-01-01
With the development of multi-/many-core processors, applications need to be written as parallel programs to improve execution efficiency. For data-intensive applications that use multiple threads to read/write files simultaneously, an I/O sub-system can easily become a bottleneck when too many of these types of threads exist; on the contrary, too few threads will cause insufficient resource utilization and hurt performance. Therefore, programmers must pay much attention to parallelism control to find the appropriate number of I/O threads for an application. This paper proposes a parallelism control mechanism named IOPA that can adjust the parallelism of applications to adapt to the I/O capability of a system and balance computing resources and I/O bandwidth. The programming interface of IOPA is also provided to programmers to simplify parallel programming. IOPA is evaluated using multiple applications with both solid state and hard disk drives. The results show that the parallel applications using IOPA can achieve higher efficiency than those with a fixed number of threads. PMID:28278236
IOPA: I/O-aware parallelism adaption for parallel programs.
Liu, Tao; Liu, Yi; Qian, Chen; Qian, Depei
2017-01-01
With the development of multi-/many-core processors, applications need to be written as parallel programs to improve execution efficiency. For data-intensive applications that use multiple threads to read/write files simultaneously, an I/O sub-system can easily become a bottleneck when too many of these types of threads exist; on the contrary, too few threads will cause insufficient resource utilization and hurt performance. Therefore, programmers must pay much attention to parallelism control to find the appropriate number of I/O threads for an application. This paper proposes a parallelism control mechanism named IOPA that can adjust the parallelism of applications to adapt to the I/O capability of a system and balance computing resources and I/O bandwidth. The programming interface of IOPA is also provided to programmers to simplify parallel programming. IOPA is evaluated using multiple applications with both solid state and hard disk drives. The results show that the parallel applications using IOPA can achieve higher efficiency than those with a fixed number of threads.
PARAMESH: A Parallel Adaptive Mesh Refinement Community Toolkit
NASA Technical Reports Server (NTRS)
MacNeice, Peter; Olson, Kevin M.; Mobarry, Clark; deFainchtein, Rosalinda; Packer, Charles
1999-01-01
In this paper, we describe a community toolkit which is designed to provide parallel support with adaptive mesh capability for a large and important class of computational models, those using structured, logically cartesian meshes. The package of Fortran 90 subroutines, called PARAMESH, is designed to provide an application developer with an easy route to extend an existing serial code which uses a logically cartesian structured mesh into a parallel code with adaptive mesh refinement. Alternatively, in its simplest use, and with minimal effort, it can operate as a domain decomposition tool for users who want to parallelize their serial codes, but who do not wish to use adaptivity. The package can provide them with an incremental evolutionary path for their code, converting it first to uniformly refined parallel code, and then later if they so desire, adding adaptivity.
BarraCUDA - a fast short read sequence aligner using graphics processing units
2012-01-01
Background With the maturation of next-generation DNA sequencing (NGS) technologies, the throughput of DNA sequencing reads has soared to over 600 gigabases from a single instrument run. General purpose computing on graphics processing units (GPGPU), extracts the computing power from hundreds of parallel stream processors within graphics processing cores and provides a cost-effective and energy efficient alternative to traditional high-performance computing (HPC) clusters. In this article, we describe the implementation of BarraCUDA, a GPGPU sequence alignment software that is based on BWA, to accelerate the alignment of sequencing reads generated by these instruments to a reference DNA sequence. Findings Using the NVIDIA Compute Unified Device Architecture (CUDA) software development environment, we ported the most computational-intensive alignment component of BWA to GPU to take advantage of the massive parallelism. As a result, BarraCUDA offers a magnitude of performance boost in alignment throughput when compared to a CPU core while delivering the same level of alignment fidelity. The software is also capable of supporting multiple CUDA devices in parallel to further accelerate the alignment throughput. Conclusions BarraCUDA is designed to take advantage of the parallelism of GPU to accelerate the alignment of millions of sequencing reads generated by NGS instruments. By doing this, we could, at least in part streamline the current bioinformatics pipeline such that the wider scientific community could benefit from the sequencing technology. BarraCUDA is currently available from http://seqbarracuda.sf.net PMID:22244497
Parallelized modelling and solution scheme for hierarchically scaled simulations
NASA Technical Reports Server (NTRS)
Padovan, Joe
1995-01-01
This two-part paper presents the results of a benchmarked analytical-numerical investigation into the operational characteristics of a unified parallel processing strategy for implicit fluid mechanics formulations. This hierarchical poly tree (HPT) strategy is based on multilevel substructural decomposition. The Tree morphology is chosen to minimize memory, communications and computational effort. The methodology is general enough to apply to existing finite difference (FD), finite element (FEM), finite volume (FV) or spectral element (SE) based computer programs without an extensive rewrite of code. In addition to finding large reductions in memory, communications, and computational effort associated with a parallel computing environment, substantial reductions are generated in the sequential mode of application. Such improvements grow with increasing problem size. Along with a theoretical development of general 2-D and 3-D HPT, several techniques for expanding the problem size that the current generation of computers are capable of solving, are presented and discussed. Among these techniques are several interpolative reduction methods. It was found that by combining several of these techniques that a relatively small interpolative reduction resulted in substantial performance gains. Several other unique features/benefits are discussed in this paper. Along with Part 1's theoretical development, Part 2 presents a numerical approach to the HPT along with four prototype CFD applications. These demonstrate the potential of the HPT strategy.
Automatic Parallelization of Numerical Python Applications using the Global Arrays Toolkit
DOE Office of Scientific and Technical Information (OSTI.GOV)
Daily, Jeffrey A.; Lewis, Robert R.
2011-11-30
Global Arrays is a software system from Pacific Northwest National Laboratory that enables an efficient, portable, and parallel shared-memory programming interface to manipulate distributed dense arrays. The NumPy module is the de facto standard for numerical calculation in the Python programming language, a language whose use is growing rapidly in the scientific and engineering communities. NumPy provides a powerful N-dimensional array class as well as other scientific computing capabilities. However, like the majority of the core Python modules, NumPy is inherently serial. Using a combination of Global Arrays and NumPy, we have reimplemented NumPy as a distributed drop-in replacement calledmore » Global Arrays in NumPy (GAiN). Serial NumPy applications can become parallel, scalable GAiN applications with only minor source code changes. Scalability studies of several different GAiN applications will be presented showing the utility of developing serial NumPy codes which can later run on more capable clusters or supercomputers.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Raboin, P J
1998-01-01
The Computational Mechanics thrust area is a vital and growing facet of the Mechanical Engineering Department at Lawrence Livermore National Laboratory (LLNL). This work supports the development of computational analysis tools in the areas of structural mechanics and heat transfer. Over 75 analysts depend on thrust area-supported software running on a variety of computing platforms to meet the demands of LLNL programs. Interactions with the Department of Defense (DOD) High Performance Computing and Modernization Program and the Defense Special Weapons Agency are of special importance as they support our ParaDyn project in its development of new parallel capabilities for DYNA3D.more » Working with DOD customers has been invaluable to driving this technology in directions mutually beneficial to the Department of Energy. Other projects associated with the Computational Mechanics thrust area include work with the Partnership for a New Generation Vehicle (PNGV) for ''Springback Predictability'' and with the Federal Aviation Administration (FAA) for the ''Development of Methodologies for Evaluating Containment and Mitigation of Uncontained Engine Debris.'' In this report for FY-97, there are five articles detailing three code development activities and two projects that synthesized new code capabilities with new analytic research in damage/failure and biomechanics. The article this year are: (1) Energy- and Momentum-Conserving Rigid-Body Contact for NIKE3D and DYNA3D; (2) Computational Modeling of Prosthetics: A New Approach to Implant Design; (3) Characterization of Laser-Induced Mechanical Failure Damage of Optical Components; (4) Parallel Algorithm Research for Solid Mechanics Applications Using Finite Element Analysis; and (5) An Accurate One-Step Elasto-Plasticity Algorithm for Shell Elements in DYNA3D.« less
Composite structural materials
NASA Technical Reports Server (NTRS)
Ansell, G. S.; Loewy, R. G.; Wiberley, S. E.
1981-01-01
The composite aircraft program component (CAPCOMP) is a graduate level project conducted in parallel with a composite structures program. The composite aircraft program glider (CAPGLIDE) is an undergraduate demonstration project which has as its objectives the design, fabrication, and testing of a foot launched ultralight glider using composite structures. The objective of the computer aided design (COMPAD) portion of the composites project is to provide computer tools for the analysis and design of composite structures. The major thrust of COMPAD is in the finite element area with effort directed at implementing finite element analysis capabilities and developing interactive graphics preprocessing and postprocessing capabilities. The criteria for selecting research projects to be conducted under the innovative and supporting research (INSURE) program are described.
Gpu Implementation of a Viscous Flow Solver on Unstructured Grids
NASA Astrophysics Data System (ADS)
Xu, Tianhao; Chen, Long
2016-06-01
Graphics processing units have gained popularities in scientific computing over past several years due to their outstanding parallel computing capability. Computational fluid dynamics applications involve large amounts of calculations, therefore a latest GPU card is preferable of which the peak computing performance and memory bandwidth are much better than a contemporary high-end CPU. We herein focus on the detailed implementation of our GPU targeting Reynolds-averaged Navier-Stokes equations solver based on finite-volume method. The solver employs a vertex-centered scheme on unstructured grids for the sake of being capable of handling complex topologies. Multiple optimizations are carried out to improve the memory accessing performance and kernel utilization. Both steady and unsteady flow simulation cases are carried out using explicit Runge-Kutta scheme. The solver with GPU acceleration in this paper is demonstrated to have competitive advantages over the CPU targeting one.
On the relationship between parallel computation and graph embedding
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gupta, A.K.
1989-01-01
The problem of efficiently simulating an algorithm designed for an n-processor parallel machine G on an m-processor parallel machine H with n > m arises when parallel algorithms designed for an ideal size machine are simulated on existing machines which are of a fixed size. The author studies this problem when every processor of H takes over the function of a number of processors in G, and he phrases the simulation problem as a graph embedding problem. New embeddings presented address relevant issues arising from the parallel computation environment. The main focus centers around embedding complete binary trees into smaller-sizedmore » binary trees, butterflies, and hypercubes. He also considers simultaneous embeddings of r source machines into a single hypercube. Constant factors play a crucial role in his embeddings since they are not only important in practice but also lead to interesting theoretical problems. All of his embeddings minimize dilation and load, which are the conventional cost measures in graph embeddings and determine the maximum amount of time required to simulate one step of G on H. His embeddings also optimize a new cost measure called ({alpha},{beta})-utilization which characterizes how evenly the processors of H are used by the processors of G. Ideally, the utilization should be balanced (i.e., every processor of H simulates at most (n/m) processors of G) and the ({alpha},{beta})-utilization measures how far off from a balanced utilization the embedding is. He presents embeddings for the situation when some processors of G have different capabilities (e.g. memory or I/O) than others and the processors with different capabilities are to be distributed uniformly among the processors of H. Placing such conditions on an embedding results in an increase in some of the cost measures.« less
NASA Astrophysics Data System (ADS)
Shi, X.
2015-12-01
As NSF indicated - "Theory and experimentation have for centuries been regarded as two fundamental pillars of science. It is now widely recognized that computational and data-enabled science forms a critical third pillar." Geocomputation is the third pillar of GIScience and geosciences. With the exponential growth of geodata, the challenge of scalable and high performance computing for big data analytics become urgent because many research activities are constrained by the inability of software or tool that even could not complete the computation process. Heterogeneous geodata integration and analytics obviously magnify the complexity and operational time frame. Many large-scale geospatial problems may be not processable at all if the computer system does not have sufficient memory or computational power. Emerging computer architectures, such as Intel's Many Integrated Core (MIC) Architecture and Graphics Processing Unit (GPU), and advanced computing technologies provide promising solutions to employ massive parallelism and hardware resources to achieve scalability and high performance for data intensive computing over large spatiotemporal and social media data. Exploring novel algorithms and deploying the solutions in massively parallel computing environment to achieve the capability for scalable data processing and analytics over large-scale, complex, and heterogeneous geodata with consistent quality and high-performance has been the central theme of our research team in the Department of Geosciences at the University of Arkansas (UARK). New multi-core architectures combined with application accelerators hold the promise to achieve scalability and high performance by exploiting task and data levels of parallelism that are not supported by the conventional computing systems. Such a parallel or distributed computing environment is particularly suitable for large-scale geocomputation over big data as proved by our prior works, while the potential of such advanced infrastructure remains unexplored in this domain. Within this presentation, our prior and on-going initiatives will be summarized to exemplify how we exploit multicore CPUs, GPUs, and MICs, and clusters of CPUs, GPUs and MICs, to accelerate geocomputation in different applications.
Multiprocessing the Sieve of Eratosthenes
NASA Technical Reports Server (NTRS)
Bokhari, S.
1986-01-01
The Sieve of Eratosthenes for finding prime numbers in recent years has seen much use as a benchmark algorithm for serial computers while its intrinsically parallel nature has gone largely unnoticed. The implementation of a parallel version of this algorithm for a real parallel computer, the Flex/32, is described and its performance discussed. It is shown that the algorithm is sensitive to several fundamental performance parameters of parallel machines, such as spawning time, signaling time, memory access, and overhead of process switching. Because of the nature of the algorithm, it is impossible to get any speedup beyond 4 or 5 processors unless some form of dynamic load balancing is employed. We describe the performance of our algorithm with and without load balancing and compare it with theoretical lower bounds and simulated results. It is straightforward to understand this algorithm and to check the final results. However, its efficient implementation on a real parallel machine requires thoughtful design, especially if dynamic load balancing is desired. The fundamental operations required by the algorithm are very simple: this means that the slightest overhead appears prominently in performance data. The Sieve thus serves not only as a very severe test of the capabilities of a parallel processor but is also an interesting challenge for the programmer.
Chen, Weiliang; De Schutter, Erik
2017-01-01
Stochastic, spatial reaction-diffusion simulations have been widely used in systems biology and computational neuroscience. However, the increasing scale and complexity of models and morphologies have exceeded the capacity of any serial implementation. This led to the development of parallel solutions that benefit from the boost in performance of modern supercomputers. In this paper, we describe an MPI-based, parallel operator-splitting implementation for stochastic spatial reaction-diffusion simulations with irregular tetrahedral meshes. The performance of our implementation is first examined and analyzed with simulations of a simple model. We then demonstrate its application to real-world research by simulating the reaction-diffusion components of a published calcium burst model in both Purkinje neuron sub-branch and full dendrite morphologies. Simulation results indicate that our implementation is capable of achieving super-linear speedup for balanced loading simulations with reasonable molecule density and mesh quality. In the best scenario, a parallel simulation with 2,000 processes runs more than 3,600 times faster than its serial SSA counterpart, and achieves more than 20-fold speedup relative to parallel simulation with 100 processes. In a more realistic scenario with dynamic calcium influx and data recording, the parallel simulation with 1,000 processes and no load balancing is still 500 times faster than the conventional serial SSA simulation. PMID:28239346
Chen, Weiliang; De Schutter, Erik
2017-01-01
Stochastic, spatial reaction-diffusion simulations have been widely used in systems biology and computational neuroscience. However, the increasing scale and complexity of models and morphologies have exceeded the capacity of any serial implementation. This led to the development of parallel solutions that benefit from the boost in performance of modern supercomputers. In this paper, we describe an MPI-based, parallel operator-splitting implementation for stochastic spatial reaction-diffusion simulations with irregular tetrahedral meshes. The performance of our implementation is first examined and analyzed with simulations of a simple model. We then demonstrate its application to real-world research by simulating the reaction-diffusion components of a published calcium burst model in both Purkinje neuron sub-branch and full dendrite morphologies. Simulation results indicate that our implementation is capable of achieving super-linear speedup for balanced loading simulations with reasonable molecule density and mesh quality. In the best scenario, a parallel simulation with 2,000 processes runs more than 3,600 times faster than its serial SSA counterpart, and achieves more than 20-fold speedup relative to parallel simulation with 100 processes. In a more realistic scenario with dynamic calcium influx and data recording, the parallel simulation with 1,000 processes and no load balancing is still 500 times faster than the conventional serial SSA simulation.
Graph Partitioning for Parallel Applications in Heterogeneous Grid Environments
NASA Technical Reports Server (NTRS)
Bisws, Rupak; Kumar, Shailendra; Das, Sajal K.; Biegel, Bryan (Technical Monitor)
2002-01-01
The problem of partitioning irregular graphs and meshes for parallel computations on homogeneous systems has been extensively studied. However, these partitioning schemes fail when the target system architecture exhibits heterogeneity in resource characteristics. With the emergence of technologies such as the Grid, it is imperative to study the partitioning problem taking into consideration the differing capabilities of such distributed heterogeneous systems. In our model, the heterogeneous system consists of processors with varying processing power and an underlying non-uniform communication network. We present in this paper a novel multilevel partitioning scheme for irregular graphs and meshes, that takes into account issues pertinent to Grid computing environments. Our partitioning algorithm, called MiniMax, generates and maps partitions onto a heterogeneous system with the objective of minimizing the maximum execution time of the parallel distributed application. For experimental performance study, we have considered both a realistic mesh problem from NASA as well as synthetic workloads. Simulation results demonstrate that MiniMax generates high quality partitions for various classes of applications targeted for parallel execution in a distributed heterogeneous environment.
Evaluation of Cache-based Superscalar and Cacheless Vector Architectures for Scientific Computations
NASA Technical Reports Server (NTRS)
Oliker, Leonid; Carter, Jonathan; Shalf, John; Skinner, David; Ethier, Stephane; Biswas, Rupak; Djomehri, Jahed; VanderWijngaart, Rob
2003-01-01
The growing gap between sustained and peak performance for scientific applications has become a well-known problem in high performance computing. The recent development of parallel vector systems offers the potential to bridge this gap for a significant number of computational science codes and deliver a substantial increase in computing capabilities. This paper examines the intranode performance of the NEC SX6 vector processor and the cache-based IBM Power3/4 superscalar architectures across a number of key scientific computing areas. First, we present the performance of a microbenchmark suite that examines a full spectrum of low-level machine characteristics. Next, we study the behavior of the NAS Parallel Benchmarks using some simple optimizations. Finally, we evaluate the perfor- mance of several numerical codes from key scientific computing domains. Overall results demonstrate that the SX6 achieves high performance on a large fraction of our application suite and in many cases significantly outperforms the RISC-based architectures. However, certain classes of applications are not easily amenable to vectorization and would likely require extensive reengineering of both algorithm and implementation to utilize the SX6 effectively.
Communication Studies of DMP and SMP Machines
NASA Technical Reports Server (NTRS)
Sohn, Andrew; Biswas, Rupak; Chancellor, Marisa K. (Technical Monitor)
1997-01-01
Understanding the interplay between machines and problems is key to obtaining high performance on parallel machines. This paper investigates the interplay between programming paradigms and communication capabilities of parallel machines. In particular, we explicate the communication capabilities of the IBM SP-2 distributed-memory multiprocessor and the SGI PowerCHALLENGEarray symmetric multiprocessor. Two benchmark problems of bitonic sorting and Fast Fourier Transform are selected for experiments. Communication-efficient algorithms are developed to exploit the overlapping capabilities of the machines. Programs are written in Message-Passing Interface for portability and identical codes are used for both machines. Various data sizes and message sizes are used to test the machines' communication capabilities. Experimental results indicate that the communication performance of the multiprocessors are consistent with the size of messages. The SP-2 is sensitive to message size but yields a much higher communication overlapping because of the communication co-processor. The PowerCHALLENGEarray is not highly sensitive to message size and yields a low communication overlapping. Bitonic sorting yields lower performance compared to FFT due to a smaller computation-to-communication ratio.
Modeling and Validation of Power-split and P2 Parallel Hybrid Electric Vehicles SAE 2013-01-1470)
The Advanced Light-Duty Powertrain and Hybrid Analysis tool was created by EPA to evaluate the Greenhouse Gas (GHG) emissions of Light-Duty (LD) vehicles. It is a physics-based, forward-looking, full vehicle computer simulator capable of analyzing various vehicle types combined ...
Cooley, Philip C.; Turner, Charles F.; O'Reilly, James M.; Allen, Danny R.; Hamill, David N.; Paddock, Richard E.
2011-01-01
This article reviews a multimedia application in the area of survey measurement research: adding audio capabilities to a computer-assisted interviewing system. Hardware and software issues are discussed, and potential hardware devices that operate from DOS platforms are reviewed. Three types of hardware devices are considered: PCMCIA devices, parallel port attachments, and laptops with built-in sound. PMID:22096271
Design and Analysis of Turbomachinery for Space Applications
NASA Technical Reports Server (NTRS)
Dorney, D.; Garcia, Roberto (Technical Monitor)
2002-01-01
This presentation provides an overview of CORSAIR, a three dimensional computational fluid dynamics software code for the analysis of turbomachinery components available from NASA, and discusses its potential use in the design of these parts. Topics covered include: time-dependent equations of motion, grid topology, turbulence models, boundary conditions, parallel simulations and miscellaneous capabilities.
A Three Dimensional Parallel Time Accurate Turbopump Simulation Procedure Using Overset Grid Systems
NASA Technical Reports Server (NTRS)
Kiris, Cetin; Chan, William; Kwak, Dochan
2001-01-01
The objective of the current effort is to provide a computational framework for design and analysis of the entire fuel supply system of a liquid rocket engine, including high-fidelity unsteady turbopump flow analysis. This capability is needed to support the design of pump sub-systems for advanced space transportation vehicles that are likely to involve liquid propulsion systems. To date, computational tools for design/analysis of turbopump flows are based on relatively lower fidelity methods. An unsteady, three-dimensional viscous flow analysis tool involving stationary and rotational components for the entire turbopump assembly has not been available for real-world engineering applications. The present effort provides developers with information such as transient flow phenomena at start up, and non-uniform inflows, and will eventually impact on system vibration and structures. In the proposed paper, the progress toward the capability of complete simulation of the turbo-pump for a liquid rocket engine is reported. The Space Shuttle Main Engine (SSME) turbo-pump is used as a test case for evaluation of the hybrid MPI/Open-MP and MLP versions of the INS3D code. CAD to solution auto-scripting capability is being developed for turbopump applications. The relative motion of the grid systems for the rotor-stator interaction was obtained using overset grid techniques. Unsteady computations for the SSME turbo-pump, which contains 114 zones with 34.5 million grid points, are carried out on Origin 3000 systems at NASA Ames Research Center. Results from these time-accurate simulations with moving boundary capability will be presented along with the performance of parallel versions of the code.
Zhang, Hong; Zapol, Peter; Dixon, David A.; ...
2015-11-17
The Shift-and-invert parallel spectral transformations (SIPs), a computational approach to solve sparse eigenvalue problems, is developed for massively parallel architectures with exceptional parallel scalability and robustness. The capabilities of SIPs are demonstrated by diagonalization of density-functional based tight-binding (DFTB) Hamiltonian and overlap matrices for single-wall metallic carbon nanotubes, diamond nanowires, and bulk diamond crystals. The largest (smallest) example studied is a 128,000 (2000) atom nanotube for which ~330,000 (~5600) eigenvalues and eigenfunctions are obtained in ~190 (~5) seconds when parallelized over 266,144 (16,384) Blue Gene/Q cores. Weak scaling and strong scaling of SIPs are analyzed and the performance of SIPsmore » is compared with other novel methods. Different matrix ordering methods are investigated to reduce the cost of the factorization step, which dominates the time-to-solution at the strong scaling limit. As a result, a parallel implementation of assembling the density matrix from the distributed eigenvectors is demonstrated.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zhang, Hong; Zapol, Peter; Dixon, David A.
The Shift-and-invert parallel spectral transformations (SIPs), a computational approach to solve sparse eigenvalue problems, is developed for massively parallel architectures with exceptional parallel scalability and robustness. The capabilities of SIPs are demonstrated by diagonalization of density-functional based tight-binding (DFTB) Hamiltonian and overlap matrices for single-wall metallic carbon nanotubes, diamond nanowires, and bulk diamond crystals. The largest (smallest) example studied is a 128,000 (2000) atom nanotube for which ~330,000 (~5600) eigenvalues and eigenfunctions are obtained in ~190 (~5) seconds when parallelized over 266,144 (16,384) Blue Gene/Q cores. Weak scaling and strong scaling of SIPs are analyzed and the performance of SIPsmore » is compared with other novel methods. Different matrix ordering methods are investigated to reduce the cost of the factorization step, which dominates the time-to-solution at the strong scaling limit. As a result, a parallel implementation of assembling the density matrix from the distributed eigenvectors is demonstrated.« less
NASA Technical Reports Server (NTRS)
Karpoukhin, Mikhii G.; Kogan, Boris Y.; Karplus, Walter J.
1995-01-01
The simulation of heart arrhythmia and fibrillation are very important and challenging tasks. The solution of these problems using sophisticated mathematical models is beyond the capabilities of modern super computers. To overcome these difficulties it is proposed to break the whole simulation problem into two tightly coupled stages: generation of the action potential using sophisticated models. and propagation of the action potential using simplified models. The well known simplified models are compared and modified to bring the rate of depolarization and action potential duration restitution closer to reality. The modified method of lines is used to parallelize the computational process. The conditions for the appearance of 2D spiral waves after the application of a premature beat and the subsequent traveling of the spiral wave inside the simulated tissue are studied.
Signal processing applications of massively parallel charge domain computing devices
NASA Technical Reports Server (NTRS)
Fijany, Amir (Inventor); Barhen, Jacob (Inventor); Toomarian, Nikzad (Inventor)
1999-01-01
The present invention is embodied in a charge coupled device (CCD)/charge injection device (CID) architecture capable of performing a Fourier transform by simultaneous matrix vector multiplication (MVM) operations in respective plural CCD/CID arrays in parallel in O(1) steps. For example, in one embodiment, a first CCD/CID array stores charge packets representing a first matrix operator based upon permutations of a Hartley transform and computes the Fourier transform of an incoming vector. A second CCD/CID array stores charge packets representing a second matrix operator based upon different permutations of a Hartley transform and computes the Fourier transform of an incoming vector. The incoming vector is applied to the inputs of the two CCD/CID arrays simultaneously, and the real and imaginary parts of the Fourier transform are produced simultaneously in the time required to perform a single MVM operation in a CCD/CID array.
Parallel Proximity Detection for Computer Simulation
NASA Technical Reports Server (NTRS)
Steinman, Jeffrey S. (Inventor); Wieland, Frederick P. (Inventor)
1997-01-01
The present invention discloses a system for performing proximity detection in computer simulations on parallel processing architectures utilizing a distribution list which includes movers and sensor coverages which check in and out of grids. Each mover maintains a list of sensors that detect the mover's motion as the mover and sensor coverages check in and out of the grids. Fuzzy grids are includes by fuzzy resolution parameters to allow movers and sensor coverages to check in and out of grids without computing exact grid crossings. The movers check in and out of grids while moving sensors periodically inform the grids of their coverage. In addition, a lookahead function is also included for providing a generalized capability without making any limiting assumptions about the particular application to which it is applied. The lookahead function is initiated so that risk-free synchronization strategies never roll back grid events. The lookahead function adds fixed delays as events are scheduled for objects on other nodes.
Parallel Proximity Detection for Computer Simulations
NASA Technical Reports Server (NTRS)
Steinman, Jeffrey S. (Inventor); Wieland, Frederick P. (Inventor)
1998-01-01
The present invention discloses a system for performing proximity detection in computer simulations on parallel processing architectures utilizing a distribution list which includes movers and sensor coverages which check in and out of grids. Each mover maintains a list of sensors that detect the mover's motion as the mover and sensor coverages check in and out of the grids. Fuzzy grids are included by fuzzy resolution parameters to allow movers and sensor coverages to check in and out of grids without computing exact grid crossings. The movers check in and out of grids while moving sensors periodically inform the grids of their coverage. In addition, a lookahead function is also included for providing a generalized capability without making any limiting assumptions about the particular application to which it is applied. The lookahead function is initiated so that risk-free synchronization strategies never roll back grid events. The lookahead function adds fixed delays as events are scheduled for objects on other nodes.
Gai, Jiading; Obeid, Nady; Holtrop, Joseph L.; Wu, Xiao-Long; Lam, Fan; Fu, Maojing; Haldar, Justin P.; Hwu, Wen-mei W.; Liang, Zhi-Pei; Sutton, Bradley P.
2013-01-01
Several recent methods have been proposed to obtain significant speed-ups in MRI image reconstruction by leveraging the computational power of GPUs. Previously, we implemented a GPU-based image reconstruction technique called the Illinois Massively Parallel Acquisition Toolkit for Image reconstruction with ENhanced Throughput in MRI (IMPATIENT MRI) for reconstructing data collected along arbitrary 3D trajectories. In this paper, we improve IMPATIENT by removing computational bottlenecks by using a gridding approach to accelerate the computation of various data structures needed by the previous routine. Further, we enhance the routine with capabilities for off-resonance correction and multi-sensor parallel imaging reconstruction. Through implementation of optimized gridding into our iterative reconstruction scheme, speed-ups of more than a factor of 200 are provided in the improved GPU implementation compared to the previous accelerated GPU code. PMID:23682203
Parallelization of the Coupled Earthquake Model
NASA Technical Reports Server (NTRS)
Block, Gary; Li, P. Peggy; Song, Yuhe T.
2007-01-01
This Web-based tsunami simulation system allows users to remotely run a model on JPL s supercomputers for a given undersea earthquake. At the time of this reporting, predicting tsunamis on the Internet has never happened before. This new code directly couples the earthquake model and the ocean model on parallel computers and improves simulation speed. Seismometers can only detect information from earthquakes; they cannot detect whether or not a tsunami may occur as a result of the earthquake. When earthquake-tsunami models are coupled with the improved computational speed of modern, high-performance computers and constrained by remotely sensed data, they are able to provide early warnings for those coastal regions at risk. The software is capable of testing NASA s satellite observations of tsunamis. It has been successfully tested for several historical tsunamis, has passed all alpha and beta testing, and is well documented for users.
Parallel Calculations in LS-DYNA
NASA Astrophysics Data System (ADS)
Vartanovich Mkrtychev, Oleg; Aleksandrovich Reshetov, Andrey
2017-11-01
Nowadays, structural mechanics exhibits a trend towards numeric solutions being found for increasingly extensive and detailed tasks, which requires that capacities of computing systems be enhanced. Such enhancement can be achieved by different means. E.g., in case a computing system is represented by a workstation, its components can be replaced and/or extended (CPU, memory etc.). In essence, such modification eventually entails replacement of the entire workstation, i.e. replacement of certain components necessitates exchange of others (faster CPUs and memory devices require buses with higher throughput etc.). Special consideration must be given to the capabilities of modern video cards. They constitute powerful computing systems capable of running data processing in parallel. Interestingly, the tools originally designed to render high-performance graphics can be applied for solving problems not immediately related to graphics (CUDA, OpenCL, Shaders etc.). However, not all software suites utilize video cards’ capacities. Another way to increase capacity of a computing system is to implement a cluster architecture: to add cluster nodes (workstations) and to increase the network communication speed between the nodes. The advantage of this approach is extensive growth due to which a quite powerful system can be obtained by combining not particularly powerful nodes. Moreover, separate nodes may possess different capacities. This paper considers the use of a clustered computing system for solving problems of structural mechanics with LS-DYNA software. To establish a range of dependencies a mere 2-node cluster has proven sufficient.
Exploring Asynchronous Many-Task Runtime Systems toward Extreme Scales
DOE Office of Scientific and Technical Information (OSTI.GOV)
Knight, Samuel; Baker, Gavin Matthew; Gamell, Marc
2015-10-01
Major exascale computing reports indicate a number of software challenges to meet the dramatic change of system architectures in near future. While several-orders-of-magnitude increase in parallelism is the most commonly cited of those, hurdles also include performance heterogeneity of compute nodes across the system, increased imbalance between computational capacity and I/O capabilities, frequent system interrupts, and complex hardware architectures. Asynchronous task-parallel programming models show a great promise in addressing these issues, but are not yet fully understood nor developed su ciently for computational science and engineering application codes. We address these knowledge gaps through quantitative and qualitative exploration of leadingmore » candidate solutions in the context of engineering applications at Sandia. In this poster, we evaluate MiniAero code ported to three leading candidate programming models (Charm++, Legion and UINTAH) to examine the feasibility of these models that permits insertion of new programming model elements into an existing code base.« less
Distributed multitasking ITS with PVM
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fan, W.C.; Halbleib, J.A. Sr.
1995-12-31
Advances in computer hardware and communication software have made it possible to perform parallel-processing computing on a collection of desktop workstations. For many applications, multitasking on a cluster of high-performance workstations has achieved performance comparable to or better than that on a traditional supercomputer. From the point of view of cost-effectiveness, it also allows users to exploit available but unused computational resources and thus achieve a higher performance-to-cost ratio. Monte Carlo calculations are inherently parallelizable because the individual particle trajectories can be generated independently with minimum need for interprocessor communication. Furthermore, the number of particle histories that can be generatedmore » in a given amount of wall-clock time is nearly proportional to the number of processors in the cluster. This is an important fact because the inherent statistical uncertainty in any Monte Carlo result decreases as the number of histories increases. For these reasons, researchers have expended considerable effort to take advantage of different parallel architectures for a variety of Monte Carlo radiation transport codes, often with excellent results. The initial interest in this work was sparked by the multitasking capability of the MCNP code on a cluster of workstations using the Parallel Virtual Machine (PVM) software. On a 16-machine IBM RS/6000 cluster, it has been demonstrated that MCNP runs ten times as fast as on a single-processor CRAY YMP. In this paper, we summarize the implementation of a similar multitasking capability for the coupled electronphoton transport code system, the Integrated TIGER Series (ITS), and the evaluation of two load-balancing schemes for homogeneous and heterogeneous networks.« less
Distributed multitasking ITS with PVM
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fan, W.C.; Halbleib, J.A. Sr.
1995-02-01
Advances of computer hardware and communication software have made it possible to perform parallel-processing computing on a collection of desktop workstations. For many applications, multitasking on a cluster of high-performance workstations has achieved performance comparable or better than that on a traditional supercomputer. From the point of view of cost-effectiveness, it also allows users to exploit available but unused computational resources, and thus achieve a higher performance-to-cost ratio. Monte Carlo calculations are inherently parallelizable because the individual particle trajectories can be generated independently with minimum need for interprocessor communication. Furthermore, the number of particle histories that can be generated inmore » a given amount of wall-clock time is nearly proportional to the number of processors in the cluster. This is an important fact because the inherent statistical uncertainty in any Monte Carlo result decreases as the number of histories increases. For these reasons, researchers have expended considerable effort to take advantage of different parallel architectures for a variety of Monte Carlo radiation transport codes, often with excellent results. The initial interest in this work was sparked by the multitasking capability of MCNP on a cluster of workstations using the Parallel Virtual Machine (PVM) software. On a 16-machine IBM RS/6000 cluster, it has been demonstrated that MCNP runs ten times as fast as on a single-processor CRAY YMP. In this paper, we summarize the implementation of a similar multitasking capability for the coupled electron/photon transport code system, the Integrated TIGER Series (ITS), and the evaluation of two load balancing schemes for homogeneous and heterogeneous networks.« less
Design, development and test of a capillary pump loop heat pipe
NASA Technical Reports Server (NTRS)
Kroliczek, E. J.; Ku, J.; Ollendorf, S.
1984-01-01
The development of a capillary pump loop (CPL) heat pipe, including computer modeling and breadboard testing, is presented. The computer model is a SINDA-type thermal analyzer, combined with a pressure analyzer, which predicts the transients of the CPL heat pipe during operation. The breadboard is an aluminum/ammonia transport system which contains multiple parallel evaporator and condenser zones within a single loop. Test results have demonstrated the practicality and reliability of such a design, including heat load sharing among evaporators, liquid inventory/temperature control feature, and priming under load. Transport capability for this system is 65 KW-M with individual evaporator pumps managing up to 1.7 KW at a heat flux of 15 W/sq cm. The prediction of the computer model for heat transport capabilities is in good agreement with experimental results.
Wilson, J Adam; Williams, Justin C
2009-01-01
The clock speeds of modern computer processors have nearly plateaued in the past 5 years. Consequently, neural prosthetic systems that rely on processing large quantities of data in a short period of time face a bottleneck, in that it may not be possible to process all of the data recorded from an electrode array with high channel counts and bandwidth, such as electrocorticographic grids or other implantable systems. Therefore, in this study a method of using the processing capabilities of a graphics card [graphics processing unit (GPU)] was developed for real-time neural signal processing of a brain-computer interface (BCI). The NVIDIA CUDA system was used to offload processing to the GPU, which is capable of running many operations in parallel, potentially greatly increasing the speed of existing algorithms. The BCI system records many channels of data, which are processed and translated into a control signal, such as the movement of a computer cursor. This signal processing chain involves computing a matrix-matrix multiplication (i.e., a spatial filter), followed by calculating the power spectral density on every channel using an auto-regressive method, and finally classifying appropriate features for control. In this study, the first two computationally intensive steps were implemented on the GPU, and the speed was compared to both the current implementation and a central processing unit-based implementation that uses multi-threading. Significant performance gains were obtained with GPU processing: the current implementation processed 1000 channels of 250 ms in 933 ms, while the new GPU method took only 27 ms, an improvement of nearly 35 times.
Xyce Parallel Electronic Simulator Users' Guide Version 6.7.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Keiter, Eric R.; Aadithya, Karthik Venkatraman; Mei, Ting
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel com- puting platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one tomore » develop new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandia's needs, including some radiation- aware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase -- a message passing parallel implementation -- which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows. The information herein is subject to change without notice. Copyright c 2002-2017 Sandia Corporation. All rights reserved. Trademarks Xyce TM Electronic Simulator and Xyce TM are trademarks of Sandia Corporation. Orcad, Orcad Capture, PSpice and Probe are registered trademarks of Cadence Design Systems, Inc. Microsoft, Windows and Windows 7 are registered trademarks of Microsoft Corporation. Medici, DaVinci and Taurus are registered trademarks of Synopsys Corporation. Amtec and TecPlot are trademarks of Amtec Engineering, Inc. All other trademarks are property of their respective owners. Contacts World Wide Web http://xyce.sandia.gov https://info.sandia.gov/xyce (Sandia only) Email xyce@sandia.gov (outside Sandia) xyce-sandia@sandia.gov (Sandia only) Bug Reports (Sandia only) http://joseki-vm.sandia.gov/bugzilla http://morannon.sandia.gov/bugzilla« less
Advances in computational design and analysis of airbreathing propulsion systems
NASA Technical Reports Server (NTRS)
Klineberg, John M.
1989-01-01
The development of commercial and military aircraft depends, to a large extent, on engine manufacturers being able to achieve significant increases in propulsion capability through improved component aerodynamics, materials, and structures. The recent history of propulsion has been marked by efforts to develop computational techniques that can speed up the propulsion design process and produce superior designs. The availability of powerful supercomputers, such as the NASA Numerical Aerodynamic Simulator, and the potential for even higher performance offered by parallel computer architectures, have opened the door to the use of multi-dimensional simulations to study complex physical phenomena in propulsion systems that have previously defied analysis or experimental observation. An overview of several NASA Lewis research efforts is provided that are contributing toward the long-range goal of a numerical test-cell for the integrated, multidisciplinary design, analysis, and optimization of propulsion systems. Specific examples in Internal Computational Fluid Mechanics, Computational Structural Mechanics, Computational Materials Science, and High Performance Computing are cited and described in terms of current capabilities, technical challenges, and future research directions.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jiang, Huaiguang; Zhang, Yingchen; Muljadi, Eduard
In this paper, a short-term load forecasting approach based network reconfiguration is proposed in a parallel manner. Specifically, a support vector regression (SVR) based short-term load forecasting approach is designed to provide an accurate load prediction and benefit the network reconfiguration. Because of the nonconvexity of the three-phase balanced optimal power flow, a second-order cone program (SOCP) based approach is used to relax the optimal power flow problem. Then, the alternating direction method of multipliers (ADMM) is used to compute the optimal power flow in distributed manner. Considering the limited number of the switches and the increasing computation capability, themore » proposed network reconfiguration is solved in a parallel way. The numerical results demonstrate the feasible and effectiveness of the proposed approach.« less
Optimal mapping of irregular finite element domains to parallel processors
NASA Technical Reports Server (NTRS)
Flower, J.; Otto, S.; Salama, M.
1987-01-01
Mapping the solution domain of n-finite elements into N-subdomains that may be processed in parallel by N-processors is an optimal one if the subdomain decomposition results in a well-balanced workload distribution among the processors. The problem is discussed in the context of irregular finite element domains as an important aspect of the efficient utilization of the capabilities of emerging multiprocessor computers. Finding the optimal mapping is an intractable combinatorial optimization problem, for which a satisfactory approximate solution is obtained here by analogy to a method used in statistical mechanics for simulating the annealing process in solids. The simulated annealing analogy and algorithm are described, and numerical results are given for mapping an irregular two-dimensional finite element domain containing a singularity onto the Hypercube computer.
A Fast MHD Code for Gravitationally Stratified Media using Graphical Processing Units: SMAUG
NASA Astrophysics Data System (ADS)
Griffiths, M. K.; Fedun, V.; Erdélyi, R.
2015-03-01
Parallelization techniques have been exploited most successfully by the gaming/graphics industry with the adoption of graphical processing units (GPUs), possessing hundreds of processor cores. The opportunity has been recognized by the computational sciences and engineering communities, who have recently harnessed successfully the numerical performance of GPUs. For example, parallel magnetohydrodynamic (MHD) algorithms are important for numerical modelling of highly inhomogeneous solar, astrophysical and geophysical plasmas. Here, we describe the implementation of SMAUG, the Sheffield Magnetohydrodynamics Algorithm Using GPUs. SMAUG is a 1-3D MHD code capable of modelling magnetized and gravitationally stratified plasma. The objective of this paper is to present the numerical methods and techniques used for porting the code to this novel and highly parallel compute architecture. The methods employed are justified by the performance benchmarks and validation results demonstrating that the code successfully simulates the physics for a range of test scenarios including a full 3D realistic model of wave propagation in the solar atmosphere.
Speed challenge: a case for hardware implementation in soft-computing
NASA Technical Reports Server (NTRS)
Daud, T.; Stoica, A.; Duong, T.; Keymeulen, D.; Zebulum, R.; Thomas, T.; Thakoor, A.
2000-01-01
For over a decade, JPL has been actively involved in soft computing research on theory, architecture, applications, and electronics hardware. The driving force in all our research activities, in addition to the potential enabling technology promise, has been creation of a niche that imparts orders of magnitude speed advantage by implementation in parallel processing hardware with algorithms made especially suitable for hardware implementation. We review our work on neural networks, fuzzy logic, and evolvable hardware with selected application examples requiring real time response capabilities.
Red Storm usage model :Version 1.12.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jefferson, Karen L.; Sturtevant, Judith E.
Red Storm is an Advanced Simulation and Computing (ASC) funded massively parallel supercomputer located at Sandia National Laboratories (SNL). The Red Storm Usage Model (RSUM) documents the capabilities and the environment provided for the FY05 Tri-Lab Level II Limited Availability Red Storm User Environment Milestone and the FY05 SNL Level II Limited Availability Red Storm Platform Milestone. This document describes specific capabilities, tools, and procedures to support both local and remote users. The model is focused on the needs of the ASC user working in the secure computing environments at Los Alamos National Laboratory (LANL), Lawrence Livermore National Laboratory (LLNL),more » and SNL. Additionally, the Red Storm Usage Model maps the provided capabilities to the Tri-Lab ASC Computing Environment (ACE) requirements. The ACE requirements reflect the high performance computing requirements for the ASC community and have been updated in FY05 to reflect the community's needs. For each section of the RSUM, Appendix I maps the ACE requirements to the Limited Availability User Environment capabilities and includes a description of ACE requirements met and those requirements that are not met in that particular section. The Red Storm Usage Model, along with the ACE mappings, has been issued and vetted throughout the Tri-Lab community.« less
Silberstein, M.; Tzemach, A.; Dovgolevsky, N.; Fishelson, M.; Schuster, A.; Geiger, D.
2006-01-01
Computation of LOD scores is a valuable tool for mapping disease-susceptibility genes in the study of Mendelian and complex diseases. However, computation of exact multipoint likelihoods of large inbred pedigrees with extensive missing data is often beyond the capabilities of a single computer. We present a distributed system called “SUPERLINK-ONLINE,” for the computation of multipoint LOD scores of large inbred pedigrees. It achieves high performance via the efficient parallelization of the algorithms in SUPERLINK, a state-of-the-art serial program for these tasks, and through the use of the idle cycles of thousands of personal computers. The main algorithmic challenge has been to efficiently split a large task for distributed execution in a highly dynamic, nondedicated running environment. Notably, the system is available online, which allows computationally intensive analyses to be performed with no need for either the installation of software or the maintenance of a complicated distributed environment. As the system was being developed, it was extensively tested by collaborating medical centers worldwide on a variety of real data sets, some of which are presented in this article. PMID:16685644
CFD analysis of hypersonic, chemically reacting flow fields
NASA Technical Reports Server (NTRS)
Edwards, T. A.
1993-01-01
Design studies are underway for a variety of hypersonic flight vehicles. The National Aero-Space Plane will provide a reusable, single-stage-to-orbit capability for routine access to low earth orbit. Flight-capable satellites will dip into the atmosphere to maneuver to new orbits, while planetary probes will decelerate at their destination by atmospheric aerobraking. To supplement limited experimental capabilities in the hypersonic regime, computational fluid dynamics (CFD) is being used to analyze the flow about these configurations. The governing equations include fluid dynamic as well as chemical species equations, which are being solved with new, robust numerical algorithms. Examples of CFD applications to hypersonic vehicles suggest an important role this technology will play in the development of future aerospace systems. The computational resources needed to obtain solutions are large, but solution adaptive grids, convergence acceleration, and parallel processing may make run times manageable.
The Growth of m-Learning and the Growth of Mobile Computing: Parallel Developments
ERIC Educational Resources Information Center
Caudill, Jason G.
2007-01-01
m-Learning is made possible by the existence and application of mobile hardware and networking technology. By exploring the capabilities of these technologies, it is possible to construct a picture of how different components of m-Learning can be implemented. This paper will explore the major technologies currently in use: portable digital…
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hasenkamp, Daren; Sim, Alexander; Wehner, Michael
Extensive computing power has been used to tackle issues such as climate changes, fusion energy, and other pressing scientific challenges. These computations produce a tremendous amount of data; however, many of the data analysis programs currently only run a single processor. In this work, we explore the possibility of using the emerging cloud computing platform to parallelize such sequential data analysis tasks. As a proof of concept, we wrap a program for analyzing trends of tropical cyclones in a set of virtual machines (VMs). This approach allows the user to keep their familiar data analysis environment in the VMs, whilemore » we provide the coordination and data transfer services to ensure the necessary input and output are directed to the desired locations. This work extensively exercises the networking capability of the cloud computing systems and has revealed a number of weaknesses in the current cloud system software. In our tests, we are able to scale the parallel data analysis job to a modest number of VMs and achieve a speedup that is comparable to running the same analysis task using MPI. However, compared to MPI based parallelization, the cloud-based approach has a number of advantages. The cloud-based approach is more flexible because the VMs can capture arbitrary software dependencies without requiring the user to rewrite their programs. The cloud-based approach is also more resilient to failure; as long as a single VM is running, it can make progress while as soon as one MPI node fails the whole analysis job fails. In short, this initial work demonstrates that a cloud computing system is a viable platform for distributed scientific data analyses traditionally conducted on dedicated supercomputing systems.« less
A FAST ITERATIVE METHOD FOR SOLVING THE EIKONAL EQUATION ON TETRAHEDRAL DOMAINS
Fu, Zhisong; Kirby, Robert M.; Whitaker, Ross T.
2014-01-01
Generating numerical solutions to the eikonal equation and its many variations has a broad range of applications in both the natural and computational sciences. Efficient solvers on cutting-edge, parallel architectures require new algorithms that may not be theoretically optimal, but that are designed to allow asynchronous solution updates and have limited memory access patterns. This paper presents a parallel algorithm for solving the eikonal equation on fully unstructured tetrahedral meshes. The method is appropriate for the type of fine-grained parallelism found on modern massively-SIMD architectures such as graphics processors and takes into account the particular constraints and capabilities of these computing platforms. This work builds on previous work for solving these equations on triangle meshes; in this paper we adapt and extend previous two-dimensional strategies to accommodate three-dimensional, unstructured, tetrahedralized domains. These new developments include a local update strategy with data compaction for tetrahedral meshes that provides solutions on both serial and parallel architectures, with a generalization to inhomogeneous, anisotropic speed functions. We also propose two new update schemes, specialized to mitigate the natural data increase observed when moving to three dimensions, and the data structures necessary for efficiently mapping data to parallel SIMD processors in a way that maintains computational density. Finally, we present descriptions of the implementations for a single CPU, as well as multicore CPUs with shared memory and SIMD architectures, with comparative results against state-of-the-art eikonal solvers. PMID:25221418
DOE Office of Scientific and Technical Information (OSTI.GOV)
Preece, D.S.; Knudsen, S.D.
The spherical element computer code DMC (Distinct Motion Code) used to model rock motion resulting from blasting has been enhanced to allow routine computer simulations of bench blasting. The enhancements required for bench blast simulation include: (1) modifying the gas flow portion of DMC, (2) adding a new explosive gas equation of state capability, (3) modifying the porosity calculation, and (4) accounting for blastwell spacing parallel to the face. A parametric study performed with DMC shows logical variation of the face velocity as burden, spacing, blastwell diameter and explosive type are varied. These additions represent a significant advance in themore » capability of DMC which will not only aid in understanding the physics involved in blasting but will also become a blast design tool. 8 refs., 7 figs., 1 tab.« less
Programmable DNA-Mediated Multitasking Processor.
Shu, Jian-Jun; Wang, Qi-Wen; Yong, Kian-Yan; Shao, Fangwei; Lee, Kee Jin
2015-04-30
Because of DNA appealing features as perfect material, including minuscule size, defined structural repeat and rigidity, programmable DNA-mediated processing is a promising computing paradigm, which employs DNAs as information storing and processing substrates to tackle the computational problems. The massive parallelism of DNA hybridization exhibits transcendent potential to improve multitasking capabilities and yield a tremendous speed-up over the conventional electronic processors with stepwise signal cascade. As an example of multitasking capability, we present an in vitro programmable DNA-mediated optimal route planning processor as a functional unit embedded in contemporary navigation systems. The novel programmable DNA-mediated processor has several advantages over the existing silicon-mediated methods, such as conducting massive data storage and simultaneous processing via much fewer materials than conventional silicon devices.
Parallel Processing Systems for Passive Ranging During Helicopter Flight
NASA Technical Reports Server (NTRS)
Sridhar, Bavavar; Suorsa, Raymond E.; Showman, Robert D. (Technical Monitor)
1994-01-01
The complexity of rotorcraft missions involving operations close to the ground result in high pilot workload. In order to allow a pilot time to perform mission-oriented tasks, sensor-aiding and automation of some of the guidance and control functions are highly desirable. Images from an electro-optical sensor provide a covert way of detecting objects in the flight path of a low-flying helicopter. Passive ranging consists of processing a sequence of images using techniques based on optical low computation and recursive estimation. The passive ranging algorithm has to extract obstacle information from imagery at rates varying from five to thirty or more frames per second depending on the helicopter speed. We have implemented and tested the passive ranging algorithm off-line using helicopter-collected images. However, the real-time data and computation requirements of the algorithm are beyond the capability of any off-the-shelf microprocessor or digital signal processor. This paper describes the computational requirements of the algorithm and uses parallel processing technology to meet these requirements. Various issues in the selection of a parallel processing architecture are discussed and four different computer architectures are evaluated regarding their suitability to process the algorithm in real-time. Based on this evaluation, we conclude that real-time passive ranging is a realistic goal and can be achieved with a short time.
Karasick, M.S.; Strip, D.R.
1996-01-30
A parallel computing system is described that comprises a plurality of uniquely labeled, parallel processors, each processor capable of modeling a three-dimensional object that includes a plurality of vertices, faces and edges. The system comprises a front-end processor for issuing a modeling command to the parallel processors, relating to a three-dimensional object. Each parallel processor, in response to the command and through the use of its own unique label, creates a directed-edge (d-edge) data structure that uniquely relates an edge of the three-dimensional object to one face of the object. Each d-edge data structure at least includes vertex descriptions of the edge and a description of the one face. As a result, each processor, in response to the modeling command, operates upon a small component of the model and generates results, in parallel with all other processors, without the need for processor-to-processor intercommunication. 8 figs.
Parallel ALLSPD-3D: Speeding Up Combustor Analysis Via Parallel Processing
NASA Technical Reports Server (NTRS)
Fricker, David M.
1997-01-01
The ALLSPD-3D Computational Fluid Dynamics code for reacting flow simulation was run on a set of benchmark test cases to determine its parallel efficiency. These test cases included non-reacting and reacting flow simulations with varying numbers of processors. Also, the tests explored the effects of scaling the simulation with the number of processors in addition to distributing a constant size problem over an increasing number of processors. The test cases were run on a cluster of IBM RS/6000 Model 590 workstations with ethernet and ATM networking plus a shared memory SGI Power Challenge L workstation. The results indicate that the network capabilities significantly influence the parallel efficiency, i.e., a shared memory machine is fastest and ATM networking provides acceptable performance. The limitations of ethernet greatly hamper the rapid calculation of flows using ALLSPD-3D.
A parallel implementation of a multisensor feature-based range-estimation method
NASA Technical Reports Server (NTRS)
Suorsa, Raymond E.; Sridhar, Banavar
1993-01-01
There are many proposed vision based methods to perform obstacle detection and avoidance for autonomous or semi-autonomous vehicles. All methods, however, will require very high processing rates to achieve real time performance. A system capable of supporting autonomous helicopter navigation will need to extract obstacle information from imagery at rates varying from ten frames per second to thirty or more frames per second depending on the vehicle speed. Such a system will need to sustain billions of operations per second. To reach such high processing rates using current technology, a parallel implementation of the obstacle detection/ranging method is required. This paper describes an efficient and flexible parallel implementation of a multisensor feature-based range-estimation algorithm, targeted for helicopter flight, realized on both a distributed-memory and shared-memory parallel computer.
Emerging Nanophotonic Applications Explored with Advanced Scientific Parallel Computing
NASA Astrophysics Data System (ADS)
Meng, Xiang
The domain of nanoscale optical science and technology is a combination of the classical world of electromagnetics and the quantum mechanical regime of atoms and molecules. Recent advancements in fabrication technology allows the optical structures to be scaled down to nanoscale size or even to the atomic level, which are far smaller than the wavelength they are designed for. These nanostructures can have unique, controllable, and tunable optical properties and their interactions with quantum materials can have important near-field and far-field optical response. Undoubtedly, these optical properties can have many important applications, ranging from the efficient and tunable light sources, detectors, filters, modulators, high-speed all-optical switches; to the next-generation classical and quantum computation, and biophotonic medical sensors. This emerging research of nanoscience, known as nanophotonics, is a highly interdisciplinary field requiring expertise in materials science, physics, electrical engineering, and scientific computing, modeling and simulation. It has also become an important research field for investigating the science and engineering of light-matter interactions that take place on wavelength and subwavelength scales where the nature of the nanostructured matter controls the interactions. In addition, the fast advancements in the computing capabilities, such as parallel computing, also become as a critical element for investigating advanced nanophotonic devices. This role has taken on even greater urgency with the scale-down of device dimensions, and the design for these devices require extensive memory and extremely long core hours. Thus distributed computing platforms associated with parallel computing are required for faster designs processes. Scientific parallel computing constructs mathematical models and quantitative analysis techniques, and uses the computing machines to analyze and solve otherwise intractable scientific challenges. In particular, parallel computing are forms of computation operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently. In this dissertation, we report a series of new nanophotonic developments using the advanced parallel computing techniques. The applications include the structure optimizations at the nanoscale to control both the electromagnetic response of materials, and to manipulate nanoscale structures for enhanced field concentration, which enable breakthroughs in imaging, sensing systems (chapter 3 and 4) and improve the spatial-temporal resolutions of spectroscopies (chapter 5). We also report the investigations on the confinement study of optical-matter interactions at the quantum mechanical regime, where the size-dependent novel properties enhanced a wide range of technologies from the tunable and efficient light sources, detectors, to other nanophotonic elements with enhanced functionality (chapter 6 and 7).
Big Computing in Astronomy: Perspectives and Challenges
NASA Astrophysics Data System (ADS)
Pankratius, Victor
2014-06-01
Hardware progress in recent years has led to astronomical instruments gathering large volumes of data. In radio astronomy for instance, the current generation of antenna arrays produces data at Tbits per second, and forthcoming instruments will expand these rates much further. As instruments are increasingly becoming software-based, astronomers will get more exposed to computer science. This talk therefore outlines key challenges that arise at the intersection of computer science and astronomy and presents perspectives on how both communities can collaborate to overcome these challenges.Major problems are emerging due to increases in data rates that are much larger than in storage and transmission capacity, as well as humans being cognitively overwhelmed when attempting to opportunistically scan through Big Data. As a consequence, the generation of scientific insight will become more dependent on automation and algorithmic instrument control. Intelligent data reduction will have to be considered across the entire acquisition pipeline. In this context, the presentation will outline the enabling role of machine learning and parallel computing.BioVictor Pankratius is a computer scientist who joined MIT Haystack Observatory following his passion for astronomy. He is currently leading efforts to advance astronomy through cutting-edge computer science and parallel computing. Victor is also involved in projects such as ALMA Phasing to enhance the ALMA Observatory with Very-Long Baseline Interferometry capabilities, the Event Horizon Telescope, as well as in the Radio Array of Portable Interferometric Detectors (RAPID) to create an analysis environment using parallel computing in the cloud. He has an extensive track record of research in parallel multicore systems and software engineering, with contributions to auto-tuning, debugging, and empirical experiments studying programmers. Victor has worked with major industry partners such as Intel, Sun Labs, and Oracle. He holds a distinguished doctorate and a Habilitation degree in Computer Science from the University of Karlsruhe. Contact him at pankrat@mit.edu, victorpankratius.com, or Twitter @vpankratius.
NASA Astrophysics Data System (ADS)
Lawry, B. J.; Encarnacao, A.; Hipp, J. R.; Chang, M.; Young, C. J.
2011-12-01
With the rapid growth of multi-core computing hardware, it is now possible for scientific researchers to run complex, computationally intensive software on affordable, in-house commodity hardware. Multi-core CPUs (Central Processing Unit) and GPUs (Graphics Processing Unit) are now commonplace in desktops and servers. Developers today have access to extremely powerful hardware that enables the execution of software that could previously only be run on expensive, massively-parallel systems. It is no longer cost-prohibitive for an institution to build a parallel computing cluster consisting of commodity multi-core servers. In recent years, our research team has developed a distributed, multi-core computing system and used it to construct global 3D earth models using seismic tomography. Traditionally, computational limitations forced certain assumptions and shortcuts in the calculation of tomographic models; however, with the recent rapid growth in computational hardware including faster CPU's, increased RAM, and the development of multi-core computers, we are now able to perform seismic tomography, 3D ray tracing and seismic event location using distributed parallel algorithms running on commodity hardware, thereby eliminating the need for many of these shortcuts. We describe Node Resource Manager (NRM), a system we developed that leverages the capabilities of a parallel computing cluster. NRM is a software-based parallel computing management framework that works in tandem with the Java Parallel Processing Framework (JPPF, http://www.jppf.org/), a third party library that provides a flexible and innovative way to take advantage of modern multi-core hardware. NRM enables multiple applications to use and share a common set of networked computers, regardless of their hardware platform or operating system. Using NRM, algorithms can be parallelized to run on multiple processing cores of a distributed computing cluster of servers and desktops, which results in a dramatic speedup in execution time. NRM is sufficiently generic to support applications in any domain, as long as the application is parallelizable (i.e., can be subdivided into multiple individual processing tasks). At present, NRM has been effective in decreasing the overall runtime of several algorithms: 1) the generation of a global 3D model of the compressional velocity distribution in the Earth using tomographic inversion, 2) the calculation of the model resolution matrix, model covariance matrix, and travel time uncertainty for the aforementioned velocity model, and 3) the correlation of waveforms with archival data on a massive scale for seismic event detection. Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000.
NASA Astrophysics Data System (ADS)
Zeitler, T.; Kirchner, T. B.; Hammond, G. E.; Park, H.
2014-12-01
The Waste Isolation Pilot Plant (WIPP) has been developed by the U.S. Department of Energy (DOE) for the geologic (deep underground) disposal of transuranic (TRU) waste. Containment of TRU waste at the WIPP is regulated by the U.S. Environmental Protection Agency (EPA). The DOE demonstrates compliance with the containment requirements by means of performance assessment (PA) calculations. WIPP PA calculations estimate the probability and consequence of potential radionuclide releases from the repository to the accessible environment for a regulatory period of 10,000 years after facility closure. The long-term performance of the repository is assessed using a suite of sophisticated computational codes. In a broad modernization effort, the DOE has overseen the transfer of these codes to modern hardware and software platforms. Additionally, there is a current effort to establish new performance assessment capabilities through the further development of the PFLOTRAN software, a state-of-the-art massively parallel subsurface flow and reactive transport code. Improvements to the current computational environment will result in greater detail in the final models due to the parallelization afforded by the modern code. Parallelization will allow for relatively faster calculations, as well as a move from a two-dimensional calculation grid to a three-dimensional grid. The result of the modernization effort will be a state-of-the-art subsurface flow and transport capability that will serve WIPP PA into the future. Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000. This research is funded by WIPP programs administered by the Office of Environmental Management (EM) of the U.S Department of Energy.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Barrett, Brian; Brightwell, Ronald B.; Grant, Ryan
This report presents a specification for the Portals 4 networ k programming interface. Portals 4 is intended to allow scalable, high-performance network communication betwee n nodes of a parallel computing system. Portals 4 is well suited to massively parallel processing and embedded syste ms. Portals 4 represents an adaption of the data movement layer developed for massively parallel processing platfor ms, such as the 4500-node Intel TeraFLOPS machine. Sandia's Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4 is tarmore » geted to the next generation of machines employing advanced network interface architectures that support enh anced offload capabilities.« less
Parallel 3D Finite Element Numerical Modelling of DC Electron Guns
DOE Office of Scientific and Technical Information (OSTI.GOV)
Prudencio, E.; Candel, A.; Ge, L.
2008-02-04
In this paper we present Gun3P, a parallel 3D finite element application that the Advanced Computations Department at the Stanford Linear Accelerator Center is developing for the analysis of beam formation in DC guns and beam transport in klystrons. Gun3P is targeted specially to complex geometries that cannot be described by 2D models and cannot be easily handled by finite difference discretizations. Its parallel capability allows simulations with more accuracy and less processing time than packages currently available. We present simulation results for the L-band Sheet Beam Klystron DC gun, in which case Gun3P is able to reduce simulation timemore » from days to some hours.« less
The Portals 4.0 network programming interface.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Barrett, Brian W.; Brightwell, Ronald Brian; Pedretti, Kevin
2012-11-01
This report presents a specification for the Portals 4.0 network programming interface. Portals 4.0 is intended to allow scalable, high-performance network communication between nodes of a parallel computing system. Portals 4.0 is well suited to massively parallel processing and embedded systems. Portals 4.0 represents an adaption of the data movement layer developed for massively parallel processing platforms, such as the 4500-node Intel TeraFLOPS machine. Sandias Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4.0 is targeted to the next generationmore » of machines employing advanced network interface architectures that support enhanced offload capabilities.« less
A parallel data management system for large-scale NASA datasets
NASA Technical Reports Server (NTRS)
Srivastava, Jaideep
1993-01-01
The past decade has experienced a phenomenal growth in the amount of data and resultant information generated by NASA's operations and research projects. A key application is the reprocessing problem which has been identified to require data management capabilities beyond those available today (PRAT93). The Intelligent Information Fusion (IIF) system (ROEL91) is an ongoing NASA project which has similar requirements. Deriving our understanding of NASA's future data management needs based on the above, this paper describes an approach to using parallel computer systems (processor and I/O architectures) to develop an efficient parallel database management system to address the needs. Specifically, we propose to investigate issues in low-level record organizations and management, complex query processing, and query compilation and scheduling.
Collectively loading an application in a parallel computer
DOE Office of Scientific and Technical Information (OSTI.GOV)
Aho, Michael E.; Attinella, John E.; Gooding, Thomas M.
Collectively loading an application in a parallel computer, the parallel computer comprising a plurality of compute nodes, including: identifying, by a parallel computer control system, a subset of compute nodes in the parallel computer to execute a job; selecting, by the parallel computer control system, one of the subset of compute nodes in the parallel computer as a job leader compute node; retrieving, by the job leader compute node from computer memory, an application for executing the job; and broadcasting, by the job leader to the subset of compute nodes in the parallel computer, the application for executing the job.
Execution of a parallel edge-based Navier-Stokes solver on commodity graphics processor units
NASA Astrophysics Data System (ADS)
Corral, Roque; Gisbert, Fernando; Pueblas, Jesus
2017-02-01
The implementation of an edge-based three-dimensional Reynolds Average Navier-Stokes solver for unstructured grids able to run on multiple graphics processing units (GPUs) is presented. Loops over edges, which are the most time-consuming part of the solver, have been written to exploit the massively parallel capabilities of GPUs. Non-blocking communications between parallel processes and between the GPU and the central processor unit (CPU) have been used to enhance code scalability. The code is written using a mixture of C++ and OpenCL, to allow the execution of the source code on GPUs. The Message Passage Interface (MPI) library is used to allow the parallel execution of the solver on multiple GPUs. A comparative study of the solver parallel performance is carried out using a cluster of CPUs and another of GPUs. It is shown that a single GPU is up to 64 times faster than a single CPU core. The parallel scalability of the solver is mainly degraded due to the loss of computing efficiency of the GPU when the size of the case decreases. However, for large enough grid sizes, the scalability is strongly improved. A cluster featuring commodity GPUs and a high bandwidth network is ten times less costly and consumes 33% less energy than a CPU-based cluster with an equivalent computational power.
NASA Astrophysics Data System (ADS)
Li, J.; Zhang, T.; Huang, Q.; Liu, Q.
2014-12-01
Today's climate datasets are featured with large volume, high degree of spatiotemporal complexity and evolving fast overtime. As visualizing large volume distributed climate datasets is computationally intensive, traditional desktop based visualization applications fail to handle the computational intensity. Recently, scientists have developed remote visualization techniques to address the computational issue. Remote visualization techniques usually leverage server-side parallel computing capabilities to perform visualization tasks and deliver visualization results to clients through network. In this research, we aim to build a remote parallel visualization platform for visualizing and analyzing massive climate data. Our visualization platform was built based on Paraview, which is one of the most popular open source remote visualization and analysis applications. To further enhance the scalability and stability of the platform, we have employed cloud computing techniques to support the deployment of the platform. In this platform, all climate datasets are regular grid data which are stored in NetCDF format. Three types of data access methods are supported in the platform: accessing remote datasets provided by OpenDAP servers, accessing datasets hosted on the web visualization server and accessing local datasets. Despite different data access methods, all visualization tasks are completed at the server side to reduce the workload of clients. As a proof of concept, we have implemented a set of scientific visualization methods to show the feasibility of the platform. Preliminary results indicate that the framework can address the computation limitation of desktop based visualization applications.
A case study for cloud based high throughput analysis of NGS data using the globus genomics system
Bhuvaneshwar, Krithika; Sulakhe, Dinanath; Gauba, Robinder; ...
2015-01-01
Next generation sequencing (NGS) technologies produce massive amounts of data requiring a powerful computational infrastructure, high quality bioinformatics software, and skilled personnel to operate the tools. We present a case study of a practical solution to this data management and analysis challenge that simplifies terabyte scale data handling and provides advanced tools for NGS data analysis. These capabilities are implemented using the “Globus Genomics” system, which is an enhanced Galaxy workflow system made available as a service that offers users the capability to process and transfer data easily, reliably and quickly to address end-to-end NGS analysis requirements. The Globus Genomicsmore » system is built on Amazon's cloud computing infrastructure. The system takes advantage of elastic scaling of compute resources to run multiple workflows in parallel and it also helps meet the scale-out analysis needs of modern translational genomics research.« less
A case study for cloud based high throughput analysis of NGS data using the globus genomics system
Bhuvaneshwar, Krithika; Sulakhe, Dinanath; Gauba, Robinder; Rodriguez, Alex; Madduri, Ravi; Dave, Utpal; Lacinski, Lukasz; Foster, Ian; Gusev, Yuriy; Madhavan, Subha
2014-01-01
Next generation sequencing (NGS) technologies produce massive amounts of data requiring a powerful computational infrastructure, high quality bioinformatics software, and skilled personnel to operate the tools. We present a case study of a practical solution to this data management and analysis challenge that simplifies terabyte scale data handling and provides advanced tools for NGS data analysis. These capabilities are implemented using the “Globus Genomics” system, which is an enhanced Galaxy workflow system made available as a service that offers users the capability to process and transfer data easily, reliably and quickly to address end-to-endNGS analysis requirements. The Globus Genomics system is built on Amazon 's cloud computing infrastructure. The system takes advantage of elastic scaling of compute resources to run multiple workflows in parallel and it also helps meet the scale-out analysis needs of modern translational genomics research. PMID:26925205
An efficient dynamic load balancing algorithm
NASA Astrophysics Data System (ADS)
Lagaros, Nikos D.
2014-01-01
In engineering problems, randomness and uncertainties are inherent. Robust design procedures, formulated in the framework of multi-objective optimization, have been proposed in order to take into account sources of randomness and uncertainty. These design procedures require orders of magnitude more computational effort than conventional analysis or optimum design processes since a very large number of finite element analyses is required to be dealt. It is therefore an imperative need to exploit the capabilities of computing resources in order to deal with this kind of problems. In particular, parallel computing can be implemented at the level of metaheuristic optimization, by exploiting the physical parallelization feature of the nondominated sorting evolution strategies method, as well as at the level of repeated structural analyses required for assessing the behavioural constraints and for calculating the objective functions. In this study an efficient dynamic load balancing algorithm for optimum exploitation of available computing resources is proposed and, without loss of generality, is applied for computing the desired Pareto front. In such problems the computation of the complete Pareto front with feasible designs only, constitutes a very challenging task. The proposed algorithm achieves linear speedup factors and almost 100% speedup factor values with reference to the sequential procedure.
Jiang, Hanyu; Ganesan, Narayan
2016-02-27
HMMER software suite is widely used for analysis of homologous protein and nucleotide sequences with high sensitivity. The latest version of hmmsearch in HMMER 3.x, utilizes heuristic-pipeline which consists of MSV/SSV (Multiple/Single ungapped Segment Viterbi) stage, P7Viterbi stage and the Forward scoring stage to accelerate homology detection. Since the latest version is highly optimized for performance on modern multi-core CPUs with SSE capabilities, only a few acceleration attempts report speedup. However, the most compute intensive tasks within the pipeline (viz., MSV/SSV and P7Viterbi stages) still stand to benefit from the computational capabilities of massively parallel processors. A Multi-Tiered Parallel Framework (CUDAMPF) implemented on CUDA-enabled GPUs presented here, offers a finer-grained parallelism for MSV/SSV and Viterbi algorithms. We couple SIMT (Single Instruction Multiple Threads) mechanism with SIMD (Single Instructions Multiple Data) video instructions with warp-synchronism to achieve high-throughput processing and eliminate thread idling. We also propose a hardware-aware optimal allocation scheme of scarce resources like on-chip memory and caches in order to boost performance and scalability of CUDAMPF. In addition, runtime compilation via NVRTC available with CUDA 7.0 is incorporated into the presented framework that not only helps unroll innermost loop to yield upto 2 to 3-fold speedup than static compilation but also enables dynamic loading and switching of kernels depending on the query model size, in order to achieve optimal performance. CUDAMPF is designed as a hardware-aware parallel framework for accelerating computational hotspots within the hmmsearch pipeline as well as other sequence alignment applications. It achieves significant speedup by exploiting hierarchical parallelism on single GPU and takes full advantage of limited resources based on their own performance features. In addition to exceeding performance of other acceleration attempts, comprehensive evaluations against high-end CPUs (Intel i5, i7 and Xeon) shows that CUDAMPF yields upto 440 GCUPS for SSV, 277 GCUPS for MSV and 14.3 GCUPS for P7Viterbi all with 100 % accuracy, which translates to a maximum speedup of 37.5, 23.1 and 11.6-fold for MSV, SSV and P7Viterbi respectively. The source code is available at https://github.com/Super-Hippo/CUDAMPF.
Cellular automata with object-oriented features for parallel molecular network modeling.
Zhu, Hao; Wu, Yinghui; Huang, Sui; Sun, Yan; Dhar, Pawan
2005-06-01
Cellular automata are an important modeling paradigm for studying the dynamics of large, parallel systems composed of multiple, interacting components. However, to model biological systems, cellular automata need to be extended beyond the large-scale parallelism and intensive communication in order to capture two fundamental properties characteristic of complex biological systems: hierarchy and heterogeneity. This paper proposes extensions to a cellular automata language, Cellang, to meet this purpose. The extended language, with object-oriented features, can be used to describe the structure and activity of parallel molecular networks within cells. Capabilities of this new programming language include object structure to define molecular programs within a cell, floating-point data type and mathematical functions to perform quantitative computation, message passing capability to describe molecular interactions, as well as new operators, statements, and built-in functions. We discuss relevant programming issues of these features, including the object-oriented description of molecular interactions with molecule encapsulation, message passing, and the description of heterogeneity and anisotropy at the cell and molecule levels. By enabling the integration of modeling at the molecular level with system behavior at cell, tissue, organ, or even organism levels, the program will help improve our understanding of how complex and dynamic biological activities are generated and controlled by parallel functioning of molecular networks. Index Terms-Cellular automata, modeling, molecular network, object-oriented.
Development of Reduced-Order Models for Aeroelastic and Flutter Prediction Using the CFL3Dv6.0 Code
NASA Technical Reports Server (NTRS)
Silva, Walter A.; Bartels, Robert E.
2002-01-01
A reduced-order model (ROM) is developed for aeroelastic analysis using the CFL3D version 6.0 computational fluid dynamics (CFD) code, recently developed at the NASA Langley Research Center. This latest version of the flow solver includes a deforming mesh capability, a modal structural definition for nonlinear aeroelastic analyses, and a parallelization capability that provides a significant increase in computational efficiency. Flutter results for the AGARD 445.6 Wing computed using CFL3D v6.0 are presented, including discussion of associated computational costs. Modal impulse responses of the unsteady aerodynamic system are then computed using the CFL3Dv6 code and transformed into state-space form. Important numerical issues associated with the computation of the impulse responses are presented. The unsteady aerodynamic state-space ROM is then combined with a state-space model of the structure to create an aeroelastic simulation using the MATLAB/SIMULINK environment. The MATLAB/SIMULINK ROM is used to rapidly compute aeroelastic transients including flutter. The ROM shows excellent agreement with the aeroelastic analyses computed using the CFL3Dv6.0 code directly.
Life and dynamic capacity modeling for aircraft transmissions
NASA Technical Reports Server (NTRS)
Savage, Michael
1991-01-01
A computer program to simulate the dynamic capacity and life of parallel shaft aircraft transmissions is presented. Five basic configurations can be analyzed: single mesh, compound, parallel, reverted, and single plane reductions. In execution, the program prompts the user for the data file prefix name, takes input from a ASCII file, and writes its output to a second ASCII file with the same prefix name. The input data file includes the transmission configuration, the input shaft torque and speed, and descriptions of the transmission geometry and the component gears and bearings. The program output file describes the transmission, its components, their capabilities, locations, and loads. It also lists the dynamic capability, ninety percent reliability, and mean life of each component and the transmission as a system. Here, the program, its input and output files, and the theory behind the operation of the program are described.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ilsche, Thomas; Schuchart, Joseph; Cope, Joseph
Event tracing is an important tool for understanding the performance of parallel applications. As concurrency increases in leadership-class computing systems, the quantity of performance log data can overload the parallel file system, perturbing the application being observed. In this work we present a solution for event tracing at leadership scales. We enhance the I/O forwarding system software to aggregate and reorganize log data prior to writing to the storage system, significantly reducing the burden on the underlying file system for this type of traffic. Furthermore, we augment the I/O forwarding system with a write buffering capability to limit the impactmore » of artificial perturbations from log data accesses on traced applications. To validate the approach, we modify the Vampir tracing tool to take advantage of this new capability and show that the approach increases the maximum traced application size by a factor of 5x to more than 200,000 processors.« less
Wilson, Justin; Dai, Manhong; Jakupovic, Elvis; Watson, Stanley; Meng, Fan
2007-01-01
Modern video cards and game consoles typically have much better performance to price ratios than that of general purpose CPUs. The parallel processing capabilities of game hardware are well-suited for high throughput biomedical data analysis. Our initial results suggest that game hardware is a cost-effective platform for some computationally demanding bioinformatics problems.
Novel wavelength diversity technique for high-speed atmospheric turbulence compensation
NASA Astrophysics Data System (ADS)
Arrasmith, William W.; Sullivan, Sean F.
2010-04-01
The defense, intelligence, and homeland security communities are driving a need for software dominant, real-time or near-real time atmospheric turbulence compensated imagery. The development of parallel processing capabilities are finding application in diverse areas including image processing, target tracking, pattern recognition, and image fusion to name a few. A novel approach to the computationally intensive case of software dominant optical and near infrared imaging through atmospheric turbulence is addressed in this paper. Previously, the somewhat conventional wavelength diversity method has been used to compensate for atmospheric turbulence with great success. We apply a new correlation based approach to the wavelength diversity methodology using a parallel processing architecture enabling high speed atmospheric turbulence compensation. Methods for optical imaging through distributed turbulence are discussed, simulation results are presented, and computational and performance assessments are provided.
Evaluating local indirect addressing in SIMD proc essors
NASA Technical Reports Server (NTRS)
Middleton, David; Tomboulian, Sherryl
1989-01-01
In the design of parallel computers, there exists a tradeoff between the number and power of individual processors. The single instruction stream, multiple data stream (SIMD) model of parallel computers lies at one extreme of the resulting spectrum. The available hardware resources are devoted to creating the largest possible number of processors, and consequently each individual processor must use the fewest possible resources. Disagreement exists as to whether SIMD processors should be able to generate addresses individually into their local data memory, or all processors should access the same address. The tradeoff is examined between the increased capability and the reduced number of processors that occurs in this single instruction stream, multiple, locally addressed, data (SIMLAD) model. Factors are assembled that affect this design choice, and the SIMLAD model is compared with the bare SIMD and the MIMD models.
Global MHD simulation of magnetosphere using HPF
NASA Astrophysics Data System (ADS)
Ogino, T.
We have translated a 3-dimensional magnetohydrodynamic (MHD) simulation code of the Earth's magnetosphere from VPP Fortran to HPF/JA on the Fujitsu VPP5000/56 vector-parallel supercomputer and the MHD code was fully vectorized and fully parallelized in VPP Fortran. The entire performance and capability of the HPF MHD code could be shown to be almost comparable to that of VPP Fortran. A 3-dimensional global MHD simulation of the earth's magnetosphere was performed at a speed of over 400 Gflops with an efficiency of 76.5% using 56 PEs of Fujitsu VPP5000/56 in vector and parallel computation that permitted comparison with catalog values. We have concluded that fluid and MHD codes that are fully vectorized and fully parallelized in VPP Fortran can be translated with relative ease to HPF/JA, and a code in HPF/JA may be expected to perform comparably to the same code written in VPP Fortran.
NASA Technical Reports Server (NTRS)
Nguyen, Duc T.
1990-01-01
Practical engineering application can often be formulated in the form of a constrained optimization problem. There are several solution algorithms for solving a constrained optimization problem. One approach is to convert a constrained problem into a series of unconstrained problems. Furthermore, unconstrained solution algorithms can be used as part of the constrained solution algorithms. Structural optimization is an iterative process where one starts with an initial design, a finite element structure analysis is then performed to calculate the response of the system (such as displacements, stresses, eigenvalues, etc.). Based upon the sensitivity information on the objective and constraint functions, an optimizer such as ADS or IDESIGN, can be used to find the new, improved design. For the structural analysis phase, the equation solver for the system of simultaneous, linear equations plays a key role since it is needed for either static, or eigenvalue, or dynamic analysis. For practical, large-scale structural analysis-synthesis applications, computational time can be excessively large. Thus, it is necessary to have a new structural analysis-synthesis code which employs new solution algorithms to exploit both parallel and vector capabilities offered by modern, high performance computers such as the Convex, Cray-2 and Cray-YMP computers. The objective of this research project is, therefore, to incorporate the latest development in the parallel-vector equation solver, PVSOLVE into the widely popular finite-element production code, such as the SAP-4. Furthermore, several nonlinear unconstrained optimization subroutines have also been developed and tested under a parallel computer environment. The unconstrained optimization subroutines are not only useful in their own right, but they can also be incorporated into a more popular constrained optimization code, such as ADS.
Liu, Yu; Hong, Yang; Lin, Chun-Yuan; Hung, Che-Lun
2015-01-01
The Smith-Waterman (SW) algorithm has been widely utilized for searching biological sequence databases in bioinformatics. Recently, several works have adopted the graphic card with Graphic Processing Units (GPUs) and their associated CUDA model to enhance the performance of SW computations. However, these works mainly focused on the protein database search by using the intertask parallelization technique, and only using the GPU capability to do the SW computations one by one. Hence, in this paper, we will propose an efficient SW alignment method, called CUDA-SWfr, for the protein database search by using the intratask parallelization technique based on a CPU-GPU collaborative system. Before doing the SW computations on GPU, a procedure is applied on CPU by using the frequency distance filtration scheme (FDFS) to eliminate the unnecessary alignments. The experimental results indicate that CUDA-SWfr runs 9.6 times and 96 times faster than the CPU-based SW method without and with FDFS, respectively.
Design of a bounded wave EMP (Electromagnetic Pulse) simulator
NASA Astrophysics Data System (ADS)
Sevat, P. A. A.
1989-06-01
Electromagnetic Pulse (EMP) simulators are used to simulate the EMP generated by a nuclear weapon and to harden equipment against the effects of EMP. At present, DREO has a 1 m EMP simulator for testing computer terminal size equipment. To develop the R and D capability for testing larger objects, such as a helicopter, a much bigger threat level facility is required. This report concerns the design of a bounded wave EMP simulator suitable for testing large size equipment. Different types of simulators are described and their pros and cons are discussed. A bounded wave parallel plate type simulator is chosen for it's efficiency and the least environmental impact. Detailed designs are given for 6 m and 10 m parallel plate type wire grid simulators. Electromagnetic fields inside and outside the simulators are computed. Preliminary specifications for a pulse generator required for the simulator are also given. Finally, the electromagnetic fields radiated from the simulator are computed and discussed.
NASA Technical Reports Server (NTRS)
Swisshelm, Julie M.
1989-01-01
An explicit flow solver, applicable to the hierarchy of model equations ranging from Euler to full Navier-Stokes, is combined with several techniques designed to reduce computational expense. The computational domain consists of local grid refinements embedded in a global coarse mesh, where the locations of these refinements are defined by the physics of the flow. Flow characteristics are also used to determine which set of model equations is appropriate for solution in each region, thereby reducing not only the number of grid points at which the solution must be obtained, but also the computational effort required to get that solution. Acceleration to steady-state is achieved by applying multigrid on each of the subgrids, regardless of the particular model equations being solved. Since each of these components is explicit, advantage can readily be taken of the vector- and parallel-processing capabilities of machines such as the Cray X-MP and Cray-2.
Konstantinidis, Evdokimos I; Frantzidis, Christos A; Pappas, Costas; Bamidis, Panagiotis D
2012-07-01
In this paper the feasibility of adopting Graphic Processor Units towards real-time emotion aware computing is investigated for boosting the time consuming computations employed in such applications. The proposed methodology was employed in analysis of encephalographic and electrodermal data gathered when participants passively viewed emotional evocative stimuli. The GPU effectiveness when processing electroencephalographic and electrodermal recordings is demonstrated by comparing the execution time of chaos/complexity analysis through nonlinear dynamics (multi-channel correlation dimension/D2) and signal processing algorithms (computation of skin conductance level/SCL) into various popular programming environments. Apart from the beneficial role of parallel programming, the adoption of special design techniques regarding memory management may further enhance the time minimization which approximates a factor of 30 in comparison with ANSI C language (single-core sequential execution). Therefore, the use of GPU parallel capabilities offers a reliable and robust solution for real-time sensing the user's affective state. Copyright © 2012 Elsevier Ireland Ltd. All rights reserved.
Afshar, Yaser; Sbalzarini, Ivo F.
2016-01-01
Modern fluorescence microscopy modalities, such as light-sheet microscopy, are capable of acquiring large three-dimensional images at high data rate. This creates a bottleneck in computational processing and analysis of the acquired images, as the rate of acquisition outpaces the speed of processing. Moreover, images can be so large that they do not fit the main memory of a single computer. We address both issues by developing a distributed parallel algorithm for segmentation of large fluorescence microscopy images. The method is based on the versatile Discrete Region Competition algorithm, which has previously proven useful in microscopy image segmentation. The present distributed implementation decomposes the input image into smaller sub-images that are distributed across multiple computers. Using network communication, the computers orchestrate the collectively solving of the global segmentation problem. This not only enables segmentation of large images (we test images of up to 1010 pixels), but also accelerates segmentation to match the time scale of image acquisition. Such acquisition-rate image segmentation is a prerequisite for the smart microscopes of the future and enables online data compression and interactive experiments. PMID:27046144
DOE Office of Scientific and Technical Information (OSTI.GOV)
D'Azevedo, Eduardo; Abbott, Stephen; Koskela, Tuomas
The XGC fusion gyrokinetic code combines state-of-the-art, portable computational and algorithmic technologies to enable complicated multiscale simulations of turbulence and transport dynamics in ITER edge plasma on the largest US open-science computer, the CRAY XK7 Titan, at its maximal heterogeneous capability, which have not been possible before due to a factor of over 10 shortage in the time-to-solution for less than 5 days of wall-clock time for one physics case. Frontier techniques such as nested OpenMP parallelism, adaptive parallel I/O, staging I/O and data reduction using dynamic and asynchronous applications interactions, dynamic repartitioning for balancing computational work in pushing particlesmore » and in grid related work, scalable and accurate discretization algorithms for non-linear Coulomb collisions, and communication-avoiding subcycling technology for pushing particles on both CPUs and GPUs are also utilized to dramatically improve the scalability and time-to-solution, hence enabling the difficult kinetic ITER edge simulation on a present-day leadership class computer.« less
Afshar, Yaser; Sbalzarini, Ivo F
2016-01-01
Modern fluorescence microscopy modalities, such as light-sheet microscopy, are capable of acquiring large three-dimensional images at high data rate. This creates a bottleneck in computational processing and analysis of the acquired images, as the rate of acquisition outpaces the speed of processing. Moreover, images can be so large that they do not fit the main memory of a single computer. We address both issues by developing a distributed parallel algorithm for segmentation of large fluorescence microscopy images. The method is based on the versatile Discrete Region Competition algorithm, which has previously proven useful in microscopy image segmentation. The present distributed implementation decomposes the input image into smaller sub-images that are distributed across multiple computers. Using network communication, the computers orchestrate the collectively solving of the global segmentation problem. This not only enables segmentation of large images (we test images of up to 10(10) pixels), but also accelerates segmentation to match the time scale of image acquisition. Such acquisition-rate image segmentation is a prerequisite for the smart microscopes of the future and enables online data compression and interactive experiments.
A real-time spike sorting method based on the embedded GPU.
Zelan Yang; Kedi Xu; Xiang Tian; Shaomin Zhang; Xiaoxiang Zheng
2017-07-01
Microelectrode arrays with hundreds of channels have been widely used to acquire neuron population signals in neuroscience studies. Online spike sorting is becoming one of the most important challenges for high-throughput neural signal acquisition systems. Graphic processing unit (GPU) with high parallel computing capability might provide an alternative solution for increasing real-time computational demands on spike sorting. This study reported a method of real-time spike sorting through computing unified device architecture (CUDA) which was implemented on an embedded GPU (NVIDIA JETSON Tegra K1, TK1). The sorting approach is based on the principal component analysis (PCA) and K-means. By analyzing the parallelism of each process, the method was further optimized in the thread memory model of GPU. Our results showed that the GPU-based classifier on TK1 is 37.92 times faster than the MATLAB-based classifier on PC while their accuracies were the same with each other. The high-performance computing features of embedded GPU demonstrated in our studies suggested that the embedded GPU provide a promising platform for the real-time neural signal processing.
Feasibility study for the implementation of NASTRAN on the ILLIAC 4 parallel processor
NASA Technical Reports Server (NTRS)
Field, E. I.
1975-01-01
The ILLIAC IV, a fourth generation multiprocessor using parallel processing hardware concepts, is operational at Moffett Field, California. Its capability to excel at matrix manipulation, makes the ILLIAC well suited for performing structural analyses using the finite element displacement method. The feasibility of modifying the NASTRAN (NASA structural analysis) computer program to make effective use of the ILLIAC IV was investigated. The characteristics are summarized of the ILLIAC and the ARPANET, a telecommunications network which spans the continent making the ILLIAC accessible to nearly all major industrial centers in the United States. Two distinct approaches are studied: retaining NASTRAN as it now operates on many of the host computers of the ARPANET to process the input and output while using the ILLIAC only for the major computational tasks, and installing NASTRAN to operate entirely in the ILLIAC environment. Though both alternatives offer similar and significant increases in computational speed over modern third generation processors, the full installation of NASTRAN on the ILLIAC is recommended. Specifications are presented for performing that task with manpower estimates and schedules to correspond.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Seefeldt, Ben; Sondak, David; Hensinger, David M.
Drekar is an application code that solves partial differential equations for fluids that can be optionally coupled to electromagnetics. Drekar solves low-mach compressible and incompressible computational fluid dynamics (CFD), compressible and incompressible resistive magnetohydrodynamics (MHD), and multiple species plasmas interacting with electromagnetic fields. Drekar discretization technology includes continuous and discontinuous finite element formulations, stabilized finite element formulations, mixed integration finite element bases (nodal, edge, face, volume) and an initial arbitrary Lagrangian Eulerian (ALE) capability. Drekar contains the implementation of the discretized physics and leverages the open source Trilinos project for both parallel solver capabilities and general finite element discretization tools.more » The code will be released open source under a BSD license. The code is used for fundamental research for simulation of fluids and plasmas on high performance computing environments.« less
Neuronal integration of dynamic sources: Bayesian learning and Bayesian inference.
Siegelmann, Hava T; Holzman, Lars E
2010-09-01
One of the brain's most basic functions is integrating sensory data from diverse sources. This ability causes us to question whether the neural system is computationally capable of intelligently integrating data, not only when sources have known, fixed relative dependencies but also when it must determine such relative weightings based on dynamic conditions, and then use these learned weightings to accurately infer information about the world. We suggest that the brain is, in fact, fully capable of computing this parallel task in a single network and describe a neural inspired circuit with this property. Our implementation suggests the possibility that evidence learning requires a more complex organization of the network than was previously assumed, where neurons have different specialties, whose emergence brings the desired adaptivity seen in human online inference.
NASA Astrophysics Data System (ADS)
Yan, Hui; Wang, K. G.; Jones, Jim E.
2016-06-01
A parallel algorithm for large-scale three-dimensional phase-field simulations of phase coarsening is developed and implemented on high-performance architectures. From the large-scale simulations, a new kinetics in phase coarsening in the region of ultrahigh volume fraction is found. The parallel implementation is capable of harnessing the greater computer power available from high-performance architectures. The parallelized code enables increase in three-dimensional simulation system size up to a 5123 grid cube. Through the parallelized code, practical runtime can be achieved for three-dimensional large-scale simulations, and the statistical significance of the results from these high resolution parallel simulations are greatly improved over those obtainable from serial simulations. A detailed performance analysis on speed-up and scalability is presented, showing good scalability which improves with increasing problem size. In addition, a model for prediction of runtime is developed, which shows a good agreement with actual run time from numerical tests.
GPU COMPUTING FOR PARTICLE TRACKING
DOE Office of Scientific and Technical Information (OSTI.GOV)
Nishimura, Hiroshi; Song, Kai; Muriki, Krishna
2011-03-25
This is a feasibility study of using a modern Graphics Processing Unit (GPU) to parallelize the accelerator particle tracking code. To demonstrate the massive parallelization features provided by GPU computing, a simplified TracyGPU program is developed for dynamic aperture calculation. Performances, issues, and challenges from introducing GPU are also discussed. General purpose Computation on Graphics Processing Units (GPGPU) bring massive parallel computing capabilities to numerical calculation. However, the unique architecture of GPU requires a comprehensive understanding of the hardware and programming model to be able to well optimize existing applications. In the field of accelerator physics, the dynamic aperture calculationmore » of a storage ring, which is often the most time consuming part of the accelerator modeling and simulation, can benefit from GPU due to its embarrassingly parallel feature, which fits well with the GPU programming model. In this paper, we use the Tesla C2050 GPU which consists of 14 multi-processois (MP) with 32 cores on each MP, therefore a total of 448 cores, to host thousands ot threads dynamically. Thread is a logical execution unit of the program on GPU. In the GPU programming model, threads are grouped into a collection of blocks Within each block, multiple threads share the same code, and up to 48 KB of shared memory. Multiple thread blocks form a grid, which is executed as a GPU kernel. A simplified code that is a subset of Tracy++ [2] is developed to demonstrate the possibility of using GPU to speed up the dynamic aperture calculation by having each thread track a particle.« less
Havery Mudd 2014-2015 Computer Science Conduit Clinic Final Report
DOE Office of Scientific and Technical Information (OSTI.GOV)
Aspesi, G; Bai, J; Deese, R
2015-05-12
Conduit, a new open-source library developed at Lawrence Livermore National Laboratories, provides a C++ application programming interface (API) to describe and access scientific data. Conduit’s primary use is for inmemory data exchange in high performance computing (HPC) applications. Our team tested and improved Conduit to make it more appealing to potential adopters in the HPC community. We extended Conduit’s capabilities by prototyping four libraries: one for parallel communication using MPI, one for I/O functionality, one for aggregating performance data, and one for data visualization.
NASA Technical Reports Server (NTRS)
Nosenchuck, D. M.; Littman, M. G.
1986-01-01
The Navier-Stokes computer (NSC) has been developed for solving problems in fluid mechanics involving complex flow simulations that require more speed and capacity than provided by current and proposed Class VI supercomputers. The machine is a parallel processing supercomputer with several new architectural elements which can be programmed to address a wide range of problems meeting the following criteria: (1) the problem is numerically intensive, and (2) the code makes use of long vectors. A simulation of two-dimensional nonsteady viscous flows is presented to illustrate the architecture, programming, and some of the capabilities of the NSC.
Computation of Reacting Flows in Combustion Processes
NASA Technical Reports Server (NTRS)
Keith, Theo G., Jr.; Chen, Kuo-Huey
1997-01-01
The main objective of this research was to develop an efficient three-dimensional computer code for chemically reacting flows. The main computer code developed is ALLSPD-3D. The ALLSPD-3D computer program is developed for the calculation of three-dimensional, chemically reacting flows with sprays. The ALL-SPD code employs a coupled, strongly implicit solution procedure for turbulent spray combustion flows. A stochastic droplet model and an efficient method for treatment of the spray source terms in the gas-phase equations are used to calculate the evaporating liquid sprays. The chemistry treatment in the code is general enough that an arbitrary number of reaction and species can be defined by the users. Also, it is written in generalized curvilinear coordinates with both multi-block and flexible internal blockage capabilities to handle complex geometries. In addition, for general industrial combustion applications, the code provides both dilution and transpiration cooling capabilities. The ALLSPD algorithm, which employs the preconditioning and eigenvalue rescaling techniques, is capable of providing efficient solution for flows with a wide range of Mach numbers. Although written for three-dimensional flows in general, the code can be used for two-dimensional and axisymmetric flow computations as well. The code is written in such a way that it can be run in various computer platforms (supercomputers, workstations and parallel processors) and the GUI (Graphical User Interface) should provide a user-friendly tool in setting up and running the code.
The portals 4.0.1 network programming interface.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Barrett, Brian W.; Brightwell, Ronald Brian; Pedretti, Kevin
2013-04-01
This report presents a specification for the Portals 4.0 network programming interface. Portals 4.0 is intended to allow scalable, high-performance network communication between nodes of a parallel computing system. Portals 4.0 is well suited to massively parallel processing and embedded systems. Portals 4.0 represents an adaption of the data movement layer developed for massively parallel processing platforms, such as the 4500-node Intel TeraFLOPS machine. Sandias Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4.0 is targeted to the next generationmore » of machines employing advanced network interface architectures that support enhanced offload capabilities. 3« less
Aggregating job exit statuses of a plurality of compute nodes executing a parallel application
DOE Office of Scientific and Technical Information (OSTI.GOV)
Aho, Michael E.; Attinella, John E.; Gooding, Thomas M.
Aggregating job exit statuses of a plurality of compute nodes executing a parallel application, including: identifying a subset of compute nodes in the parallel computer to execute the parallel application; selecting one compute node in the subset of compute nodes in the parallel computer as a job leader compute node; initiating execution of the parallel application on the subset of compute nodes; receiving an exit status from each compute node in the subset of compute nodes, where the exit status for each compute node includes information describing execution of some portion of the parallel application by the compute node; aggregatingmore » each exit status from each compute node in the subset of compute nodes; and sending an aggregated exit status for the subset of compute nodes in the parallel computer.« less
Using Agent Base Models to Optimize Large Scale Network for Large System Inventories
NASA Technical Reports Server (NTRS)
Shameldin, Ramez Ahmed; Bowling, Shannon R.
2010-01-01
The aim of this paper is to use Agent Base Models (ABM) to optimize large scale network handling capabilities for large system inventories and to implement strategies for the purpose of reducing capital expenses. The models used in this paper either use computational algorithms or procedure implementations developed by Matlab to simulate agent based models in a principal programming language and mathematical theory using clusters, these clusters work as a high performance computational performance to run the program in parallel computational. In both cases, a model is defined as compilation of a set of structures and processes assumed to underlie the behavior of a network system.
Lewis Structures Technology, 1988. Volume 2: Structural Mechanics
NASA Technical Reports Server (NTRS)
1988-01-01
Lewis Structures Div. performs and disseminates results of research conducted in support of aerospace engine structures. These results have a wide range of applicability to practitioners of structural engineering mechanics beyond the aerospace arena. The engineering community was familiarized with the depth and range of research performed by the division and its academic and industrial partners. Sessions covered vibration control, fracture mechanics, ceramic component reliability, parallel computing, nondestructive evaluation, constitutive models and experimental capabilities, dynamic systems, fatigue and damage, wind turbines, hot section technology (HOST), aeroelasticity, structural mechanics codes, computational methods for dynamics, structural optimization, and applications of structural dynamics, and structural mechanics computer codes.
Igarashi, Jun; Shouno, Osamu; Fukai, Tomoki; Tsujino, Hiroshi
2011-11-01
Real-time simulation of a biologically realistic spiking neural network is necessary for evaluation of its capacity to interact with real environments. However, the real-time simulation of such a neural network is difficult due to its high computational costs that arise from two factors: (1) vast network size and (2) the complicated dynamics of biologically realistic neurons. In order to address these problems, mainly the latter, we chose to use general purpose computing on graphics processing units (GPGPUs) for simulation of such a neural network, taking advantage of the powerful computational capability of a graphics processing unit (GPU). As a target for real-time simulation, we used a model of the basal ganglia that has been developed according to electrophysiological and anatomical knowledge. The model consists of heterogeneous populations of 370 spiking model neurons, including computationally heavy conductance-based models, connected by 11,002 synapses. Simulation of the model has not yet been performed in real-time using a general computing server. By parallelization of the model on the NVIDIA Geforce GTX 280 GPU in data-parallel and task-parallel fashion, faster-than-real-time simulation was robustly realized with only one-third of the GPU's total computational resources. Furthermore, we used the GPU's full computational resources to perform faster-than-real-time simulation of three instances of the basal ganglia model; these instances consisted of 1100 neurons and 33,006 synapses and were synchronized at each calculation step. Finally, we developed software for simultaneous visualization of faster-than-real-time simulation output. These results suggest the potential power of GPGPU techniques in real-time simulation of realistic neural networks. Copyright © 2011 Elsevier Ltd. All rights reserved.
Recent advances and future prospects for Monte Carlo
DOE Office of Scientific and Technical Information (OSTI.GOV)
Brown, Forrest B
2010-01-01
The history of Monte Carlo methods is closely linked to that of computers: The first known Monte Carlo program was written in 1947 for the ENIAC; a pre-release of the first Fortran compiler was used for Monte Carlo In 1957; Monte Carlo codes were adapted to vector computers in the 1980s, clusters and parallel computers in the 1990s, and teraflop systems in the 2000s. Recent advances include hierarchical parallelism, combining threaded calculations on multicore processors with message-passing among different nodes. With the advances In computmg, Monte Carlo codes have evolved with new capabilities and new ways of use. Production codesmore » such as MCNP, MVP, MONK, TRIPOLI and SCALE are now 20-30 years old (or more) and are very rich in advanced featUres. The former 'method of last resort' has now become the first choice for many applications. Calculations are now routinely performed on office computers, not just on supercomputers. Current research and development efforts are investigating the use of Monte Carlo methods on FPGAs. GPUs, and many-core processors. Other far-reaching research is exploring ways to adapt Monte Carlo methods to future exaflop systems that may have 1M or more concurrent computational processes.« less
NASA Technical Reports Server (NTRS)
Watson, Willie R.; Nark, Douglas M.; Nguyen, Duc T.; Tungkahotara, Siroj
2006-01-01
A finite element solution to the convected Helmholtz equation in a nonuniform flow is used to model the noise field within 3-D acoustically treated aero-engine nacelles. Options to select linear or cubic Hermite polynomial basis functions and isoparametric elements are included. However, the key feature of the method is a domain decomposition procedure that is based upon the inter-mixing of an iterative and a direct solve strategy for solving the discrete finite element equations. This procedure is optimized to take full advantage of sparsity and exploit the increased memory and parallel processing capability of modern computer architectures. Example computations are presented for the Langley Flow Impedance Test facility and a rectangular mapping of a full scale, generic aero-engine nacelle. The accuracy and parallel performance of this new solver are tested on both model problems using a supercomputer that contains hundreds of central processing units. Results show that the method gives extremely accurate attenuation predictions, achieves super-linear speedup over hundreds of CPUs, and solves upward of 25 million complex equations in a quarter of an hour.
NASA Wrangler: Automated Cloud-Based Data Assembly in the RECOVER Wildfire Decision Support System
NASA Technical Reports Server (NTRS)
Schnase, John; Carroll, Mark; Gill, Roger; Wooten, Margaret; Weber, Keith; Blair, Kindra; May, Jeffrey; Toombs, William
2017-01-01
NASA Wrangler is a loosely-coupled, event driven, highly parallel data aggregation service designed to take advantageof the elastic resource capabilities of cloud computing. Wrangler automatically collects Earth observational data, climate model outputs, derived remote sensing data products, and historic biophysical data for pre-, active-, and post-wildfire decision making. It is a core service of the RECOVER decision support system, which is providing rapid-response GIS analytic capabilities to state and local government agencies. Wrangler reduces to minutes the time needed to assemble and deliver crucial wildfire-related data.
Staging memory for massively parallel processor
NASA Technical Reports Server (NTRS)
Batcher, Kenneth E. (Inventor)
1988-01-01
The invention herein relates to a computer organization capable of rapidly processing extremely large volumes of data. A staging memory is provided having a main stager portion consisting of a large number of memory banks which are accessed in parallel to receive, store, and transfer data words simultaneous with each other. Substager portions interconnect with the main stager portion to match input and output data formats with the data format of the main stager portion. An address generator is coded for accessing the data banks for receiving or transferring the appropriate words. Input and output permutation networks arrange the lineal order of data into and out of the memory banks.
Parallel processors and nonlinear structural dynamics algorithms and software
NASA Technical Reports Server (NTRS)
Belytschko, Ted; Gilbertsen, Noreen D.; Neal, Mark O.; Plaskacz, Edward J.
1989-01-01
The adaptation of a finite element program with explicit time integration to a massively parallel SIMD (single instruction multiple data) computer, the CONNECTION Machine is described. The adaptation required the development of a new algorithm, called the exchange algorithm, in which all nodal variables are allocated to the element with an exchange of nodal forces at each time step. The architectural and C* programming language features of the CONNECTION Machine are also summarized. Various alternate data structures and associated algorithms for nonlinear finite element analysis are discussed and compared. Results are presented which demonstrate that the CONNECTION Machine is capable of outperforming the CRAY XMP/14.
Banta, Edward R.; Poeter, Eileen P.; Doherty, John E.; Hill, Mary C.
2006-01-01
he Joint Universal Parameter IdenTification and Evaluation of Reliability Application Programming Interface (JUPITER API) improves the computer programming resources available to those developing applications (computer programs) for model analysis.The JUPITER API consists of eleven Fortran-90 modules that provide for encapsulation of data and operations on that data. Each module contains one or more entities: data, data types, subroutines, functions, and generic interfaces. The modules do not constitute computer programs themselves; instead, they are used to construct computer programs. Such computer programs are called applications of the API. The API provides common modeling operations for use by a variety of computer applications.The models being analyzed are referred to here as process models, and may, for example, represent the physics, chemistry, and(or) biology of a field or laboratory system. Process models commonly are constructed using published models such as MODFLOW (Harbaugh et al., 2000; Harbaugh, 2005), MT3DMS (Zheng and Wang, 1996), HSPF (Bicknell et al., 1997), PRMS (Leavesley and Stannard, 1995), and many others. The process model may be accessed by a JUPITER API application as an external program, or it may be implemented as a subroutine within a JUPITER API application . In either case, execution of the model takes place in a framework designed by the application programmer. This framework can be designed to take advantage of any parallel processing capabilities possessed by the process model, as well as the parallel-processing capabilities of the JUPITER API.Model analyses for which the JUPITER API could be useful include, for example: Compare model results to observed values to determine how well the model reproduces system processes and characteristics.Use sensitivity analysis to determine the information provided by observations to parameters and predictions of interest.Determine the additional data needed to improve selected model predictions.Use calibration methods to modify parameter values and other aspects of the model.Compare predictions to regulatory limits.Quantify the uncertainty of predictions based on the results of one or many simulations using inferential or Monte Carlo methods.Determine how to manage the system to achieve stated objectives.The capabilities provided by the JUPITER API include, for example, communication with process models, parallel computations, compressed storage of matrices, and flexible input capabilities. The input capabilities use input blocks suitable for lists or arrays of data. The input blocks needed for one application can be included within one data file or distributed among many files. Data exchange between different JUPITER API applications or between applications and other programs is supported by data-exchange files.The JUPITER API has already been used to construct a number of applications. Three simple example applications are presented in this report. More complicated applications include the universal inverse code UCODE_2005 (Poeter et al., 2005), the multi-model analysis MMA (Eileen P. Poeter, Mary C. Hill, E.R. Banta, S.W. Mehl, and Steen Christensen, written commun., 2006), and a code named OPR_PPR (Matthew J. Tonkin, Claire R. Tiedeman, Mary C. Hill, and D. Matthew Ely, written communication, 2006).This report describes a set of underlying organizational concepts and complete specifics about the JUPITER API. While understanding the organizational concept presented is useful to understanding the modules, other organizational concepts can be used in applications constructed using the JUPITER API.
Multidimensional bioseparation with modular microfluidics
Chirica, Gabriela S.; Renzi, Ronald F.
2013-08-27
A multidimensional chemical separation and analysis system is described including a prototyping platform and modular microfluidic components capable of rapid and convenient assembly, alteration and disassembly of numerous candidate separation systems. Partial or total computer control of the separation system is possible. Single or multiple alternative processing trains can be tested, optimized and/or run in parallel. Examples related to the separation and analysis of human bodily fluids are given.
Application of CHAD hydrodynamics to shock-wave problems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Trease, H.E.; O`Rourke, P.J.; Sahota, M.S.
1997-12-31
CHAD is the latest in a sequence of continually evolving computer codes written to effectively utilize massively parallel computer architectures and the latest grid generators for unstructured meshes. Its applications range from automotive design issues such as in-cylinder and manifold flows of internal combustion engines, vehicle aerodynamics, underhood cooling and passenger compartment heating, ventilation, and air conditioning to shock hydrodynamics and materials modeling. CHAD solves the full unsteady Navier-Stoke equations with the k-epsilon turbulence model in three space dimensions. The code has four major features that distinguish it from the earlier KIVA code, also developed at Los Alamos. First, itmore » is based on a node-centered, finite-volume method in which, like finite element methods, all fluid variables are located at computational nodes. The computational mesh efficiently and accurately handles all element shapes ranging from tetrahedra to hexahedra. Second, it is written in standard Fortran 90 and relies on automatic domain decomposition and a universal communication library written in standard C and MPI for unstructured grids to effectively exploit distributed-memory parallel architectures. Thus the code is fully portable to a variety of computing platforms such as uniprocessor workstations, symmetric multiprocessors, clusters of workstations, and massively parallel platforms. Third, CHAD utilizes a variable explicit/implicit upwind method for convection that improves computational efficiency in flows that have large velocity Courant number variations due to velocity of mesh size variations. Fourth, CHAD is designed to also simulate shock hydrodynamics involving multimaterial anisotropic behavior under high shear. The authors will discuss CHAD capabilities and show several sample calculations showing the strengths and weaknesses of CHAD.« less
Three-dimensional laser microvision.
Shimotahira, H; Iizuka, K; Chu, S C; Wah, C; Costen, F; Yoshikuni, Y
2001-04-10
A three-dimensional (3-D) optical imaging system offering high resolution in all three dimensions, requiring minimum manipulation and capable of real-time operation, is presented. The system derives its capabilities from use of the superstructure grating laser source in the implementation of a laser step frequency radar for depth information acquisition. A synthetic aperture radar technique was also used to further enhance its lateral resolution as well as extend the depth of focus. High-speed operation was made possible by a dual computer system consisting of a host and a remote microcomputer supported by a dual-channel Small Computer System Interface parallel data transfer system. The system is capable of operating near real time. The 3-D display of a tunneling diode, a microwave integrated circuit, and a see-through image taken by the system operating near real time are included. The depth resolution is 40 mum; lateral resolution with a synthetic aperture approach is a fraction of a micrometer and that without it is approximately 10 mum.
Group implicit concurrent algorithms in nonlinear structural dynamics
NASA Technical Reports Server (NTRS)
Ortiz, M.; Sotelino, E. D.
1989-01-01
During the 70's and 80's, considerable effort was devoted to developing efficient and reliable time stepping procedures for transient structural analysis. Mathematically, the equations governing this type of problems are generally stiff, i.e., they exhibit a wide spectrum in the linear range. The algorithms best suited to this type of applications are those which accurately integrate the low frequency content of the response without necessitating the resolution of the high frequency modes. This means that the algorithms must be unconditionally stable, which in turn rules out explicit integration. The most exciting possibility in the algorithms development area in recent years has been the advent of parallel computers with multiprocessing capabilities. So, this work is mainly concerned with the development of parallel algorithms in the area of structural dynamics. A primary objective is to devise unconditionally stable and accurate time stepping procedures which lend themselves to an efficient implementation in concurrent machines. Some features of the new computer architecture are summarized. A brief survey of current efforts in the area is presented. A new class of concurrent procedures, or Group Implicit algorithms is introduced and analyzed. The numerical simulation shows that GI algorithms hold considerable promise for application in coarse grain as well as medium grain parallel computers.
High-performance parallel analysis of coupled problems for aircraft propulsion
NASA Technical Reports Server (NTRS)
Felippa, C. A.; Farhat, C.; Lanteri, S.; Maman, N.; Piperno, S.; Gumaste, U.
1994-01-01
This research program deals with the application of high-performance computing methods for the analysis of complete jet engines. We have entitled this program by applying the two dimensional parallel aeroelastic codes to the interior gas flow problem of a bypass jet engine. The fluid mesh generation, domain decomposition, and solution capabilities were successfully tested. We then focused attention on methodology for the partitioned analysis of the interaction of the gas flow with a flexible structure and with the fluid mesh motion that results from these structural displacements. This is treated by a new arbitrary Lagrangian-Eulerian (ALE) technique that models the fluid mesh motion as that of a fictitious mass-spring network. New partitioned analysis procedures to treat this coupled three-component problem are developed. These procedures involved delayed corrections and subcycling. Preliminary results on the stability, accuracy, and MPP computational efficiency are reported.
Dispatching packets on a global combining network of a parallel computer
Almasi, Gheorghe [Ardsley, NY; Archer, Charles J [Rochester, MN
2011-07-19
Methods, apparatus, and products are disclosed for dispatching packets on a global combining network of a parallel computer comprising a plurality of nodes connected for data communications using the network capable of performing collective operations and point to point operations that include: receiving, by an origin system messaging module on an origin node from an origin application messaging module on the origin node, a storage identifier and an operation identifier, the storage identifier specifying storage containing an application message for transmission to a target node, and the operation identifier specifying a message passing operation; packetizing, by the origin system messaging module, the application message into network packets for transmission to the target node, each network packet specifying the operation identifier and an operation type for the message passing operation specified by the operation identifier; and transmitting, by the origin system messaging module, the network packets to the target node.
CFD research, parallel computation and aerodynamic optimization
NASA Technical Reports Server (NTRS)
Ryan, James S.
1995-01-01
Over five years of research in Computational Fluid Dynamics and its applications are covered in this report. Using CFD as an established tool, aerodynamic optimization on parallel architectures is explored. The objective of this work is to provide better tools to vehicle designers. Submarine design requires accurate force and moment calculations in flow with thick boundary layers and large separated vortices. Low noise production is critical, so flow into the propulsor region must be predicted accurately. The High Speed Civil Transport (HSCT) has been the subject of recent work. This vehicle is to be a passenger vehicle with the capability of cutting overseas flight times by more than half. A successful design must surpass the performance of comparable planes. Fuel economy, other operational costs, environmental impact, and range must all be improved substantially. For all these reasons, improved design tools are required, and these tools must eventually integrate optimization, external aerodynamics, propulsion, structures, heat transfer and other disciplines.
Automated target recognition and tracking using an optical pattern recognition neural network
NASA Technical Reports Server (NTRS)
Chao, Tien-Hsin
1991-01-01
The on-going development of an automatic target recognition and tracking system at the Jet Propulsion Laboratory is presented. This system is an optical pattern recognition neural network (OPRNN) that is an integration of an innovative optical parallel processor and a feature extraction based neural net training algorithm. The parallel optical processor provides high speed and vast parallelism as well as full shift invariance. The neural network algorithm enables simultaneous discrimination of multiple noisy targets in spite of their scales, rotations, perspectives, and various deformations. This fully developed OPRNN system can be effectively utilized for the automated spacecraft recognition and tracking that will lead to success in the Automated Rendezvous and Capture (AR&C) of the unmanned Cargo Transfer Vehicle (CTV). One of the most powerful optical parallel processors for automatic target recognition is the multichannel correlator. With the inherent advantages of parallel processing capability and shift invariance, multiple objects can be simultaneously recognized and tracked using this multichannel correlator. This target tracking capability can be greatly enhanced by utilizing a powerful feature extraction based neural network training algorithm such as the neocognitron. The OPRNN, currently under investigation at JPL, is constructed with an optical multichannel correlator where holographic filters have been prepared using the neocognitron training algorithm. The computation speed of the neocognitron-type OPRNN is up to 10(exp 14) analog connections/sec that enabling the OPRNN to outperform its state-of-the-art electronics counterpart by at least two orders of magnitude.
Akuna: An Open Source User Environment for Managing Subsurface Simulation Workflows
NASA Astrophysics Data System (ADS)
Freedman, V. L.; Agarwal, D.; Bensema, K.; Finsterle, S.; Gable, C. W.; Keating, E. H.; Krishnan, H.; Lansing, C.; Moeglein, W.; Pau, G. S. H.; Porter, E.; Scheibe, T. D.
2014-12-01
The U.S. Department of Energy (DOE) is investing in development of a numerical modeling toolset called ASCEM (Advanced Simulation Capability for Environmental Management) to support modeling analyses at legacy waste sites. ASCEM is an open source and modular computing framework that incorporates new advances and tools for predicting contaminant fate and transport in natural and engineered systems. The ASCEM toolset includes both a Platform with Integrated Toolsets (called Akuna) and a High-Performance Computing multi-process simulator (called Amanzi). The focus of this presentation is on Akuna, an open-source user environment that manages subsurface simulation workflows and associated data and metadata. In this presentation, key elements of Akuna are demonstrated, which includes toolsets for model setup, database management, sensitivity analysis, parameter estimation, uncertainty quantification, and visualization of both model setup and simulation results. A key component of the workflow is in the automated job launching and monitoring capabilities, which allow a user to submit and monitor simulation runs on high-performance, parallel computers. Visualization of large outputs can also be performed without moving data back to local resources. These capabilities make high-performance computing accessible to the users who might not be familiar with batch queue systems and usage protocols on different supercomputers and clusters.
High performance computing and communications: Advancing the frontiers of information technology
DOE Office of Scientific and Technical Information (OSTI.GOV)
NONE
1997-12-31
This report, which supplements the President`s Fiscal Year 1997 Budget, describes the interagency High Performance Computing and Communications (HPCC) Program. The HPCC Program will celebrate its fifth anniversary in October 1996 with an impressive array of accomplishments to its credit. Over its five-year history, the HPCC Program has focused on developing high performance computing and communications technologies that can be applied to computation-intensive applications. Major highlights for FY 1996: (1) High performance computing systems enable practical solutions to complex problems with accuracies not possible five years ago; (2) HPCC-funded research in very large scale networking techniques has been instrumental inmore » the evolution of the Internet, which continues exponential growth in size, speed, and availability of information; (3) The combination of hardware capability measured in gigaflop/s, networking technology measured in gigabit/s, and new computational science techniques for modeling phenomena has demonstrated that very large scale accurate scientific calculations can be executed across heterogeneous parallel processing systems located thousands of miles apart; (4) Federal investments in HPCC software R and D support researchers who pioneered the development of parallel languages and compilers, high performance mathematical, engineering, and scientific libraries, and software tools--technologies that allow scientists to use powerful parallel systems to focus on Federal agency mission applications; and (5) HPCC support for virtual environments has enabled the development of immersive technologies, where researchers can explore and manipulate multi-dimensional scientific and engineering problems. Educational programs fostered by the HPCC Program have brought into classrooms new science and engineering curricula designed to teach computational science. This document contains a small sample of the significant HPCC Program accomplishments in FY 1996.« less
NASA Astrophysics Data System (ADS)
Sun, Rui; Xiao, Heng
2016-04-01
With the growth of available computational resource, CFD-DEM (computational fluid dynamics-discrete element method) becomes an increasingly promising and feasible approach for the study of sediment transport. Several existing CFD-DEM solvers are applied in chemical engineering and mining industry. However, a robust CFD-DEM solver for the simulation of sediment transport is still desirable. In this work, the development of a three-dimensional, massively parallel, and open-source CFD-DEM solver SediFoam is detailed. This solver is built based on open-source solvers OpenFOAM and LAMMPS. OpenFOAM is a CFD toolbox that can perform three-dimensional fluid flow simulations on unstructured meshes; LAMMPS is a massively parallel DEM solver for molecular dynamics. Several validation tests of SediFoam are performed using cases of a wide range of complexities. The results obtained in the present simulations are consistent with those in the literature, which demonstrates the capability of SediFoam for sediment transport applications. In addition to the validation test, the parallel efficiency of SediFoam is studied to test the performance of the code for large-scale and complex simulations. The parallel efficiency tests show that the scalability of SediFoam is satisfactory in the simulations using up to O(107) particles.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Carlson, Neil; Jibben, Zechariah; Brady, Peter
2017-06-28
Pececillo is a proxy-app for the open source Truchas metal processing code (LA-CC-15-097). It implements many of the physics models used in Truchas: free-surface, incompressible Navier-Stokes fluid dynamics (e.g., water waves); heat transport, material phase change, view factor thermal radiation; species advection-diffusion; quasi-static, elastic/plastic solid mechanics with contact; electomagnetics (Maxwell's equations). The models are simplified versions that retain the fundamental computational complexity of the Truchas models while omitting many non-essential features and modeling capabilities. The purpose is to expose Truchas algorithms in a greatly simplified context where computer science problems related to parallel performance on advanced architectures can be moremore » easily investigated. While Pececillo is capable of performing simulations representative of typical Truchas metal casting, welding, and additive manufacturing simulations, it lacks many of the modeling capabilites needed for real applications.« less
Near-Body Grid Adaption for Overset Grids
NASA Technical Reports Server (NTRS)
Buning, Pieter G.; Pulliam, Thomas H.
2016-01-01
A solution adaption capability for curvilinear near-body grids has been implemented in the OVERFLOW overset grid computational fluid dynamics code. The approach follows closely that used for the Cartesian off-body grids, but inserts refined grids in the computational space of original near-body grids. Refined curvilinear grids are generated using parametric cubic interpolation, with one-sided biasing based on curvature and stretching ratio of the original grid. Sensor functions, grid marking, and solution interpolation tasks are implemented in the same fashion as for off-body grids. A goal-oriented procedure, based on largest error first, is included for controlling growth rate and maximum size of the adapted grid system. The adaption process is almost entirely parallelized using MPI, resulting in a capability suitable for viscous, moving body simulations. Two- and three-dimensional examples are presented.
Performance Models for the Spike Banded Linear System Solver
Manguoglu, Murat; Saied, Faisal; Sameh, Ahmed; ...
2011-01-01
With availability of large-scale parallel platforms comprised of tens-of-thousands of processors and beyond, there is significant impetus for the development of scalable parallel sparse linear system solvers and preconditioners. An integral part of this design process is the development of performance models capable of predicting performance and providing accurate cost models for the solvers and preconditioners. There has been some work in the past on characterizing performance of the iterative solvers themselves. In this paper, we investigate the problem of characterizing performance and scalability of banded preconditioners. Recent work has demonstrated the superior convergence properties and robustness of banded preconditioners,more » compared to state-of-the-art ILU family of preconditioners as well as algebraic multigrid preconditioners. Furthermore, when used in conjunction with efficient banded solvers, banded preconditioners are capable of significantly faster time-to-solution. Our banded solver, the Truncated Spike algorithm is specifically designed for parallel performance and tolerance to deep memory hierarchies. Its regular structure is also highly amenable to accurate performance characterization. Using these characteristics, we derive the following results in this paper: (i) we develop parallel formulations of the Truncated Spike solver, (ii) we develop a highly accurate pseudo-analytical parallel performance model for our solver, (iii) we show excellent predication capabilities of our model – based on which we argue the high scalability of our solver. Our pseudo-analytical performance model is based on analytical performance characterization of each phase of our solver. These analytical models are then parameterized using actual runtime information on target platforms. An important consequence of our performance models is that they reveal underlying performance bottlenecks in both serial and parallel formulations. All of our results are validated on diverse heterogeneous multiclusters – platforms for which performance prediction is particularly challenging. Finally, we provide predict the scalability of the Spike algorithm using up to 65,536 cores with our model. In this paper we extend the results presented in the Ninth International Symposium on Parallel and Distributed Computing.« less
A Fast Synthetic Aperture Radar Raw Data Simulation Using Cloud Computing.
Li, Zhixin; Su, Dandan; Zhu, Haijiang; Li, Wei; Zhang, Fan; Li, Ruirui
2017-01-08
Synthetic Aperture Radar (SAR) raw data simulation is a fundamental problem in radar system design and imaging algorithm research. The growth of surveying swath and resolution results in a significant increase in data volume and simulation period, which can be considered to be a comprehensive data intensive and computing intensive issue. Although several high performance computing (HPC) methods have demonstrated their potential for accelerating simulation, the input/output (I/O) bottleneck of huge raw data has not been eased. In this paper, we propose a cloud computing based SAR raw data simulation algorithm, which employs the MapReduce model to accelerate the raw data computing and the Hadoop distributed file system (HDFS) for fast I/O access. The MapReduce model is designed for the irregular parallel accumulation of raw data simulation, which greatly reduces the parallel efficiency of graphics processing unit (GPU) based simulation methods. In addition, three kinds of optimization strategies are put forward from the aspects of programming model, HDFS configuration and scheduling. The experimental results show that the cloud computing based algorithm achieves 4_ speedup over the baseline serial approach in an 8-node cloud environment, and each optimization strategy can improve about 20%. This work proves that the proposed cloud algorithm is capable of solving the computing intensive and data intensive issues in SAR raw data simulation, and is easily extended to large scale computing to achieve higher acceleration.
Computational Fluid Dynamics of Whole-Body Aircraft
NASA Astrophysics Data System (ADS)
Agarwal, Ramesh
1999-01-01
The current state of the art in computational aerodynamics for whole-body aircraft flowfield simulations is described. Recent advances in geometry modeling, surface and volume grid generation, and flow simulation algorithms have led to accurate flowfield predictions for increasingly complex and realistic configurations. As a result, computational aerodynamics has emerged as a crucial enabling technology for the design and development of flight vehicles. Examples illustrating the current capability for the prediction of transport and fighter aircraft flowfields are presented. Unfortunately, accurate modeling of turbulence remains a major difficulty in the analysis of viscosity-dominated flows. In the future, inverse design methods, multidisciplinary design optimization methods, artificial intelligence technology, and massively parallel computer technology will be incorporated into computational aerodynamics, opening up greater opportunities for improved product design at substantially reduced costs.
ICASE semiannual report, April 1 - September 30, 1989
NASA Technical Reports Server (NTRS)
1990-01-01
The Institute conducts unclassified basic research in applied mathematics, numerical analysis, and computer science in order to extend and improve problem-solving capabilities in science and engineering, particularly in aeronautics and space. The major categories of the current Institute for Computer Applications in Science and Engineering (ICASE) research program are: (1) numerical methods, with particular emphasis on the development and analysis of basic numerical algorithms; (2) control and parameter identification problems, with emphasis on effective numerical methods; (3) computational problems in engineering and the physical sciences, particularly fluid dynamics, acoustics, and structural analysis; and (4) computer systems and software, especially vector and parallel computers. ICASE reports are considered to be primarily preprints of manuscripts that have been submitted to appropriate research journals or that are to appear in conference proceedings.
NASA Technical Reports Server (NTRS)
Harper, Richard
1989-01-01
In a fault-tolerant parallel computer, a functional programming model can facilitate distributed checkpointing, error recovery, load balancing, and graceful degradation. Such a model has been implemented on the Draper Fault-Tolerant Parallel Processor (FTPP). When used in conjunction with the FTPP's fault detection and masking capabilities, this implementation results in a graceful degradation of system performance after faults. Three graceful degradation algorithms have been implemented and are presented. A user interface has been implemented which requires minimal cognitive overhead by the application programmer, masking such complexities as the system's redundancy, distributed nature, variable complement of processing resources, load balancing, fault occurrence and recovery. This user interface is described and its use demonstrated. The applicability of the functional programming style to the Activation Framework, a paradigm for intelligent systems, is then briefly described.
NASA Astrophysics Data System (ADS)
Fiandrotti, Attilio; Fosson, Sophie M.; Ravazzi, Chiara; Magli, Enrico
2018-04-01
Compressive sensing promises to enable bandwidth-efficient on-board compression of astronomical data by lifting the encoding complexity from the source to the receiver. The signal is recovered off-line, exploiting GPUs parallel computation capabilities to speedup the reconstruction process. However, inherent GPU hardware constraints limit the size of the recoverable signal and the speedup practically achievable. In this work, we design parallel algorithms that exploit the properties of circulant matrices for efficient GPU-accelerated sparse signals recovery. Our approach reduces the memory requirements, allowing us to recover very large signals with limited memory. In addition, it achieves a tenfold signal recovery speedup thanks to ad-hoc parallelization of matrix-vector multiplications and matrix inversions. Finally, we practically demonstrate our algorithms in a typical application of circulant matrices: deblurring a sparse astronomical image in the compressed domain.
A Computer Simulation of the System-Wide Effects of Parallel-Offset Route Maneuvers
NASA Technical Reports Server (NTRS)
Lauderdale, Todd A.; Santiago, Confesor; Pankok, Carl
2010-01-01
Most aircraft managed by air-traffic controllers in the National Airspace System are capable of flying parallel-offset routes. This paper presents the results of two related studies on the effects of increased use of offset routes as a conflict resolution maneuver. The first study analyzes offset routes in the context of all standard resolution types which air-traffic controllers currently use. This study shows that by utilizing parallel-offset route maneuvers, significant system-wide savings in delay due to conflict resolution of up to 30% are possible. It also shows that most offset resolutions replace horizontal-vectoring resolutions. The second study builds on the results of the first and directly compares offset resolutions and standard horizontal-vectoring maneuvers to determine that in-trail conflicts are often more efficiently resolved by offset maneuvers.
Nuclide Depletion Capabilities in the Shift Monte Carlo Code
Davidson, Gregory G.; Pandya, Tara M.; Johnson, Seth R.; ...
2017-12-21
A new depletion capability has been developed in the Exnihilo radiation transport code suite. This capability enables massively parallel domain-decomposed coupling between the Shift continuous-energy Monte Carlo solver and the nuclide depletion solvers in ORIGEN to perform high-performance Monte Carlo depletion calculations. This paper describes this new depletion capability and discusses its various features, including a multi-level parallel decomposition, high-order transport-depletion coupling, and energy-integrated power renormalization. Several test problems are presented to validate the new capability against other Monte Carlo depletion codes, and the parallel performance of the new capability is analyzed.
PARVMEC: An Efficient, Scalable Implementation of the Variational Moments Equilibrium Code
DOE Office of Scientific and Technical Information (OSTI.GOV)
Seal, Sudip K; Hirshman, Steven Paul; Wingen, Andreas
The ability to sustain magnetically confined plasma in a state of stable equilibrium is crucial for optimal and cost-effective operations of fusion devices like tokamaks and stellarators. The Variational Moments Equilibrium Code (VMEC) is the de-facto serial application used by fusion scientists to compute magnetohydrodynamics (MHD) equilibria and study the physics of three dimensional plasmas in confined configurations. Modern fusion energy experiments have larger system scales with more interactive experimental workflows, both demanding faster analysis turnaround times on computational workloads that are stressing the capabilities of sequential VMEC. In this paper, we present PARVMEC, an efficient, parallel version of itsmore » sequential counterpart, capable of scaling to thousands of processors on distributed memory machines. PARVMEC is a non-linear code, with multiple numerical physics modules, each with its own computational complexity. A detailed speedup analysis supported by scaling results on 1,024 cores of a Cray XC30 supercomputer is presented. Depending on the mode of PARVMEC execution, speedup improvements of one to two orders of magnitude are reported. PARVMEC equips fusion scientists for the first time with a state-of-theart capability for rapid, high fidelity analyses of magnetically confined plasmas at unprecedented scales.« less
Benchmarking Memory Performance with the Data Cube Operator
NASA Technical Reports Server (NTRS)
Frumkin, Michael A.; Shabanov, Leonid V.
2004-01-01
Data movement across a computer memory hierarchy and across computational grids is known to be a limiting factor for applications processing large data sets. We use the Data Cube Operator on an Arithmetic Data Set, called ADC, to benchmark capabilities of computers and of computational grids to handle large distributed data sets. We present a prototype implementation of a parallel algorithm for computation of the operatol: The algorithm follows a known approach for computing views from the smallest parent. The ADC stresses all levels of grid memory and storage by producing some of 2d views of an Arithmetic Data Set of d-tuples described by a small number of integers. We control data intensity of the ADC by selecting the tuple parameters, the sizes of the views, and the number of realized views. Benchmarking results of memory performance of a number of computer architectures and of a small computational grid are presented.
Accelerating Wright–Fisher Forward Simulations on the Graphics Processing Unit
Lawrie, David S.
2017-01-01
Forward Wright–Fisher simulations are powerful in their ability to model complex demography and selection scenarios, but suffer from slow execution on the Central Processor Unit (CPU), thus limiting their usefulness. However, the single-locus Wright–Fisher forward algorithm is exceedingly parallelizable, with many steps that are so-called “embarrassingly parallel,” consisting of a vast number of individual computations that are all independent of each other and thus capable of being performed concurrently. The rise of modern Graphics Processing Units (GPUs) and programming languages designed to leverage the inherent parallel nature of these processors have allowed researchers to dramatically speed up many programs that have such high arithmetic intensity and intrinsic concurrency. The presented GPU Optimized Wright–Fisher simulation, or “GO Fish” for short, can be used to simulate arbitrary selection and demographic scenarios while running over 250-fold faster than its serial counterpart on the CPU. Even modest GPU hardware can achieve an impressive speedup of over two orders of magnitude. With simulations so accelerated, one can not only do quick parametric bootstrapping of previously estimated parameters, but also use simulated results to calculate the likelihoods and summary statistics of demographic and selection models against real polymorphism data, all without restricting the demographic and selection scenarios that can be modeled or requiring approximations to the single-locus forward algorithm for efficiency. Further, as many of the parallel programming techniques used in this simulation can be applied to other computationally intensive algorithms important in population genetics, GO Fish serves as an exciting template for future research into accelerating computation in evolution. GO Fish is part of the Parallel PopGen Package available at: http://dl42.github.io/ParallelPopGen/. PMID:28768689
Verification and Validation: High Charge and Energy (HZE) Transport Codes and Future Development
NASA Technical Reports Server (NTRS)
Wilson, John W.; Tripathi, Ram K.; Mertens, Christopher J.; Blattnig, Steve R.; Clowdsley, Martha S.; Cucinotta, Francis A.; Tweed, John; Heinbockel, John H.; Walker, Steven A.; Nealy, John E.
2005-01-01
In the present paper, we give the formalism for further developing a fully three-dimensional HZETRN code using marching procedures but also development of a new Green's function code is discussed. The final Green's function code is capable of not only validation in the space environment but also in ground based laboratories with directed beams of ions of specific energy and characterized with detailed diagnostic particle spectrometer devices. Special emphasis is given to verification of the computational procedures and validation of the resultant computational model using laboratory and spaceflight measurements. Due to historical requirements, two parallel development paths for computational model implementation using marching procedures and Green s function techniques are followed. A new version of the HZETRN code capable of simulating HZE ions with either laboratory or space boundary conditions is under development. Validation of computational models at this time is particularly important for President Bush s Initiative to develop infrastructure for human exploration with first target demonstration of the Crew Exploration Vehicle (CEV) in low Earth orbit in 2008.
Development of a Robust and Efficient Parallel Solver for Unsteady Turbomachinery Flows
NASA Technical Reports Server (NTRS)
West, Jeff; Wright, Jeffrey; Thakur, Siddharth; Luke, Ed; Grinstead, Nathan
2012-01-01
The traditional design and analysis practice for advanced propulsion systems relies heavily on expensive full-scale prototype development and testing. Over the past decade, use of high-fidelity analysis and design tools such as CFD early in the product development cycle has been identified as one way to alleviate testing costs and to develop these devices better, faster and cheaper. In the design of advanced propulsion systems, CFD plays a major role in defining the required performance over the entire flight regime, as well as in testing the sensitivity of the design to the different modes of operation. Increased emphasis is being placed on developing and applying CFD models to simulate the flow field environments and performance of advanced propulsion systems. This necessitates the development of next generation computational tools which can be used effectively and reliably in a design environment. The turbomachinery simulation capability presented here is being developed in a computational tool called Loci-STREAM [1]. It integrates proven numerical methods for generalized grids and state-of-the-art physical models in a novel rule-based programming framework called Loci [2] which allows: (a) seamless integration of multidisciplinary physics in a unified manner, and (b) automatic handling of massively parallel computing. The objective is to be able to routinely simulate problems involving complex geometries requiring large unstructured grids and complex multidisciplinary physics. An immediate application of interest is simulation of unsteady flows in rocket turbopumps, particularly in cryogenic liquid rocket engines. The key components of the overall methodology presented in this paper are the following: (a) high fidelity unsteady simulation capability based on Detached Eddy Simulation (DES) in conjunction with second-order temporal discretization, (b) compliance with Geometric Conservation Law (GCL) in order to maintain conservative property on moving meshes for second-order time-stepping scheme, (c) a novel cloud-of-points interpolation method (based on a fast parallel kd-tree search algorithm) for interfaces between turbomachinery components in relative motion which is demonstrated to be highly scalable, and (d) demonstrated accuracy and parallel scalability on large grids (approx 250 million cells) in full turbomachinery geometries.
Single Sided Messaging v. 0.6.6
DOE Office of Scientific and Technical Information (OSTI.GOV)
Curry, Matthew Leon; Farmer, Matthew Shane; Hassani, Amin
Single-Sided Messaging (SSM) is a portable, multitransport networking library that enables applications to leverage potential one-sided capabilities of underlying network transports. It also provides desirable semantics that services for highperformance, massively parallel computers can leverage, such as an explicit cancel operation for pending transmissions, as well as enhanced matching semantics favoring large numbers of buffers attached to a single match entry. This release supports TCP/IP, shared memory, and Infiniband.
Cloud-Based Perception and Control of Sensor Nets and Robot Swarms
2016-04-01
distributed stream processing framework provides the necessary API and infrastructure to develop and execute such applications in a cluster of computation...streaming DDDAS applications based on challenges they present to the backend Cloud control system. Figure 2 Parallel SLAM Application 3 1) Set of...the art deep learning- based object detectors can recognize among hundreds of object classes and this capability would be very useful for mobile
High performance, low cost, self-contained, multipurpose PC based ground systems
NASA Technical Reports Server (NTRS)
Forman, Michael; Nickum, William; Troendly, Gregory
1993-01-01
The use of embedded processors greatly enhances the capabilities of personal computers when used for telemetry processing and command control center functions. Parallel architectures based on the use of transputers are shown to be very versatile and reusable, and the synergism between the PC and the embedded processor with transputers results in single unit, low cost workstations of 20 less than MIPS less than or equal to 1000.
DOE Office of Scientific and Technical Information (OSTI.GOV)
2014-01-17
This library is an implementation of the Sparse Approximate Matrix Multiplication (SpAMM) algorithm introduced. It provides a matrix data type, and an approximate matrix product, which exhibits linear scaling computational complexity for matrices with decay. The product error and the performance of the multiply can be tuned by choosing an appropriate tolerance. The library can be compiled for serial execution or parallel execution on shared memory systems with an OpenMP capable compiler
Distributed Finite Element Analysis Using a Transputer Network
NASA Technical Reports Server (NTRS)
Watson, James; Favenesi, James; Danial, Albert; Tombrello, Joseph; Yang, Dabby; Reynolds, Brian; Turrentine, Ronald; Shephard, Mark; Baehmann, Peggy
1989-01-01
The principal objective of this research effort was to demonstrate the extraordinarily cost effective acceleration of finite element structural analysis problems using a transputer-based parallel processing network. This objective was accomplished in the form of a commercially viable parallel processing workstation. The workstation is a desktop size, low-maintenance computing unit capable of supercomputer performance yet costs two orders of magnitude less. To achieve the principal research objective, a transputer based structural analysis workstation termed XPFEM was implemented with linear static structural analysis capabilities resembling commercially available NASTRAN. Finite element model files, generated using the on-line preprocessing module or external preprocessing packages, are downloaded to a network of 32 transputers for accelerated solution. The system currently executes at about one third Cray X-MP24 speed but additional acceleration appears likely. For the NASA selected demonstration problem of a Space Shuttle main engine turbine blade model with about 1500 nodes and 4500 independent degrees of freedom, the Cray X-MP24 required 23.9 seconds to obtain a solution while the transputer network, operated from an IBM PC-AT compatible host computer, required 71.7 seconds. Consequently, the $80,000 transputer network demonstrated a cost-performance ratio about 60 times better than the $15,000,000 Cray X-MP24 system.
Overview of the NASA Glenn Flux Reconstruction Based High-Order Unstructured Grid Code
NASA Technical Reports Server (NTRS)
Spiegel, Seth C.; DeBonis, James R.; Huynh, H. T.
2016-01-01
A computational fluid dynamics code based on the flux reconstruction (FR) method is currently being developed at NASA Glenn Research Center to ultimately provide a large- eddy simulation capability that is both accurate and efficient for complex aeropropulsion flows. The FR approach offers a simple and efficient method that is easy to implement and accurate to an arbitrary order on common grid cell geometries. The governing compressible Navier-Stokes equations are discretized in time using various explicit Runge-Kutta schemes, with the default being the 3-stage/3rd-order strong stability preserving scheme. The code is written in modern Fortran (i.e., Fortran 2008) and parallelization is attained through MPI for execution on distributed-memory high-performance computing systems. An h- refinement study of the isentropic Euler vortex problem is able to empirically demonstrate the capability of the FR method to achieve super-accuracy for inviscid flows. Additionally, the code is applied to the Taylor-Green vortex problem, performing numerous implicit large-eddy simulations across a range of grid resolutions and solution orders. The solution found by a pseudo-spectral code is commonly used as a reference solution to this problem, and the FR code is able to reproduce this solution using approximately the same grid resolution. Finally, an examination of the code's performance demonstrates good parallel scaling, as well as an implementation of the FR method with a computational cost/degree- of-freedom/time-step that is essentially independent of the solution order of accuracy for structured geometries.
Performance upgrades to the MCNP6 burnup capability for large scale depletion calculations
Fensin, M. L.; Galloway, J. D.; James, M. R.
2015-04-11
The first MCNP based inline Monte Carlo depletion capability was officially released from the Radiation Safety Information and Computational Center as MCNPX 2.6.0. With the merger of MCNPX and MCNP5, MCNP6 combined the capability of both simulation tools, as well as providing new advanced technology, in a single radiation transport code. The new MCNP6 depletion capability was first showcased at the International Congress for Advancements in Nuclear Power Plants (ICAPP) meeting in 2012. At that conference the new capabilities addressed included the combined distributive and shared memory parallel architecture for the burnup capability, improved memory management, physics enhancements, and newmore » predictability as compared to the H.B Robinson Benchmark. At Los Alamos National Laboratory, a special purpose cluster named “tebow,” was constructed such to maximize available RAM per CPU, as well as leveraging swap space with solid state hard drives, to allow larger scale depletion calculations (allowing for significantly more burnable regions than previously examined). As the MCNP6 burnup capability was scaled to larger numbers of burnable regions, a noticeable slowdown was realized.This paper details two specific computational performance strategies for improving calculation speedup: (1) retrieving cross sections during transport; and (2) tallying mechanisms specific to burnup in MCNP. To combat this slowdown new performance upgrades were developed and integrated into MCNP6 1.2.« less
Parallel processing for scientific computations
NASA Technical Reports Server (NTRS)
Alkhatib, Hasan S.
1995-01-01
The scope of this project dealt with the investigation of the requirements to support distributed computing of scientific computations over a cluster of cooperative workstations. Various experiments on computations for the solution of simultaneous linear equations were performed in the early phase of the project to gain experience in the general nature and requirements of scientific applications. A specification of a distributed integrated computing environment, DICE, based on a distributed shared memory communication paradigm has been developed and evaluated. The distributed shared memory model facilitates porting existing parallel algorithms that have been designed for shared memory multiprocessor systems to the new environment. The potential of this new environment is to provide supercomputing capability through the utilization of the aggregate power of workstations cooperating in a cluster interconnected via a local area network. Workstations, generally, do not have the computing power to tackle complex scientific applications, making them primarily useful for visualization, data reduction, and filtering as far as complex scientific applications are concerned. There is a tremendous amount of computing power that is left unused in a network of workstations. Very often a workstation is simply sitting idle on a desk. A set of tools can be developed to take advantage of this potential computing power to create a platform suitable for large scientific computations. The integration of several workstations into a logical cluster of distributed, cooperative, computing stations presents an alternative to shared memory multiprocessor systems. In this project we designed and evaluated such a system.
Heterogeneous computing architecture for fast detection of SNP-SNP interactions.
Sluga, Davor; Curk, Tomaz; Zupan, Blaz; Lotric, Uros
2014-06-25
The extent of data in a typical genome-wide association study (GWAS) poses considerable computational challenges to software tools for gene-gene interaction discovery. Exhaustive evaluation of all interactions among hundreds of thousands to millions of single nucleotide polymorphisms (SNPs) may require weeks or even months of computation. Massively parallel hardware within a modern Graphic Processing Unit (GPU) and Many Integrated Core (MIC) coprocessors can shorten the run time considerably. While the utility of GPU-based implementations in bioinformatics has been well studied, MIC architecture has been introduced only recently and may provide a number of comparative advantages that have yet to be explored and tested. We have developed a heterogeneous, GPU and Intel MIC-accelerated software module for SNP-SNP interaction discovery to replace the previously single-threaded computational core in the interactive web-based data exploration program SNPsyn. We report on differences between these two modern massively parallel architectures and their software environments. Their utility resulted in an order of magnitude shorter execution times when compared to the single-threaded CPU implementation. GPU implementation on a single Nvidia Tesla K20 runs twice as fast as that for the MIC architecture-based Xeon Phi P5110 coprocessor, but also requires considerably more programming effort. General purpose GPUs are a mature platform with large amounts of computing power capable of tackling inherently parallel problems, but can prove demanding for the programmer. On the other hand the new MIC architecture, albeit lacking in performance reduces the programming effort and makes it up with a more general architecture suitable for a wider range of problems.
Heterogeneous computing architecture for fast detection of SNP-SNP interactions
2014-01-01
Background The extent of data in a typical genome-wide association study (GWAS) poses considerable computational challenges to software tools for gene-gene interaction discovery. Exhaustive evaluation of all interactions among hundreds of thousands to millions of single nucleotide polymorphisms (SNPs) may require weeks or even months of computation. Massively parallel hardware within a modern Graphic Processing Unit (GPU) and Many Integrated Core (MIC) coprocessors can shorten the run time considerably. While the utility of GPU-based implementations in bioinformatics has been well studied, MIC architecture has been introduced only recently and may provide a number of comparative advantages that have yet to be explored and tested. Results We have developed a heterogeneous, GPU and Intel MIC-accelerated software module for SNP-SNP interaction discovery to replace the previously single-threaded computational core in the interactive web-based data exploration program SNPsyn. We report on differences between these two modern massively parallel architectures and their software environments. Their utility resulted in an order of magnitude shorter execution times when compared to the single-threaded CPU implementation. GPU implementation on a single Nvidia Tesla K20 runs twice as fast as that for the MIC architecture-based Xeon Phi P5110 coprocessor, but also requires considerably more programming effort. Conclusions General purpose GPUs are a mature platform with large amounts of computing power capable of tackling inherently parallel problems, but can prove demanding for the programmer. On the other hand the new MIC architecture, albeit lacking in performance reduces the programming effort and makes it up with a more general architecture suitable for a wider range of problems. PMID:24964802
NASA Technical Reports Server (NTRS)
Reinsch, K. G. (Editor); Schmidt, W. (Editor); Ecer, A. (Editor); Haeuser, Jochem (Editor); Periaux, J. (Editor)
1992-01-01
A conference was held on parallel computational fluid dynamics and produced related papers. Topics discussed in these papers include: parallel implicit and explicit solvers for compressible flow, parallel computational techniques for Euler and Navier-Stokes equations, grid generation techniques for parallel computers, and aerodynamic simulation om massively parallel systems.
Exploiting Parallel R in the Cloud with SPRINT
Piotrowski, M.; McGilvary, G.A.; Sloan, T. M.; Mewissen, M.; Lloyd, A.D.; Forster, T.; Mitchell, L.; Ghazal, P.; Hill, J.
2012-01-01
Background Advances in DNA Microarray devices and next-generation massively parallel DNA sequencing platforms have led to an exponential growth in data availability but the arising opportunities require adequate computing resources. High Performance Computing (HPC) in the Cloud offers an affordable way of meeting this need. Objectives Bioconductor, a popular tool for high-throughput genomic data analysis, is distributed as add-on modules for the R statistical programming language but R has no native capabilities for exploiting multi-processor architectures. SPRINT is an R package that enables easy access to HPC for genomics researchers. This paper investigates: setting up and running SPRINT-enabled genomic analyses on Amazon’s Elastic Compute Cloud (EC2), the advantages of submitting applications to EC2 from different parts of the world and, if resource underutilization can improve application performance. Methods The SPRINT parallel implementations of correlation, permutation testing, partitioning around medoids and the multi-purpose papply have been benchmarked on data sets of various size on Amazon EC2. Jobs have been submitted from both the UK and Thailand to investigate monetary differences. Results It is possible to obtain good, scalable performance but the level of improvement is dependent upon the nature of algorithm. Resource underutilization can further improve the time to result. End-user’s location impacts on costs due to factors such as local taxation. Conclusions: Although not designed to satisfy HPC requirements, Amazon EC2 and cloud computing in general provides an interesting alternative and provides new possibilities for smaller organisations with limited funds. PMID:23223611
Exploiting parallel R in the cloud with SPRINT.
Piotrowski, M; McGilvary, G A; Sloan, T M; Mewissen, M; Lloyd, A D; Forster, T; Mitchell, L; Ghazal, P; Hill, J
2013-01-01
Advances in DNA Microarray devices and next-generation massively parallel DNA sequencing platforms have led to an exponential growth in data availability but the arising opportunities require adequate computing resources. High Performance Computing (HPC) in the Cloud offers an affordable way of meeting this need. Bioconductor, a popular tool for high-throughput genomic data analysis, is distributed as add-on modules for the R statistical programming language but R has no native capabilities for exploiting multi-processor architectures. SPRINT is an R package that enables easy access to HPC for genomics researchers. This paper investigates: setting up and running SPRINT-enabled genomic analyses on Amazon's Elastic Compute Cloud (EC2), the advantages of submitting applications to EC2 from different parts of the world and, if resource underutilization can improve application performance. The SPRINT parallel implementations of correlation, permutation testing, partitioning around medoids and the multi-purpose papply have been benchmarked on data sets of various size on Amazon EC2. Jobs have been submitted from both the UK and Thailand to investigate monetary differences. It is possible to obtain good, scalable performance but the level of improvement is dependent upon the nature of the algorithm. Resource underutilization can further improve the time to result. End-user's location impacts on costs due to factors such as local taxation. Although not designed to satisfy HPC requirements, Amazon EC2 and cloud computing in general provides an interesting alternative and provides new possibilities for smaller organisations with limited funds.
An Object Oriented Extensible Architecture for Affordable Aerospace Propulsion Systems
NASA Technical Reports Server (NTRS)
Follen, Gregory J.; Lytle, John K. (Technical Monitor)
2002-01-01
Driven by a need to explore and develop propulsion systems that exceeded current computing capabilities, NASA Glenn embarked on a novel strategy leading to the development of an architecture that enables propulsion simulations never thought possible before. Full engine 3 Dimensional Computational Fluid Dynamic propulsion system simulations were deemed impossible due to the impracticality of the hardware and software computing systems required. However, with a software paradigm shift and an embracing of parallel and distributed processing, an architecture was designed to meet the needs of future propulsion system modeling. The author suggests that the architecture designed at the NASA Glenn Research Center for propulsion system modeling has potential for impacting the direction of development of affordable weapons systems currently under consideration by the Applied Vehicle Technology Panel (AVT). This paper discusses the salient features of the NPSS Architecture including its interface layer, object layer, implementation for accessing legacy codes, numerical zooming infrastructure and its computing layer. The computing layer focuses on the use and deployment of these propulsion simulations on parallel and distributed computing platforms which has been the focus of NASA Ames. Additional features of the object oriented architecture that support MultiDisciplinary (MD) Coupling, computer aided design (CAD) access and MD coupling objects will be discussed. Included will be a discussion of the successes, challenges and benefits of implementing this architecture.
Synergia: an accelerator modeling tool with 3-D space charge
DOE Office of Scientific and Technical Information (OSTI.GOV)
Amundson, James F.; Spentzouris, P.; /Fermilab
2004-07-01
High precision modeling of space-charge effects, together with accurate treatment of single-particle dynamics, is essential for designing future accelerators as well as optimizing the performance of existing machines. We describe Synergia, a high-fidelity parallel beam dynamics simulation package with fully three dimensional space-charge capabilities and a higher order optics implementation. We describe the computational techniques, the advanced human interface, and the parallel performance obtained using large numbers of macroparticles. We also perform code benchmarks comparing to semi-analytic results and other codes. Finally, we present initial results on particle tune spread, beam halo creation, and emittance growth in the Fermilab boostermore » accelerator.« less
Ando, S; Sekine, S; Mita, M; Katsuo, S
1989-12-15
An architecture and the algorithms for matrix multiplication using optical flip-flops (OFFs) in optical processors are proposed based on residue arithmetic. The proposed system is capable of processing all elements of matrices in parallel utilizing the information retrieving ability of optical Fourier processors. The employment of OFFs enables bidirectional data flow leading to a simpler architecture and the burden of residue-to-decimal (or residue-to-binary) conversion to operation time can be largely reduced by processing all elements in parallel. The calculated characteristics of operation time suggest a promising use of the system in a real time 2-D linear transform.
2015-01-01
Implementing parallel and multivalued logic operations at the molecular scale has the potential to improve the miniaturization and efficiency of a new generation of nanoscale computing devices. Two-dimensional photon-echo spectroscopy is capable of resolving dynamical pathways on electronic and vibrational molecular states. We experimentally demonstrate the implementation of molecular decision trees, logic operations where all possible values of inputs are processed in parallel and the outputs are read simultaneously, by probing the laser-induced dynamics of populations and coherences in a rhodamine dye mounted on a short DNA duplex. The inputs are provided by the bilinear interactions between the molecule and the laser pulses, and the output values are read from the two-dimensional molecular response at specific frequencies. Our results highlights how ultrafast dynamics between multiple molecular states induced by light–matter interactions can be used as an advantage for performing complex logic operations in parallel, operations that are faster than electrical switching. PMID:25984269
Real-time electron dynamics for massively parallel excited-state simulations
NASA Astrophysics Data System (ADS)
Andrade, Xavier
The simulation of the real-time dynamics of electrons, based on time dependent density functional theory (TDDFT), is a powerful approach to study electronic excited states in molecular and crystalline systems. What makes the method attractive is its flexibility to simulate different kinds of phenomena beyond the linear-response regime, including strongly-perturbed electronic systems and non-adiabatic electron-ion dynamics. Electron-dynamics simulations are also attractive from a computational point of view. They can run efficiently on massively parallel architectures due to the low communication requirements. Our implementations of electron dynamics, based on the codes Octopus (real-space) and Qball (plane-waves), allow us to simulate systems composed of thousands of atoms and to obtain good parallel scaling up to 1.6 million processor cores. Due to the versatility of real-time electron dynamics and its parallel performance, we expect it to become the method of choice to apply the capabilities of exascale supercomputers for the simulation of electronic excited states.
Modeling high-temperature superconductors and metallic alloys on the Intel IPSC/860
NASA Astrophysics Data System (ADS)
Geist, G. A.; Peyton, B. W.; Shelton, W. A.; Stocks, G. M.
Oak Ridge National Laboratory has embarked on several computational Grand Challenges, which require the close cooperation of physicists, mathematicians, and computer scientists. One of these projects is the determination of the material properties of alloys from first principles and, in particular, the electronic structure of high-temperature superconductors. While the present focus of the project is on superconductivity, the approach is general enough to permit study of other properties of metallic alloys such as strength and magnetic properties. This paper describes the progress to date on this project. We include a description of a self-consistent KKR-CPA method, parallelization of the model, and the incorporation of a dynamic load balancing scheme into the algorithm. We also describe the development and performance of a consolidated KKR-CPA code capable of running on CRAYs, workstations, and several parallel computers without source code modification. Performance of this code on the Intel iPSC/860 is also compared to a CRAY 2, CRAY YMP, and several workstations. Finally, some density of state calculations of two perovskite superconductors are given.
Scalable and fast heterogeneous molecular simulation with predictive parallelization schemes
NASA Astrophysics Data System (ADS)
Guzman, Horacio V.; Junghans, Christoph; Kremer, Kurt; Stuehn, Torsten
2017-11-01
Multiscale and inhomogeneous molecular systems are challenging topics in the field of molecular simulation. In particular, modeling biological systems in the context of multiscale simulations and exploring material properties are driving a permanent development of new simulation methods and optimization algorithms. In computational terms, those methods require parallelization schemes that make a productive use of computational resources for each simulation and from its genesis. Here, we introduce the heterogeneous domain decomposition approach, which is a combination of an heterogeneity-sensitive spatial domain decomposition with an a priori rearrangement of subdomain walls. Within this approach, the theoretical modeling and scaling laws for the force computation time are proposed and studied as a function of the number of particles and the spatial resolution ratio. We also show the new approach capabilities, by comparing it to both static domain decomposition algorithms and dynamic load-balancing schemes. Specifically, two representative molecular systems have been simulated and compared to the heterogeneous domain decomposition proposed in this work. These two systems comprise an adaptive resolution simulation of a biomolecule solvated in water and a phase-separated binary Lennard-Jones fluid.
JANUS: A Compilation System for Balancing Parallelism and Performance in OpenVX
NASA Astrophysics Data System (ADS)
Omidian, Hossein; Lemieux, Guy G. F.
2018-04-01
Embedded systems typically do not have enough on-chip memory for entire an image buffer. Programming systems like OpenCV operate on entire image frames at each step, making them use excessive memory bandwidth and power. In contrast, the paradigm used by OpenVX is much more efficient; it uses image tiling, and the compilation system is allowed to analyze and optimize the operation sequence, specified as a compute graph, before doing any pixel processing. In this work, we are building a compilation system for OpenVX that can analyze and optimize the compute graph to take advantage of parallel resources in many-core systems or FPGAs. Using a database of prewritten OpenVX kernels, it automatically adjusts the image tile size as well as using kernel duplication and coalescing to meet a defined area (resource) target, or to meet a specified throughput target. This allows a single compute graph to target implementations with a wide range of performance needs or capabilities, e.g. from handheld to datacenter, that use minimal resources and power to reach the performance target.
Electrically controlled polymeric gel actuators
Adolf, Douglas B.; Shahinpoor, Mohsen; Segalman, Daniel J.; Witkowski, Walter R.
1993-01-01
Electrically controlled polymeric gel actuators or synthetic muscles capable of undergoing substantial expansion and contraction when subjected to changing pH environments, temperature, or solvent. The actuators employ compliant containers for the gels and their solvents. The gels employed may be cylindrical electromechanical gel fibers such as polyacrylamide fibers or a mixture of poly vinyl alcohol-polyacrylic acid arranged in a parallel aggregate and contained in an electrolytic solvent bath such as salt water. The invention includes smart, electrically activated devices exploiting this phenomenon. These devices are capable of being manipulated via active computer control as large displacement actuators for use in adaptive structure such as robots.
Electrically controlled polymeric gel actuators
Adolf, D.B.; Shahinpoor, M.; Segalman, D.J.; Witkowski, W.R.
1993-10-05
Electrically controlled polymeric gel actuators or synthetic muscles are described capable of undergoing substantial expansion and contraction when subjected to changing pH environments, temperature, or solvent. The actuators employ compliant containers for the gels and their solvents. The gels employed may be cylindrical electromechanical gel fibers such as polyacrylamide fibers or a mixture of poly vinyl alcohol-polyacrylic acid arranged in a parallel aggregate and contained in an electrolytic solvent bath such as salt water. The invention includes smart, electrically activated devices exploiting this phenomenon. These devices are capable of being manipulated via active computer control as large displacement actuators for use in adaptive structure such as robots. 11 figures.
NASA Technical Reports Server (NTRS)
Armstrong, Wilbur C.
1992-01-01
The piping in a liquid rocket can assume complex configurations due to multiple tanks, multiple engines, and structures that must be piped around. The capability to handle some of these complex configurations have been incorporated into the ADMIT code. The capability to modify the input on line has been implemented. The configurations allowed include multiple tanks, multiple engines, the splitting of a pipe into unequal segments going to different (or the same) engines. This program will handle the following type elements: straight pipes, bends, inline accumulators, tuned stub accumulators, Helmholtz resonators, parallel resonators, pumps, split pipes, multiple tanks, and multiple engines.
NASA Technical Reports Server (NTRS)
Armstrong, Wilbur C.
1992-01-01
The piping in a liquid rocket can assume complex configurations due to multiple tanks, multiple engines, and structures that must be piped around. The capability to handle some of these complex configurations have been incorporated into the SSFREQ code. The capability to modify the input on line has been implemented. The configurations allowed include multiple tanks, multiple engines, the splitting of a pipe into equal segments going to different (or the same) engines. This program will handle the following type elements: straight pipes, bends, inline accumulators, tuned stub accumulators, Helmholtz resonators, parallel resonators, pumps, split pipes, multiple tanks, and multiple engines.
NASA Astrophysics Data System (ADS)
Lambrecht, L.; Lamert, A.; Friederich, W.; Möller, T.; Boxberg, M. S.
2018-03-01
A nodal discontinuous Galerkin (NDG) approach is developed and implemented for the computation of viscoelastic wavefields in complex geological media. The NDG approach combines unstructured tetrahedral meshes with an element-wise, high-order spatial interpolation of the wavefield based on Lagrange polynomials. Numerical fluxes are computed from an exact solution of the heterogeneous Riemann problem. Our implementation offers capabilities for modelling viscoelastic wave propagation in 1-D, 2-D and 3-D settings of very different spatial scale with little logistical overhead. It allows the import of external tetrahedral meshes provided by independent meshing software and can be run in a parallel computing environment. Computation of adjoint wavefields and an interface for the computation of waveform sensitivity kernels are offered. The method is validated in 2-D and 3-D by comparison to analytical solutions and results from a spectral element method. The capabilities of the NDG method are demonstrated through a 3-D example case taken from tunnel seismics which considers high-frequency elastic wave propagation around a curved underground tunnel cutting through inclined and faulted sedimentary strata. The NDG method was coded into the open-source software package NEXD and is available from GitHub.
2nd Generation QUATARA Flight Computer Project
NASA Technical Reports Server (NTRS)
Falker, Jay; Keys, Andrew; Fraticelli, Jose Molina; Capo-Iugo, Pedro; Peeples, Steven
2015-01-01
Single core flight computer boards have been designed, developed, and tested (DD&T) to be flown in small satellites for the last few years. In this project, a prototype flight computer will be designed as a distributed multi-core system containing four microprocessors running code in parallel. This flight computer will be capable of performing multiple computationally intensive tasks such as processing digital and/or analog data, controlling actuator systems, managing cameras, operating robotic manipulators and transmitting/receiving from/to a ground station. In addition, this flight computer will be designed to be fault tolerant by creating both a robust physical hardware connection and by using a software voting scheme to determine the processor's performance. This voting scheme will leverage on the work done for the Space Launch System (SLS) flight software. The prototype flight computer will be constructed with Commercial Off-The-Shelf (COTS) components which are estimated to survive for two years in a low-Earth orbit.
The history of MR imaging as seen through the pages of radiology.
Edelman, Robert R
2014-11-01
The first reports in Radiology pertaining to magnetic resonance (MR) imaging were published in 1980, 7 years after Paul Lauterbur pioneered the first MR images and 9 years after the first human computed tomographic images were obtained. Historical advances in the research and clinical applications of MR imaging very much parallel the remarkable advances in MR imaging technology. These advances can be roughly classified into hardware (eg, magnets, gradients, radiofrequency [RF] coils, RF transmitter and receiver, MR imaging-compatible biopsy devices) and imaging techniques (eg, pulse sequences, parallel imaging, and so forth). Image quality has been dramatically improved with the introduction of high-field-strength superconducting magnets, digital RF systems, and phased-array coils. Hybrid systems, such as MR/positron emission tomography (PET), combine the superb anatomic and functional imaging capabilities of MR imaging with the unsurpassed capability of PET to demonstrate tissue metabolism. Supported by the improvements in hardware, advances in pulse sequence design and image reconstruction techniques have spurred dramatic improvements in imaging speed and the capability for studying tissue function. In this historical review, the history of MR imaging technology and developing research and clinical applications, as seen through the pages of Radiology, will be considered.
Parallel computing in enterprise modeling.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Goldsby, Michael E.; Armstrong, Robert C.; Shneider, Max S.
2008-08-01
This report presents the results of our efforts to apply high-performance computing to entity-based simulations with a multi-use plugin for parallel computing. We use the term 'Entity-based simulation' to describe a class of simulation which includes both discrete event simulation and agent based simulation. What simulations of this class share, and what differs from more traditional models, is that the result sought is emergent from a large number of contributing entities. Logistic, economic and social simulations are members of this class where things or people are organized or self-organize to produce a solution. Entity-based problems never have an a priorimore » ergodic principle that will greatly simplify calculations. Because the results of entity-based simulations can only be realized at scale, scalable computing is de rigueur for large problems. Having said that, the absence of a spatial organizing principal makes the decomposition of the problem onto processors problematic. In addition, practitioners in this domain commonly use the Java programming language which presents its own problems in a high-performance setting. The plugin we have developed, called the Parallel Particle Data Model, overcomes both of these obstacles and is now being used by two Sandia frameworks: the Decision Analysis Center, and the Seldon social simulation facility. While the ability to engage U.S.-sized problems is now available to the Decision Analysis Center, this plugin is central to the success of Seldon. Because Seldon relies on computationally intensive cognitive sub-models, this work is necessary to achieve the scale necessary for realistic results. With the recent upheavals in the financial markets, and the inscrutability of terrorist activity, this simulation domain will likely need a capability with ever greater fidelity. High-performance computing will play an important part in enabling that greater fidelity.« less
NASA Astrophysics Data System (ADS)
Larour, Eric; Utke, Jean; Bovin, Anton; Morlighem, Mathieu; Perez, Gilberto
2016-11-01
Within the framework of sea-level rise projections, there is a strong need for hindcast validation of the evolution of polar ice sheets in a way that tightly matches observational records (from radar, gravity, and altimetry observations mainly). However, the computational requirements for making hindcast reconstructions possible are severe and rely mainly on the evaluation of the adjoint state of transient ice-flow models. Here, we look at the computation of adjoints in the context of the NASA/JPL/UCI Ice Sheet System Model (ISSM), written in C++ and designed for parallel execution with MPI. We present the adaptations required in the way the software is designed and written, but also generic adaptations in the tools facilitating the adjoint computations. We concentrate on the use of operator overloading coupled with the AdjoinableMPI library to achieve the adjoint computation of the ISSM. We present a comprehensive approach to (1) carry out type changing through the ISSM, hence facilitating operator overloading, (2) bind to external solvers such as MUMPS and GSL-LU, and (3) handle MPI-based parallelism to scale the capability. We demonstrate the success of the approach by computing sensitivities of hindcast metrics such as the misfit to observed records of surface altimetry on the northeastern Greenland Ice Stream, or the misfit to observed records of surface velocities on Upernavik Glacier, central West Greenland. We also provide metrics for the scalability of the approach, and the expected performance. This approach has the potential to enable a new generation of hindcast-validated projections that make full use of the wealth of datasets currently being collected, or already collected, in Greenland and Antarctica.
NASA Astrophysics Data System (ADS)
Perez, G. L.; Larour, E. Y.; Morlighem, M.
2016-12-01
Within the framework of sea-level rise projections, there is a strong need for hindcast validation of the evolution of polar ice sheets in a way that tightly matches observational records (from radar and altimetry observations mainly). However, the computational requirements for making hindcast reconstructions possible are severe and rely mainly on the evaluation of the adjoint state of transient ice-flow models. Here, we look at the computation of adjoints in the context of the NASA/JPL/UCI Ice Sheet System Model, written in C++ and designed for parallel execution with MPI. We present the adaptations required in the way the software is designed and written but also generic adaptations in the tools facilitating the adjoint computations. We concentrate on the use of operator overloading coupled with the AdjoinableMPI library to achieve the adjoint computation of ISSM. We present a comprehensive approach to 1) carry out type changing through ISSM, hence facilitating operator overloading, 2) bind to external solvers such as MUMPS and GSL-LU and 3) handle MPI-based parallelism to scale the capability. We demonstrate the success of the approach by computing sensitivities of hindcast metrics such as the misfit to observed records of surface altimetry on the North-East Greenland Ice Stream, or the misfit to observed records of surface velocities on Upernavik Glacier, Central West Greenland. We also provide metrics for the scalability of the approach, and the expected performance. This approach has the potential of enabling a new generation of hindcast-validated projections that make full use of the wealth of datasets currently being collected, or alreay collected in Greenland and Antarctica, such as surface altimetry, surface velocities, and/or gravity measurements.
Parallel computing for automated model calibration
DOE Office of Scientific and Technical Information (OSTI.GOV)
Burke, John S.; Danielson, Gary R.; Schulz, Douglas A.
2002-07-29
Natural resources model calibration is a significant burden on computing and staff resources in modeling efforts. Most assessments must consider multiple calibration objectives (for example magnitude and timing of stream flow peak). An automated calibration process that allows real time updating of data/models, allowing scientists to focus effort on improving models is needed. We are in the process of building a fully featured multi objective calibration tool capable of processing multiple models cheaply and efficiently using null cycle computing. Our parallel processing and calibration software routines have been generically, but our focus has been on natural resources model calibration. Somore » far, the natural resources models have been friendly to parallel calibration efforts in that they require no inter-process communication, only need a small amount of input data and only output a small amount of statistical information for each calibration run. A typical auto calibration run might involve running a model 10,000 times with a variety of input parameters and summary statistical output. In the past model calibration has been done against individual models for each data set. The individual model runs are relatively fast, ranging from seconds to minutes. The process was run on a single computer using a simple iterative process. We have completed two Auto Calibration prototypes and are currently designing a more feature rich tool. Our prototypes have focused on running the calibration in a distributed computing cross platform environment. They allow incorporation of?smart? calibration parameter generation (using artificial intelligence processing techniques). Null cycle computing similar to SETI@Home has also been a focus of our efforts. This paper details the design of the latest prototype and discusses our plans for the next revision of the software.« less
The Research of the Parallel Computing Development from the Angle of Cloud Computing
NASA Astrophysics Data System (ADS)
Peng, Zhensheng; Gong, Qingge; Duan, Yanyu; Wang, Yun
2017-10-01
Cloud computing is the development of parallel computing, distributed computing and grid computing. The development of cloud computing makes parallel computing come into people’s lives. Firstly, this paper expounds the concept of cloud computing and introduces two several traditional parallel programming model. Secondly, it analyzes and studies the principles, advantages and disadvantages of OpenMP, MPI and Map Reduce respectively. Finally, it takes MPI, OpenMP models compared to Map Reduce from the angle of cloud computing. The results of this paper are intended to provide a reference for the development of parallel computing.
Spatial data analytics on heterogeneous multi- and many-core parallel architectures using python
Laura, Jason R.; Rey, Sergio J.
2017-01-01
Parallel vector spatial analysis concerns the application of parallel computational methods to facilitate vector-based spatial analysis. The history of parallel computation in spatial analysis is reviewed, and this work is placed into the broader context of high-performance computing (HPC) and parallelization research. The rise of cyber infrastructure and its manifestation in spatial analysis as CyberGIScience is seen as a main driver of renewed interest in parallel computation in the spatial sciences. Key problems in spatial analysis that have been the focus of parallel computing are covered. Chief among these are spatial optimization problems, computational geometric problems including polygonization and spatial contiguity detection, the use of Monte Carlo Markov chain simulation in spatial statistics, and parallel implementations of spatial econometric methods. Future directions for research on parallelization in computational spatial analysis are outlined.
NASA Technical Reports Server (NTRS)
Goodwin, Sabine A.; Raj, P.
1999-01-01
Progress to date towards the development and validation of a fast, accurate and cost-effective aeroelastic method for advanced parallel computing platforms such as the IBM SP2 and the SGI Origin 2000 is presented in this paper. The ENSAERO code, developed at the NASA-Ames Research Center has been selected for this effort. The code allows for the computation of aeroelastic responses by simultaneously integrating the Euler or Navier-Stokes equations and the modal structural equations of motion. To assess the computational performance and accuracy of the ENSAERO code, this paper reports the results of the Navier-Stokes simulations of the transonic flow over a flexible aeroelastic wing body configuration. In addition, a forced harmonic oscillation analysis in the frequency domain and an analysis in the time domain are done on a wing undergoing a rigid pitch and plunge motion. Finally, to demonstrate the ENSAERO flutter-analysis capability, aeroelastic Euler and Navier-Stokes computations on an L-1011 wind tunnel model including pylon, nacelle and empennage are underway. All computational solutions are compared with experimental data to assess the level of accuracy of ENSAERO. As the computations described above are performed, a meticulous log of computational performance in terms of wall clock time, execution speed, memory and disk storage is kept. Code scalability is also demonstrated by studying the impact of varying the number of processors on computational performance on the IBM SP2 and the Origin 2000 systems.
GASPRNG: GPU accelerated scalable parallel random number generator library
NASA Astrophysics Data System (ADS)
Gao, Shuang; Peterson, Gregory D.
2013-04-01
Graphics processors represent a promising technology for accelerating computational science applications. Many computational science applications require fast and scalable random number generation with good statistical properties, so they use the Scalable Parallel Random Number Generators library (SPRNG). We present the GPU Accelerated SPRNG library (GASPRNG) to accelerate SPRNG in GPU-based high performance computing systems. GASPRNG includes code for a host CPU and CUDA code for execution on NVIDIA graphics processing units (GPUs) along with a programming interface to support various usage models for pseudorandom numbers and computational science applications executing on the CPU, GPU, or both. This paper describes the implementation approach used to produce high performance and also describes how to use the programming interface. The programming interface allows a user to be able to use GASPRNG the same way as SPRNG on traditional serial or parallel computers as well as to develop tightly coupled programs executing primarily on the GPU. We also describe how to install GASPRNG and use it. To help illustrate linking with GASPRNG, various demonstration codes are included for the different usage models. GASPRNG on a single GPU shows up to 280x speedup over SPRNG on a single CPU core and is able to scale for larger systems in the same manner as SPRNG. Because GASPRNG generates identical streams of pseudorandom numbers as SPRNG, users can be confident about the quality of GASPRNG for scalable computational science applications. Catalogue identifier: AEOI_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEOI_v1_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland Licensing provisions: UTK license. No. of lines in distributed program, including test data, etc.: 167900 No. of bytes in distributed program, including test data, etc.: 1422058 Distribution format: tar.gz Programming language: C and CUDA. Computer: Any PC or workstation with NVIDIA GPU (Tested on Fermi GTX480, Tesla C1060, Tesla M2070). Operating system: Linux with CUDA version 4.0 or later. Should also run on MacOS, Windows, or UNIX. Has the code been vectorized or parallelized?: Yes. Parallelized using MPI directives. RAM: 512 MB˜ 732 MB (main memory on host CPU, depending on the data type of random numbers.) / 512 MB (GPU global memory) Classification: 4.13, 6.5. Nature of problem: Many computational science applications are able to consume large numbers of random numbers. For example, Monte Carlo simulations are able to consume limitless random numbers for the computation as long as resources for the computing are supported. Moreover, parallel computational science applications require independent streams of random numbers to attain statistically significant results. The SPRNG library provides this capability, but at a significant computational cost. The GASPRNG library presented here accelerates the generators of independent streams of random numbers using graphical processing units (GPUs). Solution method: Multiple copies of random number generators in GPUs allow a computational science application to consume large numbers of random numbers from independent, parallel streams. GASPRNG is a random number generators library to allow a computational science application to employ multiple copies of random number generators to boost performance. Users can interface GASPRNG with software code executing on microprocessors and/or GPUs. Running time: The tests provided take a few minutes to run.
Algorithm and code development for unsteady three-dimensional Navier-Stokes equations
NASA Technical Reports Server (NTRS)
Obayashi, Shigeru
1994-01-01
Aeroelastic tests require extensive cost and risk. An aeroelastic wind-tunnel experiment is an order of magnitude more expensive than a parallel experiment involving only aerodynamics. By complementing the wind-tunnel experiments with numerical simulations, the overall cost of the development of aircraft can be considerably reduced. In order to accurately compute aeroelastic phenomenon it is necessary to solve the unsteady Euler/Navier-Stokes equations simultaneously with the structural equations of motion. These equations accurately describe the flow phenomena for aeroelastic applications. At ARC a code, ENSAERO, is being developed for computing the unsteady aerodynamics and aeroelasticity of aircraft, and it solves the Euler/Navier-Stokes equations. The purpose of this cooperative agreement was to enhance ENSAERO in both algorithm and geometric capabilities. During the last five years, the algorithms of the code have been enhanced extensively by using high-resolution upwind algorithms and efficient implicit solvers. The zonal capability of the code has been extended from a one-to-one grid interface to a mismatching unsteady zonal interface. The geometric capability of the code has been extended from a single oscillating wing case to a full-span wing-body configuration with oscillating control surfaces. Each time a new capability was added, a proper validation case was simulated, and the capability of the code was demonstrated.
Issues in human/computer control of dexterous remote hands
NASA Technical Reports Server (NTRS)
Salisbury, K.
1987-01-01
Much research on dexterous robot hands has been aimed at the design and control problems associated with their autonomous operation, while relatively little research has addressed the problem of direct human control. It is likely that these two modes can be combined in a complementary manner yielding more capability than either alone could provide. While many of the issues in mixed computer/human control of dexterous hands parallel those found in supervisory control of traditional remote manipulators, the unique geometry and capabilities of dexterous hands pose many new problems. Among these are the control of redundant degrees of freedom, grasp stabilization and specification of non-anthropomorphic behavior. An overview is given of progress made at the MIT AI Laboratory in control of the Salisbury 3 finger hand, including experiments in grasp planning and manipulation via controlled slip. It is also suggested how we might introduce human control into the process at a variety of functional levels.
Introducing Argonne’s Theta Supercomputer
DOE Office of Scientific and Technical Information (OSTI.GOV)
None
Theta, the Argonne Leadership Computing Facility’s (ALCF) new Intel-Cray supercomputer, is officially open to the research community. Theta’s massively parallel, many-core architecture puts the ALCF on the path to Aurora, the facility’s future Intel-Cray system. Capable of nearly 10 quadrillion calculations per second, Theta enables researchers to break new ground in scientific investigations that range from modeling the inner workings of the brain to developing new materials for renewable energy applications.
Visualization Software for VisIT Java Client
DOE Office of Scientific and Technical Information (OSTI.GOV)
Billings, Jay Jay; Smith, Robert W
The VisIT Java Client (JVC) library is a lightweight thin client that is designed and written purely in the native language of Java (the Python & JavaScript versions of the library use the same concept) and communicates with any new unmodified standalone version of VisIT, a high performance computing parallel visualization toolkit, over traditional or web sockets and dynamically determines capabilities of the running VisIT instance whether local or remote.
Fast Computation of the Two-Point Correlation Function in the Age of Big Data
NASA Astrophysics Data System (ADS)
Pellegrino, Andrew; Timlin, John
2018-01-01
We present a new code which quickly computes the two-point correlation function for large sets of astronomical data. This code combines the ease of use of Python with the speed of parallel shared libraries written in C. We include the capability to compute the auto- and cross-correlation statistics, and allow the user to calculate the three-dimensional and angular correlation functions. Additionally, the code automatically divides the user-provided sky masks into contiguous subsamples of similar size, using the HEALPix pixelization scheme, for the purpose of resampling. Errors are computed using jackknife and bootstrap resampling in a way that adds negligible extra runtime, even with many subsamples. We demonstrate comparable speed with other clustering codes, and code accuracy compared to known and analytic results.
GaAs Supercomputing: Architecture, Language, And Algorithms For Image Processing
NASA Astrophysics Data System (ADS)
Johl, John T.; Baker, Nick C.
1988-10-01
The application of high-speed GaAs processors in a parallel system matches the demanding computational requirements of image processing. The architecture of the McDonnell Douglas Astronautics Company (MDAC) vector processor is described along with the algorithms and language translator. Most image and signal processing algorithms can utilize parallel processing and show a significant performance improvement over sequential versions. The parallelization performed by this system is within each vector instruction. Since each vector has many elements, each requiring some computation, useful concurrent arithmetic operations can easily be performed. Balancing the memory bandwidth with the computation rate of the processors is an important design consideration for high efficiency and utilization. The architecture features a bus-based execution unit consisting of four to eight 32-bit GaAs RISC microprocessors running at a 200 MHz clock rate for a peak performance of 1.6 BOPS. The execution unit is connected to a vector memory with three buses capable of transferring two input words and one output word every 10 nsec. The address generators inside the vector memory perform different vector addressing modes and feed the data to the execution unit. The functions discussed in this paper include basic MATRIX OPERATIONS, 2-D SPATIAL CONVOLUTION, HISTOGRAM, and FFT. For each of these algorithms, assembly language programs were run on a behavioral model of the system to obtain performance figures.
Xia, Yidong; Lou, Jialin; Luo, Hong; ...
2015-02-09
Here, an OpenACC directive-based graphics processing unit (GPU) parallel scheme is presented for solving the compressible Navier–Stokes equations on 3D hybrid unstructured grids with a third-order reconstructed discontinuous Galerkin method. The developed scheme requires the minimum code intrusion and algorithm alteration for upgrading a legacy solver with the GPU computing capability at very little extra effort in programming, which leads to a unified and portable code development strategy. A face coloring algorithm is adopted to eliminate the memory contention because of the threading of internal and boundary face integrals. A number of flow problems are presented to verify the implementationmore » of the developed scheme. Timing measurements were obtained by running the resulting GPU code on one Nvidia Tesla K20c GPU card (Nvidia Corporation, Santa Clara, CA, USA) and compared with those obtained by running the equivalent Message Passing Interface (MPI) parallel CPU code on a compute node (consisting of two AMD Opteron 6128 eight-core CPUs (Advanced Micro Devices, Inc., Sunnyvale, CA, USA)). Speedup factors of up to 24× and 1.6× for the GPU code were achieved with respect to one and 16 CPU cores, respectively. The numerical results indicate that this OpenACC-based parallel scheme is an effective and extensible approach to port unstructured high-order CFD solvers to GPU computing.« less
Implementing Molecular Dynamics for Hybrid High Performance Computers - 1. Short Range Forces
DOE Office of Scientific and Technical Information (OSTI.GOV)
Brown, W Michael; Wang, Peng; Plimpton, Steven J
The use of accelerators such as general-purpose graphics processing units (GPGPUs) have become popular in scientific computing applications due to their low cost, impressive floating-point capabilities, high memory bandwidth, and low electrical power requirements. Hybrid high performance computers, machines with more than one type of floating-point processor, are now becoming more prevalent due to these advantages. In this work, we discuss several important issues in porting a large molecular dynamics code for use on parallel hybrid machines - 1) choosing a hybrid parallel decomposition that works on central processing units (CPUs) with distributed memory and accelerator cores with shared memory,more » 2) minimizing the amount of code that must be ported for efficient acceleration, 3) utilizing the available processing power from both many-core CPUs and accelerators, and 4) choosing a programming model for acceleration. We present our solution to each of these issues for short-range force calculation in the molecular dynamics package LAMMPS. We describe algorithms for efficient short range force calculation on hybrid high performance machines. We describe a new approach for dynamic load balancing of work between CPU and accelerator cores. We describe the Geryon library that allows a single code to compile with both CUDA and OpenCL for use on a variety of accelerators. Finally, we present results on a parallel test cluster containing 32 Fermi GPGPUs and 180 CPU cores.« less
Tyagi, Neelam; Bose, Abhijit; Chetty, Indrin J
2004-09-01
We have parallelized the Dose Planning Method (DPM), a Monte Carlo code optimized for radiotherapy class problems, on distributed-memory processor architectures using the Message Passing Interface (MPI). Parallelization has been investigated on a variety of parallel computing architectures at the University of Michigan-Center for Advanced Computing, with respect to efficiency and speedup as a function of the number of processors. We have integrated the parallel pseudo random number generator from the Scalable Parallel Pseudo-Random Number Generator (SPRNG) library to run with the parallel DPM. The Intel cluster consisting of 800 MHz Intel Pentium III processor shows an almost linear speedup up to 32 processors for simulating 1 x 10(8) or more particles. The speedup results are nearly linear on an Athlon cluster (up to 24 processors based on availability) which consists of 1.8 GHz+ Advanced Micro Devices (AMD) Athlon processors on increasing the problem size up to 8 x 10(8) histories. For a smaller number of histories (1 x 10(8)) the reduction of efficiency with the Athlon cluster (down to 83.9% with 24 processors) occurs because the processing time required to simulate 1 x 10(8) histories is less than the time associated with interprocessor communication. A similar trend was seen with the Opteron Cluster (consisting of 1400 MHz, 64-bit AMD Opteron processors) on increasing the problem size. Because of the 64-bit architecture Opteron processors are capable of storing and processing instructions at a faster rate and hence are faster as compared to the 32-bit Athlon processors. We have validated our implementation with an in-phantom dose calculation study using a parallel pencil monoenergetic electron beam of 20 MeV energy. The phantom consists of layers of water, lung, bone, aluminum, and titanium. The agreement in the central axis depth dose curves and profiles at different depths shows that the serial and parallel codes are equivalent in accuracy.
Broadcasting collective operation contributions throughout a parallel computer
Faraj, Ahmad [Rochester, MN
2012-02-21
Methods, systems, and products are disclosed for broadcasting collective operation contributions throughout a parallel computer. The parallel computer includes a plurality of compute nodes connected together through a data communications network. Each compute node has a plurality of processors for use in collective parallel operations on the parallel computer. Broadcasting collective operation contributions throughout a parallel computer according to embodiments of the present invention includes: transmitting, by each processor on each compute node, that processor's collective operation contribution to the other processors on that compute node using intra-node communications; and transmitting on a designated network link, by each processor on each compute node according to a serial processor transmission sequence, that processor's collective operation contribution to the other processors on the other compute nodes using inter-node communications.
Application Portable Parallel Library
NASA Technical Reports Server (NTRS)
Cole, Gary L.; Blech, Richard A.; Quealy, Angela; Townsend, Scott
1995-01-01
Application Portable Parallel Library (APPL) computer program is subroutine-based message-passing software library intended to provide consistent interface to variety of multiprocessor computers on market today. Minimizes effort needed to move application program from one computer to another. User develops application program once and then easily moves application program from parallel computer on which created to another parallel computer. ("Parallel computer" also include heterogeneous collection of networked computers). Written in C language with one FORTRAN 77 subroutine for UNIX-based computers and callable from application programs written in C language or FORTRAN 77.
Lanczos eigensolution method for high-performance computers
NASA Technical Reports Server (NTRS)
Bostic, Susan W.
1991-01-01
The theory, computational analysis, and applications are presented of a Lanczos algorithm on high performance computers. The computationally intensive steps of the algorithm are identified as: the matrix factorization, the forward/backward equation solution, and the matrix vector multiples. These computational steps are optimized to exploit the vector and parallel capabilities of high performance computers. The savings in computational time from applying optimization techniques such as: variable band and sparse data storage and access, loop unrolling, use of local memory, and compiler directives are presented. Two large scale structural analysis applications are described: the buckling of a composite blade stiffened panel with a cutout, and the vibration analysis of a high speed civil transport. The sequential computational time for the panel problem executed on a CONVEX computer of 181.6 seconds was decreased to 14.1 seconds with the optimized vector algorithm. The best computational time of 23 seconds for the transport problem with 17,000 degs of freedom was on the the Cray-YMP using an average of 3.63 processors.
High-performance computing in image registration
NASA Astrophysics Data System (ADS)
Zanin, Michele; Remondino, Fabio; Dalla Mura, Mauro
2012-10-01
Thanks to the recent technological advances, a large variety of image data is at our disposal with variable geometric, radiometric and temporal resolution. In many applications the processing of such images needs high performance computing techniques in order to deliver timely responses e.g. for rapid decisions or real-time actions. Thus, parallel or distributed computing methods, Digital Signal Processor (DSP) architectures, Graphical Processing Unit (GPU) programming and Field-Programmable Gate Array (FPGA) devices have become essential tools for the challenging issue of processing large amount of geo-data. The article focuses on the processing and registration of large datasets of terrestrial and aerial images for 3D reconstruction, diagnostic purposes and monitoring of the environment. For the image alignment procedure, sets of corresponding feature points need to be automatically extracted in order to successively compute the geometric transformation that aligns the data. The feature extraction and matching are ones of the most computationally demanding operations in the processing chain thus, a great degree of automation and speed is mandatory. The details of the implemented operations (named LARES) exploiting parallel architectures and GPU are thus presented. The innovative aspects of the implementation are (i) the effectiveness on a large variety of unorganized and complex datasets, (ii) capability to work with high-resolution images and (iii) the speed of the computations. Examples and comparisons with standard CPU processing are also reported and commented.
A Fast Synthetic Aperture Radar Raw Data Simulation Using Cloud Computing
Li, Zhixin; Su, Dandan; Zhu, Haijiang; Li, Wei; Zhang, Fan; Li, Ruirui
2017-01-01
Synthetic Aperture Radar (SAR) raw data simulation is a fundamental problem in radar system design and imaging algorithm research. The growth of surveying swath and resolution results in a significant increase in data volume and simulation period, which can be considered to be a comprehensive data intensive and computing intensive issue. Although several high performance computing (HPC) methods have demonstrated their potential for accelerating simulation, the input/output (I/O) bottleneck of huge raw data has not been eased. In this paper, we propose a cloud computing based SAR raw data simulation algorithm, which employs the MapReduce model to accelerate the raw data computing and the Hadoop distributed file system (HDFS) for fast I/O access. The MapReduce model is designed for the irregular parallel accumulation of raw data simulation, which greatly reduces the parallel efficiency of graphics processing unit (GPU) based simulation methods. In addition, three kinds of optimization strategies are put forward from the aspects of programming model, HDFS configuration and scheduling. The experimental results show that the cloud computing based algorithm achieves 4× speedup over the baseline serial approach in an 8-node cloud environment, and each optimization strategy can improve about 20%. This work proves that the proposed cloud algorithm is capable of solving the computing intensive and data intensive issues in SAR raw data simulation, and is easily extended to large scale computing to achieve higher acceleration. PMID:28075343
Performance of parallel computation using CUDA for solving the one-dimensional elasticity equations
NASA Astrophysics Data System (ADS)
Darmawan, J. B. B.; Mungkasi, S.
2017-01-01
In this paper, we investigate the performance of parallel computation in solving the one-dimensional elasticity equations. Elasticity equations are usually implemented in engineering science. Solving these equations fast and efficiently is desired. Therefore, we propose the use of parallel computation. Our parallel computation uses CUDA of the NVIDIA. Our research results show that parallel computation using CUDA has a great advantage and is powerful when the computation is of large scale.
ParCAT: A Parallel Climate Analysis Toolkit
NASA Astrophysics Data System (ADS)
Haugen, B.; Smith, B.; Steed, C.; Ricciuto, D. M.; Thornton, P. E.; Shipman, G.
2012-12-01
Climate science has employed increasingly complex models and simulations to analyze the past and predict the future of our climate. The size and dimensionality of climate simulation data has been growing with the complexity of the models. This growth in data is creating a widening gap between the data being produced and the tools necessary to analyze large, high dimensional data sets. With single run data sets increasing into 10's, 100's and even 1000's of gigabytes, parallel computing tools are becoming a necessity in order to analyze and compare climate simulation data. The Parallel Climate Analysis Toolkit (ParCAT) provides basic tools that efficiently use parallel computing techniques to narrow the gap between data set size and analysis tools. ParCAT was created as a collaborative effort between climate scientists and computer scientists in order to provide efficient parallel implementations of the computing tools that are of use to climate scientists. Some of the basic functionalities included in the toolkit are the ability to compute spatio-temporal means and variances, differences between two runs and histograms of the values in a data set. ParCAT is designed to facilitate the "heavy lifting" that is required for large, multidimensional data sets. The toolkit does not focus on performing the final visualizations and presentation of results but rather, reducing large data sets to smaller, more manageable summaries. The output from ParCAT is provided in commonly used file formats (NetCDF, CSV, ASCII) to allow for simple integration with other tools. The toolkit is currently implemented as a command line utility, but will likely also provide a C library for developers interested in tighter software integration. Elements of the toolkit are already being incorporated into projects such as UV-CDAT and CMDX. There is also an effort underway to implement portions of the CCSM Land Model Diagnostics package using ParCAT in conjunction with Python and gnuplot. ParCAT is implemented in C to provide efficient file IO. The file IO operations in the toolkit use the parallel-netcdf library; this enables the code to use the parallel IO capabilities of modern HPC systems. Analysis that currently requires an estimated 12+ hours with the traditional CCSM Land Model Diagnostics Package can now be performed in as little as 30 minutes on a single desktop workstation and a few minutes for relatively small jobs completed on modern HPC systems such as ORNL's Jaguar.
Increasing processor utilization during parallel computation rundown
NASA Technical Reports Server (NTRS)
Jones, W. H.
1986-01-01
Some parallel processing environments provide for asynchronous execution and completion of general purpose parallel computations from a single computational phase. When all the computations from such a phase are complete, a new parallel computational phase is begun. Depending upon the granularity of the parallel computations to be performed, there may be a shortage of available work as a particular computational phase draws to a close (computational rundown). This can result in the waste of computing resources and the delay of the overall problem. In many practical instances, strict sequential ordering of phases of parallel computation is not totally required. In such cases, the beginning of one phase can be correctly computed before the end of a previous phase is completed. This allows additional work to be generated somewhat earlier to keep computing resources busy during each computational rundown. The conditions under which this can occur are identified and the frequency of occurrence of such overlapping in an actual parallel Navier-Stokes code is reported. A language construct is suggested and possible control strategies for the management of such computational phase overlapping are discussed.
Broadcasting a message in a parallel computer
Berg, Jeremy E [Rochester, MN; Faraj, Ahmad A [Rochester, MN
2011-08-02
Methods, systems, and products are disclosed for broadcasting a message in a parallel computer. The parallel computer includes a plurality of compute nodes connected together using a data communications network. The data communications network optimized for point to point data communications and is characterized by at least two dimensions. The compute nodes are organized into at least one operational group of compute nodes for collective parallel operations of the parallel computer. One compute node of the operational group assigned to be a logical root. Broadcasting a message in a parallel computer includes: establishing a Hamiltonian path along all of the compute nodes in at least one plane of the data communications network and in the operational group; and broadcasting, by the logical root to the remaining compute nodes, the logical root's message along the established Hamiltonian path.
Force user's manual: A portable, parallel FORTRAN
NASA Technical Reports Server (NTRS)
Jordan, Harry F.; Benten, Muhammad S.; Arenstorf, Norbert S.; Ramanan, Aruna V.
1990-01-01
The use of Force, a parallel, portable FORTRAN on shared memory parallel computers is described. Force simplifies writing code for parallel computers and, once the parallel code is written, it is easily ported to computers on which Force is installed. Although Force is nearly the same for all computers, specific details are included for the Cray-2, Cray-YMP, Convex 220, Flex/32, Encore, Sequent, Alliant computers on which it is installed.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jalving, Jordan; Abhyankar, Shrirang; Kim, Kibaek
Here, we present a computational framework that facilitates the construction, instantiation, and analysis of large-scale optimization and simulation applications of coupled energy networks. The framework integrates the optimization modeling package PLASMO and the simulation package DMNetwork (built around PETSc). These tools use a common graphbased abstraction that enables us to achieve compatibility between data structures and to build applications that use network models of different physical fidelity. We also describe how to embed these tools within complex computational workflows using SWIFT, which is a tool that facilitates parallel execution of multiple simulation runs and management of input and output data.more » We discuss how to use these capabilities to target coupled natural gas and electricity systems.« less
Jalving, Jordan; Abhyankar, Shrirang; Kim, Kibaek; ...
2017-04-24
Here, we present a computational framework that facilitates the construction, instantiation, and analysis of large-scale optimization and simulation applications of coupled energy networks. The framework integrates the optimization modeling package PLASMO and the simulation package DMNetwork (built around PETSc). These tools use a common graphbased abstraction that enables us to achieve compatibility between data structures and to build applications that use network models of different physical fidelity. We also describe how to embed these tools within complex computational workflows using SWIFT, which is a tool that facilitates parallel execution of multiple simulation runs and management of input and output data.more » We discuss how to use these capabilities to target coupled natural gas and electricity systems.« less
NASA Technical Reports Server (NTRS)
1988-01-01
The charter of the Structures Division is to perform and disseminate results of research conducted in support of aerospace engine structures. These results have a wide range of applicability to practioners of structural engineering mechanics beyond the aerospace arena. The specific purpose of the symposium was to familiarize the engineering structures community with the depth and range of research performed by the division and its academic and industrial partners. Sessions covered vibration control, fracture mechanics, ceramic component reliability, parallel computing, nondestructive evaluation, constitutive models and experimental capabilities, dynamic systems, fatigue and damage, wind turbines, hot section technology (HOST), aeroelasticity, structural mechanics codes, computational methods for dynamics, structural optimization, and applications of structural dynamics, and structural mechanics computer codes.
On the Use of CAD and Cartesian Methods for Aerodynamic Optimization
NASA Technical Reports Server (NTRS)
Nemec, M.; Aftosmis, M. J.; Pulliam, T. H.
2004-01-01
The objective for this paper is to present the development of an optimization capability for Curt3D, a Cartesian inviscid-flow analysis package. We present the construction of a new optimization framework and we focus on the following issues: 1) Component-based geometry parameterization approach using parametric-CAD models and CAPRI. A novel geometry server is introduced that addresses the issue of parallel efficiency while only sparingly consuming CAD resources; 2) The use of genetic and gradient-based algorithms for three-dimensional aerodynamic design problems. The influence of noise on the optimization methods is studied. Our goal is to create a responsive and automated framework that efficiently identifies design modifications that result in substantial performance improvements. In addition, we examine the architectural issues associated with the deployment of a CAD-based approach in a heterogeneous parallel computing environment that contains both CAD workstations and dedicated compute engines. We demonstrate the effectiveness of the framework for a design problem that features topology changes and complex geometry.
High Performance Geostatistical Modeling of Biospheric Resources
NASA Astrophysics Data System (ADS)
Pedelty, J. A.; Morisette, J. T.; Smith, J. A.; Schnase, J. L.; Crosier, C. S.; Stohlgren, T. J.
2004-12-01
We are using parallel geostatistical codes to study spatial relationships among biospheric resources in several study areas. For example, spatial statistical models based on large- and small-scale variability have been used to predict species richness of both native and exotic plants (hot spots of diversity) and patterns of exotic plant invasion. However, broader use of geostastics in natural resource modeling, especially at regional and national scales, has been limited due to the large computing requirements of these applications. To address this problem, we implemented parallel versions of the kriging spatial interpolation algorithm. The first uses the Message Passing Interface (MPI) in a master/slave paradigm on an open source Linux Beowulf cluster, while the second is implemented with the new proprietary Xgrid distributed processing system on an Xserve G5 cluster from Apple Computer, Inc. These techniques are proving effective and provide the basis for a national decision support capability for invasive species management that is being jointly developed by NASA and the US Geological Survey.
High Performance Radiation Transport Simulations on TITAN
DOE Office of Scientific and Technical Information (OSTI.GOV)
Baker, Christopher G; Davidson, Gregory G; Evans, Thomas M
2012-01-01
In this paper we describe the Denovo code system. Denovo solves the six-dimensional, steady-state, linear Boltzmann transport equation, of central importance to nuclear technology applications such as reactor core analysis (neutronics), radiation shielding, nuclear forensics and radiation detection. The code features multiple spatial differencing schemes, state-of-the-art linear solvers, the Koch-Baker-Alcouffe (KBA) parallel-wavefront sweep algorithm for inverting the transport operator, a new multilevel energy decomposition method scaling to hundreds of thousands of processing cores, and a modern, novel code architecture that supports straightforward integration of new features. In this paper we discuss the performance of Denovo on the 10--20 petaflop ORNLmore » GPU-based system, Titan. We describe algorithms and techniques used to exploit the capabilities of Titan's heterogeneous compute node architecture and the challenges of obtaining good parallel performance for this sparse hyperbolic PDE solver containing inherently sequential computations. Numerical results demonstrating Denovo performance on early Titan hardware are presented.« less
MPIGeneNet: Parallel Calculation of Gene Co-Expression Networks on Multicore Clusters.
Gonzalez-Dominguez, Jorge; Martin, Maria J
2017-10-10
In this work we present MPIGeneNet, a parallel tool that applies Pearson's correlation and Random Matrix Theory to construct gene co-expression networks. It is based on the state-of-the-art sequential tool RMTGeneNet, which provides networks with high robustness and sensitivity at the expenses of relatively long runtimes for large scale input datasets. MPIGeneNet returns the same results as RMTGeneNet but improves the memory management, reduces the I/O cost, and accelerates the two most computationally demanding steps of co-expression network construction by exploiting the compute capabilities of common multicore CPU clusters. Our performance evaluation on two different systems using three typical input datasets shows that MPIGeneNet is significantly faster than RMTGeneNet. As an example, our tool is up to 175.41 times faster on a cluster with eight nodes, each one containing two 12-core Intel Haswell processors. Source code of MPIGeneNet, as well as a reference manual, are available at https://sourceforge.net/projects/mpigenenet/.
NASA Technical Reports Server (NTRS)
Schumann, H. H. (Principal Investigator)
1972-01-01
The author has identified the following significant results. Preliminary analysis of DCS data from the USGS Verde River stream flow measuring site indicates the DCS system is furnishing high quality data more frequently than had been expected. During the 43-day period between Nov. 3, and Dec. 15, 1972, 552 DCS transmissions were received during 193 data passes. The amount of data received far exceeded the single high quality transmission per 12-hour period expected from the DCS system. The digital-parallel ERTS-1 data has furnished sufficient to accurately compute mean daily gage heights. These in turn, are used to compute average daily streamflow rates during periods of stable or slowly changing flow conditions. The digital-parallel data has also furnished useful information during peak flow periods. However, the serial-digital DCS capability, currently under development for transmitting streamflow data, should provide data of greater utility for determining times of flood peaks.
NASA Astrophysics Data System (ADS)
Laracuente, Nicholas; Grossman, Carl
2013-03-01
We developed an algorithm and software to calculate autocorrelation functions from real-time photon-counting data using the fast, parallel capabilities of graphical processor units (GPUs). Recent developments in hardware and software have allowed for general purpose computing with inexpensive GPU hardware. These devices are more suited for emulating hardware autocorrelators than traditional CPU-based software applications by emphasizing parallel throughput over sequential speed. Incoming data are binned in a standard multi-tau scheme with configurable points-per-bin size and are mapped into a GPU memory pattern to reduce time-expensive memory access. Applications include dynamic light scattering (DLS) and fluorescence correlation spectroscopy (FCS) experiments. We ran the software on a 64-core graphics pci card in a 3.2 GHz Intel i5 CPU based computer running Linux. FCS measurements were made on Alexa-546 and Texas Red dyes in a standard buffer (PBS). Software correlations were compared to hardware correlator measurements on the same signals. Supported by HHMI and Swarthmore College
NASA Astrophysics Data System (ADS)
Kavehei, Omid; Linn, Eike; Nielen, Lutz; Tappertzhofen, Stefan; Skafidas, Efstratios; Valov, Ilia; Waser, Rainer
2013-05-01
We report on the implementation of an Associative Capacitive Network (ACN) based on the nondestructive capacitive readout of two Complementary Resistive Switches (2-CRSs). ACNs are capable of performing a fully parallel search for Hamming distances (i.e. similarity) between input and stored templates. Unlike conventional associative memories where charge retention is a key function and hence, they require frequent refresh cycles, in ACNs, information is retained in a nonvolatile resistive state and normal tasks are carried out through capacitive coupling between input and output nodes. Each device consists of two CRS cells and no selective element is needed, therefore, CMOS circuitry is only required in the periphery, for addressing and read-out. Highly parallel processing, nonvolatility, wide interconnectivity and low-energy consumption are significant advantages of ACNs over conventional and emerging associative memories. These characteristics make ACNs one of the promising candidates for applications in memory-intensive and cognitive computing, switches and routers as binary and ternary Content Addressable Memories (CAMs) and intelligent data processing.
Architecture Adaptive Computing Environment
NASA Technical Reports Server (NTRS)
Dorband, John E.
2006-01-01
Architecture Adaptive Computing Environment (aCe) is a software system that includes a language, compiler, and run-time library for parallel computing. aCe was developed to enable programmers to write programs, more easily than was previously possible, for a variety of parallel computing architectures. Heretofore, it has been perceived to be difficult to write parallel programs for parallel computers and more difficult to port the programs to different parallel computing architectures. In contrast, aCe is supportable on all high-performance computing architectures. Currently, it is supported on LINUX clusters. aCe uses parallel programming constructs that facilitate writing of parallel programs. Such constructs were used in single-instruction/multiple-data (SIMD) programming languages of the 1980s, including Parallel Pascal, Parallel Forth, C*, *LISP, and MasPar MPL. In aCe, these constructs are extended and implemented for both SIMD and multiple- instruction/multiple-data (MIMD) architectures. Two new constructs incorporated in aCe are those of (1) scalar and virtual variables and (2) pre-computed paths. The scalar-and-virtual-variables construct increases flexibility in optimizing memory utilization in various architectures. The pre-computed-paths construct enables the compiler to pre-compute part of a communication operation once, rather than computing it every time the communication operation is performed.
Quaglio, Pietro; Yegenoglu, Alper; Torre, Emiliano; Endres, Dominik M; Grün, Sonja
2017-01-01
Repeated, precise sequences of spikes are largely considered a signature of activation of cell assemblies. These repeated sequences are commonly known under the name of spatio-temporal patterns (STPs). STPs are hypothesized to play a role in the communication of information in the computational process operated by the cerebral cortex. A variety of statistical methods for the detection of STPs have been developed and applied to electrophysiological recordings, but such methods scale poorly with the current size of available parallel spike train recordings (more than 100 neurons). In this work, we introduce a novel method capable of overcoming the computational and statistical limits of existing analysis techniques in detecting repeating STPs within massively parallel spike trains (MPST). We employ advanced data mining techniques to efficiently extract repeating sequences of spikes from the data. Then, we introduce and compare two alternative approaches to distinguish statistically significant patterns from chance sequences. The first approach uses a measure known as conceptual stability, of which we investigate a computationally cheap approximation for applications to such large data sets. The second approach is based on the evaluation of pattern statistical significance. In particular, we provide an extension to STPs of a method we recently introduced for the evaluation of statistical significance of synchronous spike patterns. The performance of the two approaches is evaluated in terms of computational load and statistical power on a variety of artificial data sets that replicate specific features of experimental data. Both methods provide an effective and robust procedure for detection of STPs in MPST data. The method based on significance evaluation shows the best overall performance, although at a higher computational cost. We name the novel procedure the spatio-temporal Spike PAttern Detection and Evaluation (SPADE) analysis.
Quaglio, Pietro; Yegenoglu, Alper; Torre, Emiliano; Endres, Dominik M.; Grün, Sonja
2017-01-01
Repeated, precise sequences of spikes are largely considered a signature of activation of cell assemblies. These repeated sequences are commonly known under the name of spatio-temporal patterns (STPs). STPs are hypothesized to play a role in the communication of information in the computational process operated by the cerebral cortex. A variety of statistical methods for the detection of STPs have been developed and applied to electrophysiological recordings, but such methods scale poorly with the current size of available parallel spike train recordings (more than 100 neurons). In this work, we introduce a novel method capable of overcoming the computational and statistical limits of existing analysis techniques in detecting repeating STPs within massively parallel spike trains (MPST). We employ advanced data mining techniques to efficiently extract repeating sequences of spikes from the data. Then, we introduce and compare two alternative approaches to distinguish statistically significant patterns from chance sequences. The first approach uses a measure known as conceptual stability, of which we investigate a computationally cheap approximation for applications to such large data sets. The second approach is based on the evaluation of pattern statistical significance. In particular, we provide an extension to STPs of a method we recently introduced for the evaluation of statistical significance of synchronous spike patterns. The performance of the two approaches is evaluated in terms of computational load and statistical power on a variety of artificial data sets that replicate specific features of experimental data. Both methods provide an effective and robust procedure for detection of STPs in MPST data. The method based on significance evaluation shows the best overall performance, although at a higher computational cost. We name the novel procedure the spatio-temporal Spike PAttern Detection and Evaluation (SPADE) analysis. PMID:28596729
DOE Office of Scientific and Technical Information (OSTI.GOV)
Not Available
An account of the Caltech Concurrent Computation Program (C{sup 3}P), a five year project that focused on answering the question: Can parallel computers be used to do large-scale scientific computations '' As the title indicates, the question is answered in the affirmative, by implementing numerous scientific applications on real parallel computers and doing computations that produced new scientific results. In the process of doing so, C{sup 3}P helped design and build several new computers, designed and implemented basic system software, developed algorithms for frequently used mathematical computations on massively parallel machines, devised performance models and measured the performance of manymore » computers, and created a high performance computing facility based exclusively on parallel computers. While the initial focus of C{sup 3}P was the hypercube architecture developed by C. Seitz, many of the methods developed and lessons learned have been applied successfully on other massively parallel architectures.« less
Distributing an executable job load file to compute nodes in a parallel computer
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gooding, Thomas M.
Distributing an executable job load file to compute nodes in a parallel computer, the parallel computer comprising a plurality of compute nodes, including: determining, by a compute node in the parallel computer, whether the compute node is participating in a job; determining, by the compute node in the parallel computer, whether a descendant compute node is participating in the job; responsive to determining that the compute node is participating in the job or that the descendant compute node is participating in the job, communicating, by the compute node to a parent compute node, an identification of a data communications linkmore » over which the compute node receives data from the parent compute node; constructing a class route for the job, wherein the class route identifies all compute nodes participating in the job; and broadcasting the executable load file for the job along the class route for the job.« less
Laboratory on Legs: An Architecture for Adjustable Morphology with Legged Robots
2012-04-01
fit within the body of the robot. Additional capabilities will largely depend upon a given activity, and should be easily reconfigurable to maximize...mobile robots, the essential units of actuation, computation, and sensing must be designed to fit within the body of the robot. Additional...PackBot,36 among others. Two parallel rails, 40 cm long and spaced at a center-to-center distance of 14 cm, span the length of the each robot’s body
Numerical Prediction of Non-Reacting and Reacting Flow in a Model Gas Turbine Combustor
NASA Technical Reports Server (NTRS)
Davoudzadeh, Farhad; Liu, Nan-Suey
2005-01-01
The three-dimensional, viscous, turbulent, reacting and non-reacting flow characteristics of a model gas turbine combustor operating on air/methane are simulated via an unstructured and massively parallel Reynolds-Averaged Navier-Stokes (RANS) code. This serves to demonstrate the capabilities of the code for design and analysis of real combustor engines. The effects of some design features of combustors are examined. In addition, the computed results are validated against experimental data.
NASA Astrophysics Data System (ADS)
White, Christopher Joseph
We describe the implementation of sophisticated numerical techniques for general-relativistic magnetohydrodynamics simulations in the Athena++ code framework. Improvements over many existing codes include the use of advanced Riemann solvers and of staggered-mesh constrained transport. Combined with considerations for computational performance and parallel scalability, these allow us to investigate black hole accretion flows with unprecedented accuracy. The capability of the code is demonstrated by exploring magnetically arrested disks.
Glaser, I
1982-04-01
By combining a lenslet array with masks it is possible to obtain a noncoherent optical processor capable of computing in parallel generalized 2-D discrete linear transformations. We present here an analysis of such lenslet array processors (LAP). The effect of several errors, including optical aberrations, diffraction, vignetting, and geometrical and mask errors, are calculated, and guidelines to optical design of LAP are derived. Using these results, both ultimate and practical performances of LAP are compared with those of competing techniques.
Time-Dependent Simulations of Incompressible Flow in a Turbopump Using Overset Grid Approach
NASA Technical Reports Server (NTRS)
Kiris, Cetin; Kwak, Dochan
2001-01-01
This viewgraph presentation provides information on mathematical modelling of the SSME (space shuttle main engine). The unsteady SSME-rig1 start-up procedure from the pump at rest has been initiated by using 34.3 million grid points. The computational model for the SSME-rig1 has been completed. Moving boundary capability is obtained by using DCF module in OVERFLOW-D. MPI (Message Passing Interface)/OpenMP hybrid parallel code has been benchmarked.
Wang, Ke; Nirmalathas, Ampalavanapillai; Lim, Christina; Skafidas, Efstratios; Alameh, Kamal
2013-07-01
In this paper, we propose and experimentally demonstrate a free-space based high-speed reconfigurable card-to-card optical interconnect architecture with broadcast capability, which is required for control functionalities and efficient parallel computing applications. Experimental results show that 10 Gb/s data can be broadcast to all receiving channels for up to 30 cm with a worst-case receiver sensitivity better than -12.20 dBm. In addition, arbitrary multicasting with the same architecture is also investigated. 10 Gb/s reconfigurable point-to-point link and multicast channels are simultaneously demonstrated with a measured receiver sensitivity power penalty of ~1.3 dB due to crosstalk.
Automatic Data Traffic Control on DSM Architecture
NASA Technical Reports Server (NTRS)
Frumkin, Michael; Jin, Hao-Qiang; Yan, Jerry; Kwak, Dochan (Technical Monitor)
2000-01-01
We study data traffic on distributed shared memory machines and conclude that data placement and grouping improve performance of scientific codes. We present several methods which user can employ to improve data traffic in his code. We report on implementation of a tool which detects the code fragments causing data congestions and advises user on improvements of data routing in these fragments. The capabilities of the tool include deduction of data alignment and affinity from the source code; detection of the code constructs having abnormally high cache or TLB misses; generation of data placement constructs. We demonstrate the capabilities of the tool on experiments with NAS parallel benchmarks and with a simple computational fluid dynamics application ARC3D.
SciSpark: Highly Interactive and Scalable Model Evaluation and Climate Metrics
NASA Astrophysics Data System (ADS)
Wilson, B. D.; Mattmann, C. A.; Waliser, D. E.; Kim, J.; Loikith, P.; Lee, H.; McGibbney, L. J.; Whitehall, K. D.
2014-12-01
Remote sensing data and climate model output are multi-dimensional arrays of massive sizes locked away in heterogeneous file formats (HDF5/4, NetCDF 3/4) and metadata models (HDF-EOS, CF) making it difficult to perform multi-stage, iterative science processing since each stage requires writing and reading data to and from disk. We are developing a lightning fast Big Data technology called SciSpark based on ApacheTM Spark. Spark implements the map-reduce paradigm for parallel computing on a cluster, but emphasizes in-memory computation, "spilling" to disk only as needed, and so outperforms the disk-based ApacheTM Hadoop by 100x in memory and by 10x on disk, and makes iterative algorithms feasible. SciSpark will enable scalable model evaluation by executing large-scale comparisons of A-Train satellite observations to model grids on a cluster of 100 to 1000 compute nodes. This 2nd generation capability for NASA's Regional Climate Model Evaluation System (RCMES) will compute simple climate metrics at interactive speeds, and extend to quite sophisticated iterative algorithms such as machine-learning (ML) based clustering of temperature PDFs, and even graph-based algorithms for searching for Mesocale Convective Complexes. The goals of SciSpark are to: (1) Decrease the time to compute comparison statistics and plots from minutes to seconds; (2) Allow for interactive exploration of time-series properties over seasons and years; (3) Decrease the time for satellite data ingestion into RCMES to hours; (4) Allow for Level-2 comparisons with higher-order statistics or PDF's in minutes to hours; and (5) Move RCMES into a near real time decision-making platform. We will report on: the architecture and design of SciSpark, our efforts to integrate climate science algorithms in Python and Scala, parallel ingest and partitioning (sharding) of A-Train satellite observations from HDF files and model grids from netCDF files, first parallel runs to compute comparison statistics and PDF's, and first metrics quantifying parallel speedups and memory & disk usage.
CSM research: Methods and application studies
NASA Technical Reports Server (NTRS)
Knight, Norman F., Jr.
1989-01-01
Computational mechanics is that discipline of applied science and engineering devoted to the study of physical phenomena by means of computational methods based on mathematical modeling and simulation, utilizing digital computers. The discipline combines theoretical and applied mechanics, approximation theory, numerical analysis, and computer science. Computational mechanics has had a major impact on engineering analysis and design. When applied to structural mechanics, the discipline is referred to herein as computational structural mechanics. Complex structures being considered by NASA for the 1990's include composite primary aircraft structures and the space station. These structures will be much more difficult to analyze than today's structures and necessitate a major upgrade in computerized structural analysis technology. NASA has initiated a research activity in structural analysis called Computational Structural Mechanics (CSM). The broad objective of the CSM activity is to develop advanced structural analysis technology that will exploit modern and emerging computers, such as those with vector and/or parallel processing capabilities. Here, the current research directions for the Methods and Application Studies Team of the Langley CSM activity are described.
Design for Run-Time Monitor on Cloud Computing
NASA Astrophysics Data System (ADS)
Kang, Mikyung; Kang, Dong-In; Yun, Mira; Park, Gyung-Leen; Lee, Junghoon
Cloud computing is a new information technology trend that moves computing and data away from desktops and portable PCs into large data centers. The basic principle of cloud computing is to deliver applications as services over the Internet as well as infrastructure. A cloud is the type of a parallel and distributed system consisting of a collection of inter-connected and virtualized computers that are dynamically provisioned and presented as one or more unified computing resources. The large-scale distributed applications on a cloud require adaptive service-based software, which has the capability of monitoring the system status change, analyzing the monitored information, and adapting its service configuration while considering tradeoffs among multiple QoS features simultaneously. In this paper, we design Run-Time Monitor (RTM) which is a system software to monitor the application behavior at run-time, analyze the collected information, and optimize resources on cloud computing. RTM monitors application software through library instrumentation as well as underlying hardware through performance counter optimizing its computing configuration based on the analyzed data.
Large-Scale NASA Science Applications on the Columbia Supercluster
NASA Technical Reports Server (NTRS)
Brooks, Walter
2005-01-01
Columbia, NASA's newest 61 teraflops supercomputer that became operational late last year, is a highly integrated Altix cluster of 10,240 processors, and was named to honor the crew of the Space Shuttle lost in early 2003. Constructed in just four months, Columbia increased NASA's computing capability ten-fold, and revitalized the Agency's high-end computing efforts. Significant cutting-edge science and engineering simulations in the areas of space and Earth sciences, as well as aeronautics and space operations, are already occurring on this largest operational Linux supercomputer, demonstrating its capacity and capability to accelerate NASA's space exploration vision. The presentation will describe how an integrated environment consisting not only of next-generation systems, but also modeling and simulation, high-speed networking, parallel performance optimization, and advanced data analysis and visualization, is being used to reduce design cycle time, accelerate scientific discovery, conduct parametric analysis of multiple scenarios, and enhance safety during the life cycle of NASA missions. The talk will conclude by discussing how NAS partnered with various NASA centers, other government agencies, computer industry, and academia, to create a national resource in large-scale modeling and simulation.
Two-dimensional nonsteady viscous flow simulation on the Navier-Stokes computer miniNode
NASA Technical Reports Server (NTRS)
Nosenchuck, Daniel M.; Littman, Michael G.; Flannery, William
1986-01-01
The needs of large-scale scientific computation are outpacing the growth in performance of mainframe supercomputers. In particular, problems in fluid mechanics involving complex flow simulations require far more speed and capacity than that provided by current and proposed Class VI supercomputers. To address this concern, the Navier-Stokes Computer (NSC) was developed. The NSC is a parallel-processing machine, comprised of individual Nodes, each comparable in performance to current supercomputers. The global architecture is that of a hypercube, and a 128-Node NSC has been designed. New architectural features, such as a reconfigurable many-function ALU pipeline and a multifunction memory-ALU switch, have provided the capability to efficiently implement a wide range of algorithms. Efficient algorithms typically involve numerically intensive tasks, which often include conditional operations. These operations may be efficiently implemented on the NSC without, in general, sacrificing vector-processing speed. To illustrate the architecture, programming, and several of the capabilities of the NSC, the simulation of two-dimensional, nonsteady viscous flows on a prototype Node, called the miniNode, is presented.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Küchlin, Stephan, E-mail: kuechlin@ifd.mavt.ethz.ch; Jenny, Patrick
2017-01-01
A major challenge for the conventional Direct Simulation Monte Carlo (DSMC) technique lies in the fact that its computational cost becomes prohibitive in the near continuum regime, where the Knudsen number (Kn)—characterizing the degree of rarefaction—becomes small. In contrast, the Fokker–Planck (FP) based particle Monte Carlo scheme allows for computationally efficient simulations of rarefied gas flows in the low and intermediate Kn regime. The Fokker–Planck collision operator—instead of performing binary collisions employed by the DSMC method—integrates continuous stochastic processes for the phase space evolution in time. This allows for time step and grid cell sizes larger than the respective collisionalmore » scales required by DSMC. Dynamically switching between the FP and the DSMC collision operators in each computational cell is the basis of the combined FP-DSMC method, which has been proven successful in simulating flows covering the whole Kn range. Until recently, this algorithm had only been applied to two-dimensional test cases. In this contribution, we present the first general purpose implementation of the combined FP-DSMC method. Utilizing both shared- and distributed-memory parallelization, this implementation provides the capability for simulations involving many particles and complex geometries by exploiting state of the art computer cluster technologies.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Brown, W Michael; Kohlmeyer, Axel; Plimpton, Steven J
The use of accelerators such as graphics processing units (GPUs) has become popular in scientific computing applications due to their low cost, impressive floating-point capabilities, high memory bandwidth, and low electrical power requirements. Hybrid high-performance computers, machines with nodes containing more than one type of floating-point processor (e.g. CPU and GPU), are now becoming more prevalent due to these advantages. In this paper, we present a continuation of previous work implementing algorithms for using accelerators into the LAMMPS molecular dynamics software for distributed memory parallel hybrid machines. In our previous work, we focused on acceleration for short-range models with anmore » approach intended to harness the processing power of both the accelerator and (multi-core) CPUs. To augment the existing implementations, we present an efficient implementation of long-range electrostatic force calculation for molecular dynamics. Specifically, we present an implementation of the particle-particle particle-mesh method based on the work by Harvey and De Fabritiis. We present benchmark results on the Keeneland InfiniBand GPU cluster. We provide a performance comparison of the same kernels compiled with both CUDA and OpenCL. We discuss limitations to parallel efficiency and future directions for improving performance on hybrid or heterogeneous computers.« less
Matching pursuit parallel decomposition of seismic data
NASA Astrophysics Data System (ADS)
Li, Chuanhui; Zhang, Fanchang
2017-07-01
In order to improve the computation speed of matching pursuit decomposition of seismic data, a matching pursuit parallel algorithm is designed in this paper. We pick a fixed number of envelope peaks from the current signal in every iteration according to the number of compute nodes and assign them to the compute nodes on average to search the optimal Morlet wavelets in parallel. With the help of parallel computer systems and Message Passing Interface, the parallel algorithm gives full play to the advantages of parallel computing to significantly improve the computation speed of the matching pursuit decomposition and also has good expandability. Besides, searching only one optimal Morlet wavelet by every compute node in every iteration is the most efficient implementation.
NASA Astrophysics Data System (ADS)
Macomber, B.; Woollands, R. M.; Probe, A.; Younes, A.; Bai, X.; Junkins, J.
2013-09-01
Modified Chebyshev Picard Iteration (MCPI) is an iterative numerical method for approximating solutions of linear or non-linear Ordinary Differential Equations (ODEs) to obtain time histories of system state trajectories. Unlike other step-by-step differential equation solvers, the Runge-Kutta family of numerical integrators for example, MCPI approximates long arcs of the state trajectory with an iterative path approximation approach, and is ideally suited to parallel computation. Orthogonal Chebyshev Polynomials are used as basis functions during each path iteration; the integrations of the Picard iteration are then done analytically. Due to the orthogonality of the Chebyshev basis functions, the least square approximations are computed without matrix inversion; the coefficients are computed robustly from discrete inner products. As a consequence of discrete sampling and weighting adopted for the inner product definition, Runge phenomena errors are minimized near the ends of the approximation intervals. The MCPI algorithm utilizes a vector-matrix framework for computational efficiency. Additionally, all Chebyshev coefficients and integrand function evaluations are independent, meaning they can be simultaneously computed in parallel for further decreased computational cost. Over an order of magnitude speedup from traditional methods is achieved in serial processing, and an additional order of magnitude is achievable in parallel architectures. This paper presents a new MCPI library, a modular toolset designed to allow MCPI to be easily applied to a wide variety of ODE systems. Library users will not have to concern themselves with the underlying mathematics behind the MCPI method. Inputs are the boundary conditions of the dynamical system, the integrand function governing system behavior, and the desired time interval of integration, and the output is a time history of the system states over the interval of interest. Examples from the field of astrodynamics are presented to compare the output from the MCPI library to current state-of-practice numerical integration methods. It is shown that MCPI is capable of out-performing the state-of-practice in terms of computational cost and accuracy.
NASA Technical Reports Server (NTRS)
Kim, B. F.; Moorjani, K.; Phillips, T. E.; Adrian, F. J.; Bohandy, J.; Dolecek, Q. E.
1993-01-01
A method for characterization of granular superconducting thin films has been developed which encompasses both the morphological state of the sample and its fabrication process parameters. The broad scope of this technique is due to the synergism between experimental measurements and their interpretation using numerical simulation. Two novel technologies form the substance of this system: the magnetically modulated resistance method for characterizing superconductors; and a powerful new computer peripheral, the Parallel Information Processor card, which provides enhanced computing capability for PC computers. This enhancement allows PC computers to operate at speeds approaching that of supercomputers. This makes atomic scale simulations possible on low cost machines. The present development of this system involves the integration of these two technologies using mesoscale simulations of thin film growth. A future stage of development will incorporate atomic scale modeling.
Parallel reduced-instruction-set-computer architecture for real-time symbolic pattern matching
NASA Astrophysics Data System (ADS)
Parson, Dale E.
1991-03-01
This report discusses ongoing work on a parallel reduced-instruction- set-computer (RISC) architecture for automatic production matching. The PRIOPS compiler takes advantage of the memoryless character of automatic processing by translating a program's collection of automatic production tests into an equivalent combinational circuit-a digital circuit without memory, whose outputs are immediate functions of its inputs. The circuit provides a highly parallel, fine-grain model of automatic matching. The compiler then maps the combinational circuit onto RISC hardware. The heart of the processor is an array of comparators capable of testing production conditions in parallel, Each comparator attaches to private memory that contains virtual circuit nodes-records of the current state of nodes and busses in the combinational circuit. All comparator memories hold identical information, allowing simultaneous update for a single changing circuit node and simultaneous retrieval of different circuit nodes by different comparators. Along with the comparator-based logic unit is a sequencer that determines the current combination of production-derived comparisons to try, based on the combined success and failure of previous combinations of comparisons. The memoryless nature of automatic matching allows the compiler to designate invariant memory addresses for virtual circuit nodes, and to generate the most effective sequences of comparison test combinations. The result is maximal utilization of parallel hardware, indicating speed increases and scalability beyond that found for course-grain, multiprocessor approaches to concurrent Rete matching. Future work will consider application of this RISC architecture to the standard (controlled) Rete algorithm, where search through memory dominates portions of matching.
Computer hardware fault administration
Archer, Charles J.; Megerian, Mark G.; Ratterman, Joseph D.; Smith, Brian E.
2010-09-14
Computer hardware fault administration carried out in a parallel computer, where the parallel computer includes a plurality of compute nodes. The compute nodes are coupled for data communications by at least two independent data communications networks, where each data communications network includes data communications links connected to the compute nodes. Typical embodiments carry out hardware fault administration by identifying a location of a defective link in the first data communications network of the parallel computer and routing communications data around the defective link through the second data communications network of the parallel computer.
Scaling Optimization of the SIESTA MHD Code
NASA Astrophysics Data System (ADS)
Seal, Sudip; Hirshman, Steven; Perumalla, Kalyan
2013-10-01
SIESTA is a parallel three-dimensional plasma equilibrium code capable of resolving magnetic islands at high spatial resolutions for toroidal plasmas. Originally designed to exploit small-scale parallelism, SIESTA has now been scaled to execute efficiently over several thousands of processors P. This scaling improvement was accomplished with minimal intrusion to the execution flow of the original version. First, the efficiency of the iterative solutions was improved by integrating the parallel tridiagonal block solver code BCYCLIC. Krylov-space generation in GMRES was then accelerated using a customized parallel matrix-vector multiplication algorithm. Novel parallel Hessian generation algorithms were integrated and memory access latencies were dramatically reduced through loop nest optimizations and data layout rearrangement. These optimizations sped up equilibria calculations by factors of 30-50. It is possible to compute solutions with granularity N/P near unity on extremely fine radial meshes (N > 1024 points). Grid separation in SIESTA, which manifests itself primarily in the resonant components of the pressure far from rational surfaces, is strongly suppressed by finer meshes. Large problem sizes of up to 300 K simultaneous non-linear coupled equations have been solved on the NERSC supercomputers. Work supported by U.S. DOE under Contract DE-AC05-00OR22725 with UT-Battelle, LLC.
Data communications in a parallel active messaging interface of a parallel computer
Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E
2014-02-11
Data communications in a parallel active messaging interface ('PAMI') or a parallel computer, the parallel computer including a plurality of compute nodes that execute a parallel application, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution of a compute node, including specification of a client, a context, and a task, the compute nodes and the endpoints coupled for data communications instruction, the instruction characterized by instruction type, the instruction specifying a transmission of transfer data from the origin endpoint to a target endpoint and transmitting, in accordance witht the instruction type, the transfer data from the origin endpoin to the target endpoint.
BigDebug: Debugging Primitives for Interactive Big Data Processing in Spark.
Gulzar, Muhammad Ali; Interlandi, Matteo; Yoo, Seunghyun; Tetali, Sai Deep; Condie, Tyson; Millstein, Todd; Kim, Miryung
2016-05-01
Developers use cloud computing platforms to process a large quantity of data in parallel when developing big data analytics. Debugging the massive parallel computations that run in today's data-centers is time consuming and error-prone. To address this challenge, we design a set of interactive, real-time debugging primitives for big data processing in Apache Spark, the next generation data-intensive scalable cloud computing platform. This requires re-thinking the notion of step-through debugging in a traditional debugger such as gdb, because pausing the entire computation across distributed worker nodes causes significant delay and naively inspecting millions of records using a watchpoint is too time consuming for an end user. First, BIGDEBUG's simulated breakpoints and on-demand watchpoints allow users to selectively examine distributed, intermediate data on the cloud with little overhead. Second, a user can also pinpoint a crash-inducing record and selectively resume relevant sub-computations after a quick fix. Third, a user can determine the root causes of errors (or delays) at the level of individual records through a fine-grained data provenance capability. Our evaluation shows that BIGDEBUG scales to terabytes and its record-level tracing incurs less than 25% overhead on average. It determines crash culprits orders of magnitude more accurately and provides up to 100% time saving compared to the baseline replay debugger. The results show that BIGDEBUG supports debugging at interactive speeds with minimal performance impact.
High-performance computing — an overview
NASA Astrophysics Data System (ADS)
Marksteiner, Peter
1996-08-01
An overview of high-performance computing (HPC) is given. Different types of computer architectures used in HPC are discussed: vector supercomputers, high-performance RISC processors, various parallel computers like symmetric multiprocessors, workstation clusters, massively parallel processors. Software tools and programming techniques used in HPC are reviewed: vectorizing compilers, optimization and vector tuning, optimization for RISC processors; parallel programming techniques like shared-memory parallelism, message passing and data parallelism; and numerical libraries.
Distributed GPU Computing in GIScience
NASA Astrophysics Data System (ADS)
Jiang, Y.; Yang, C.; Huang, Q.; Li, J.; Sun, M.
2013-12-01
Geoscientists strived to discover potential principles and patterns hidden inside ever-growing Big Data for scientific discoveries. To better achieve this objective, more capable computing resources are required to process, analyze and visualize Big Data (Ferreira et al., 2003; Li et al., 2013). Current CPU-based computing techniques cannot promptly meet the computing challenges caused by increasing amount of datasets from different domains, such as social media, earth observation, environmental sensing (Li et al., 2013). Meanwhile CPU-based computing resources structured as cluster or supercomputer is costly. In the past several years with GPU-based technology matured in both the capability and performance, GPU-based computing has emerged as a new computing paradigm. Compare to traditional computing microprocessor, the modern GPU, as a compelling alternative microprocessor, has outstanding high parallel processing capability with cost-effectiveness and efficiency(Owens et al., 2008), although it is initially designed for graphical rendering in visualization pipe. This presentation reports a distributed GPU computing framework for integrating GPU-based computing within distributed environment. Within this framework, 1) for each single computer, computing resources of both GPU-based and CPU-based can be fully utilized to improve the performance of visualizing and processing Big Data; 2) within a network environment, a variety of computers can be used to build up a virtual super computer to support CPU-based and GPU-based computing in distributed computing environment; 3) GPUs, as a specific graphic targeted device, are used to greatly improve the rendering efficiency in distributed geo-visualization, especially for 3D/4D visualization. Key words: Geovisualization, GIScience, Spatiotemporal Studies Reference : 1. Ferreira de Oliveira, M. C., & Levkowitz, H. (2003). From visual data exploration to visual data mining: A survey. Visualization and Computer Graphics, IEEE Transactions on, 9(3), 378-394. 2. Li, J., Jiang, Y., Yang, C., Huang, Q., & Rice, M. (2013). Visualizing 3D/4D Environmental Data Using Many-core Graphics Processing Units (GPUs) and Multi-core Central Processing Units (CPUs). Computers & Geosciences, 59(9), 78-89. 3. Owens, J. D., Houston, M., Luebke, D., Green, S., Stone, J. E., & Phillips, J. C. (2008). GPU computing. Proceedings of the IEEE, 96(5), 879-899.
Calculating Potential Energy Curves with Quantum Monte Carlo
NASA Astrophysics Data System (ADS)
Powell, Andrew D.; Dawes, Richard
2014-06-01
Quantum Monte Carlo (QMC) is a computational technique that can be applied to the electronic Schrödinger equation for molecules. QMC methods such as Variational Monte Carlo (VMC) and Diffusion Monte Carlo (DMC) have demonstrated the capability of capturing large fractions of the correlation energy, thus suggesting their possible use for high-accuracy quantum chemistry calculations. QMC methods scale particularly well with respect to parallelization making them an attractive consideration in anticipation of next-generation computing architectures which will involve massive parallelization with millions of cores. Due to the statistical nature of the approach, in contrast to standard quantum chemistry methods, uncertainties (error-bars) are associated with each calculated energy. This study focuses on the cost, feasibility and practical application of calculating potential energy curves for small molecules with QMC methods. Trial wave functions were constructed with the multi-configurational self-consistent field (MCSCF) method from GAMESS-US.[1] The CASINO Monte Carlo quantum chemistry package [2] was used for all of the DMC calculations. An overview of our progress in this direction will be given. References: M. W. Schmidt et al. J. Comput. Chem. 14, 1347 (1993). R. J. Needs et al. J. Phys.: Condensed Matter 22, 023201 (2010).
Scalable and fast heterogeneous molecular simulation with predictive parallelization schemes
DOE Office of Scientific and Technical Information (OSTI.GOV)
Guzman, Horacio V.; Junghans, Christoph; Kremer, Kurt
Multiscale and inhomogeneous molecular systems are challenging topics in the field of molecular simulation. In particular, modeling biological systems in the context of multiscale simulations and exploring material properties are driving a permanent development of new simulation methods and optimization algorithms. In computational terms, those methods require parallelization schemes that make a productive use of computational resources for each simulation and from its genesis. Here, we introduce the heterogeneous domain decomposition approach, which is a combination of an heterogeneity-sensitive spatial domain decomposition with an a priori rearrangement of subdomain walls. Within this approach and paper, the theoretical modeling and scalingmore » laws for the force computation time are proposed and studied as a function of the number of particles and the spatial resolution ratio. We also show the new approach capabilities, by comparing it to both static domain decomposition algorithms and dynamic load-balancing schemes. Specifically, two representative molecular systems have been simulated and compared to the heterogeneous domain decomposition proposed in this work. Finally, these two systems comprise an adaptive resolution simulation of a biomolecule solvated in water and a phase-separated binary Lennard-Jones fluid.« less
Scalable and fast heterogeneous molecular simulation with predictive parallelization schemes
Guzman, Horacio V.; Junghans, Christoph; Kremer, Kurt; ...
2017-11-27
Multiscale and inhomogeneous molecular systems are challenging topics in the field of molecular simulation. In particular, modeling biological systems in the context of multiscale simulations and exploring material properties are driving a permanent development of new simulation methods and optimization algorithms. In computational terms, those methods require parallelization schemes that make a productive use of computational resources for each simulation and from its genesis. Here, we introduce the heterogeneous domain decomposition approach, which is a combination of an heterogeneity-sensitive spatial domain decomposition with an a priori rearrangement of subdomain walls. Within this approach and paper, the theoretical modeling and scalingmore » laws for the force computation time are proposed and studied as a function of the number of particles and the spatial resolution ratio. We also show the new approach capabilities, by comparing it to both static domain decomposition algorithms and dynamic load-balancing schemes. Specifically, two representative molecular systems have been simulated and compared to the heterogeneous domain decomposition proposed in this work. Finally, these two systems comprise an adaptive resolution simulation of a biomolecule solvated in water and a phase-separated binary Lennard-Jones fluid.« less
Tavčar, Gregor; Katrašnik, Tomaž
2014-01-01
The parallel straight channel PEM fuel cell model presented in this paper extends the innovative hybrid 3D analytic-numerical (HAN) approach previously published by the authors with capabilities to address ternary diffusion systems and counter-flow configurations. The model's core principle is modelling species transport by obtaining a 2D analytic solution for species concentration distribution in the plane perpendicular to the cannel gas-flow and coupling consecutive 2D solutions by means of a 1D numerical pipe-flow model. Electrochemical and other nonlinear phenomena are coupled to the species transport by a routine that uses derivative approximation with prediction-iteration. The latter is also the core of the counter-flow computation algorithm. A HAN model of a laboratory test fuel cell is presented and evaluated against a professional 3D CFD simulation tool showing very good agreement between results of the presented model and those of the CFD simulation. Furthermore, high accuracy results are achieved at moderate computational times, which is owed to the semi-analytic nature and to the efficient computational coupling of electrochemical kinetics and species transport.
Experimental and Computational Sonic Boom Assessment of Lockheed-Martin N+2 Low Boom Models
NASA Technical Reports Server (NTRS)
Cliff, Susan E.; Durston, Donald A.; Elmiligui, Alaa A.; Walker, Eric L.; Carter, Melissa B.
2015-01-01
Flight at speeds greater than the speed of sound is not permitted over land, primarily because of the noise and structural damage caused by sonic boom pressure waves of supersonic aircraft. Mitigation of sonic boom is a key focus area of the High Speed Project under NASA's Fundamental Aeronautics Program. The project is focusing on technologies to enable future civilian aircraft to fly efficiently with reduced sonic boom, engine and aircraft noise, and emissions. A major objective of the project is to improve both computational and experimental capabilities for design of low-boom, high-efficiency aircraft. NASA and industry partners are developing improved wind tunnel testing techniques and new pressure instrumentation to measure the weak sonic boom pressure signatures of modern vehicle concepts. In parallel, computational methods are being developed to provide rapid design and analysis of supersonic aircraft with improved meshing techniques that provide efficient, robust, and accurate on- and off-body pressures at several body lengths from vehicles with very low sonic boom overpressures. The maturity of these critical parallel efforts is necessary before low-boom flight can be demonstrated and commercial supersonic flight can be realized.
A parallel expert system for the control of a robotic air vehicle
NASA Technical Reports Server (NTRS)
Shakley, Donald; Lamont, Gary B.
1988-01-01
Expert systems can be used to govern the intelligent control of vehicles, for example the Robotic Air Vehicle (RAV). Due to the nature of the RAV system the associated expert system needs to perform in a demanding real-time environment. The use of a parallel processing capability to support the associated expert system's computational requirement is critical in this application. Thus, algorithms for parallel real-time expert systems must be designed, analyzed, and synthesized. The design process incorporates a consideration of the rule-set/face-set size along with representation issues. These issues are looked at in reference to information movement and various inference mechanisms. Also examined is the process involved with transporting the RAV expert system functions from the TI Explorer, where they are implemented in the Automated Reasoning Tool (ART), to the iPSC Hypercube, where the system is synthesized using Concurrent Common LISP (CCLISP). The transformation process for the ART to CCLISP conversion is described. The performance characteristics of the parallel implementation of these expert systems on the iPSC Hypercube are compared to the TI Explorer implementation.
Multibus-based parallel processor for simulation
NASA Technical Reports Server (NTRS)
Ogrady, E. P.; Wang, C.-H.
1983-01-01
A Multibus-based parallel processor simulation system is described. The system is intended to serve as a vehicle for gaining hands-on experience, testing system and application software, and evaluating parallel processor performance during development of a larger system based on the horizontal/vertical-bus interprocessor communication mechanism. The prototype system consists of up to seven Intel iSBC 86/12A single-board computers which serve as processing elements, a multiple transmission controller (MTC) designed to support system operation, and an Intel Model 225 Microcomputer Development System which serves as the user interface and input/output processor. All components are interconnected by a Multibus/IEEE 796 bus. An important characteristic of the system is that it provides a mechanism for a processing element to broadcast data to other selected processing elements. This parallel transfer capability is provided through the design of the MTC and a minor modification to the iSBC 86/12A board. The operation of the MTC, the basic hardware-level operation of the system, and pertinent details about the iSBC 86/12A and the Multibus are described.
Corral framework: Trustworthy and fully functional data intensive parallel astronomical pipelines
NASA Astrophysics Data System (ADS)
Cabral, J. B.; Sánchez, B.; Beroiz, M.; Domínguez, M.; Lares, M.; Gurovich, S.; Granitto, P.
2017-07-01
Data processing pipelines represent an important slice of the astronomical software library that include chains of processes that transform raw data into valuable information via data reduction and analysis. In this work we present Corral, a Python framework for astronomical pipeline generation. Corral features a Model-View-Controller design pattern on top of an SQL Relational Database capable of handling: custom data models; processing stages; and communication alerts, and also provides automatic quality and structural metrics based on unit testing. The Model-View-Controller provides concept separation between the user logic and the data models, delivering at the same time multi-processing and distributed computing capabilities. Corral represents an improvement over commonly found data processing pipelines in astronomysince the design pattern eases the programmer from dealing with processing flow and parallelization issues, allowing them to focus on the specific algorithms needed for the successive data transformations and at the same time provides a broad measure of quality over the created pipeline. Corral and working examples of pipelines that use it are available to the community at https://github.com/toros-astro.
NASA Astrophysics Data System (ADS)
Fellman, Ronald D.; Kaneshiro, Ronald T.; Konstantinides, Konstantinos
1990-03-01
The authors present the design and evaluation of an architecture for a monolithic, programmable, floating-point digital signal processor (DSP) for instrumentation applications. An investigation of the most commonly used algorithms in instrumentation led to a design that satisfies the requirements for high computational and I/O (input/output) throughput. In the arithmetic unit, a 16- x 16-bit multiplier and a 32-bit accumulator provide the capability for single-cycle multiply/accumulate operations, and three format adjusters automatically adjust the data format for increased accuracy and dynamic range. An on-chip I/O unit is capable of handling data block transfers through a direct memory access port and real-time data streams through a pair of parallel I/O ports. I/O operations and program execution are performed in parallel. In addition, the processor includes two data memories with independent addressing units, a microsequencer with instruction RAM, and multiplexers for internal data redirection. The authors also present the structure and implementation of a design environment suitable for the algorithmic, behavioral, and timing simulation of a complete DSP system. Various benchmarking results are reported.
Parallel family trees for transfer matrices in the Potts model
NASA Astrophysics Data System (ADS)
Navarro, Cristobal A.; Canfora, Fabrizio; Hitschfeld, Nancy; Navarro, Gonzalo
2015-02-01
The computational cost of transfer matrix methods for the Potts model is related to the question in how many ways can two layers of a lattice be connected? Answering the question leads to the generation of a combinatorial set of lattice configurations. This set defines the configuration space of the problem, and the smaller it is, the faster the transfer matrix can be computed. The configuration space of generic (q , v) transfer matrix methods for strips is in the order of the Catalan numbers, which grows asymptotically as O(4m) where m is the width of the strip. Other transfer matrix methods with a smaller configuration space indeed exist but they make assumptions on the temperature, number of spin states, or restrict the structure of the lattice. In this paper we propose a parallel algorithm that uses a sub-Catalan configuration space of O(3m) to build the generic (q , v) transfer matrix in a compressed form. The improvement is achieved by grouping the original set of Catalan configurations into a forest of family trees, in such a way that the solution to the problem is now computed by solving the root node of each family. As a result, the algorithm becomes exponentially faster than the Catalan approach while still highly parallel. The resulting matrix is stored in a compressed form using O(3m ×4m) of space, making numerical evaluation and decompression to be faster than evaluating the matrix in its O(4m ×4m) uncompressed form. Experimental results for different sizes of strip lattices show that the parallel family trees (PFT) strategy indeed runs exponentially faster than the Catalan Parallel Method (CPM), especially when dealing with dense transfer matrices. In terms of parallel performance, we report strong-scaling speedups of up to 5.7 × when running on an 8-core shared memory machine and 28 × for a 32-core cluster. The best balance of speedup and efficiency for the multi-core machine was achieved when using p = 4 processors, while for the cluster scenario it was in the range p ∈ [ 8 , 10 ] . Because of the parallel capabilities of the algorithm, a large-scale execution of the parallel family trees strategy in a supercomputer could contribute to the study of wider strip lattices.
Accelerating epistasis analysis in human genetics with consumer graphics hardware.
Sinnott-Armstrong, Nicholas A; Greene, Casey S; Cancare, Fabio; Moore, Jason H
2009-07-24
Human geneticists are now capable of measuring more than one million DNA sequence variations from across the human genome. The new challenge is to develop computationally feasible methods capable of analyzing these data for associations with common human disease, particularly in the context of epistasis. Epistasis describes the situation where multiple genes interact in a complex non-linear manner to determine an individual's disease risk and is thought to be ubiquitous for common diseases. Multifactor Dimensionality Reduction (MDR) is an algorithm capable of detecting epistasis. An exhaustive analysis with MDR is often computationally expensive, particularly for high order interactions. This challenge has previously been met with parallel computation and expensive hardware. The option we examine here exploits commodity hardware designed for computer graphics. In modern computers Graphics Processing Units (GPUs) have more memory bandwidth and computational capability than Central Processing Units (CPUs) and are well suited to this problem. Advances in the video game industry have led to an economy of scale creating a situation where these powerful components are readily available at very low cost. Here we implement and evaluate the performance of the MDR algorithm on GPUs. Of primary interest are the time required for an epistasis analysis and the price to performance ratio of available solutions. We found that using MDR on GPUs consistently increased performance per machine over both a feature rich Java software package and a C++ cluster implementation. The performance of a GPU workstation running a GPU implementation reduces computation time by a factor of 160 compared to an 8-core workstation running the Java implementation on CPUs. This GPU workstation performs similarly to 150 cores running an optimized C++ implementation on a Beowulf cluster. Furthermore this GPU system provides extremely cost effective performance while leaving the CPU available for other tasks. The GPU workstation containing three GPUs costs $2000 while obtaining similar performance on a Beowulf cluster requires 150 CPU cores which, including the added infrastructure and support cost of the cluster system, cost approximately $82,500. Graphics hardware based computing provides a cost effective means to perform genetic analysis of epistasis using MDR on large datasets without the infrastructure of a computing cluster.
TU-AB-BRC-12: Optimized Parallel MonteCarlo Dose Calculations for Secondary MU Checks
DOE Office of Scientific and Technical Information (OSTI.GOV)
French, S; Nazareth, D; Bellor, M
Purpose: Secondary MU checks are an important tool used during a physics review of a treatment plan. Commercial software packages offer varying degrees of theoretical dose calculation accuracy, depending on the modality involved. Dose calculations of VMAT plans are especially prone to error due to the large approximations involved. Monte Carlo (MC) methods are not commonly used due to their long run times. We investigated two methods to increase the computational efficiency of MC dose simulations with the BEAMnrc code. Distributed computing resources, along with optimized code compilation, will allow for accurate and efficient VMAT dose calculations. Methods: The BEAMnrcmore » package was installed on a high performance computing cluster accessible to our clinic. MATLAB and PYTHON scripts were developed to convert a clinical VMAT DICOM plan into BEAMnrc input files. The BEAMnrc installation was optimized by running the VMAT simulations through profiling tools which indicated the behavior of the constituent routines in the code, e.g. the bremsstrahlung splitting routine, and the specified random number generator. This information aided in determining the most efficient compiling parallel configuration for the specific CPU’s available on our cluster, resulting in the fastest VMAT simulation times. Our method was evaluated with calculations involving 10{sup 8} – 10{sup 9} particle histories which are sufficient to verify patient dose using VMAT. Results: Parallelization allowed the calculation of patient dose on the order of 10 – 15 hours with 100 parallel jobs. Due to the compiler optimization process, further speed increases of 23% were achieved when compared with the open-source compiler BEAMnrc packages. Conclusion: Analysis of the BEAMnrc code allowed us to optimize the compiler configuration for VMAT dose calculations. In future work, the optimized MC code, in conjunction with the parallel processing capabilities of BEAMnrc, will be applied to provide accurate and efficient secondary MU checks.« less
Feasibility of optically interconnected parallel processors using wavelength division multiplexing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Deri, R.J.; De Groot, A.J.; Haigh, R.E.
1996-03-01
New national security demands require enhanced computing systems for nearly ab initio simulations of extremely complex systems and analyzing unprecedented quantities of remote sensing data. This computational performance is being sought using parallel processing systems, in which many less powerful processors are ganged together to achieve high aggregate performance. Such systems require increased capability to communicate information between individual processor and memory elements. As it is likely that the limited performance of today`s electronic interconnects will prevent the system from achieving its ultimate performance, there is great interest in using fiber optic technology to improve interconnect communication. However, little informationmore » is available to quantify the requirements on fiber optical hardware technology for this application. Furthermore, we have sought to explore interconnect architectures that use the complete communication richness of the optical domain rather than using optics as a simple replacement for electronic interconnects. These considerations have led us to study the performance of a moderate size parallel processor with optical interconnects using multiple optical wavelengths. We quantify the bandwidth, latency, and concurrency requirements which allow a bus-type interconnect to achieve scalable computing performance using up to 256 nodes, each operating at GFLOP performance. Our key conclusion is that scalable performance, to {approx}150 GFLOPS, is achievable for several scientific codes using an optical bus with a small number of WDM channels (8 to 32), only one WDM channel received per node, and achievable optoelectronic bandwidth and latency requirements. 21 refs. , 10 figs.« less
TOUGH3: A new efficient version of the TOUGH suite of multiphase flow and transport simulators
NASA Astrophysics Data System (ADS)
Jung, Yoojin; Pau, George Shu Heng; Finsterle, Stefan; Pollyea, Ryan M.
2017-11-01
The TOUGH suite of nonisothermal multiphase flow and transport simulators has been updated by various developers over many years to address a vast range of challenging subsurface problems. The increasing complexity of the simulated processes as well as the growing size of model domains that need to be handled call for an improvement in the simulator's computational robustness and efficiency. Moreover, modifications have been frequently introduced independently, resulting in multiple versions of TOUGH that (1) led to inconsistencies in feature implementation and usage, (2) made code maintenance and development inefficient, and (3) caused confusion to users and developers. TOUGH3-a new base version of TOUGH-addresses these issues. It consolidates both the serial (TOUGH2 V2.1) and parallel (TOUGH2-MP V2.0) implementations, enabling simulations to be performed on desktop computers and supercomputers using a single code. New PETSc parallel linear solvers are added to the existing serial solvers of TOUGH2 and the Aztec solver used in TOUGH2-MP. The PETSc solvers generally perform better than the Aztec solvers in parallel and the internal TOUGH3 linear solver in serial. TOUGH3 also incorporates many new features, addresses bugs, and improves the flexibility of data handling. Due to the improved capabilities and usability, TOUGH3 is more robust and efficient for solving tough and computationally demanding problems in diverse scientific and practical applications related to subsurface flow modeling.
Legacy model integration for enhancing hydrologic interdisciplinary research
NASA Astrophysics Data System (ADS)
Dozier, A.; Arabi, M.; David, O.
2013-12-01
Many challenges are introduced to interdisciplinary research in and around the hydrologic science community due to advances in computing technology and modeling capabilities in different programming languages, across different platforms and frameworks by researchers in a variety of fields with a variety of experience in computer programming. Many new hydrologic models as well as optimization, parameter estimation, and uncertainty characterization techniques are developed in scripting languages such as Matlab, R, Python, or in newer languages such as Java and the .Net languages, whereas many legacy models have been written in FORTRAN and C, which complicates inter-model communication for two-way feedbacks. However, most hydrologic researchers and industry personnel have little knowledge of the computing technologies that are available to address the model integration process. Therefore, the goal of this study is to address these new challenges by utilizing a novel approach based on a publish-subscribe-type system to enhance modeling capabilities of legacy socio-economic, hydrologic, and ecologic software. Enhancements include massive parallelization of executions and access to legacy model variables at any point during the simulation process by another program without having to compile all the models together into an inseparable 'super-model'. Thus, this study provides two-way feedback mechanisms between multiple different process models that can be written in various programming languages and can run on different machines and operating systems. Additionally, a level of abstraction is given to the model integration process that allows researchers and other technical personnel to perform more detailed and interactive modeling, visualization, optimization, calibration, and uncertainty analysis without requiring deep understanding of inter-process communication. To be compatible, a program must be written in a programming language with bindings to a common implementation of the message passing interface (MPI), which includes FORTRAN, C, Java, the .NET languages, Python, R, Matlab, and many others. The system is tested on a longstanding legacy hydrologic model, the Soil and Water Assessment Tool (SWAT), to observe and enhance speed-up capabilities for various optimization, parameter estimation, and model uncertainty characterization techniques, which is particularly important for computationally intensive hydrologic simulations. Initial results indicate that the legacy extension system significantly decreases developer time, computation time, and the cost of purchasing commercial parallel processing licenses, while enhancing interdisciplinary research by providing detailed two-way feedback mechanisms between various process models with minimal changes to legacy code.
NASA Astrophysics Data System (ADS)
Lee, Eun Seok
2000-10-01
An improved aerodynamics performance of a turbine cascade shape can be achieved by an understanding of the flow-field associated with the stator-rotor interaction. In this research, an axial gas turbine airfoil cascade shape is optimized for improved aerodynamic performance by using an unsteady Navier-Stokes solver and a parallel genetic algorithm. The objective of the research is twofold: (1) to develop a computational fluid dynamics code having faster convergence rate and unsteady flow simulation capabilities, and (2) to optimize a turbine airfoil cascade shape with unsteady passing wakes for improved aerodynamic performance. The computer code solves the Reynolds averaged Navier-Stokes equations. It is based on the explicit, finite difference, Runge-Kutta time marching scheme and the Diagonalized Alternating Direction Implicit (DADI) scheme, with the Baldwin-Lomax algebraic and k-epsilon turbulence modeling. Improvements in the code focused on the cascade shape design capability, convergence acceleration and unsteady formulation. First, the inverse shape design method was implemented in the code to provide the design capability, where a surface transpiration concept was employed as an inverse technique to modify the geometry satisfying the user specified pressure distribution on the airfoil surface. Second, an approximation storage multigrid method was implemented as an acceleration technique. Third, the preconditioning method was adopted to speed up the convergence rate in solving the low Mach number flows. Finally, the implicit dual time stepping method was incorporated in order to simulate the unsteady flow-fields. For the unsteady code validation, the Stokes's 2nd problem and the Poiseuille flow were chosen and compared with the computed results and analytic solutions. To test the code's ability to capture the natural unsteady flow phenomena, vortex shedding past a cylinder and the shock oscillation over a bicircular airfoil were simulated and compared with experiments and other research results. The rotor cascade shape optimization with unsteady passing wakes was performed to obtain an improved aerodynamic performance using the unsteady Navier-Stokes solver. Two objective functions were defined as minimization of total pressure loss and maximization of lift, while the mass flow rate was fixed. A parallel genetic algorithm was used as an optimizer and the penalty method was introduced. Each individual's objective function was computed simultaneously by using a 32 processor distributed memory computer. One optimization took about four days.
The TeraShake Computational Platform for Large-Scale Earthquake Simulations
NASA Astrophysics Data System (ADS)
Cui, Yifeng; Olsen, Kim; Chourasia, Amit; Moore, Reagan; Maechling, Philip; Jordan, Thomas
Geoscientific and computer science researchers with the Southern California Earthquake Center (SCEC) are conducting a large-scale, physics-based, computationally demanding earthquake system science research program with the goal of developing predictive models of earthquake processes. The computational demands of this program continue to increase rapidly as these researchers seek to perform physics-based numerical simulations of earthquake processes for larger meet the needs of this research program, a multiple-institution team coordinated by SCEC has integrated several scientific codes into a numerical modeling-based research tool we call the TeraShake computational platform (TSCP). A central component in the TSCP is a highly scalable earthquake wave propagation simulation program called the TeraShake anelastic wave propagation (TS-AWP) code. In this chapter, we describe how we extended an existing, stand-alone, wellvalidated, finite-difference, anelastic wave propagation modeling code into the highly scalable and widely used TS-AWP and then integrated this code into the TeraShake computational platform that provides end-to-end (initialization to analysis) research capabilities. We also describe the techniques used to enhance the TS-AWP parallel performance on TeraGrid supercomputers, as well as the TeraShake simulations phases including input preparation, run time, data archive management, and visualization. As a result of our efforts to improve its parallel efficiency, the TS-AWP has now shown highly efficient strong scaling on over 40K processors on IBM’s BlueGene/L Watson computer. In addition, the TSCP has developed into a computational system that is useful to many members of the SCEC community for performing large-scale earthquake simulations.
Computer-aided mechanogenesis of skeletal muscle organs from single cells in vitro
NASA Technical Reports Server (NTRS)
Vanderburgh, Herman H.; Swasdison, Somporn; Karlisch, Patricia
1991-01-01
Complex mechanical forces generated in the growing embryo play an important role in organogenesis. Computerized application of similar forces to differentiating skeletal muscle myoblasts in vitro generate three dimensional artificial muscle organs. These organs contain parallel networks of long unbranched myofibers organized into fascicle-like structures. Tendon development is initiated and the muscles are capable of performing directed, functional work. Kinetically engineered organs provide a new method for studying the growth and development of normal and diseased skeletal muscle.
Computer aided mechanogenesis of skeletal muscle organs from single cells in vitro
NASA Technical Reports Server (NTRS)
Vandenburgh, Herman H.; Swasdison, Somporn; Karlisch, Patricia
1990-01-01
Complex mechanical forces generated in the growing embryo play an important role in organogenesis. Computerized application of similar forces to differentiating skeletal muscle myoblasts in vitro generate three dimensional artificial muscle organs. These organs contain parallel networks of long unbranched myofibers organized into fascicle-like structures. Tendon development is initiated and the muscles are capable of performing directed, functional work. Kinetically engineered organs provide a new method for studying the growth and development of normal and diseased skeletal muscle.
NASA Astrophysics Data System (ADS)
Konishi, Tsuyoshi; Tanida, Jun; Ichioka, Yoshiki
1995-06-01
A novel technique, the visual-area coding technique (VACT), for the optical implementation of fuzzy logic with the capability of visualization of the results is presented. This technique is based on the microfont method and is considered to be an instance of digitized analog optical computing. Huge amounts of data can be processed in fuzzy logic with the VACT. In addition, real-time visualization of the processed result can be accomplished.
Unstructured mesh algorithms for aerodynamic calculations
NASA Technical Reports Server (NTRS)
Mavriplis, D. J.
1992-01-01
The use of unstructured mesh techniques for solving complex aerodynamic flows is discussed. The principle advantages of unstructured mesh strategies, as they relate to complex geometries, adaptive meshing capabilities, and parallel processing are emphasized. The various aspects required for the efficient and accurate solution of aerodynamic flows are addressed. These include mesh generation, mesh adaptivity, solution algorithms, convergence acceleration, and turbulence modeling. Computations of viscous turbulent two-dimensional flows and inviscid three-dimensional flows about complex configurations are demonstrated. Remaining obstacles and directions for future research are also outlined.
An Object Oriented Extensible Architecture for Affordable Aerospace Propulsion Systems
NASA Technical Reports Server (NTRS)
Follen, Gregory J.
2003-01-01
Driven by a need to explore and develop propulsion systems that exceeded current computing capabilities, NASA Glenn embarked on a novel strategy leading to the development of an architecture that enables propulsion simulations never thought possible before. Full engine 3 Dimensional Computational Fluid Dynamic propulsion system simulations were deemed impossible due to the impracticality of the hardware and software computing systems required. However, with a software paradigm shift and an embracing of parallel and distributed processing, an architecture was designed to meet the needs of future propulsion system modeling. The author suggests that the architecture designed at the NASA Glenn Research Center for propulsion system modeling has potential for impacting the direction of development of affordable weapons systems currently under consideration by the Applied Vehicle Technology Panel (AVT).
Tempest: GPU-CPU computing for high-throughput database spectral matching.
Milloy, Jeffrey A; Faherty, Brendan K; Gerber, Scott A
2012-07-06
Modern mass spectrometers are now capable of producing hundreds of thousands of tandem (MS/MS) spectra per experiment, making the translation of these fragmentation spectra into peptide matches a common bottleneck in proteomics research. When coupled with experimental designs that enrich for post-translational modifications such as phosphorylation and/or include isotopically labeled amino acids for quantification, additional burdens are placed on this computational infrastructure by shotgun sequencing. To address this issue, we have developed a new database searching program that utilizes the massively parallel compute capabilities of a graphical processing unit (GPU) to produce peptide spectral matches in a very high throughput fashion. Our program, named Tempest, combines efficient database digestion and MS/MS spectral indexing on a CPU with fast similarity scoring on a GPU. In our implementation, the entire similarity score, including the generation of full theoretical peptide candidate fragmentation spectra and its comparison to experimental spectra, is conducted on the GPU. Although Tempest uses the classical SEQUEST XCorr score as a primary metric for evaluating similarity for spectra collected at unit resolution, we have developed a new "Accelerated Score" for MS/MS spectra collected at high resolution that is based on a computationally inexpensive dot product but exhibits scoring accuracy similar to that of the classical XCorr. In our experience, Tempest provides compute-cluster level performance in an affordable desktop computer.
Xyce Parallel Electronic Simulator Users Guide Version 6.2.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Keiter, Eric R.; Mei, Ting; Russo, Thomas V.
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been de- signed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel com- puting platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows onemore » to develop new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandia's needs, including some radiation- aware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase -- a message passing parallel implementation -- which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows. Trademarks The information herein is subject to change without notice. Copyright c 2002-2014 Sandia Corporation. All rights reserved. Xyce TM Electronic Simulator and Xyce TM are trademarks of Sandia Corporation. Portions of the Xyce TM code are: Copyright c 2002, The Regents of the University of California. Produced at the Lawrence Livermore National Laboratory. Written by Alan Hindmarsh, Allan Taylor, Radu Serban. UCRL-CODE-2002-59 All rights reserved. Orcad, Orcad Capture, PSpice and Probe are registered trademarks of Cadence Design Systems, Inc. Microsoft, Windows and Windows 7 are registered trademarks of Microsoft Corporation. Medici, DaVinci and Taurus are registered trademarks of Synopsys Corporation. Amtec and TecPlot are trademarks of Amtec Engineering, Inc. Xyce 's expression library is based on that inside Spice 3F5 developed by the EECS Department at the University of California. The EKV3 MOSFET model was developed by the EKV Team of the Electronics Laboratory-TUC of the Technical University of Crete. All other trademarks are property of their respective owners. Contacts Bug Reports (Sandia only) http://joseki.sandia.gov/bugzilla http://charleston.sandia.gov/bugzilla World Wide Web http://xyce.sandia.gov http://charleston.sandia.gov/xyce (Sandia only) Email xyce@sandia.gov (outside Sandia) xyce-sandia@sandia.gov (Sandia only)« less
Xyce Parallel Electronic Simulator Users Guide Version 6.4
DOE Office of Scientific and Technical Information (OSTI.GOV)
Keiter, Eric R.; Mei, Ting; Russo, Thomas V.
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been de- signed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel com- puting platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows onemore » to develop new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandia's needs, including some radiation- aware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase -- a message passing parallel implementation -- which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows. Trademarks The information herein is subject to change without notice. Copyright c 2002-2015 Sandia Corporation. All rights reserved. Xyce TM Electronic Simulator and Xyce TM are trademarks of Sandia Corporation. Portions of the Xyce TM code are: Copyright c 2002, The Regents of the University of California. Produced at the Lawrence Livermore National Laboratory. Written by Alan Hindmarsh, Allan Taylor, Radu Serban. UCRL-CODE-2002-59 All rights reserved. Orcad, Orcad Capture, PSpice and Probe are registered trademarks of Cadence Design Systems, Inc. Microsoft, Windows and Windows 7 are registered trademarks of Microsoft Corporation. Medici, DaVinci and Taurus are registered trademarks of Synopsys Corporation. Amtec and TecPlot are trademarks of Amtec Engineering, Inc. Xyce 's expression library is based on that inside Spice 3F5 developed by the EECS Department at the University of California. The EKV3 MOSFET model was developed by the EKV Team of the Electronics Laboratory-TUC of the Technical University of Crete. All other trademarks are property of their respective owners. Contacts Bug Reports (Sandia only) http://joseki.sandia.gov/bugzilla http://charleston.sandia.gov/bugzilla World Wide Web http://xyce.sandia.gov http://charleston.sandia.gov/xyce (Sandia only) Email xyce@sandia.gov (outside Sandia) xyce-sandia@sandia.gov (Sandia only)« less
Parallel image registration with a thin client interface
NASA Astrophysics Data System (ADS)
Saiprasad, Ganesh; Lo, Yi-Jung; Plishker, William; Lei, Peng; Ahmad, Tabassum; Shekhar, Raj
2010-03-01
Despite its high significance, the clinical utilization of image registration remains limited because of its lengthy execution time and a lack of easy access. The focus of this work was twofold. First, we accelerated our course-to-fine, volume subdivision-based image registration algorithm by a novel parallel implementation that maintains the accuracy of our uniprocessor implementation. Second, we developed a thin-client computing model with a user-friendly interface to perform rigid and nonrigid image registration. Our novel parallel computing model uses the message passing interface model on a 32-core cluster. The results show that, compared with the uniprocessor implementation, the parallel implementation of our image registration algorithm is approximately 5 times faster for rigid image registration and approximately 9 times faster for nonrigid registration for the images used. To test the viability of such systems for clinical use, we developed a thin client in the form of a plug-in in OsiriX, a well-known open source PACS workstation and DICOM viewer, and used it for two applications. The first application registered the baseline and follow-up MR brain images, whose subtraction was used to track progression of multiple sclerosis. The second application registered pretreatment PET and intratreatment CT of radiofrequency ablation patients to demonstrate a new capability of multimodality imaging guidance. The registration acceleration coupled with the remote implementation using a thin client should ultimately increase accuracy, speed, and access of image registration-based interpretations in a number of diagnostic and interventional applications.
High-performance parallel analysis of coupled problems for aircraft propulsion
NASA Technical Reports Server (NTRS)
Felippa, C. A.; Farhat, C.; Chen, P.-S.; Gumaste, U.; Leoinne, M.; Stern, P.
1995-01-01
This research program deals with the application of high-performance computing methods to the numerical simulation of complete jet engines. The program was initiated in 1993 by applying two-dimensional parallel aeroelastic codes to the interior gas flow problem of a by-pass jet engine. The fluid mesh generation, domain decomposition and solution capabilities were successfully tested. Attention was then focused on methodology for the partitioned analysis of the interaction of the gas flow with a flexible structure and with the fluid mesh motion driven by these structural displacements. The latter is treated by an ALE technique that models the fluid mesh motion as that of a fictitious mechanical network laid along the edges of near-field fluid elements. New partitioned analysis procedures to treat this coupled 3-component problem were developed in 1994. These procedures involved delayed corrections and subcycling, and have been successfully tested on several massively parallel computers. For the global steady-state axisymmetric analysis of a complete engine we have decided to use the NASA-sponsored ENG10 program, which uses a regular FV-multiblock-grid discretization in conjunction with circumferential averaging to include effects of blade forces, loss, combustor heat addition, blockage, bleeds and convective mixing. A load-balancing preprocessor for parallel versions of ENG10 has been developed. It is planned to use the steady-state global solution provided by ENG10 as input to a localized three-dimensional FSI analysis for engine regions where aeroelastic effects may be important.
NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations
NASA Astrophysics Data System (ADS)
Valiev, M.; Bylaska, E. J.; Govind, N.; Kowalski, K.; Straatsma, T. P.; Van Dam, H. J. J.; Wang, D.; Nieplocha, J.; Apra, E.; Windus, T. L.; de Jong, W. A.
2010-09-01
The latest release of NWChem delivers an open-source computational chemistry package with extensive capabilities for large scale simulations of chemical and biological systems. Utilizing a common computational framework, diverse theoretical descriptions can be used to provide the best solution for a given scientific problem. Scalable parallel implementations and modular software design enable efficient utilization of current computational architectures. This paper provides an overview of NWChem focusing primarily on the core theoretical modules provided by the code and their parallel performance. Program summaryProgram title: NWChem Catalogue identifier: AEGI_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEGI_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Open Source Educational Community License No. of lines in distributed program, including test data, etc.: 11 709 543 No. of bytes in distributed program, including test data, etc.: 680 696 106 Distribution format: tar.gz Programming language: Fortran 77, C Computer: all Linux based workstations and parallel supercomputers, Windows and Apple machines Operating system: Linux, OS X, Windows Has the code been vectorised or parallelized?: Code is parallelized Classification: 2.1, 2.2, 3, 7.3, 7.7, 16.1, 16.2, 16.3, 16.10, 16.13 Nature of problem: Large-scale atomistic simulations of chemical and biological systems require efficient and reliable methods for ground and excited solutions of many-electron Hamiltonian, analysis of the potential energy surface, and dynamics. Solution method: Ground and excited solutions of many-electron Hamiltonian are obtained utilizing density-functional theory, many-body perturbation approach, and coupled cluster expansion. These solutions or a combination thereof with classical descriptions are then used to analyze potential energy surface and perform dynamical simulations. Additional comments: Full documentation is provided in the distribution file. This includes an INSTALL file giving details of how to build the package. A set of test runs is provided in the examples directory. The distribution file for this program is over 90 Mbytes and therefore is not delivered directly when download or Email is requested. Instead a html file giving details of how the program can be obtained is sent. Running time: Running time depends on the size of the chemical system, complexity of the method, number of cpu's and the computational task. It ranges from several seconds for serial DFT energy calculations on a few atoms to several hours for parallel coupled cluster energy calculations on tens of atoms or ab-initio molecular dynamics simulation on hundreds of atoms.
The 2nd Symposium on the Frontiers of Massively Parallel Computations
NASA Technical Reports Server (NTRS)
Mills, Ronnie (Editor)
1988-01-01
Programming languages, computer graphics, neural networks, massively parallel computers, SIMD architecture, algorithms, digital terrain models, sort computation, simulation of charged particle transport on the massively parallel processor and image processing are among the topics discussed.
An object-oriented, coprocessor-accelerated model for ice sheet simulations
NASA Astrophysics Data System (ADS)
Seddik, H.; Greve, R.
2013-12-01
Recently, numerous models capable of modeling the thermo-dynamics of ice sheets have been developed within the ice sheet modeling community. Their capabilities have been characterized by a wide range of features with different numerical methods (finite difference or finite element), different implementations of the ice flow mechanics (shallow-ice, higher-order, full Stokes) and different treatments for the basal and coastal areas (basal hydrology, basal sliding, ice shelves). Shallow-ice models (SICOPOLIS, IcIES, PISM, etc) have been widely used for modeling whole ice sheets (Greenland and Antarctica) due to the relatively low computational cost of the shallow-ice approximation but higher order (ISSM, AIF) and full Stokes (Elmer/Ice) models have been recently used to model the Greenland ice sheet. The advance in processor speed and the decrease in cost for accessing large amount of memory and storage have undoubtedly been the driving force in the commoditization of models with higher capabilities, and the popularity of Elmer/Ice (http://elmerice.elmerfem.com) with an active user base is a notable representation of this trend. Elmer/Ice is a full Stokes model built on top of the multi-physics package Elmer (http://www.csc.fi/english/pages/elmer) which provides the full machinery for the complex finite element procedure and is fully parallel (mesh partitioning with OpenMPI communication). Elmer is mainly written in Fortran 90 and targets essentially traditional processors as the code base was not initially written to run on modern coprocessors (yet adding support for the recently introduced x86 based coprocessors is possible). Furthermore, a truly modular and object-oriented implementation is required for quick adaptation to fast evolving capabilities in hardware (Fortran 2003 provides an object-oriented programming model while not being clean and requiring a tricky refactoring of Elmer code). In this work, the object-oriented, coprocessor-accelerated finite element code Sainou is introduced. Sainou is an Elmer fork which is reimplemented in Objective C and used for experimenting with ice sheet models running on coprocessors, essentially GPU devices. GPUs are highly parallel processors that provide opportunities for fine-grained parallelization of the full Stokes problem using the standard OpenCL language (http://www.khronos.org/opencl/) to access the device. Sainou is built upon a collection of Objective C base classes that service a modular kernel (itself a base class) which provides the core methods to solve the finite element problem. An early implementation of Sainou will be presented with emphasis on the object architecture and the strategies of parallelizations. The computation of a simple heat conduction problem is used to test the implementation which also provides experimental support for running the global matrix assembly on GPU.
NASA Astrophysics Data System (ADS)
Rastogi, Richa; Londhe, Ashutosh; Srivastava, Abhishek; Sirasala, Kirannmayi M.; Khonde, Kiran
2017-03-01
In this article, a new scalable 3D Kirchhoff depth migration algorithm is presented on state of the art multicore CPU based cluster. Parallelization of 3D Kirchhoff depth migration is challenging due to its high demand of compute time, memory, storage and I/O along with the need of their effective management. The most resource intensive modules of the algorithm are traveltime calculations and migration summation which exhibit an inherent trade off between compute time and other resources. The parallelization strategy of the algorithm largely depends on the storage of calculated traveltimes and its feeding mechanism to the migration process. The presented work is an extension of our previous work, wherein a 3D Kirchhoff depth migration application for multicore CPU based parallel system had been developed. Recently, we have worked on improving parallel performance of this application by re-designing the parallelization approach. The new algorithm is capable to efficiently migrate both prestack and poststack 3D data. It exhibits flexibility for migrating large number of traces within the available node memory and with minimal requirement of storage, I/O and inter-node communication. The resultant application is tested using 3D Overthrust data on PARAM Yuva II, which is a Xeon E5-2670 based multicore CPU cluster with 16 cores/node and 64 GB shared memory. Parallel performance of the algorithm is studied using different numerical experiments and the scalability results show striking improvement over its previous version. An impressive 49.05X speedup with 76.64% efficiency is achieved for 3D prestack data and 32.00X speedup with 50.00% efficiency for 3D poststack data, using 64 nodes. The results also demonstrate the effectiveness and robustness of the improved algorithm with high scalability and efficiency on a multicore CPU cluster.
Data communications in a parallel active messaging interface of a parallel computer
Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E
2013-11-12
Data communications in a parallel active messaging interface (`PAMI`) of a parallel computer composed of compute nodes that execute a parallel application, each compute node including application processors that execute the parallel application and at least one management processor dedicated to gathering information regarding data communications. The PAMI is composed of data communications endpoints, each endpoint composed of a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes and the endpoints coupled for data communications through the PAMI and through data communications resources. Embodiments function by gathering call site statistics describing data communications resulting from execution of data communications instructions and identifying in dependence upon the call cite statistics a data communications algorithm for use in executing a data communications instruction at a call site in the parallel application.
Parallel Computation of the Jacobian Matrix for Nonlinear Equation Solvers Using MATLAB
NASA Technical Reports Server (NTRS)
Rose, Geoffrey K.; Nguyen, Duc T.; Newman, Brett A.
2017-01-01
Demonstrating speedup for parallel code on a multicore shared memory PC can be challenging in MATLAB due to underlying parallel operations that are often opaque to the user. This can limit potential for improvement of serial code even for the so-called embarrassingly parallel applications. One such application is the computation of the Jacobian matrix inherent to most nonlinear equation solvers. Computation of this matrix represents the primary bottleneck in nonlinear solver speed such that commercial finite element (FE) and multi-body-dynamic (MBD) codes attempt to minimize computations. A timing study using MATLAB's Parallel Computing Toolbox was performed for numerical computation of the Jacobian. Several approaches for implementing parallel code were investigated while only the single program multiple data (spmd) method using composite objects provided positive results. Parallel code speedup is demonstrated but the goal of linear speedup through the addition of processors was not achieved due to PC architecture.
Performance Evaluation in Network-Based Parallel Computing
NASA Technical Reports Server (NTRS)
Dezhgosha, Kamyar
1996-01-01
Network-based parallel computing is emerging as a cost-effective alternative for solving many problems which require use of supercomputers or massively parallel computers. The primary objective of this project has been to conduct experimental research on performance evaluation for clustered parallel computing. First, a testbed was established by augmenting our existing SUNSPARCs' network with PVM (Parallel Virtual Machine) which is a software system for linking clusters of machines. Second, a set of three basic applications were selected. The applications consist of a parallel search, a parallel sort, a parallel matrix multiplication. These application programs were implemented in C programming language under PVM. Third, we conducted performance evaluation under various configurations and problem sizes. Alternative parallel computing models and workload allocations for application programs were explored. The performance metric was limited to elapsed time or response time which in the context of parallel computing can be expressed in terms of speedup. The results reveal that the overhead of communication latency between processes in many cases is the restricting factor to performance. That is, coarse-grain parallelism which requires less frequent communication between processes will result in higher performance in network-based computing. Finally, we are in the final stages of installing an Asynchronous Transfer Mode (ATM) switch and four ATM interfaces (each 155 Mbps) which will allow us to extend our study to newer applications, performance metrics, and configurations.
FPGA-Based Stochastic Echo State Networks for Time-Series Forecasting.
Alomar, Miquel L; Canals, Vincent; Perez-Mora, Nicolas; Martínez-Moll, Víctor; Rosselló, Josep L
2016-01-01
Hardware implementation of artificial neural networks (ANNs) allows exploiting the inherent parallelism of these systems. Nevertheless, they require a large amount of resources in terms of area and power dissipation. Recently, Reservoir Computing (RC) has arisen as a strategic technique to design recurrent neural networks (RNNs) with simple learning capabilities. In this work, we show a new approach to implement RC systems with digital gates. The proposed method is based on the use of probabilistic computing concepts to reduce the hardware required to implement different arithmetic operations. The result is the development of a highly functional system with low hardware resources. The presented methodology is applied to chaotic time-series forecasting.
Time-Dependent Simulation of Incompressible Flow in a Turbopump Using Overset Grid Approach
NASA Technical Reports Server (NTRS)
Kiris, Cetin; Kwak, Dochan
2001-01-01
This paper reports the progress being made towards complete unsteady turbopump simulation capability by using overset grid systems. A computational model of a turbo-pump impeller is used as a test case for the performance evaluation of the MPI, hybrid MPI/Open-MP, and MLP versions of the INS3D code. Relative motion of the grid system for rotor-stator interaction was obtained by employing overset grid techniques. Unsteady computations for a turbo-pump, which contains 114 zones with 34.3 Million grid points, are performed on Origin 2000 systems at NASA Ames Research Center. The approach taken for these simulations, and the performance of the parallel versions of the code are presented.
FPGA-Based Stochastic Echo State Networks for Time-Series Forecasting
Alomar, Miquel L.; Canals, Vincent; Perez-Mora, Nicolas; Martínez-Moll, Víctor; Rosselló, Josep L.
2016-01-01
Hardware implementation of artificial neural networks (ANNs) allows exploiting the inherent parallelism of these systems. Nevertheless, they require a large amount of resources in terms of area and power dissipation. Recently, Reservoir Computing (RC) has arisen as a strategic technique to design recurrent neural networks (RNNs) with simple learning capabilities. In this work, we show a new approach to implement RC systems with digital gates. The proposed method is based on the use of probabilistic computing concepts to reduce the hardware required to implement different arithmetic operations. The result is the development of a highly functional system with low hardware resources. The presented methodology is applied to chaotic time-series forecasting. PMID:26880876
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fu, Pengchen; Settgast, Randolph R.; Johnson, Scott M.
2014-12-17
GEOS is a massively parallel, multi-physics simulation application utilizing high performance computing (HPC) to address subsurface reservoir stimulation activities with the goal of optimizing current operations and evaluating innovative stimulation methods. GEOS enables coupling of di erent solvers associated with the various physical processes occurring during reservoir stimulation in unique and sophisticated ways, adapted to various geologic settings, materials and stimulation methods. Developed at the Lawrence Livermore National Laboratory (LLNL) as a part of a Laboratory-Directed Research and Development (LDRD) Strategic Initiative (SI) project, GEOS represents the culmination of a multi-year ongoing code development and improvement e ort that hasmore » leveraged existing code capabilities and sta expertise to design new computational geosciences software.« less
Real-Time Cognitive Computing Architecture for Data Fusion in a Dynamic Environment
NASA Technical Reports Server (NTRS)
Duong, Tuan A.; Duong, Vu A.
2012-01-01
A novel cognitive computing architecture is conceptualized for processing multiple channels of multi-modal sensory data streams simultaneously, and fusing the information in real time to generate intelligent reaction sequences. This unique architecture is capable of assimilating parallel data streams that could be analog, digital, synchronous/asynchronous, and could be programmed to act as a knowledge synthesizer and/or an "intelligent perception" processor. In this architecture, the bio-inspired models of visual pathway and olfactory receptor processing are combined as processing components, to achieve the composite function of "searching for a source of food while avoiding the predator." The architecture is particularly suited for scene analysis from visual data and odorant.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Adams, Brian M.; Ebeida, Mohamed Salah; Eldred, Michael S.
The Dakota (Design Analysis Kit for Optimization and Terascale Applications) toolkit provides a exible and extensible interface between simulation codes and iterative analysis methods. Dakota contains algorithms for optimization with gradient and nongradient-based methods; uncertainty quanti cation with sampling, reliability, and stochastic expansion methods; parameter estimation with nonlinear least squares methods; and sensitivity/variance analysis with design of experiments and parameter study methods. These capabilities may be used on their own or as components within advanced strategies such as surrogate-based optimization, mixed integer nonlinear programming, or optimization under uncertainty. By employing object-oriented design to implement abstractions of the key components requiredmore » for iterative systems analyses, the Dakota toolkit provides a exible and extensible problem-solving environment for design and performance analysis of computational models on high performance computers. This report serves as a user's manual for the Dakota software and provides capability overviews and procedures for software execution, as well as a variety of example studies.« less
A Low-Power High-Speed Smart Sensor Design for Space Exploration Missions
NASA Technical Reports Server (NTRS)
Fang, Wai-Chi
1997-01-01
A low-power high-speed smart sensor system based on a large format active pixel sensor (APS) integrated with a programmable neural processor for space exploration missions is presented. The concept of building an advanced smart sensing system is demonstrated by a system-level microchip design that is composed with an APS sensor, a programmable neural processor, and an embedded microprocessor in a SOI CMOS technology. This ultra-fast smart sensor system-on-a-chip design mimics what is inherent in biological vision systems. Moreover, it is programmable and capable of performing ultra-fast machine vision processing in all levels such as image acquisition, image fusion, image analysis, scene interpretation, and control functions. The system provides about one tera-operation-per-second computing power which is a two order-of-magnitude increase over that of state-of-the-art microcomputers. Its high performance is due to massively parallel computing structures, high data throughput rates, fast learning capabilities, and advanced VLSI system-on-a-chip implementation.
Parallel Computing Using Web Servers and "Servlets".
ERIC Educational Resources Information Center
Lo, Alfred; Bloor, Chris; Choi, Y. K.
2000-01-01
Describes parallel computing and presents inexpensive ways to implement a virtual parallel computer with multiple Web servers. Highlights include performance measurement of parallel systems; models for using Java and intranet technology including single server, multiple clients and multiple servers, single client; and a comparison of CGI (common…
NASA Astrophysics Data System (ADS)
O'Connor, A. S.; Justice, B.; Harris, A. T.
2013-12-01
Graphics Processing Units (GPUs) are high-performance multiple-core processors capable of very high computational speeds and large data throughput. Modern GPUs are inexpensive and widely available commercially. These are general-purpose parallel processors with support for a variety of programming interfaces, including industry standard languages such as C. GPU implementations of algorithms that are well suited for parallel processing can often achieve speedups of several orders of magnitude over optimized CPU codes. Significant improvements in speeds for imagery orthorectification, atmospheric correction, target detection and image transformations like Independent Components Analsyis (ICA) have been achieved using GPU-based implementations. Additional optimizations, when factored in with GPU processing capabilities, can provide 50x - 100x reduction in the time required to process large imagery. Exelis Visual Information Solutions (VIS) has implemented a CUDA based GPU processing frame work for accelerating ENVI and IDL processes that can best take advantage of parallelization. Testing Exelis VIS has performed shows that orthorectification can take as long as two hours with a WorldView1 35,0000 x 35,000 pixel image. With GPU orthorecification, the same orthorectification process takes three minutes. By speeding up image processing, imagery can successfully be used by first responders, scientists making rapid discoveries with near real time data, and provides an operational component to data centers needing to quickly process and disseminate data.
A class of parallel algorithms for computation of the manipulator inertia matrix
NASA Technical Reports Server (NTRS)
Fijany, Amir; Bejczy, Antal K.
1989-01-01
Parallel and parallel/pipeline algorithms for computation of the manipulator inertia matrix are presented. An algorithm based on composite rigid-body spatial inertia method, which provides better features for parallelization, is used for the computation of the inertia matrix. Two parallel algorithms are developed which achieve the time lower bound in computation. Also described is the mapping of these algorithms with topological variation on a two-dimensional processor array, with nearest-neighbor connection, and with cardinality variation on a linear processor array. An efficient parallel/pipeline algorithm for the linear array was also developed, but at significantly higher efficiency.
Parallel computing of a climate model on the dawn 1000 by domain decomposition method
NASA Astrophysics Data System (ADS)
Bi, Xunqiang
1997-12-01
In this paper the parallel computing of a grid-point nine-level atmospheric general circulation model on the Dawn 1000 is introduced. The model was developed by the Institute of Atmospheric Physics (IAP), Chinese Academy of Sciences (CAS). The Dawn 1000 is a MIMD massive parallel computer made by National Research Center for Intelligent Computer (NCIC), CAS. A two-dimensional domain decomposition method is adopted to perform the parallel computing. The potential ways to increase the speed-up ratio and exploit more resources of future massively parallel supercomputation are also discussed.
A Pervasive Parallel Processing Framework for Data Visualization and Analysis at Extreme Scale
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ma, Kwan-Liu
Most of today’s visualization libraries and applications are based off of what is known today as the visualization pipeline. In the visualization pipeline model, algorithms are encapsulated as “filtering” components with inputs and outputs. These components can be combined by connecting the outputs of one filter to the inputs of another filter. The visualization pipeline model is popular because it provides a convenient abstraction that allows users to combine algorithms in powerful ways. Unfortunately, the visualization pipeline cannot run effectively on exascale computers. Experts agree that the exascale machine will comprise processors that contain many cores. Furthermore, physical limitations willmore » prevent data movement in and out of the chip (that is, between main memory and the processing cores) from keeping pace with improvements in overall compute performance. To use these processors to their fullest capability, it is essential to carefully consider memory access. This is where the visualization pipeline fails. Each filtering component in the visualization library is expected to take a data set in its entirety, perform some computation across all of the elements, and output the complete results. The process of iterating over all elements must be repeated in each filter, which is one of the worst possible ways to traverse memory when trying to maximize the number of executions per memory access. This project investigates a new type of visualization framework that exhibits a pervasive parallelism necessary to run on exascale machines. Our framework achieves this by defining algorithms in terms of functors, which are localized, stateless operations. Functors can be composited in much the same way as filters in the visualization pipeline. But, functors’ design allows them to be concurrently running on massive amounts of lightweight threads. Only with such fine-grained parallelism can we hope to fill the billions of threads we expect will be necessary for efficient computation on an exascale computer. This project concludes with a functional prototype containing pervasively parallel algorithms that perform demonstratively well on many-core processors. These algorithms are fundamental for performing data analysis and visualization at extreme scale.« less
A Gateway for Phylogenetic Analysis Powered by Grid Computing Featuring GARLI 2.0
Bazinet, Adam L.; Zwickl, Derrick J.; Cummings, Michael P.
2014-01-01
We introduce molecularevolution.org, a publicly available gateway for high-throughput, maximum-likelihood phylogenetic analysis powered by grid computing. The gateway features a garli 2.0 web service that enables a user to quickly and easily submit thousands of maximum likelihood tree searches or bootstrap searches that are executed in parallel on distributed computing resources. The garli web service allows one to easily specify partitioned substitution models using a graphical interface, and it performs sophisticated post-processing of phylogenetic results. Although the garli web service has been used by the research community for over three years, here we formally announce the availability of the service, describe its capabilities, highlight new features and recent improvements, and provide details about how the grid system efficiently delivers high-quality phylogenetic results. [garli, gateway, grid computing, maximum likelihood, molecular evolution portal, phylogenetics, web service.] PMID:24789072
Extending Strong Scaling of Quantum Monte Carlo to the Exascale
NASA Astrophysics Data System (ADS)
Shulenburger, Luke; Baczewski, Andrew; Luo, Ye; Romero, Nichols; Kent, Paul
Quantum Monte Carlo is one of the most accurate and most computationally expensive methods for solving the electronic structure problem. In spite of its significant computational expense, its massively parallel nature is ideally suited to petascale computers which have enabled a wide range of applications to relatively large molecular and extended systems. Exascale capabilities have the potential to enable the application of QMC to significantly larger systems, capturing much of the complexity of real materials such as defects and impurities. However, both memory and computational demands will require significant changes to current algorithms to realize this possibility. This talk will detail both the causes of the problem and potential solutions. Sandia National Laboratories is a multi-mission laboratory managed and operated by Sandia Corp, a wholly owned subsidiary of Lockheed Martin Corp, for the US Department of Energys National Nuclear Security Administration under contract DE-AC04-94AL85000.
Accelerating Full Configuration Interaction Calculations for Nuclear Structure
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yang, Chao; Sternberg, Philip; Maris, Pieter
2008-04-14
One of the emerging computational approaches in nuclear physics is the full configuration interaction (FCI) method for solving the many-body nuclear Hamiltonian in a sufficiently large single-particle basis space to obtain exact answers - either directly or by extrapolation. The lowest eigenvalues and correspondingeigenvectors for very large, sparse and unstructured nuclear Hamiltonian matrices are obtained and used to evaluate additional experimental quantities. These matrices pose a significant challenge to the design and implementation of efficient and scalable algorithms for obtaining solutions on massively parallel computer systems. In this paper, we describe the computational strategies employed in a state-of-the-art FCI codemore » MFDn (Many Fermion Dynamics - nuclear) as well as techniques we recently developed to enhance the computational efficiency of MFDn. We will demonstrate the current capability of MFDn and report the latest performance improvement we have achieved. We will also outline our future research directions.« less
Parallel Computing Strategies for Irregular Algorithms
NASA Technical Reports Server (NTRS)
Biswas, Rupak; Oliker, Leonid; Shan, Hongzhang; Biegel, Bryan (Technical Monitor)
2002-01-01
Parallel computing promises several orders of magnitude increase in our ability to solve realistic computationally-intensive problems, but relies on their efficient mapping and execution on large-scale multiprocessor architectures. Unfortunately, many important applications are irregular and dynamic in nature, making their effective parallel implementation a daunting task. Moreover, with the proliferation of parallel architectures and programming paradigms, the typical scientist is faced with a plethora of questions that must be answered in order to obtain an acceptable parallel implementation of the solution algorithm. In this paper, we consider three representative irregular applications: unstructured remeshing, sparse matrix computations, and N-body problems, and parallelize them using various popular programming paradigms on a wide spectrum of computer platforms ranging from state-of-the-art supercomputers to PC clusters. We present the underlying problems, the solution algorithms, and the parallel implementation strategies. Smart load-balancing, partitioning, and ordering techniques are used to enhance parallel performance. Overall results demonstrate the complexity of efficiently parallelizing irregular algorithms.
Yang, Tzuhsiung; Berry, John F
2018-06-04
The computation of nuclear second derivatives of energy, or the nuclear Hessian, is an essential routine in quantum chemical investigations of ground and transition states, thermodynamic calculations, and molecular vibrations. Analytic nuclear Hessian computations require the resolution of costly coupled-perturbed self-consistent field (CP-SCF) equations, while numerical differentiation of analytic first derivatives has an unfavorable 6 N ( N = number of atoms) prefactor. Herein, we present a new method in which grid computing is used to accelerate and/or enable the evaluation of the nuclear Hessian via numerical differentiation: NUMFREQ@Grid. Nuclear Hessians were successfully evaluated by NUMFREQ@Grid at the DFT level as well as using RIJCOSX-ZORA-MP2 or RIJCOSX-ZORA-B2PLYP for a set of linear polyacenes with systematically increasing size. For the larger members of this group, NUMFREQ@Grid was found to outperform the wall clock time of analytic Hessian evaluation; at the MP2 or B2LYP levels, these Hessians cannot even be evaluated analytically. We also evaluated a 156-atom catalytically relevant open-shell transition metal complex and found that NUMFREQ@Grid is faster (7.7 times shorter wall clock time) and less demanding (4.4 times less memory requirement) than an analytic Hessian. Capitalizing on the capabilities of parallel grid computing, NUMFREQ@Grid can outperform analytic methods in terms of wall time, memory requirements, and treatable system size. The NUMFREQ@Grid method presented herein demonstrates how grid computing can be used to facilitate embarrassingly parallel computational procedures and is a pioneer for future implementations.
Parallel solution of sparse one-dimensional dynamic programming problems
NASA Technical Reports Server (NTRS)
Nicol, David M.
1989-01-01
Parallel computation offers the potential for quickly solving large computational problems. However, it is often a non-trivial task to effectively use parallel computers. Solution methods must sometimes be reformulated to exploit parallelism; the reformulations are often more complex than their slower serial counterparts. We illustrate these points by studying the parallelization of sparse one-dimensional dynamic programming problems, those which do not obviously admit substantial parallelization. We propose a new method for parallelizing such problems, develop analytic models which help us to identify problems which parallelize well, and compare the performance of our algorithm with existing algorithms on a multiprocessor.
Jackin, Boaz Jessie; Watanabe, Shinpei; Ootsu, Kanemitsu; Ohkawa, Takeshi; Yokota, Takashi; Hayasaki, Yoshio; Yatagai, Toyohiko; Baba, Takanobu
2018-04-20
A parallel computation method for large-size Fresnel computer-generated hologram (CGH) is reported. The method was introduced by us in an earlier report as a technique for calculating Fourier CGH from 2D object data. In this paper we extend the method to compute Fresnel CGH from 3D object data. The scale of the computation problem is also expanded to 2 gigapixels, making it closer to real application requirements. The significant feature of the reported method is its ability to avoid communication overhead and thereby fully utilize the computing power of parallel devices. The method exhibits three layers of parallelism that favor small to large scale parallel computing machines. Simulation and optical experiments were conducted to demonstrate the workability and to evaluate the efficiency of the proposed technique. A two-times improvement in computation speed has been achieved compared to the conventional method, on a 16-node cluster (one GPU per node) utilizing only one layer of parallelism. A 20-times improvement in computation speed has been estimated utilizing two layers of parallelism on a very large-scale parallel machine with 16 nodes, where each node has 16 GPUs.
Java Performance for Scientific Applications on LLNL Computer Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kapfer, C; Wissink, A
2002-05-10
Languages in use for high performance computing at the laboratory--Fortran (f77 and f90), C, and C++--have many years of development behind them and are generally considered the fastest available. However, Fortran and C do not readily extend to object-oriented programming models, limiting their capability for very complex simulation software. C++ facilitates object-oriented programming but is a very complex and error-prone language. Java offers a number of capabilities that these other languages do not. For instance it implements cleaner (i.e., easier to use and less prone to errors) object-oriented models than C++. It also offers networking and security as part ofmore » the language standard, and cross-platform executables that make it architecture neutral, to name a few. These features have made Java very popular for industrial computing applications. The aim of this paper is to explain the trade-offs in using Java for large-scale scientific applications at LLNL. Despite its advantages, the computational science community has been reluctant to write large-scale computationally intensive applications in Java due to concerns over its poor performance. However, considerable progress has been made over the last several years. The Java Grande Forum [1] has been promoting the use of Java for large-scale computing. Members have introduced efficient array libraries, developed fast just-in-time (JIT) compilers, and built links to existing packages used in high performance parallel computing.« less
Parallel Geospatial Data Management for Multi-Scale Environmental Data Analysis on GPUs
NASA Astrophysics Data System (ADS)
Wang, D.; Zhang, J.; Wei, Y.
2013-12-01
As the spatial and temporal resolutions of Earth observatory data and Earth system simulation outputs are getting higher, in-situ and/or post- processing such large amount of geospatial data increasingly becomes a bottleneck in scientific inquires of Earth systems and their human impacts. Existing geospatial techniques that are based on outdated computing models (e.g., serial algorithms and disk-resident systems), as have been implemented in many commercial and open source packages, are incapable of processing large-scale geospatial data and achieve desired level of performance. In this study, we have developed a set of parallel data structures and algorithms that are capable of utilizing massively data parallel computing power available on commodity Graphics Processing Units (GPUs) for a popular geospatial technique called Zonal Statistics. Given two input datasets with one representing measurements (e.g., temperature or precipitation) and the other one represent polygonal zones (e.g., ecological or administrative zones), Zonal Statistics computes major statistics (or complete distribution histograms) of the measurements in all regions. Our technique has four steps and each step can be mapped to GPU hardware by identifying its inherent data parallelisms. First, a raster is divided into blocks and per-block histograms are derived. Second, the Minimum Bounding Boxes (MBRs) of polygons are computed and are spatially matched with raster blocks; matched polygon-block pairs are tested and blocks that are either inside or intersect with polygons are identified. Third, per-block histograms are aggregated to polygons for blocks that are completely within polygons. Finally, for blocks that intersect with polygon boundaries, all the raster cells within the blocks are examined using point-in-polygon-test and cells that are within polygons are used to update corresponding histograms. As the task becomes I/O bound after applying spatial indexing and GPU hardware acceleration, we have developed a GPU-based data compression technique by reusing our previous work on Bitplane Quadtree (or BPQ-Tree) based indexing of binary bitmaps. Results have shown that our GPU-based parallel Zonal Statistic technique on 3000+ US counties over 20+ billion NASA SRTM 30 meter resolution Digital Elevation (DEM) raster cells has achieved impressive end-to-end runtimes: 101 seconds and 46 seconds a low-end workstation equipped with a Nvidia GTX Titan GPU using cold and hot cache, respectively; and, 60-70 seconds using a single OLCF TITAN computing node and 10-15 seconds using 8 nodes. Our experiment results clearly show the potentials of using high-end computing facilities for large-scale geospatial processing.
Generating performance portable geoscientific simulation code with Firedrake (Invited)
NASA Astrophysics Data System (ADS)
Ham, D. A.; Bercea, G.; Cotter, C. J.; Kelly, P. H.; Loriant, N.; Luporini, F.; McRae, A. T.; Mitchell, L.; Rathgeber, F.
2013-12-01
This presentation will demonstrate how a change in simulation programming paradigm can be exploited to deliver sophisticated simulation capability which is far easier to programme than are conventional models, is capable of exploiting different emerging parallel hardware, and is tailored to the specific needs of geoscientific simulation. Geoscientific simulation represents a grand challenge computational task: many of the largest computers in the world are tasked with this field, and the requirements of resolution and complexity of scientists in this field are far from being sated. However, single thread performance has stalled, even sometimes decreased, over the last decade, and has been replaced by ever more parallel systems: both as conventional multicore CPUs and in the emerging world of accelerators. At the same time, the needs of scientists to couple ever-more complex dynamics and parametrisations into their models makes the model development task vastly more complex. The conventional approach of writing code in low level languages such as Fortran or C/C++ and then hand-coding parallelism for different platforms by adding library calls and directives forces the intermingling of the numerical code with its implementation. This results in an almost impossible set of skill requirements for developers, who must simultaneously be domain science experts, numericists, software engineers and parallelisation specialists. Even more critically, it requires code to be essentially rewritten for each emerging hardware platform. Since new platforms are emerging constantly, and since code owners do not usually control the procurement of the supercomputers on which they must run, this represents an unsustainable development load. The Firedrake system, conversely, offers the developer the opportunity to write PDE discretisations in the high-level mathematical language UFL from the FEniCS project (http://fenicsproject.org). Non-PDE model components, such as parametrisations, can be written as short C kernels operating locally on the underlying mesh, with no explicit parallelism. The executable code is then generated in C, CUDA or OpenCL and executed in parallel on the target architecture. The system also offers features of special relevance to the geosciences. In particular, the large scale separation between the vertical and horizontal directions in many geoscientific processes can be exploited to offer the flexibility of unstructured meshes in the horizontal direction, without the performance penalty usually associated with those methods.
MPI_XSTAR: MPI-based Parallelization of the XSTAR Photoionization Program
NASA Astrophysics Data System (ADS)
Danehkar, Ashkbiz; Nowak, Michael A.; Lee, Julia C.; Smith, Randall K.
2018-02-01
We describe a program for the parallel implementation of multiple runs of XSTAR, a photoionization code that is used to predict the physical properties of an ionized gas from its emission and/or absorption lines. The parallelization program, called MPI_XSTAR, has been developed and implemented in the C++ language by using the Message Passing Interface (MPI) protocol, a conventional standard of parallel computing. We have benchmarked parallel multiprocessing executions of XSTAR, using MPI_XSTAR, against a serial execution of XSTAR, in terms of the parallelization speedup and the computing resource efficiency. Our experience indicates that the parallel execution runs significantly faster than the serial execution, however, the efficiency in terms of the computing resource usage decreases with increasing the number of processors used in the parallel computing.
Data communications in a parallel active messaging interface of a parallel computer
Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E
2013-10-29
Data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the parallel computer including a plurality of compute nodes that execute a parallel application, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes and the endpoints coupled for data communications through the PAMI and through data communications resources, including receiving in an origin endpoint of the PAMI a data communications instruction, the instruction characterized by an instruction type, the instruction specifying a transmission of transfer data from the origin endpoint to a target endpoint and transmitting, in accordance with the instruction type, the transfer data from the origin endpoint to the target endpoint.
A CFD Heterogeneous Parallel Solver Based on Collaborating CPU and GPU
NASA Astrophysics Data System (ADS)
Lai, Jianqi; Tian, Zhengyu; Li, Hua; Pan, Sha
2018-03-01
Since Graphic Processing Unit (GPU) has a strong ability of floating-point computation and memory bandwidth for data parallelism, it has been widely used in the areas of common computing such as molecular dynamics (MD), computational fluid dynamics (CFD) and so on. The emergence of compute unified device architecture (CUDA), which reduces the complexity of compiling program, brings the great opportunities to CFD. There are three different modes for parallel solution of NS equations: parallel solver based on CPU, parallel solver based on GPU and heterogeneous parallel solver based on collaborating CPU and GPU. As we can see, GPUs are relatively rich in compute capacity but poor in memory capacity and the CPUs do the opposite. We need to make full use of the GPUs and CPUs, so a CFD heterogeneous parallel solver based on collaborating CPU and GPU has been established. Three cases are presented to analyse the solver’s computational accuracy and heterogeneous parallel efficiency. The numerical results agree well with experiment results, which demonstrate that the heterogeneous parallel solver has high computational precision. The speedup on a single GPU is more than 40 for laminar flow, it decreases for turbulent flow, but it still can reach more than 20. What’s more, the speedup increases as the grid size becomes larger.
Software Aids Visualization of Computed Unsteady Flow
NASA Technical Reports Server (NTRS)
Kao, David; Kenwright, David
2003-01-01
Unsteady Flow Analysis Toolkit (UFAT) is a computer program that synthesizes motions of time-dependent flows represented by very large sets of data generated in computational fluid dynamics simulations. Prior to the development of UFAT, it was necessary to rely on static, single-snapshot depictions of time-dependent flows generated by flow-visualization software designed for steady flows. Whereas it typically takes weeks to analyze the results of a largescale unsteady-flow simulation by use of steady-flow visualization software, the analysis time is reduced to hours when UFAT is used. UFAT can be used to generate graphical objects of flow visualization results using multi-block curvilinear grids in the format of a previously developed NASA data-visualization program, PLOT3D. These graphical objects can be rendered using FAST, another popular flow visualization software developed at NASA. Flow-visualization techniques that can be exploited by use of UFAT include time-dependent tracking of particles, detection of vortex cores, extractions of stream ribbons and surfaces, and tetrahedral decomposition for optimal particle tracking. Unique computational features of UFAT include capabilities for automatic (batch) processing, restart, memory mapping, and parallel processing. These capabilities significantly reduce analysis time and storage requirements, relative to those of prior flow-visualization software. UFAT can be executed on a variety of supercomputers.
NASA Technical Reports Server (NTRS)
Scheper, C.; Baker, R.; Frank, G.; Yalamanchili, S.; Gray, G.
1992-01-01
Systems for Space Defense Initiative (SDI) space applications typically require both high performance and very high reliability. These requirements present the systems engineer evaluating such systems with the extremely difficult problem of conducting performance and reliability trade-offs over large design spaces. A controlled development process supported by appropriate automated tools must be used to assure that the system will meet design objectives. This report describes an investigation of methods, tools, and techniques necessary to support performance and reliability modeling for SDI systems development. Models of the JPL Hypercubes, the Encore Multimax, and the C.S. Draper Lab Fault-Tolerant Parallel Processor (FTPP) parallel-computing architectures using candidate SDI weapons-to-target assignment algorithms as workloads were built and analyzed as a means of identifying the necessary system models, how the models interact, and what experiments and analyses should be performed. As a result of this effort, weaknesses in the existing methods and tools were revealed and capabilities that will be required for both individual tools and an integrated toolset were identified.
Fine grained event processing on HPCs with the ATLAS Yoda system
NASA Astrophysics Data System (ADS)
Calafiura, Paolo; De, Kaushik; Guan, Wen; Maeno, Tadashi; Nilsson, Paul; Oleynik, Danila; Panitkin, Sergey; Tsulaia, Vakhtang; Van Gemmeren, Peter; Wenaus, Torre
2015-12-01
High performance computing facilities present unique challenges and opportunities for HEP event processing. The massive scale of many HPC systems means that fractionally small utilization can yield large returns in processing throughput. Parallel applications which can dynamically and efficiently fill any scheduling opportunities the resource presents benefit both the facility (maximal utilization) and the (compute-limited) science. The ATLAS Yoda system provides this capability to HEP-like event processing applications by implementing event-level processing in an MPI-based master-client model that integrates seamlessly with the more broadly scoped ATLAS Event Service. Fine grained, event level work assignments are intelligently dispatched to parallel workers to sustain full utilization on all cores, with outputs streamed off to destination object stores in near real time with similarly fine granularity, such that processing can proceed until termination with full utilization. The system offers the efficiency and scheduling flexibility of preemption without requiring the application actually support or employ check-pointing. We will present the new Yoda system, its motivations, architecture, implementation, and applications in ATLAS data processing at several US HPC centers.
Systems-on-chip approach for real-time simulation of wheel-rail contact laws
NASA Astrophysics Data System (ADS)
Mei, T. X.; Zhou, Y. J.
2013-04-01
This paper presents the development of a systems-on-chip approach to speed up the simulation of wheel-rail contact laws, which can be used to reduce the requirement for high-performance computers and enable simulation in real time for the use of hardware-in-loop for experimental studies of the latest vehicle dynamic and control technologies. The wheel-rail contact laws are implemented using a field programmable gate array (FPGA) device with a design that substantially outperforms modern general-purpose PC platforms or fixed architecture digital signal processor devices in terms of processing time, configuration flexibility and cost. In order to utilise the FPGA's parallel-processing capability, the operations in the contact laws algorithms are arranged in a parallel manner and multi-contact patches are tackled simultaneously in the design. The interface between the FPGA device and the host PC is achieved by using a high-throughput and low-latency Ethernet link. The development is based on FASTSIM algorithms, although the design can be adapted and expanded for even more computationally demanding tasks.
Molecular Dynamics of Hot Dense Plasmas: New Horizons
NASA Astrophysics Data System (ADS)
Graziani, Frank
2011-10-01
We describe the status of a new time-dependent simulation capability for hot dense plasmas. The backbone of this multi-institutional computational and experimental effort--the Cimarron Project--is the massively parallel molecular dynamics (MD) code ``ddcMD''. The project's focus is material conditions such as exist in inertial confinement fusion experiments, and in many stellar interiors: high temperatures, high densities, significant electromagnetic fields, mixtures of high- and low- Zelements, and non-Maxwellian particle distributions. Of particular importance is our ability to incorporate into this classical MD code key atomic, radiative, and nuclear processes, so that their interacting effects under non-ideal plasma conditions can be investigated. This talk summarizes progress in computational methodology, discusses strengths and weaknesses of quantum statistical potentials as effective interactions for MD, explains the model used for quantum events possibly occurring in a collision and highlights some significant results obtained to date. We describe the status of a new time-dependent simulation capability for hot dense plasmas. The backbone of this multi-institutional computational and experimental effort--the Cimarron Project--is the massively parallel molecular dynamics (MD) code ``ddcMD''. The project's focus is material conditions such as exist in inertial confinement fusion experiments, and in many stellar interiors: high temperatures, high densities, significant electromagnetic fields, mixtures of high- and low- Zelements, and non-Maxwellian particle distributions. Of particular importance is our ability to incorporate into this classical MD code key atomic, radiative, and nuclear processes, so that their interacting effects under non-ideal plasma conditions can be investigated. This talk summarizes progress in computational methodology, discusses strengths and weaknesses of quantum statistical potentials as effective interactions for MD, explains the model used for quantum events possibly occurring in a collision and highlights some significant results obtained to date. This work is performed under the auspices of the U. S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gooding, Thomas M.
Distributing an executable job load file to compute nodes in a parallel computer, the parallel computer comprising a plurality of compute nodes, including: determining, by a compute node in the parallel computer, whether the compute node is participating in a job; determining, by the compute node in the parallel computer, whether a descendant compute node is participating in the job; responsive to determining that the compute node is participating in the job or that the descendant compute node is participating in the job, communicating, by the compute node to a parent compute node, an identification of a data communications linkmore » over which the compute node receives data from the parent compute node; constructing a class route for the job, wherein the class route identifies all compute nodes participating in the job; and broadcasting the executable load file for the job along the class route for the job.« less
NASA Technical Reports Server (NTRS)
Dongarra, Jack (Editor); Messina, Paul (Editor); Sorensen, Danny C. (Editor); Voigt, Robert G. (Editor)
1990-01-01
Attention is given to such topics as an evaluation of block algorithm variants in LAPACK and presents a large-grain parallel sparse system solver, a multiprocessor method for the solution of the generalized Eigenvalue problem on an interval, and a parallel QR algorithm for iterative subspace methods on the CM2. A discussion of numerical methods includes the topics of asynchronous numerical solutions of PDEs on parallel computers, parallel homotopy curve tracking on a hypercube, and solving Navier-Stokes equations on the Cedar Multi-Cluster system. A section on differential equations includes a discussion of a six-color procedure for the parallel solution of elliptic systems using the finite quadtree structure, data parallel algorithms for the finite element method, and domain decomposition methods in aerodynamics. Topics dealing with massively parallel computing include hypercube vs. 2-dimensional meshes and massively parallel computation of conservation laws. Performance and tools are also discussed.
NASA Astrophysics Data System (ADS)
Iacobucci, Joseph V.
The research objective for this manuscript is to develop a Rapid Architecture Alternative Modeling (RAAM) methodology to enable traceable Pre-Milestone A decision making during the conceptual phase of design of a system of systems. Rather than following current trends that place an emphasis on adding more analysis which tends to increase the complexity of the decision making problem, RAAM improves on current methods by reducing both runtime and model creation complexity. RAAM draws upon principles from computer science, system architecting, and domain specific languages to enable the automatic generation and evaluation of architecture alternatives. For example, both mission dependent and mission independent metrics are considered. Mission dependent metrics are determined by the performance of systems accomplishing a task, such as Probability of Success. In contrast, mission independent metrics, such as acquisition cost, are solely determined and influenced by the other systems in the portfolio. RAAM also leverages advances in parallel computing to significantly reduce runtime by defining executable models that are readily amendable to parallelization. This allows the use of cloud computing infrastructures such as Amazon's Elastic Compute Cloud and the PASTEC cluster operated by the Georgia Institute of Technology Research Institute (GTRI). Also, the amount of data that can be generated when fully exploring the design space can quickly exceed the typical capacity of computational resources at the analyst's disposal. To counter this, specific algorithms and techniques are employed. Streaming algorithms and recursive architecture alternative evaluation algorithms are used that reduce computer memory requirements. Lastly, a domain specific language is created to provide a reduction in the computational time of executing the system of systems models. A domain specific language is a small, usually declarative language that offers expressive power focused on a particular problem domain by establishing an effective means to communicate the semantics from the RAAM framework. These techniques make it possible to include diverse multi-metric models within the RAAM framework in addition to system and operational level trades. A canonical example was used to explore the uses of the methodology. The canonical example contains all of the features of a full system of systems architecture analysis study but uses fewer tasks and systems. Using RAAM with the canonical example it was possible to consider both system and operational level trades in the same analysis. Once the methodology had been tested with the canonical example, a Suppression of Enemy Air Defenses (SEAD) capability model was developed. Due to the sensitive nature of analyses on that subject, notional data was developed. The notional data has similar trends and properties to realistic Suppression of Enemy Air Defenses data. RAAM was shown to be traceable and provided a mechanism for a unified treatment of a variety of metrics. The SEAD capability model demonstrated lower computer runtimes and reduced model creation complexity as compared to methods currently in use. To determine the usefulness of the implementation of the methodology on current computing hardware, RAAM was tested with system of system architecture studies of different sizes. This was necessary since system of systems may be called upon to accomplish thousands of tasks. It has been clearly demonstrated that RAAM is able to enumerate and evaluate the types of large, complex design spaces usually encountered in capability based design, oftentimes providing the ability to efficiently search the entire decision space. The core algorithms for generation and evaluation of alternatives scale linearly with expected problem sizes. The SEAD capability model outputs prompted the discovery a new issue, the data storage and manipulation requirements for an analysis. Two strategies were developed to counter large data sizes, the use of portfolio views and top 'n' analysis. This proved the usefulness of the RAAM framework and methodology during Pre-Milestone A capability based analysis. (Abstract shortened by UMI.).
Design and Development of a Run-Time Monitor for Multi-Core Architectures in Cloud Computing
Kang, Mikyung; Kang, Dong-In; Crago, Stephen P.; Park, Gyung-Leen; Lee, Junghoon
2011-01-01
Cloud computing is a new information technology trend that moves computing and data away from desktops and portable PCs into large data centers. The basic principle of cloud computing is to deliver applications as services over the Internet as well as infrastructure. A cloud is a type of parallel and distributed system consisting of a collection of inter-connected and virtualized computers that are dynamically provisioned and presented as one or more unified computing resources. The large-scale distributed applications on a cloud require adaptive service-based software, which has the capability of monitoring system status changes, analyzing the monitored information, and adapting its service configuration while considering tradeoffs among multiple QoS features simultaneously. In this paper, we design and develop a Run-Time Monitor (RTM) which is a system software to monitor the application behavior at run-time, analyze the collected information, and optimize cloud computing resources for multi-core architectures. RTM monitors application software through library instrumentation as well as underlying hardware through a performance counter optimizing its computing configuration based on the analyzed data. PMID:22163811
Design and development of a run-time monitor for multi-core architectures in cloud computing.
Kang, Mikyung; Kang, Dong-In; Crago, Stephen P; Park, Gyung-Leen; Lee, Junghoon
2011-01-01
Cloud computing is a new information technology trend that moves computing and data away from desktops and portable PCs into large data centers. The basic principle of cloud computing is to deliver applications as services over the Internet as well as infrastructure. A cloud is a type of parallel and distributed system consisting of a collection of inter-connected and virtualized computers that are dynamically provisioned and presented as one or more unified computing resources. The large-scale distributed applications on a cloud require adaptive service-based software, which has the capability of monitoring system status changes, analyzing the monitored information, and adapting its service configuration while considering tradeoffs among multiple QoS features simultaneously. In this paper, we design and develop a Run-Time Monitor (RTM) which is a system software to monitor the application behavior at run-time, analyze the collected information, and optimize cloud computing resources for multi-core architectures. RTM monitors application software through library instrumentation as well as underlying hardware through a performance counter optimizing its computing configuration based on the analyzed data.
NASA Astrophysics Data System (ADS)
Ohene-Kwofie, Daniel; Otoo, Ekow
2015-10-01
The ATLAS detector, operated at the Large Hadron Collider (LHC) records proton-proton collisions at CERN every 50ns resulting in a sustained data flow up to PB/s. The upgraded Tile Calorimeter of the ATLAS experiment will sustain about 5PB/s of digital throughput. These massive data rates require extremely fast data capture and processing. Although there has been a steady increase in the processing speed of CPU/GPGPU assembled for high performance computing, the rate of data input and output, even under parallel I/O, has not kept up with the general increase in computing speeds. The problem then is whether one can implement an I/O subsystem infrastructure capable of meeting the computational speeds of the advanced computing systems at the petascale and exascale level. We propose a system architecture that leverages the Partitioned Global Address Space (PGAS) model of computing to maintain an in-memory data-store for the Processing Unit (PU) of the upgraded electronics of the Tile Calorimeter which is proposed to be used as a high throughput general purpose co-processor to the sROD of the upgraded Tile Calorimeter. The physical memory of the PUs are aggregated into a large global logical address space using RDMA- capable interconnects such as PCI- Express to enhance data processing throughput.
Parallel mutual information estimation for inferring gene regulatory networks on GPUs
2011-01-01
Background Mutual information is a measure of similarity between two variables. It has been widely used in various application domains including computational biology, machine learning, statistics, image processing, and financial computing. Previously used simple histogram based mutual information estimators lack the precision in quality compared to kernel based methods. The recently introduced B-spline function based mutual information estimation method is competitive to the kernel based methods in terms of quality but at a lower computational complexity. Results We present a new approach to accelerate the B-spline function based mutual information estimation algorithm with commodity graphics hardware. To derive an efficient mapping onto this type of architecture, we have used the Compute Unified Device Architecture (CUDA) programming model to design and implement a new parallel algorithm. Our implementation, called CUDA-MI, can achieve speedups of up to 82 using double precision on a single GPU compared to a multi-threaded implementation on a quad-core CPU for large microarray datasets. We have used the results obtained by CUDA-MI to infer gene regulatory networks (GRNs) from microarray data. The comparisons to existing methods including ARACNE and TINGe show that CUDA-MI produces GRNs of higher quality in less time. Conclusions CUDA-MI is publicly available open-source software, written in CUDA and C++ programming languages. It obtains significant speedup over sequential multi-threaded implementation by fully exploiting the compute capability of commonly used CUDA-enabled low-cost GPUs. PMID:21672264
BigDebug: Debugging Primitives for Interactive Big Data Processing in Spark
Gulzar, Muhammad Ali; Interlandi, Matteo; Yoo, Seunghyun; Tetali, Sai Deep; Condie, Tyson; Millstein, Todd; Kim, Miryung
2016-01-01
Developers use cloud computing platforms to process a large quantity of data in parallel when developing big data analytics. Debugging the massive parallel computations that run in today’s data-centers is time consuming and error-prone. To address this challenge, we design a set of interactive, real-time debugging primitives for big data processing in Apache Spark, the next generation data-intensive scalable cloud computing platform. This requires re-thinking the notion of step-through debugging in a traditional debugger such as gdb, because pausing the entire computation across distributed worker nodes causes significant delay and naively inspecting millions of records using a watchpoint is too time consuming for an end user. First, BIGDEBUG’s simulated breakpoints and on-demand watchpoints allow users to selectively examine distributed, intermediate data on the cloud with little overhead. Second, a user can also pinpoint a crash-inducing record and selectively resume relevant sub-computations after a quick fix. Third, a user can determine the root causes of errors (or delays) at the level of individual records through a fine-grained data provenance capability. Our evaluation shows that BIGDEBUG scales to terabytes and its record-level tracing incurs less than 25% overhead on average. It determines crash culprits orders of magnitude more accurately and provides up to 100% time saving compared to the baseline replay debugger. The results show that BIGDEBUG supports debugging at interactive speeds with minimal performance impact. PMID:27390389
A GPU-accelerated implicit meshless method for compressible flows
NASA Astrophysics Data System (ADS)
Zhang, Jia-Le; Ma, Zhi-Hua; Chen, Hong-Quan; Cao, Cheng
2018-05-01
This paper develops a recently proposed GPU based two-dimensional explicit meshless method (Ma et al., 2014) by devising and implementing an efficient parallel LU-SGS implicit algorithm to further improve the computational efficiency. The capability of the original 2D meshless code is extended to deal with 3D complex compressible flow problems. To resolve the inherent data dependency of the standard LU-SGS method, which causes thread-racing conditions destabilizing numerical computation, a generic rainbow coloring method is presented and applied to organize the computational points into different groups by painting neighboring points with different colors. The original LU-SGS method is modified and parallelized accordingly to perform calculations in a color-by-color manner. The CUDA Fortran programming model is employed to develop the key kernel functions to apply boundary conditions, calculate time steps, evaluate residuals as well as advance and update the solution in the temporal space. A series of two- and three-dimensional test cases including compressible flows over single- and multi-element airfoils and a M6 wing are carried out to verify the developed code. The obtained solutions agree well with experimental data and other computational results reported in the literature. Detailed analysis on the performance of the developed code reveals that the developed CPU based implicit meshless method is at least four to eight times faster than its explicit counterpart. The computational efficiency of the implicit method could be further improved by ten to fifteen times on the GPU.
NASA Technical Reports Server (NTRS)
Hall, Edward J.; Heidegger, Nathan J.; Delaney, Robert A.
1999-01-01
The overall objective of this study was to evaluate the effects of turbulence models in a 3-D numerical analysis on the wake prediction capability. The current version of the computer code resulting from this study is referred to as ADPAC v7 (Advanced Ducted Propfan Analysis Codes -Version 7). This report is intended to serve as a computer program user's manual for the ADPAC code used and modified under Task 15 of NASA Contract NAS3-27394. The ADPAC program is based on a flexible multiple-block and discretization scheme permitting coupled 2-D/3-D mesh block solutions with application to a wide variety of geometries. Aerodynamic calculations are based on a four-stage Runge-Kutta time-marching finite volume solution technique with added numerical dissipation. Steady flow predictions are accelerated by a multigrid procedure. Turbulence models now available in the ADPAC code are: a simple mixing-length model, the algebraic Baldwin-Lomax model with user defined coefficients, the one-equation Spalart-Allmaras model, and a two-equation k-R model. The consolidated ADPAC code is capable of executing in either a serial or parallel computing mode from a single source code.
Lee, Jae H.; Yao, Yushu; Shrestha, Uttam; Gullberg, Grant T.; Seo, Youngho
2014-01-01
The primary goal of this project is to implement the iterative statistical image reconstruction algorithm, in this case maximum likelihood expectation maximum (MLEM) used for dynamic cardiac single photon emission computed tomography, on Spark/GraphX. This involves porting the algorithm to run on large-scale parallel computing systems. Spark is an easy-to- program software platform that can handle large amounts of data in parallel. GraphX is a graph analytic system running on top of Spark to handle graph and sparse linear algebra operations in parallel. The main advantage of implementing MLEM algorithm in Spark/GraphX is that it allows users to parallelize such computation without any expertise in parallel computing or prior knowledge in computer science. In this paper we demonstrate a successful implementation of MLEM in Spark/GraphX and present the performance gains with the goal to eventually make it useable in clinical setting. PMID:27081299
Lee, Jae H; Yao, Yushu; Shrestha, Uttam; Gullberg, Grant T; Seo, Youngho
2014-11-01
The primary goal of this project is to implement the iterative statistical image reconstruction algorithm, in this case maximum likelihood expectation maximum (MLEM) used for dynamic cardiac single photon emission computed tomography, on Spark/GraphX. This involves porting the algorithm to run on large-scale parallel computing systems. Spark is an easy-to- program software platform that can handle large amounts of data in parallel. GraphX is a graph analytic system running on top of Spark to handle graph and sparse linear algebra operations in parallel. The main advantage of implementing MLEM algorithm in Spark/GraphX is that it allows users to parallelize such computation without any expertise in parallel computing or prior knowledge in computer science. In this paper we demonstrate a successful implementation of MLEM in Spark/GraphX and present the performance gains with the goal to eventually make it useable in clinical setting.
Wu, Xiao-Lin; Sun, Chuanyu; Beissinger, Timothy M; Rosa, Guilherme Jm; Weigel, Kent A; Gatti, Natalia de Leon; Gianola, Daniel
2012-09-25
Most Bayesian models for the analysis of complex traits are not analytically tractable and inferences are based on computationally intensive techniques. This is true of Bayesian models for genome-enabled selection, which uses whole-genome molecular data to predict the genetic merit of candidate animals for breeding purposes. In this regard, parallel computing can overcome the bottlenecks that can arise from series computing. Hence, a major goal of the present study is to bridge the gap to high-performance Bayesian computation in the context of animal breeding and genetics. Parallel Monte Carlo Markov chain algorithms and strategies are described in the context of animal breeding and genetics. Parallel Monte Carlo algorithms are introduced as a starting point including their applications to computing single-parameter and certain multiple-parameter models. Then, two basic approaches for parallel Markov chain Monte Carlo are described: one aims at parallelization within a single chain; the other is based on running multiple chains, yet some variants are discussed as well. Features and strategies of the parallel Markov chain Monte Carlo are illustrated using real data, including a large beef cattle dataset with 50K SNP genotypes. Parallel Markov chain Monte Carlo algorithms are useful for computing complex Bayesian models, which does not only lead to a dramatic speedup in computing but can also be used to optimize model parameters in complex Bayesian models. Hence, we anticipate that use of parallel Markov chain Monte Carlo will have a profound impact on revolutionizing the computational tools for genomic selection programs.
2012-01-01
Background Most Bayesian models for the analysis of complex traits are not analytically tractable and inferences are based on computationally intensive techniques. This is true of Bayesian models for genome-enabled selection, which uses whole-genome molecular data to predict the genetic merit of candidate animals for breeding purposes. In this regard, parallel computing can overcome the bottlenecks that can arise from series computing. Hence, a major goal of the present study is to bridge the gap to high-performance Bayesian computation in the context of animal breeding and genetics. Results Parallel Monte Carlo Markov chain algorithms and strategies are described in the context of animal breeding and genetics. Parallel Monte Carlo algorithms are introduced as a starting point including their applications to computing single-parameter and certain multiple-parameter models. Then, two basic approaches for parallel Markov chain Monte Carlo are described: one aims at parallelization within a single chain; the other is based on running multiple chains, yet some variants are discussed as well. Features and strategies of the parallel Markov chain Monte Carlo are illustrated using real data, including a large beef cattle dataset with 50K SNP genotypes. Conclusions Parallel Markov chain Monte Carlo algorithms are useful for computing complex Bayesian models, which does not only lead to a dramatic speedup in computing but can also be used to optimize model parameters in complex Bayesian models. Hence, we anticipate that use of parallel Markov chain Monte Carlo will have a profound impact on revolutionizing the computational tools for genomic selection programs. PMID:23009363
Parallel approach in RDF query processing
NASA Astrophysics Data System (ADS)
Vajgl, Marek; Parenica, Jan
2017-07-01
Parallel approach is nowadays a very cheap solution to increase computational power due to possibility of usage of multithreaded computational units. This hardware became typical part of nowadays personal computers or notebooks and is widely spread. This contribution deals with experiments how evaluation of computational complex algorithm of the inference over RDF data can be parallelized over graphical cards to decrease computational time.
NASA Astrophysics Data System (ADS)
Duffy, D.; Maxwell, T. P.; Doutriaux, C.; Williams, D. N.; Chaudhary, A.; Ames, S.
2015-12-01
As the size of remote sensing observations and model output data grows, the volume of the data has become overwhelming, even to many scientific experts. As societies are forced to better understand, mitigate, and adapt to climate changes, the combination of Earth observation data and global climate model projects is crucial to not only scientists but to policy makers, downstream applications, and even the public. Scientific progress on understanding climate is critically dependent on the availability of a reliable infrastructure that promotes data access, management, and provenance. The Earth System Grid Federation (ESGF) has created such an environment for the Intergovernmental Panel on Climate Change (IPCC). ESGF provides a federated global cyber infrastructure for data access and management of model outputs generated for the IPCC Assessment Reports (AR). The current generation of the ESGF federated grid allows consumers of the data to find and download data with limited capabilities for server-side processing. Since the amount of data for future AR is expected to grow dramatically, ESGF is working on integrating server-side analytics throughout the federation. The ESGF Compute Working Team (CWT) has created a Web Processing Service (WPS) Application Programming Interface (API) to enable access scalable computational resources. The API is the exposure point to high performance computing resources across the federation. Specifically, the API allows users to execute simple operations, such as maximum, minimum, average, and anomalies, on ESGF data without having to download the data. These operations are executed at the ESGF data node site with access to large amounts of parallel computing capabilities. This presentation will highlight the WPS API, its capabilities, provide implementation details, and discuss future developments.
A computer program for uncertainty analysis integrating regression and Bayesian methods
Lu, Dan; Ye, Ming; Hill, Mary C.; Poeter, Eileen P.; Curtis, Gary
2014-01-01
This work develops a new functionality in UCODE_2014 to evaluate Bayesian credible intervals using the Markov Chain Monte Carlo (MCMC) method. The MCMC capability in UCODE_2014 is based on the FORTRAN version of the differential evolution adaptive Metropolis (DREAM) algorithm of Vrugt et al. (2009), which estimates the posterior probability density function of model parameters in high-dimensional and multimodal sampling problems. The UCODE MCMC capability provides eleven prior probability distributions and three ways to initialize the sampling process. It evaluates parametric and predictive uncertainties and it has parallel computing capability based on multiple chains to accelerate the sampling process. This paper tests and demonstrates the MCMC capability using a 10-dimensional multimodal mathematical function, a 100-dimensional Gaussian function, and a groundwater reactive transport model. The use of the MCMC capability is made straightforward and flexible by adopting the JUPITER API protocol. With the new MCMC capability, UCODE_2014 can be used to calculate three types of uncertainty intervals, which all can account for prior information: (1) linear confidence intervals which require linearity and Gaussian error assumptions and typically 10s–100s of highly parallelizable model runs after optimization, (2) nonlinear confidence intervals which require a smooth objective function surface and Gaussian observation error assumptions and typically 100s–1,000s of partially parallelizable model runs after optimization, and (3) MCMC Bayesian credible intervals which require few assumptions and commonly 10,000s–100,000s or more partially parallelizable model runs. Ready access allows users to select methods best suited to their work, and to compare methods in many circumstances.
NASA Astrophysics Data System (ADS)
Heinzeller, Dominikus; Duda, Michael G.; Kunstmann, Harald
2017-04-01
With strong financial and political support from national and international initiatives, exascale computing is projected for the end of this decade. Energy requirements and physical limitations imply the use of accelerators and the scaling out to orders of magnitudes larger numbers of cores then today to achieve this milestone. In order to fully exploit the capabilities of these Exascale computing systems, existing applications need to undergo significant development. The Model for Prediction Across Scales (MPAS) is a novel set of Earth system simulation components and consists of an atmospheric core, an ocean core, a land-ice core and a sea-ice core. Its distinct features are the use of unstructured Voronoi meshes and C-grid discretisation to address shortcomings of global models on regular grids and the use of limited area models nested in a forcing data set, with respect to parallel scalability, numerical accuracy and physical consistency. Here, we present work towards the application of the atmospheric core (MPAS-A) on current and future high performance computing systems for problems at extreme scale. In particular, we address the issue of massively parallel I/O by extending the model to support the highly scalable SIONlib library. Using global uniform meshes with a convection-permitting resolution of 2-3km, we demonstrate the ability of MPAS-A to scale out to half a million cores while maintaining a high parallel efficiency. We also demonstrate the potential benefit of a hybrid parallelisation of the code (MPI/OpenMP) on the latest generation of Intel's Many Integrated Core Architecture, the Intel Xeon Phi Knights Landing.
NASA Astrophysics Data System (ADS)
Li, Gen; Tang, Chun-An; Liang, Zheng-Zhao
2017-01-01
Multi-scale high-resolution modeling of rock failure process is a powerful means in modern rock mechanics studies to reveal the complex failure mechanism and to evaluate engineering risks. However, multi-scale continuous modeling of rock, from deformation, damage to failure, has raised high requirements on the design, implementation scheme and computation capacity of the numerical software system. This study is aimed at developing the parallel finite element procedure, a parallel rock failure process analysis (RFPA) simulator that is capable of modeling the whole trans-scale failure process of rock. Based on the statistical meso-damage mechanical method, the RFPA simulator is able to construct heterogeneous rock models with multiple mechanical properties, deal with and represent the trans-scale propagation of cracks, in which the stress and strain fields are solved for the damage evolution analysis of representative volume element by the parallel finite element method (FEM) solver. This paper describes the theoretical basis of the approach and provides the details of the parallel implementation on a Windows - Linux interactive platform. A numerical model is built to test the parallel performance of FEM solver. Numerical simulations are then carried out on a laboratory-scale uniaxial compression test, and field-scale net fracture spacing and engineering-scale rock slope examples, respectively. The simulation results indicate that relatively high speedup and computation efficiency can be achieved by the parallel FEM solver with a reasonable boot process. In laboratory-scale simulation, the well-known physical phenomena, such as the macroscopic fracture pattern and stress-strain responses, can be reproduced. In field-scale simulation, the formation process of net fracture spacing from initiation, propagation to saturation can be revealed completely. In engineering-scale simulation, the whole progressive failure process of the rock slope can be well modeled. It is shown that the parallel FE simulator developed in this study is an efficient tool for modeling the whole trans-scale failure process of rock from meso- to engineering-scale.
NASA Technical Reports Server (NTRS)
Montag, Bruce C.; Bishop, Alfred M.; Redfield, Joe B.
1989-01-01
The findings of a preliminary investigation by Southwest Research Institute (SwRI) in simulation host computer concepts is presented. It is designed to aid NASA in evaluating simulation technologies for use in spaceflight training. The focus of the investigation is on the next generation of space simulation systems that will be utilized in training personnel for Space Station Freedom operations. SwRI concludes that NASA should pursue a distributed simulation host computer system architecture for the Space Station Training Facility (SSTF) rather than a centralized mainframe based arrangement. A distributed system offers many advantages and is seen by SwRI as the only architecture that will allow NASA to achieve established functional goals and operational objectives over the life of the Space Station Freedom program. Several distributed, parallel computing systems are available today that offer real-time capabilities for time critical, man-in-the-loop simulation. These systems are flexible in terms of connectivity and configurability, and are easily scaled to meet increasing demands for more computing power.
A comparative study of serial and parallel aeroelastic computations of wings
NASA Technical Reports Server (NTRS)
Byun, Chansup; Guruswamy, Guru P.
1994-01-01
A procedure for computing the aeroelasticity of wings on parallel multiple-instruction, multiple-data (MIMD) computers is presented. In this procedure, fluids are modeled using Euler equations, and structures are modeled using modal or finite element equations. The procedure is designed in such a way that each discipline can be developed and maintained independently by using a domain decomposition approach. In the present parallel procedure, each computational domain is scalable. A parallel integration scheme is used to compute aeroelastic responses by solving fluid and structural equations concurrently. The computational efficiency issues of parallel integration of both fluid and structural equations are investigated in detail. This approach, which reduces the total computational time by a factor of almost 2, is demonstrated for a typical aeroelastic wing by using various numbers of processors on the Intel iPSC/860.
Research in parallel computing
NASA Technical Reports Server (NTRS)
Ortega, James M.; Henderson, Charles
1994-01-01
This report summarizes work on parallel computations for NASA Grant NAG-1-1529 for the period 1 Jan. - 30 June 1994. Short summaries on highly parallel preconditioners, target-specific parallel reductions, and simulation of delta-cache protocols are provided.
Performance Characterization of Global Address Space Applications: A Case Study with NWChem
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hammond, Jeffrey R.; Krishnamoorthy, Sriram; Shende, Sameer
The use of global address space languages and one-sided communication for complex applications is gaining attention in the parallel computing community. However, lack of good evaluative methods to observe multiple levels of performance makes it difficult to isolate the cause of performance deficiencies and to understand the fundamental limitations of system and application design for future improvement. NWChem is a popular computational chemistry package which depends on the Global Arrays/ ARMCI suite for partitioned global address space functionality to deliver high-end molecular modeling capabilities. A workload characterization methodology was developed to support NWChem performance engineering on large-scale parallel platforms. Themore » research involved both the integration of performance instrumentation and measurement in the NWChem software, as well as the analysis of one-sided communication performance in the context of NWChem workloads. Scaling studies were conducted for NWChem on Blue Gene/P and on two large-scale clusters using different generation Infiniband interconnects and x86 processors. The performance analysis and results show how subtle changes in the runtime parameters related to the communication subsystem could have significant impact on performance behavior. The tool has successfully identified several algorithmic bottlenecks which are already being tackled by computational chemists to improve NWChem performance.« less
Aquilante, Francesco; Autschbach, Jochen; Carlson, Rebecca K; Chibotaru, Liviu F; Delcey, Mickaël G; De Vico, Luca; Fdez Galván, Ignacio; Ferré, Nicolas; Frutos, Luis Manuel; Gagliardi, Laura; Garavelli, Marco; Giussani, Angelo; Hoyer, Chad E; Li Manni, Giovanni; Lischka, Hans; Ma, Dongxia; Malmqvist, Per Åke; Müller, Thomas; Nenov, Artur; Olivucci, Massimo; Pedersen, Thomas Bondo; Peng, Daoling; Plasser, Felix; Pritchard, Ben; Reiher, Markus; Rivalta, Ivan; Schapiro, Igor; Segarra-Martí, Javier; Stenrup, Michael; Truhlar, Donald G; Ungur, Liviu; Valentini, Alessio; Vancoillie, Steven; Veryazov, Valera; Vysotskiy, Victor P; Weingart, Oliver; Zapata, Felipe; Lindh, Roland
2016-02-15
In this report, we summarize and describe the recent unique updates and additions to the Molcas quantum chemistry program suite as contained in release version 8. These updates include natural and spin orbitals for studies of magnetic properties, local and linear scaling methods for the Douglas-Kroll-Hess transformation, the generalized active space concept in MCSCF methods, a combination of multiconfigurational wave functions with density functional theory in the MC-PDFT method, additional methods for computation of magnetic properties, methods for diabatization, analytical gradients of state average complete active space SCF in association with density fitting, methods for constrained fragment optimization, large-scale parallel multireference configuration interaction including analytic gradients via the interface to the Columbus package, and approximations of the CASPT2 method to be used for computations of large systems. In addition, the report includes the description of a computational machinery for nonlinear optical spectroscopy through an interface to the QM/MM package Cobramm. Further, a module to run molecular dynamics simulations is added, two surface hopping algorithms are included to enable nonadiabatic calculations, and the DQ method for diabatization is added. Finally, we report on the subject of improvements with respects to alternative file options and parallelization. © 2015 Wiley Periodicals, Inc.
ALEGRA -- A massively parallel h-adaptive code for solid dynamics
DOE Office of Scientific and Technical Information (OSTI.GOV)
Summers, R.M.; Wong, M.K.; Boucheron, E.A.
1997-12-31
ALEGRA is a multi-material, arbitrary-Lagrangian-Eulerian (ALE) code for solid dynamics designed to run on massively parallel (MP) computers. It combines the features of modern Eulerian shock codes, such as CTH, with modern Lagrangian structural analysis codes using an unstructured grid. ALEGRA is being developed for use on the teraflop supercomputers to conduct advanced three-dimensional (3D) simulations of shock phenomena important to a variety of systems. ALEGRA was designed with the Single Program Multiple Data (SPMD) paradigm, in which the mesh is decomposed into sub-meshes so that each processor gets a single sub-mesh with approximately the same number of elements. Usingmore » this approach the authors have been able to produce a single code that can scale from one processor to thousands of processors. A current major effort is to develop efficient, high precision simulation capabilities for ALEGRA, without the computational cost of using a global highly resolved mesh, through flexible, robust h-adaptivity of finite elements. H-adaptivity is the dynamic refinement of the mesh by subdividing elements, thus changing the characteristic element size and reducing numerical error. The authors are working on several major technical challenges that must be met to make effective use of HAMMER on MP computers.« less
Design of a massively parallel computer using bit serial processing elements
NASA Technical Reports Server (NTRS)
Aburdene, Maurice F.; Khouri, Kamal S.; Piatt, Jason E.; Zheng, Jianqing
1995-01-01
A 1-bit serial processor designed for a parallel computer architecture is described. This processor is used to develop a massively parallel computational engine, with a single instruction-multiple data (SIMD) architecture. The computer is simulated and tested to verify its operation and to measure its performance for further development.
NASA Technical Reports Server (NTRS)
Horvath, Joan C.; Alkalaj, Leon J.; Schneider, Karl M.; Amador, Arthur V.; Spitale, Joseph N.
1993-01-01
Robotic spacecraft are controlled by sets of commands called 'sequences.' These sequences must be checked against mission constraints. Making our existing constraint checking program faster would enable new capabilities in our uplink process. Therefore, we are rewriting this program to run on a parallel computer. To do so, we had to determine how to run constraint-checking algorithms in parallel and create a new method of specifying spacecraft models and constraints. This new specification gives us a means of representing flight systems and their predicted response to commands which could be used in a variety of applications throughout the command process, particularly during anomaly or high-activity operations. This commonality could reduce operations cost and risk for future complex missions. Lessons learned in applying some parts of this system to the TOPEX/Poseidon mission will be described.
Evolution of a minimal parallel programming model
Lusk, Ewing; Butler, Ralph; Pieper, Steven C.
2017-04-30
Here, we take a historical approach to our presentation of self-scheduled task parallelism, a programming model with its origins in early irregular and nondeterministic computations encountered in automated theorem proving and logic programming. We show how an extremely simple task model has evolved into a system, asynchronous dynamic load balancing (ADLB), and a scalable implementation capable of supporting sophisticated applications on today’s (and tomorrow’s) largest supercomputers; and we illustrate the use of ADLB with a Green’s function Monte Carlo application, a modern, mature nuclear physics code in production use. Our lesson is that by surrendering a certain amount of generalitymore » and thus applicability, a minimal programming model (in terms of its basic concepts and the size of its application programmer interface) can achieve extreme scalability without introducing complexity.« less
Planning in subsumption architectures
NASA Technical Reports Server (NTRS)
Chalfant, Eugene C.
1994-01-01
A subsumption planner using a parallel distributed computational paradigm based on the subsumption architecture for control of real-world capable robots is described. Virtual sensor state space is used as a planning tool to visualize the robot's anticipated effect on its environment. Decision sequences are generated based on the environmental situation expected at the time the robot must commit to a decision. Between decision points, the robot performs in a preprogrammed manner. A rudimentary, domain-specific partial world model contains enough information to extrapolate the end results of the rote behavior between decision points. A collective network of predictors operates in parallel with the reactive network forming a recurrrent network which generates plans as a hierarchy. Details of a plan segment are generated only when its execution is imminent. The use of the subsumption planner is demonstrated by a simple maze navigation problem.
Accelerated numerical processing of electronically recorded holograms with reduced speckle noise.
Trujillo, Carlos; Garcia-Sucerquia, Jorge
2013-09-01
The numerical reconstruction of digitally recorded holograms suffers from speckle noise. An accelerated method that uses general-purpose computing in graphics processing units to reduce that noise is shown. The proposed methodology utilizes parallelized algorithms to record, reconstruct, and superimpose multiple uncorrelated holograms of a static scene. For the best tradeoff between reduction of the speckle noise and processing time, the method records, reconstructs, and superimposes six holograms of 1024 × 1024 pixels in 68 ms; for this case, the methodology reduces the speckle noise by 58% compared with that exhibited by a single hologram. The fully parallelized method running on a commodity graphics processing unit is one order of magnitude faster than the same technique implemented on a regular CPU using its multithreading capabilities. Experimental results are shown to validate the proposal.