Performance Characteristics of the Multi-Zone NAS Parallel Benchmarks
NASA Technical Reports Server (NTRS)
Jin, Haoqiang; VanderWijngaart, Rob F.
2003-01-01
We describe a new suite of computational benchmarks that models applications featuring multiple levels of parallelism. Such parallelism is often available in realistic flow computations on systems of grids, but had not previously been captured in bench-marks. The new suite, named NPB Multi-Zone, is extended from the NAS Parallel Benchmarks suite, and involves solving the application benchmarks LU, BT and SP on collections of loosely coupled discretization meshes. The solutions on the meshes are updated independently, but after each time step they exchange boundary value information. This strategy provides relatively easily exploitable coarse-grain parallelism between meshes. Three reference implementations are available: one serial, one hybrid using the Message Passing Interface (MPI) and OpenMP, and another hybrid using a shared memory multi-level programming model (SMP+OpenMP). We examine the effectiveness of hybrid parallelization paradigms in these implementations on three different parallel computers. We also use an empirical formula to investigate the performance characteristics of the multi-zone benchmarks.
Unstructured Adaptive (UA) NAS Parallel Benchmark. Version 1.0
NASA Technical Reports Server (NTRS)
Feng, Huiyu; VanderWijngaart, Rob; Biswas, Rupak; Mavriplis, Catherine
2004-01-01
We present a complete specification of a new benchmark for measuring the performance of modern computer systems when solving scientific problems featuring irregular, dynamic memory accesses. It complements the existing NAS Parallel Benchmark suite. The benchmark involves the solution of a stylized heat transfer problem in a cubic domain, discretized on an adaptively refined, unstructured mesh.
Spherical harmonic results for the 3D Kobayashi Benchmark suite
DOE Office of Scientific and Technical Information (OSTI.GOV)
Brown, P N; Chang, B; Hanebutte, U R
1999-03-02
Spherical harmonic solutions are presented for the Kobayashi benchmark suite. The results were obtained with Ardra, a scalable, parallel neutron transport code developed at Lawrence Livermore National Laboratory (LLNL). The calculations were performed on the IBM ASCI Blue-Pacific computer at LLNL.
Benchmarking hypercube hardware and software
NASA Technical Reports Server (NTRS)
Grunwald, Dirk C.; Reed, Daniel A.
1986-01-01
It was long a truism in computer systems design that balanced systems achieve the best performance. Message passing parallel processors are no different. To quantify the balance of a hypercube design, an experimental methodology was developed and the associated suite of benchmarks was applied to several existing hypercubes. The benchmark suite includes tests of both processor speed in the absence of internode communication and message transmission speed as a function of communication patterns.
Data Race Benchmark Collection
DOE Office of Scientific and Technical Information (OSTI.GOV)
Liao, Chunhua; Lin, Pei-Hung; Asplund, Joshua
2017-03-21
This project is a benchmark suite of Open-MP parallel codes that have been checked for data races. The programs are marked to show which do and do not have races. This allows them to be leveraged while testing and developing race detection tools.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bailey, David H.
The NAS Parallel Benchmarks (NPB) are a suite of parallel computer performance benchmarks. They were originally developed at the NASA Ames Research Center in 1991 to assess high-end parallel supercomputers. Although they are no longer used as widely as they once were for comparing high-end system performance, they continue to be studied and analyzed a great deal in the high-performance computing community. The acronym 'NAS' originally stood for the Numerical Aeronautical Simulation Program at NASA Ames. The name of this organization was subsequently changed to the Numerical Aerospace Simulation Program, and more recently to the NASA Advanced Supercomputing Center, althoughmore » the acronym remains 'NAS.' The developers of the original NPB suite were David H. Bailey, Eric Barszcz, John Barton, David Browning, Russell Carter, LeoDagum, Rod Fatoohi, Samuel Fineberg, Paul Frederickson, Thomas Lasinski, Rob Schreiber, Horst Simon, V. Venkatakrishnan and Sisira Weeratunga. The original NAS Parallel Benchmarks consisted of eight individual benchmark problems, each of which focused on some aspect of scientific computing. The principal focus was in computational aerophysics, although most of these benchmarks have much broader relevance, since in a much larger sense they are typical of many real-world scientific computing applications. The NPB suite grew out of the need for a more rational procedure to select new supercomputers for acquisition by NASA. The emergence of commercially available highly parallel computer systems in the late 1980s offered an attractive alternative to parallel vector supercomputers that had been the mainstay of high-end scientific computing. However, the introduction of highly parallel systems was accompanied by a regrettable level of hype, not only on the part of the commercial vendors but even, in some cases, by scientists using the systems. As a result, it was difficult to discern whether the new systems offered any fundamental performance advantage over vector supercomputers, and, if so, which of the parallel offerings would be most useful in real-world scientific computation. In part to draw attention to some of the performance reporting abuses prevalent at the time, the present author wrote a humorous essay 'Twelve Ways to Fool the Masses,' which described in a light-hearted way a number of the questionable ways in which both vendor marketing people and scientists were inflating and distorting their performance results. All of this underscored the need for an objective and scientifically defensible measure to compare performance on these systems.« less
Pandya, Tara M.; Johnson, Seth R.; Evans, Thomas M.; ...
2015-12-21
This paper discusses the implementation, capabilities, and validation of Shift, a massively parallel Monte Carlo radiation transport package developed and maintained at Oak Ridge National Laboratory. It has been developed to scale well from laptop to small computing clusters to advanced supercomputers. Special features of Shift include hybrid capabilities for variance reduction such as CADIS and FW-CADIS, and advanced parallel decomposition and tally methods optimized for scalability on supercomputing architectures. Shift has been validated and verified against various reactor physics benchmarks and compares well to other state-of-the-art Monte Carlo radiation transport codes such as MCNP5, CE KENO-VI, and OpenMC. Somemore » specific benchmarks used for verification and validation include the CASL VERA criticality test suite and several Westinghouse AP1000 ® problems. These benchmark and scaling studies show promising results.« less
Comparison of Origin 2000 and Origin 3000 Using NAS Parallel Benchmarks
NASA Technical Reports Server (NTRS)
Turney, Raymond D.
2001-01-01
This report describes results of benchmark tests on the Origin 3000 system currently being installed at the NASA Ames National Advanced Supercomputing facility. This machine will ultimately contain 1024 R14K processors. The first part of the system, installed in November, 2000 and named mendel, is an Origin 3000 with 128 R12K processors. For comparison purposes, the tests were also run on lomax, an Origin 2000 with R12K processors. The BT, LU, and SP application benchmarks in the NAS Parallel Benchmark Suite and the kernel benchmark FT were chosen to determine system performance and measure the impact of changes on the machine as it evolves. Having been written to measure performance on Computational Fluid Dynamics applications, these benchmarks are assumed appropriate to represent the NAS workload. Since the NAS runs both message passing (MPI) and shared-memory, compiler directive type codes, both MPI and OpenMP versions of the benchmarks were used. The MPI versions used were the latest official release of the NAS Parallel Benchmarks, version 2.3. The OpenMP versiqns used were PBN3b2, a beta version that is in the process of being released. NPB 2.3 and PBN 3b2 are technically different benchmarks, and NPB results are not directly comparable to PBN results.
An efficient parallel algorithm for matrix-vector multiplication
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hendrickson, B.; Leland, R.; Plimpton, S.
The multiplication of a vector by a matrix is the kernel computation of many algorithms in scientific computation. A fast parallel algorithm for this calculation is therefore necessary if one is to make full use of the new generation of parallel supercomputers. This paper presents a high performance, parallel matrix-vector multiplication algorithm that is particularly well suited to hypercube multiprocessors. For an n x n matrix on p processors, the communication cost of this algorithm is O(n/[radical]p + log(p)), independent of the matrix sparsity pattern. The performance of the algorithm is demonstrated by employing it as the kernel in themore » well-known NAS conjugate gradient benchmark, where a run time of 6.09 seconds was observed. This is the best published performance on this benchmark achieved to date using a massively parallel supercomputer.« less
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster
NASA Technical Reports Server (NTRS)
Jost, Gabriele; Jin, Haoqiang; anMey, Dieter; Hatay, Ferhat F.
2003-01-01
With the advent of parallel hardware and software technologies users are faced with the challenge to choose a programming paradigm best suited for the underlying computer architecture. With the current trend in parallel computer architectures towards clusters of shared memory symmetric multi-processors (SMP), parallel programming techniques have evolved to support parallelism beyond a single level. Which programming paradigm is the best will depend on the nature of the given problem, the hardware architecture, and the available software. In this study we will compare different programming paradigms for the parallelization of a selected benchmark application on a cluster of SMP nodes. We compare the timings of different implementations of the same CFD benchmark application employing the same numerical algorithm on a cluster of Sun Fire SMP nodes. The rest of the paper is structured as follows: In section 2 we briefly discuss the programming models under consideration. We describe our compute platform in section 3. The different implementations of our benchmark code are described in section 4 and the performance results are presented in section 5. We conclude our study in section 6.
Spherical Harmonic Solutions to the 3D Kobayashi Benchmark Suite
DOE Office of Scientific and Technical Information (OSTI.GOV)
Brown, P.N.; Chang, B.; Hanebutte, U.R.
1999-12-29
Spherical harmonic solutions of order 5, 9 and 21 on spatial grids containing up to 3.3 million cells are presented for the Kobayashi benchmark suite. This suite of three problems with simple geometry of pure absorber with large void region was proposed by Professor Kobayashi at an OECD/NEA meeting in 1996. Each of the three problems contains a source, a void and a shield region. Problem 1 can best be described as a box in a box problem, where a source region is surrounded by a square void region which itself is embedded in a square shield region. Problems 2more » and 3 represent a shield with a void duct. Problem 2 having a straight and problem 3 a dog leg shaped duct. A pure absorber and a 50% scattering case are considered for each of the three problems. The solutions have been obtained with Ardra, a scalable, parallel neutron transport code developed at Lawrence Livermore National Laboratory (LLNL). The Ardra code takes advantage of a two-level parallelization strategy, which combines message passing between processing nodes and thread based parallelism amongst processors on each node. All calculations were performed on the IBM ASCI Blue-Pacific computer at LLNL.« less
NAS Parallel Benchmark Results 11-96. 1.0
NASA Technical Reports Server (NTRS)
Bailey, David H.; Bailey, David; Chancellor, Marisa K. (Technical Monitor)
1997-01-01
The NAS Parallel Benchmarks have been developed at NASA Ames Research Center to study the performance of parallel supercomputers. The eight benchmark problems are specified in a "pencil and paper" fashion. In other words, the complete details of the problem to be solved are given in a technical document, and except for a few restrictions, benchmarkers are free to select the language constructs and implementation techniques best suited for a particular system. These results represent the best results that have been reported to us by the vendors for the specific 3 systems listed. In this report, we present new NPB (Version 1.0) performance results for the following systems: DEC Alpha Server 8400 5/440, Fujitsu VPP Series (VX, VPP300, and VPP700), HP/Convex Exemplar SPP2000, IBM RS/6000 SP P2SC node (120 MHz), NEC SX-4/32, SGI/CRAY T3E, SGI Origin200, and SGI Origin2000. We also report High Performance Fortran (HPF) based NPB results for IBM SP2 Wide Nodes, HP/Convex Exemplar SPP2000, and SGI/CRAY T3D. These results have been submitted by Applied Parallel Research (APR) and Portland Group Inc. (PGI). We also present sustained performance per dollar for Class B LU, SP and BT benchmarks.
PFLOTRAN Verification: Development of a Testing Suite to Ensure Software Quality
NASA Astrophysics Data System (ADS)
Hammond, G. E.; Frederick, J. M.
2016-12-01
In scientific computing, code verification ensures the reliability and numerical accuracy of a model simulation by comparing the simulation results to experimental data or known analytical solutions. The model is typically defined by a set of partial differential equations with initial and boundary conditions, and verification ensures whether the mathematical model is solved correctly by the software. Code verification is especially important if the software is used to model high-consequence systems which cannot be physically tested in a fully representative environment [Oberkampf and Trucano (2007)]. Justified confidence in a particular computational tool requires clarity in the exercised physics and transparency in its verification process with proper documentation. We present a quality assurance (QA) testing suite developed by Sandia National Laboratories that performs code verification for PFLOTRAN, an open source, massively-parallel subsurface simulator. PFLOTRAN solves systems of generally nonlinear partial differential equations describing multiphase, multicomponent and multiscale reactive flow and transport processes in porous media. PFLOTRAN's QA test suite compares the numerical solutions of benchmark problems in heat and mass transport against known, closed-form, analytical solutions, including documentation of the exercised physical process models implemented in each PFLOTRAN benchmark simulation. The QA test suite development strives to follow the recommendations given by Oberkampf and Trucano (2007), which describes four essential elements in high-quality verification benchmark construction: (1) conceptual description, (2) mathematical description, (3) accuracy assessment, and (4) additional documentation and user information. Several QA tests within the suite will be presented, including details of the benchmark problems and their closed-form analytical solutions, implementation of benchmark problems in PFLOTRAN simulations, and the criteria used to assess PFLOTRAN's performance in the code verification procedure. References Oberkampf, W. L., and T. G. Trucano (2007), Verification and Validation Benchmarks, SAND2007-0853, 67 pgs., Sandia National Laboratories, Albuquerque, NM.
Present Status and Extensions of the Monte Carlo Performance Benchmark
NASA Astrophysics Data System (ADS)
Hoogenboom, J. Eduard; Petrovic, Bojan; Martin, William R.
2014-06-01
The NEA Monte Carlo Performance benchmark started in 2011 aiming to monitor over the years the abilities to perform a full-size Monte Carlo reactor core calculation with a detailed power production for each fuel pin with axial distribution. This paper gives an overview of the contributed results thus far. It shows that reaching a statistical accuracy of 1 % for most of the small fuel zones requires about 100 billion neutron histories. The efficiency of parallel execution of Monte Carlo codes on a large number of processor cores shows clear limitations for computer clusters with common type computer nodes. However, using true supercomputers the speedup of parallel calculations is increasing up to large numbers of processor cores. More experience is needed from calculations on true supercomputers using large numbers of processors in order to predict if the requested calculations can be done in a short time. As the specifications of the reactor geometry for this benchmark test are well suited for further investigations of full-core Monte Carlo calculations and a need is felt for testing other issues than its computational performance, proposals are presented for extending the benchmark to a suite of benchmark problems for evaluating fission source convergence for a system with a high dominance ratio, for coupling with thermal-hydraulics calculations to evaluate the use of different temperatures and coolant densities and to study the correctness and effectiveness of burnup calculations. Moreover, other contemporary proposals for a full-core calculation with realistic geometry and material composition will be discussed.
NASA Technical Reports Server (NTRS)
VanderWijngaart, Rob; Frumkin, Michael; Biegel, Bryan A. (Technical Monitor)
2002-01-01
We provide a paper-and-pencil specification of a benchmark suite for computational grids. It is based on the NAS (NASA Advanced Supercomputing) Parallel Benchmarks (NPB) and is called the NAS Grid Benchmarks (NGB). NGB problems are presented as data flow graphs encapsulating an instance of a slightly modified NPB task in each graph node, which communicates with other nodes by sending/receiving initialization data. Like NPB, NGB specifies several different classes (problem sizes). In this report we describe classes S, W, and A, and provide verification values for each. The implementor has the freedom to choose any language, grid environment, security model, fault tolerance/error correction mechanism, etc., as long as the resulting implementation passes the verification test and reports the turnaround time of the benchmark.
A new deadlock resolution protocol and message matching algorithm for the extreme-scale simulator
Engelmann, Christian; Naughton, III, Thomas J.
2016-03-22
Investigating the performance of parallel applications at scale on future high-performance computing (HPC) architectures and the performance impact of different HPC architecture choices is an important component of HPC hardware/software co-design. The Extreme-scale Simulator (xSim) is a simulation toolkit for investigating the performance of parallel applications at scale. xSim scales to millions of simulated Message Passing Interface (MPI) processes. The overhead introduced by a simulation tool is an important performance and productivity aspect. This paper documents two improvements to xSim: (1)~a new deadlock resolution protocol to reduce the parallel discrete event simulation overhead and (2)~a new simulated MPI message matchingmore » algorithm to reduce the oversubscription management overhead. The results clearly show a significant performance improvement. The simulation overhead for running the NAS Parallel Benchmark suite was reduced from 102% to 0% for the embarrassingly parallel (EP) benchmark and from 1,020% to 238% for the conjugate gradient (CG) benchmark. xSim offers a highly accurate simulation mode for better tracking of injected MPI process failures. Furthermore, with highly accurate simulation, the overhead was reduced from 3,332% to 204% for EP and from 37,511% to 13,808% for CG.« less
NASA Technical Reports Server (NTRS)
Davis, G. J.
1994-01-01
One area of research of the Information Sciences Division at NASA Ames Research Center is devoted to the analysis and enhancement of processors and advanced computer architectures, specifically in support of automation and robotic systems. To compare systems' abilities to efficiently process Lisp and Ada, scientists at Ames Research Center have developed a suite of non-parallel benchmarks called ELAPSE. The benchmark suite was designed to test a single computer's efficiency as well as alternate machine comparisons on Lisp, and/or Ada languages. ELAPSE tests the efficiency with which a machine can execute the various routines in each environment. The sample routines are based on numeric and symbolic manipulations and include two-dimensional fast Fourier transformations, Cholesky decomposition and substitution, Gaussian elimination, high-level data processing, and symbol-list references. Also included is a routine based on a Bayesian classification program sorting data into optimized groups. The ELAPSE benchmarks are available for any computer with a validated Ada compiler and/or Common Lisp system. Of the 18 routines that comprise ELAPSE, provided within this package are 14 developed or translated at Ames. The others are readily available through literature. The benchmark that requires the most memory is CHOLESKY.ADA. Under VAX/VMS, CHOLESKY.ADA requires 760K of main memory. ELAPSE is available on either two 5.25 inch 360K MS-DOS format diskettes (standard distribution) or a 9-track 1600 BPI ASCII CARD IMAGE format magnetic tape. The contents of the diskettes are compressed using the PKWARE archiving tools. The utility to unarchive the files, PKUNZIP.EXE, is included. The ELAPSE benchmarks were written in 1990. VAX and VMS are trademarks of Digital Equipment Corporation. MS-DOS is a registered trademark of Microsoft Corporation.
Heterogeneous Distributed Computing for Computational Aerosciences
NASA Technical Reports Server (NTRS)
Sunderam, Vaidy S.
1998-01-01
The research supported under this award focuses on heterogeneous distributed computing for high-performance applications, with particular emphasis on computational aerosciences. The overall goal of this project was to and investigate issues in, and develop solutions to, efficient execution of computational aeroscience codes in heterogeneous concurrent computing environments. In particular, we worked in the context of the PVM[1] system and, subsequent to detailed conversion efforts and performance benchmarking, devising novel techniques to increase the efficacy of heterogeneous networked environments for computational aerosciences. Our work has been based upon the NAS Parallel Benchmark suite, but has also recently expanded in scope to include the NAS I/O benchmarks as specified in the NHT-1 document. In this report we summarize our research accomplishments under the auspices of the grant.
Genetic Parallel Programming: design and implementation.
Cheang, Sin Man; Leung, Kwong Sak; Lee, Kin Hong
2006-01-01
This paper presents a novel Genetic Parallel Programming (GPP) paradigm for evolving parallel programs running on a Multi-Arithmetic-Logic-Unit (Multi-ALU) Processor (MAP). The MAP is a Multiple Instruction-streams, Multiple Data-streams (MIMD), general-purpose register machine that can be implemented on modern Very Large-Scale Integrated Circuits (VLSIs) in order to evaluate genetic programs at high speed. For human programmers, writing parallel programs is more difficult than writing sequential programs. However, experimental results show that GPP evolves parallel programs with less computational effort than that of their sequential counterparts. It creates a new approach to evolving a feasible problem solution in parallel program form and then serializes it into a sequential program if required. The effectiveness and efficiency of GPP are investigated using a suite of 14 well-studied benchmark problems. Experimental results show that GPP speeds up evolution substantially.
MoMaS reactive transport benchmark using PFLOTRAN
NASA Astrophysics Data System (ADS)
Park, H.
2017-12-01
MoMaS benchmark was developed to enhance numerical simulation capability for reactive transport modeling in porous media. The benchmark was published in late September of 2009; it is not taken from a real chemical system, but realistic and numerically challenging tests. PFLOTRAN is a state-of-art massively parallel subsurface flow and reactive transport code that is being used in multiple nuclear waste repository projects at Sandia National Laboratories including Waste Isolation Pilot Plant and Used Fuel Disposition. MoMaS benchmark has three independent tests with easy, medium, and hard chemical complexity. This paper demonstrates how PFLOTRAN is applied to this benchmark exercise and shows results of the easy benchmark test case which includes mixing of aqueous components and surface complexation. Surface complexations consist of monodentate and bidentate reactions which introduces difficulty in defining selectivity coefficient if the reaction applies to a bulk reference volume. The selectivity coefficient becomes porosity dependent for bidentate reaction in heterogeneous porous media. The benchmark is solved by PFLOTRAN with minimal modification to address the issue and unit conversions were made properly to suit PFLOTRAN.
Performance effects of irregular communications patterns on massively parallel multiprocessors
NASA Technical Reports Server (NTRS)
Saltz, Joel; Petiton, Serge; Berryman, Harry; Rifkin, Adam
1991-01-01
A detailed study of the performance effects of irregular communications patterns on the CM-2 was conducted. The communications capabilities of the CM-2 were characterized under a variety of controlled conditions. In the process of carrying out the performance evaluation, extensive use was made of a parameterized synthetic mesh. In addition, timings with unstructured meshes generated for aerodynamic codes and a set of sparse matrices with banded patterns on non-zeroes were performed. This benchmarking suite stresses the communications capabilities of the CM-2 in a range of different ways. Benchmark results demonstrate that it is possible to make effective use of much of the massive concurrency available in the communications network.
Benchmarking and performance analysis of the CM-2. [SIMD computer
NASA Technical Reports Server (NTRS)
Myers, David W.; Adams, George B., II
1988-01-01
A suite of benchmarking routines testing communication, basic arithmetic operations, and selected kernel algorithms written in LISP and PARIS was developed for the CM-2. Experiment runs are automated via a software framework that sequences individual tests, allowing for unattended overnight operation. Multiple measurements are made and treated statistically to generate well-characterized results from the noisy values given by cm:time. The results obtained provide a comparison with similar, but less extensive, testing done on a CM-1. Tests were chosen to aid the algorithmist in constructing fast, efficient, and correct code on the CM-2, as well as gain insight into what performance criteria are needed when evaluating parallel processing machines.
NASA Technical Reports Server (NTRS)
Bailey, David (Editor); Barton, John (Editor); Lasinski, Thomas (Editor); Simon, Horst (Editor)
1993-01-01
A new set of benchmarks was developed for the performance evaluation of highly parallel supercomputers. These benchmarks consist of a set of kernels, the 'Parallel Kernels,' and a simulated application benchmark. Together they mimic the computation and data movement characteristics of large scale computational fluid dynamics (CFD) applications. The principal distinguishing feature of these benchmarks is their 'pencil and paper' specification - all details of these benchmarks are specified only algorithmically. In this way many of the difficulties associated with conventional benchmarking approaches on highly parallel systems are avoided.
NASA Technical Reports Server (NTRS)
Bailey, D. H.; Barszcz, E.; Barton, J. T.; Carter, R. L.; Lasinski, T. A.; Browning, D. S.; Dagum, L.; Fatoohi, R. A.; Frederickson, P. O.; Schreiber, R. S.
1991-01-01
A new set of benchmarks has been developed for the performance evaluation of highly parallel supercomputers in the framework of the NASA Ames Numerical Aerodynamic Simulation (NAS) Program. These consist of five 'parallel kernel' benchmarks and three 'simulated application' benchmarks. Together they mimic the computation and data movement characteristics of large-scale computational fluid dynamics applications. The principal distinguishing feature of these benchmarks is their 'pencil and paper' specification-all details of these benchmarks are specified only algorithmically. In this way many of the difficulties associated with conventional benchmarking approaches on highly parallel systems are avoided.
Performance Metrics for Monitoring Parallel Program Executions
NASA Technical Reports Server (NTRS)
Sarukkai, Sekkar R.; Gotwais, Jacob K.; Yan, Jerry; Lum, Henry, Jr. (Technical Monitor)
1994-01-01
Existing tools for debugging performance of parallel programs either provide graphical representations of program execution or profiles of program executions. However, for performance debugging tools to be useful, such information has to be augmented with information that highlights the cause of poor program performance. Identifying the cause of poor performance necessitates the need for not only determining the significance of various performance problems on the execution time of the program, but also needs to consider the effect of interprocessor communications of individual source level data structures. In this paper, we present a suite of normalized indices which provide a convenient mechanism for focusing on a region of code with poor performance and highlights the cause of the problem in terms of processors, procedures and data structure interactions. All the indices are generated from trace files augmented with data structure information.. Further, we show with the help of examples from the NAS benchmark suite that the indices help in detecting potential cause of poor performance, based on augmented execution traces obtained by monitoring the program.
Characterizing Task-Based OpenMP Programs
Muddukrishna, Ananya; Jonsson, Peter A.; Brorsson, Mats
2015-01-01
Programmers struggle to understand performance of task-based OpenMP programs since profiling tools only report thread-based performance. Performance tuning also requires task-based performance in order to balance per-task memory hierarchy utilization against exposed task parallelism. We provide a cost-effective method to extract detailed task-based performance information from OpenMP programs. We demonstrate the utility of our method by quickly diagnosing performance problems and characterizing exposed task parallelism and per-task instruction profiles of benchmarks in the widely-used Barcelona OpenMP Tasks Suite. Programmers can tune performance faster and understand performance tradeoffs more effectively than existing tools by using our method to characterize task-based performance. PMID:25860023
Combining Phase Identification and Statistic Modeling for Automated Parallel Benchmark Generation
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jin, Ye; Ma, Xiaosong; Liu, Qing Gary
2015-01-01
Parallel application benchmarks are indispensable for evaluating/optimizing HPC software and hardware. However, it is very challenging and costly to obtain high-fidelity benchmarks reflecting the scale and complexity of state-of-the-art parallel applications. Hand-extracted synthetic benchmarks are time-and labor-intensive to create. Real applications themselves, while offering most accurate performance evaluation, are expensive to compile, port, reconfigure, and often plainly inaccessible due to security or ownership concerns. This work contributes APPRIME, a novel tool for trace-based automatic parallel benchmark generation. Taking as input standard communication-I/O traces of an application's execution, it couples accurate automatic phase identification with statistical regeneration of event parameters tomore » create compact, portable, and to some degree reconfigurable parallel application benchmarks. Experiments with four NAS Parallel Benchmarks (NPB) and three real scientific simulation codes confirm the fidelity of APPRIME benchmarks. They retain the original applications' performance characteristics, in particular the relative performance across platforms.« less
Implementation of ADI: Schemes on MIMD parallel computers
NASA Technical Reports Server (NTRS)
Vanderwijngaart, Rob F.
1993-01-01
In order to simulate the effects of the impingement of hot exhaust jets of High Performance Aircraft on landing surfaces a multi-disciplinary computation coupling flow dynamics to heat conduction in the runway needs to be carried out. Such simulations, which are essentially unsteady, require very large computational power in order to be completed within a reasonable time frame of the order of an hour. Such power can be furnished by the latest generation of massively parallel computers. These remove the bottleneck of ever more congested data paths to one or a few highly specialized central processing units (CPU's) by having many off-the-shelf CPU's work independently on their own data, and exchange information only when needed. During the past year the first phase of this project was completed, in which the optimal strategy for mapping an ADI-algorithm for the three dimensional unsteady heat equation to a MIMD parallel computer was identified. This was done by implementing and comparing three different domain decomposition techniques that define the tasks for the CPU's in the parallel machine. These implementations were done for a Cartesian grid and Dirichlet boundary conditions. The most promising technique was then used to implement the heat equation solver on a general curvilinear grid with a suite of nontrivial boundary conditions. Finally, this technique was also used to implement the Scalar Penta-diagonal (SP) benchmark, which was taken from the NAS Parallel Benchmarks report. All implementations were done in the programming language C on the Intel iPSC/860 computer.
Analysis of a benchmark suite to evaluate mixed numeric and symbolic processing
NASA Technical Reports Server (NTRS)
Ragharan, Bharathi; Galant, David
1992-01-01
The suite of programs that formed the benchmark for a proposed advanced computer is described and analyzed. The features of the processor and its operating system that are tested by the benchmark are discussed. The computer codes and the supporting data for the analysis are given as appendices.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sukumar, Sreenivas R.; Hong, Seokyong; Lee, Sangkeun
2016-06-01
GraphBench is a benchmark suite for graph pattern mining and graph analysis systems. The benchmark suite is a significant addition to conducting apples-apples comparison of graph analysis software (databases, in-memory tools, triple stores, etc.)
Evaluation of Cache-based Superscalar and Cacheless Vector Architectures for Scientific Computations
NASA Technical Reports Server (NTRS)
Oliker, Leonid; Carter, Jonathan; Shalf, John; Skinner, David; Ethier, Stephane; Biswas, Rupak; Djomehri, Jahed; VanderWijngaart, Rob
2003-01-01
The growing gap between sustained and peak performance for scientific applications has become a well-known problem in high performance computing. The recent development of parallel vector systems offers the potential to bridge this gap for a significant number of computational science codes and deliver a substantial increase in computing capabilities. This paper examines the intranode performance of the NEC SX6 vector processor and the cache-based IBM Power3/4 superscalar architectures across a number of key scientific computing areas. First, we present the performance of a microbenchmark suite that examines a full spectrum of low-level machine characteristics. Next, we study the behavior of the NAS Parallel Benchmarks using some simple optimizations. Finally, we evaluate the perfor- mance of several numerical codes from key scientific computing domains. Overall results demonstrate that the SX6 achieves high performance on a large fraction of our application suite and in many cases significantly outperforms the RISC-based architectures. However, certain classes of applications are not easily amenable to vectorization and would likely require extensive reengineering of both algorithm and implementation to utilize the SX6 effectively.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fischer, G.A.
2011-07-01
Document available in abstract form only, full text of document follows: The dosimetry from the H. B. Robinson Unit 2 Pressure Vessel Benchmark is analyzed with a suite of Westinghouse-developed codes and data libraries. The radiation transport from the reactor core to the surveillance capsule and ex-vessel locations is performed by RAPTOR-M3G, a parallel deterministic radiation transport code that calculates high-resolution neutron flux information in three dimensions. The cross-section library used in this analysis is the ALPAN library, an Evaluated Nuclear Data File (ENDF)/B-VII.0-based library designed for reactor dosimetry and fluence analysis applications. Dosimetry is evaluated with the industry-standard SNLRMLmore » reactor dosimetry cross-section data library. (authors)« less
PMLB: a large benchmark suite for machine learning evaluation and comparison.
Olson, Randal S; La Cava, William; Orzechowski, Patryk; Urbanowicz, Ryan J; Moore, Jason H
2017-01-01
The selection, development, or comparison of machine learning methods in data mining can be a difficult task based on the target problem and goals of a particular study. Numerous publicly available real-world and simulated benchmark datasets have emerged from different sources, but their organization and adoption as standards have been inconsistent. As such, selecting and curating specific benchmarks remains an unnecessary burden on machine learning practitioners and data scientists. The present study introduces an accessible, curated, and developing public benchmark resource to facilitate identification of the strengths and weaknesses of different machine learning methodologies. We compare meta-features among the current set of benchmark datasets in this resource to characterize the diversity of available data. Finally, we apply a number of established machine learning methods to the entire benchmark suite and analyze how datasets and algorithms cluster in terms of performance. From this study, we find that existing benchmarks lack the diversity to properly benchmark machine learning algorithms, and there are several gaps in benchmarking problems that still need to be considered. This work represents another important step towards understanding the limitations of popular benchmarking suites and developing a resource that connects existing benchmarking standards to more diverse and efficient standards in the future.
INL Results for Phases I and III of the OECD/NEA MHTGR-350 Benchmark
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gerhard Strydom; Javier Ortensi; Sonat Sen
2013-09-01
The Idaho National Laboratory (INL) Very High Temperature Reactor (VHTR) Technology Development Office (TDO) Methods Core Simulation group led the construction of the Organization for Economic Cooperation and Development (OECD) Modular High Temperature Reactor (MHTGR) 350 MW benchmark for comparing and evaluating prismatic VHTR analysis codes. The benchmark is sponsored by the OECD's Nuclear Energy Agency (NEA), and the project will yield a set of reference steady-state, transient, and lattice depletion problems that can be used by the Department of Energy (DOE), the Nuclear Regulatory Commission (NRC), and vendors to assess their code suits. The Methods group is responsible formore » defining the benchmark specifications, leading the data collection and comparison activities, and chairing the annual technical workshops. This report summarizes the latest INL results for Phase I (steady state) and Phase III (lattice depletion) of the benchmark. The INSTANT, Pronghorn and RattleSnake codes were used for the standalone core neutronics modeling of Exercise 1, and the results obtained from these codes are compared in Section 4. Exercise 2 of Phase I requires the standalone steady-state thermal fluids modeling of the MHTGR-350 design, and the results for the systems code RELAP5-3D are discussed in Section 5. The coupled neutronics and thermal fluids steady-state solution for Exercise 3 are reported in Section 6, utilizing the newly developed Parallel and Highly Innovative Simulation for INL Code System (PHISICS)/RELAP5-3D code suit. Finally, the lattice depletion models and results obtained for Phase III are compared in Section 7. The MHTGR-350 benchmark proved to be a challenging simulation set of problems to model accurately, and even with the simplifications introduced in the benchmark specification this activity is an important step in the code-to-code verification of modern prismatic VHTR codes. A final OECD/NEA comparison report will compare the Phase I and III results of all other international participants in 2014, while the remaining Phase II transient case results will be reported in 2015.« less
Object-Oriented Implementation of the NAS Parallel Benchmarks using Charm++
NASA Technical Reports Server (NTRS)
Krishnan, Sanjeev; Bhandarkar, Milind; Kale, Laxmikant V.
1996-01-01
This report describes experiences with implementing the NAS Computational Fluid Dynamics benchmarks using a parallel object-oriented language, Charm++. Our main objective in implementing the NAS CFD kernel benchmarks was to develop a code that could be used to easily experiment with different domain decomposition strategies and dynamic load balancing. We also wished to leverage the object-orientation provided by the Charm++ parallel object-oriented language, to develop reusable abstractions that would simplify the process of developing parallel applications. We first describe the Charm++ parallel programming model and the parallel object array abstraction, then go into detail about each of the Scalar Pentadiagonal (SP) and Lower/Upper Triangular (LU) benchmarks, along with performance results. Finally we conclude with an evaluation of the methodology used.
Quantitative phenotyping via deep barcode sequencing.
Smith, Andrew M; Heisler, Lawrence E; Mellor, Joseph; Kaper, Fiona; Thompson, Michael J; Chee, Mark; Roth, Frederick P; Giaever, Guri; Nislow, Corey
2009-10-01
Next-generation DNA sequencing technologies have revolutionized diverse genomics applications, including de novo genome sequencing, SNP detection, chromatin immunoprecipitation, and transcriptome analysis. Here we apply deep sequencing to genome-scale fitness profiling to evaluate yeast strain collections in parallel. This method, Barcode analysis by Sequencing, or "Bar-seq," outperforms the current benchmark barcode microarray assay in terms of both dynamic range and throughput. When applied to a complex chemogenomic assay, Bar-seq quantitatively identifies drug targets, with performance superior to the benchmark microarray assay. We also show that Bar-seq is well-suited for a multiplex format. We completely re-sequenced and re-annotated the yeast deletion collection using deep sequencing, found that approximately 20% of the barcodes and common priming sequences varied from expectation, and used this revised list of barcode sequences to improve data quality. Together, this new assay and analysis routine provide a deep-sequencing-based toolkit for identifying gene-environment interactions on a genome-wide scale.
Evaluation of the ACEC Benchmark Suite for Real-Time Applications
1990-07-23
1.0 benchmark suite waSanalyzed with respect to its measuring of Ada real-time features such as tasking, memory management, input/output, scheduling...and delay statement, Chapter 13 features , pragmas, interrupt handling, subprogram overhead, numeric computations etc. For most of the features that...meant for programming real-time systems. The ACEC benchmarks have been analyzed extensively with respect to their measuring of Ada real-time features
Parallelization of NAS Benchmarks for Shared Memory Multiprocessors
NASA Technical Reports Server (NTRS)
Waheed, Abdul; Yan, Jerry C.; Saini, Subhash (Technical Monitor)
1998-01-01
This paper presents our experiences of parallelizing the sequential implementation of NAS benchmarks using compiler directives on SGI Origin2000 distributed shared memory (DSM) system. Porting existing applications to new high performance parallel and distributed computing platforms is a challenging task. Ideally, a user develops a sequential version of the application, leaving the task of porting to new generations of high performance computing systems to parallelization tools and compilers. Due to the simplicity of programming shared-memory multiprocessors, compiler developers have provided various facilities to allow the users to exploit parallelism. Native compilers on SGI Origin2000 support multiprocessing directives to allow users to exploit loop-level parallelism in their programs. Additionally, supporting tools can accomplish this process automatically and present the results of parallelization to the users. We experimented with these compiler directives and supporting tools by parallelizing sequential implementation of NAS benchmarks. Results reported in this paper indicate that with minimal effort, the performance gain is comparable with the hand-parallelized, carefully optimized, message-passing implementations of the same benchmarks.
Characterization of robotics parallel algorithms and mapping onto a reconfigurable SIMD machine
NASA Technical Reports Server (NTRS)
Lee, C. S. G.; Lin, C. T.
1989-01-01
The kinematics, dynamics, Jacobian, and their corresponding inverse computations are six essential problems in the control of robot manipulators. Efficient parallel algorithms for these computations are discussed and analyzed. Their characteristics are identified and a scheme on the mapping of these algorithms to a reconfigurable parallel architecture is presented. Based on the characteristics including type of parallelism, degree of parallelism, uniformity of the operations, fundamental operations, data dependencies, and communication requirement, it is shown that most of the algorithms for robotic computations possess highly regular properties and some common structures, especially the linear recursive structure. Moreover, they are well-suited to be implemented on a single-instruction-stream multiple-data-stream (SIMD) computer with reconfigurable interconnection network. The model of a reconfigurable dual network SIMD machine with internal direct feedback is introduced. A systematic procedure internal direct feedback is introduced. A systematic procedure to map these computations to the proposed machine is presented. A new scheduling problem for SIMD machines is investigated and a heuristic algorithm, called neighborhood scheduling, that reorders the processing sequence of subtasks to reduce the communication time is described. Mapping results of a benchmark algorithm are illustrated and discussed.
Algorithm and Architecture Independent Benchmarking with SEAK
DOE Office of Scientific and Technical Information (OSTI.GOV)
Tallent, Nathan R.; Manzano Franco, Joseph B.; Gawande, Nitin A.
2016-05-23
Many applications of high performance embedded computing are limited by performance or power bottlenecks. We have designed the Suite for Embedded Applications & Kernels (SEAK), a new benchmark suite, (a) to capture these bottlenecks in a way that encourages creative solutions; and (b) to facilitate rigorous, objective, end-user evaluation for their solutions. To avoid biasing solutions toward existing algorithms, SEAK benchmarks use a mission-centric (abstracted from a particular algorithm) and goal-oriented (functional) specification. To encourage solutions that are any combination of software or hardware, we use an end-user black-box evaluation that can capture tradeoffs between performance, power, accuracy, size, andmore » weight. The tradeoffs are especially informative for procurement decisions. We call our benchmarks future proof because each mission-centric interface and evaluation remains useful despite shifting algorithmic preferences. It is challenging to create both concise and precise goal-oriented specifications for mission-centric problems. This paper describes the SEAK benchmark suite and presents an evaluation of sample solutions that highlights power and performance tradeoffs.« less
The Model Averaging for Dichotomous Response Benchmark Dose (MADr-BMD) Tool
Providing quantal response models, which are also used in the U.S. EPA benchmark dose software suite, and generates a model-averaged dose response model to generate benchmark dose and benchmark dose lower bound estimates.
Statistical Analysis of NAS Parallel Benchmarks and LINPACK Results
NASA Technical Reports Server (NTRS)
Meuer, Hans-Werner; Simon, Horst D.; Strohmeier, Erich; Lasinski, T. A. (Technical Monitor)
1994-01-01
In the last three years extensive performance data have been reported for parallel machines both based on the NAS Parallel Benchmarks, and on LINPACK. In this study we have used the reported benchmark results and performed a number of statistical experiments using factor, cluster, and regression analyses. In addition to the performance results of LINPACK and the eight NAS parallel benchmarks, we have also included peak performance of the machine, and the LINPACK n and n(sub 1/2) values. Some of the results and observations can be summarized as follows: 1) All benchmarks are strongly correlated with peak performance. 2) LINPACK and EP have each a unique signature. 3) The remaining NPB can grouped into three groups as follows: (CG and IS), (LU and SP), and (MG, FT, and BT). Hence three (or four with EP) benchmarks are sufficient to characterize the overall NPB performance. Our poster presentation will follow a standard poster format, and will present the data of our statistical analysis in detail.
EVA Health and Human Performance Benchmarking Study
NASA Technical Reports Server (NTRS)
Abercromby, A. F.; Norcross, J.; Jarvis, S. L.
2016-01-01
Multiple HRP Risks and Gaps require detailed characterization of human health and performance during exploration extravehicular activity (EVA) tasks; however, a rigorous and comprehensive methodology for characterizing and comparing the health and human performance implications of current and future EVA spacesuit designs does not exist. This study will identify and implement functional tasks and metrics, both objective and subjective, that are relevant to health and human performance, such as metabolic expenditure, suit fit, discomfort, suited postural stability, cognitive performance, and potentially biochemical responses for humans working inside different EVA suits doing functional tasks under the appropriate simulated reduced gravity environments. This study will provide health and human performance benchmark data for humans working in current EVA suits (EMU, Mark III, and Z2) as well as shirtsleeves using a standard set of tasks and metrics with quantified reliability. Results and methodologies developed during this test will provide benchmark data against which future EVA suits, and different suit configurations (eg, varied pressure, mass, CG) may be reliably compared in subsequent tests. Results will also inform fitness for duty standards as well as design requirements and operations concepts for future EVA suits and other exploration systems.
Quantitative phenotyping via deep barcode sequencing
Smith, Andrew M.; Heisler, Lawrence E.; Mellor, Joseph; Kaper, Fiona; Thompson, Michael J.; Chee, Mark; Roth, Frederick P.; Giaever, Guri; Nislow, Corey
2009-01-01
Next-generation DNA sequencing technologies have revolutionized diverse genomics applications, including de novo genome sequencing, SNP detection, chromatin immunoprecipitation, and transcriptome analysis. Here we apply deep sequencing to genome-scale fitness profiling to evaluate yeast strain collections in parallel. This method, Barcode analysis by Sequencing, or “Bar-seq,” outperforms the current benchmark barcode microarray assay in terms of both dynamic range and throughput. When applied to a complex chemogenomic assay, Bar-seq quantitatively identifies drug targets, with performance superior to the benchmark microarray assay. We also show that Bar-seq is well-suited for a multiplex format. We completely re-sequenced and re-annotated the yeast deletion collection using deep sequencing, found that ∼20% of the barcodes and common priming sequences varied from expectation, and used this revised list of barcode sequences to improve data quality. Together, this new assay and analysis routine provide a deep-sequencing-based toolkit for identifying gene–environment interactions on a genome-wide scale. PMID:19622793
DOE Office of Scientific and Technical Information (OSTI.GOV)
Karl Anderson, Steve Plimpton
2015-01-27
The FireHose Streaming Benchmarks are a suite of stream-processing benchmarks defined to enable comparison of streaming software and hardware, both quantitatively vis-a-vis the rate at which they can process data, and qualitatively by judging the effort involved to implement and run the benchmarks. Each benchmark has two parts. The first is a generator which produces and outputs datums at a high rate in a specific format. The second is an analytic which reads the stream of datums and is required to perform a well-defined calculation on the collection of datums, typically to find anomalous datums that have been created inmore » the stream by the generator. The FireHose suite provides code for the generators, sample code for the analytics (which users are free to re-implement in their own custom frameworks), and a precise definition of each benchmark calculation.« less
Implementation and verification of global optimization benchmark problems
NASA Astrophysics Data System (ADS)
Posypkin, Mikhail; Usov, Alexander
2017-12-01
The paper considers the implementation and verification of a test suite containing 150 benchmarks for global deterministic box-constrained optimization. A C++ library for describing standard mathematical expressions was developed for this purpose. The library automate the process of generating the value of a function and its' gradient at a given point and the interval estimates of a function and its' gradient on a given box using a single description. Based on this functionality, we have developed a collection of tests for an automatic verification of the proposed benchmarks. The verification has shown that literary sources contain mistakes in the benchmarks description. The library and the test suite are available for download and can be used freely.
DYNA3D/ParaDyn Regression Test Suite Inventory
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lin, Jerry I.
2016-09-01
The following table constitutes an initial assessment of feature coverage across the regression test suite used for DYNA3D and ParaDyn. It documents the regression test suite at the time of preliminary release 16.1 in September 2016. The columns of the table represent groupings of functionalities, e.g., material models. Each problem in the test suite is represented by a row in the table. All features exercised by the problem are denoted by a check mark (√) in the corresponding column. The definition of “feature” has not been subdivided to its smallest unit of user input, e.g., algorithmic parameters specific to amore » particular type of contact surface. This represents a judgment to provide code developers and users a reasonable impression of feature coverage without expanding the width of the table by several multiples. All regression testing is run in parallel, typically with eight processors, except problems involving features only available in serial mode. Many are strictly regression tests acting as a check that the codes continue to produce adequately repeatable results as development unfolds; compilers change and platforms are replaced. A subset of the tests represents true verification problems that have been checked against analytical or other benchmark solutions. Users are welcomed to submit documented problems for inclusion in the test suite, especially if they are heavily exercising, and dependent upon, features that are currently underrepresented.« less
The Suite for Embedded Applications and Kernels
DOE Office of Scientific and Technical Information (OSTI.GOV)
2016-05-10
Many applications of high performance embedded computing are limited by performance or power bottlenecks. We havedesigned SEAK, a new benchmark suite, (a) to capture these bottlenecks in a way that encourages creative solutions to these bottlenecks? and (b) to facilitate rigorous, objective, end-user evaluation for their solutions. To avoid biasing solutions toward existing algorithms, SEAK benchmarks use a mission-centric (abstracted from a particular algorithm) andgoal-oriented (functional) specification. To encourage solutions that are any combination of software or hardware, we use an end-user blackbox evaluation that can capture tradeoffs between performance, power, accuracy, size, and weight. The tradeoffs are especially informativemore » for procurement decisions. We call our benchmarks future proof because each mission-centric interface and evaluation remains useful despite shifting algorithmic preferences. It is challenging to create both concise and precise goal-oriented specifications for mission-centric problems. This paper describes the SEAK benchmark suite and presents an evaluation of sample solutions that highlights power and performance tradeoffs.« less
New NAS Parallel Benchmarks Results
NASA Technical Reports Server (NTRS)
Yarrow, Maurice; Saphir, William; VanderWijngaart, Rob; Woo, Alex; Kutler, Paul (Technical Monitor)
1997-01-01
NPB2 (NAS (NASA Advanced Supercomputing) Parallel Benchmarks 2) is an implementation, based on Fortran and the MPI (message passing interface) message passing standard, of the original NAS Parallel Benchmark specifications. NPB2 programs are run with little or no tuning, in contrast to NPB vendor implementations, which are highly optimized for specific architectures. NPB2 results complement, rather than replace, NPB results. Because they have not been optimized by vendors, NPB2 implementations approximate the performance a typical user can expect for a portable parallel program on distributed memory parallel computers. Together these results provide an insightful comparison of the real-world performance of high-performance computers. New NPB2 features: New implementation (CG), new workstation class problem sizes, new serial sample versions, more performance statistics.
NASA Technical Reports Server (NTRS)
Norcross, Jason; Jarvis, Sarah; Bekdash, Omar; Cupples, Scott; Abercromby, Andrew
2017-01-01
The primary objective of this study is to develop a protocol to reliably characterize human health and performance metrics for individuals working inside various EVA suits under realistic spaceflight conditions. Expected results and methodologies developed during this study will provide the baseline benchmarking data and protocols with which future EVA suits and suit configurations (e.g., varied pressure, mass, center of gravity [CG]) and different test subject populations (e.g., deconditioned crewmembers) may be reliably assessed and compared. Results may also be used, in conjunction with subsequent testing, to inform fitness-for-duty standards, as well as design requirements and operations concepts for future EVA suits and other exploration systems.
NASA Technical Reports Server (NTRS)
Ayguade, Eduard; Gonzalez, Marc; Martorell, Xavier; Jost, Gabriele
2004-01-01
In this paper we describe the parallelization of the multi-zone code versions of the NAS Parallel Benchmarks employing multi-level OpenMP parallelism. For our study we use the NanosCompiler, which supports nesting of OpenMP directives and provides clauses to control the grouping of threads, load balancing, and synchronization. We report the benchmark results, compare the timings with those of different hybrid parallelization paradigms and discuss OpenMP implementation issues which effect the performance of multi-level parallel applications.
plasmaFoam: An OpenFOAM framework for computational plasma physics and chemistry
NASA Astrophysics Data System (ADS)
Venkattraman, Ayyaswamy; Verma, Abhishek Kumar
2016-09-01
As emphasized in the 2012 Roadmap for low temperature plasmas (LTP), scientific computing has emerged as an essential tool for the investigation and prediction of the fundamental physical and chemical processes associated with these systems. While several in-house and commercial codes exist, with each having its own advantages and disadvantages, a common framework that can be developed by researchers from all over the world will likely accelerate the impact of computational studies on advances in low-temperature plasma physics and chemistry. In this regard, we present a finite volume computational toolbox to perform high-fidelity simulations of LTP systems. This framework, primarily based on the OpenFOAM solver suite, allows us to enhance our understanding of multiscale plasma phenomenon by performing massively parallel, three-dimensional simulations on unstructured meshes using well-established high performance computing tools that are widely used in the computational fluid dynamics community. In this talk, we will present preliminary results obtained using the OpenFOAM-based solver suite with benchmark three-dimensional simulations of microplasma devices including both dielectric and plasma regions. We will also discuss the future outlook for the solver suite.
The OpenMP Implementation of NAS Parallel Benchmarks and its Performance
NASA Technical Reports Server (NTRS)
Jin, Hao-Qiang; Frumkin, Michael; Yan, Jerry
1999-01-01
As the new ccNUMA architecture became popular in recent years, parallel programming with compiler directives on these machines has evolved to accommodate new needs. In this study, we examine the effectiveness of OpenMP directives for parallelizing the NAS Parallel Benchmarks. Implementation details will be discussed and performance will be compared with the MPI implementation. We have demonstrated that OpenMP can achieve very good results for parallelization on a shared memory system, but effective use of memory and cache is very important.
The MCNP6 Analytic Criticality Benchmark Suite
DOE Office of Scientific and Technical Information (OSTI.GOV)
Brown, Forrest B.
2016-06-16
Analytical benchmarks provide an invaluable tool for verifying computer codes used to simulate neutron transport. Several collections of analytical benchmark problems [1-4] are used routinely in the verification of production Monte Carlo codes such as MCNP® [5,6]. Verification of a computer code is a necessary prerequisite to the more complex validation process. The verification process confirms that a code performs its intended functions correctly. The validation process involves determining the absolute accuracy of code results vs. nature. In typical validations, results are computed for a set of benchmark experiments using a particular methodology (code, cross-section data with uncertainties, and modeling)more » and compared to the measured results from the set of benchmark experiments. The validation process determines bias, bias uncertainty, and possibly additional margins. Verification is generally performed by the code developers, while validation is generally performed by code users for a particular application space. The VERIFICATION_KEFF suite of criticality problems [1,2] was originally a set of 75 criticality problems found in the literature for which exact analytical solutions are available. Even though the spatial and energy detail is necessarily limited in analytical benchmarks, typically to a few regions or energy groups, the exact solutions obtained can be used to verify that the basic algorithms, mathematics, and methods used in complex production codes perform correctly. The present work has focused on revisiting this benchmark suite. A thorough review of the problems resulted in discarding some of them as not suitable for MCNP benchmarking. For the remaining problems, many of them were reformulated to permit execution in either multigroup mode or in the normal continuous-energy mode for MCNP. Execution of the benchmarks in continuous-energy mode provides a significant advance to MCNP verification methods.« less
NASA Technical Reports Server (NTRS)
Waheed, Abdul; Yan, Jerry
1998-01-01
This paper presents a model to evaluate the performance and overhead of parallelizing sequential code using compiler directives for multiprocessing on distributed shared memory (DSM) systems. With increasing popularity of shared address space architectures, it is essential to understand their performance impact on programs that benefit from shared memory multiprocessing. We present a simple model to characterize the performance of programs that are parallelized using compiler directives for shared memory multiprocessing. We parallelized the sequential implementation of NAS benchmarks using native Fortran77 compiler directives for an Origin2000, which is a DSM system based on a cache-coherent Non Uniform Memory Access (ccNUMA) architecture. We report measurement based performance of these parallelized benchmarks from four perspectives: efficacy of parallelization process; scalability; parallelization overhead; and comparison with hand-parallelized and -optimized version of the same benchmarks. Our results indicate that sequential programs can conveniently be parallelized for DSM systems using compiler directives but realizing performance gains as predicted by the performance model depends primarily on minimizing architecture-specific data locality overhead.
MPI, HPF or OpenMP: A Study with the NAS Benchmarks
NASA Technical Reports Server (NTRS)
Jin, Hao-Qiang; Frumkin, Michael; Hribar, Michelle; Waheed, Abdul; Yan, Jerry; Saini, Subhash (Technical Monitor)
1999-01-01
Porting applications to new high performance parallel and distributed platforms is a challenging task. Writing parallel code by hand is time consuming and costly, but the task can be simplified by high level languages and would even better be automated by parallelizing tools and compilers. The definition of HPF (High Performance Fortran, based on data parallel model) and OpenMP (based on shared memory parallel model) standards has offered great opportunity in this respect. Both provide simple and clear interfaces to language like FORTRAN and simplify many tedious tasks encountered in writing message passing programs. In our study we implemented the parallel versions of the NAS Benchmarks with HPF and OpenMP directives. Comparison of their performance with the MPI implementation and pros and cons of different approaches will be discussed along with experience of using computer-aided tools to help parallelize these benchmarks. Based on the study,potentials of applying some of the techniques to realistic aerospace applications will be presented
MPI, HPF or OpenMP: A Study with the NAS Benchmarks
NASA Technical Reports Server (NTRS)
Jin, H.; Frumkin, M.; Hribar, M.; Waheed, A.; Yan, J.; Saini, Subhash (Technical Monitor)
1999-01-01
Porting applications to new high performance parallel and distributed platforms is a challenging task. Writing parallel code by hand is time consuming and costly, but this task can be simplified by high level languages and would even better be automated by parallelizing tools and compilers. The definition of HPF (High Performance Fortran, based on data parallel model) and OpenMP (based on shared memory parallel model) standards has offered great opportunity in this respect. Both provide simple and clear interfaces to language like FORTRAN and simplify many tedious tasks encountered in writing message passing programs. In our study, we implemented the parallel versions of the NAS Benchmarks with HPF and OpenMP directives. Comparison of their performance with the MPI implementation and pros and cons of different approaches will be discussed along with experience of using computer-aided tools to help parallelize these benchmarks. Based on the study, potentials of applying some of the techniques to realistic aerospace applications will be presented.
Implementation of the NAS Parallel Benchmarks in Java
NASA Technical Reports Server (NTRS)
Frumkin, Michael A.; Schultz, Matthew; Jin, Haoqiang; Yan, Jerry; Biegel, Bryan (Technical Monitor)
2002-01-01
Several features make Java an attractive choice for High Performance Computing (HPC). In order to gauge the applicability of Java to Computational Fluid Dynamics (CFD), we have implemented the NAS (NASA Advanced Supercomputing) Parallel Benchmarks in Java. The performance and scalability of the benchmarks point out the areas where improvement in Java compiler technology and in Java thread implementation would position Java closer to Fortran in the competition for CFD applications.
Performance and Scalability of the NAS Parallel Benchmarks in Java
NASA Technical Reports Server (NTRS)
Frumkin, Michael A.; Schultz, Matthew; Jin, Haoqiang; Yan, Jerry; Biegel, Bryan A. (Technical Monitor)
2002-01-01
Several features make Java an attractive choice for scientific applications. In order to gauge the applicability of Java to Computational Fluid Dynamics (CFD), we have implemented the NAS (NASA Advanced Supercomputing) Parallel Benchmarks in Java. The performance and scalability of the benchmarks point out the areas where improvement in Java compiler technology and in Java thread implementation would position Java closer to Fortran in the competition for scientific applications.
HS06 Benchmark for an ARM Server
NASA Astrophysics Data System (ADS)
Kluth, Stefan
2014-06-01
We benchmarked an ARM cortex-A9 based server system with a four-core CPU running at 1.1 GHz. The system used Ubuntu 12.04 as operating system and the HEPSPEC 2006 (HS06) benchmarking suite was compiled natively with gcc-4.4 on the system. The benchmark was run for various settings of the relevant gcc compiler options. We did not find significant influence from the compiler options on the benchmark result. The final HS06 benchmark result is 10.4.
NASA Technical Reports Server (NTRS)
Saini, Subhash; Frumkin, Michael; Hribar, Michelle; Jin, Hao-Qiang; Waheed, Abdul; Yan, Jerry
1998-01-01
Porting applications to new high performance parallel and distributed computing platforms is a challenging task. Since writing parallel code by hand is extremely time consuming and costly, porting codes would ideally be automated by using some parallelization tools and compilers. In this paper, we compare the performance of the hand written NAB Parallel Benchmarks against three parallel versions generated with the help of tools and compilers: 1) CAPTools: an interactive computer aided parallelization too] that generates message passing code, 2) the Portland Group's HPF compiler and 3) using compiler directives with the native FORTAN77 compiler on the SGI Origin2000.
A performance study of the time-varying cache behavior: a study on APEX, Mantevo, NAS, and PARSEC
Siddique, Nafiul A.; Grubel, Patricia A.; Badawy, Abdel-Hameed A.; ...
2017-09-20
Cache has long been used to minimize the latency of main memory accesses by storing frequently used data near the processor. Processor performance depends on the underlying cache performance. Therefore, significant research has been done to identify the most crucial metrics of cache performance. Although the majority of research focuses on measuring cache hit rates and data movement as the primary cache performance metrics, cache utilization is significantly important. We investigate the application’s locality using cache utilization metrics. In addition, we present cache utilization and traditional cache performance metrics as the program progresses providing detailed insights into the dynamic applicationmore » behavior on parallel applications from four benchmark suites running on multiple cores. We explore cache utilization for APEX, Mantevo, NAS, and PARSEC, mostly scientific benchmark suites. Our results indicate that 40% of the data bytes in a cache line are accessed at least once before line eviction. Also, on average a byte is accessed two times before the cache line is evicted for these applications. Moreover, we present runtime cache utilization, as well as, conventional performance metrics that illustrate a holistic understanding of cache behavior. To facilitate this research, we build a memory simulator incorporated into the Structural Simulation Toolkit (Rodrigues et al. in SIGMETRICS Perform Eval Rev 38(4):37–42, 2011). Finally, our results suggest that variable cache line size can result in better performance and can also conserve power.« less
A performance study of the time-varying cache behavior: a study on APEX, Mantevo, NAS, and PARSEC
DOE Office of Scientific and Technical Information (OSTI.GOV)
Siddique, Nafiul A.; Grubel, Patricia A.; Badawy, Abdel-Hameed A.
Cache has long been used to minimize the latency of main memory accesses by storing frequently used data near the processor. Processor performance depends on the underlying cache performance. Therefore, significant research has been done to identify the most crucial metrics of cache performance. Although the majority of research focuses on measuring cache hit rates and data movement as the primary cache performance metrics, cache utilization is significantly important. We investigate the application’s locality using cache utilization metrics. In addition, we present cache utilization and traditional cache performance metrics as the program progresses providing detailed insights into the dynamic applicationmore » behavior on parallel applications from four benchmark suites running on multiple cores. We explore cache utilization for APEX, Mantevo, NAS, and PARSEC, mostly scientific benchmark suites. Our results indicate that 40% of the data bytes in a cache line are accessed at least once before line eviction. Also, on average a byte is accessed two times before the cache line is evicted for these applications. Moreover, we present runtime cache utilization, as well as, conventional performance metrics that illustrate a holistic understanding of cache behavior. To facilitate this research, we build a memory simulator incorporated into the Structural Simulation Toolkit (Rodrigues et al. in SIGMETRICS Perform Eval Rev 38(4):37–42, 2011). Finally, our results suggest that variable cache line size can result in better performance and can also conserve power.« less
Implementation of NAS Parallel Benchmarks in Java
NASA Technical Reports Server (NTRS)
Frumkin, Michael; Schultz, Matthew; Jin, Hao-Qiang; Yan, Jerry
2000-01-01
A number of features make Java an attractive but a debatable choice for High Performance Computing (HPC). In order to gauge the applicability of Java to the Computational Fluid Dynamics (CFD) we have implemented NAS Parallel Benchmarks in Java. The performance and scalability of the benchmarks point out the areas where improvement in Java compiler technology and in Java thread implementation would move Java closer to Fortran in the competition for CFD applications.
Machine characterization and benchmark performance prediction
NASA Technical Reports Server (NTRS)
Saavedra-Barrera, Rafael H.
1988-01-01
From runs of standard benchmarks or benchmark suites, it is not possible to characterize the machine nor to predict the run time of other benchmarks which have not been run. A new approach to benchmarking and machine characterization is reported. The creation and use of a machine analyzer is described, which measures the performance of a given machine on FORTRAN source language constructs. The machine analyzer yields a set of parameters which characterize the machine and spotlight its strong and weak points. Also described is a program analyzer, which analyzes FORTRAN programs and determines the frequency of execution of each of the same set of source language operations. It is then shown that by combining a machine characterization and a program characterization, we are able to predict with good accuracy the run time of a given benchmark on a given machine. Characterizations are provided for the Cray-X-MP/48, Cyber 205, IBM 3090/200, Amdahl 5840, Convex C-1, VAX 8600, VAX 11/785, VAX 11/780, SUN 3/50, and IBM RT-PC/125, and for the following benchmark programs or suites: Los Alamos (BMK8A1), Baskett, Linpack, Livermore Loops, Madelbrot Set, NAS Kernels, Shell Sort, Smith, Whetstone and Sieve of Erathostenes.
Benchmarking Ada tasking on tightly coupled multiprocessor architectures
NASA Technical Reports Server (NTRS)
Collard, Philippe; Goforth, Andre; Marquardt, Matthew
1989-01-01
The development of benchmarks and performance measures for parallel Ada tasking is reported with emphasis on the macroscopic behavior of the benchmark across a set of load parameters. The application chosen for the study was the NASREM model for telerobot control, relevant to many NASA missions. The results of the study demonstrate the potential of parallel Ada in accomplishing the task of developing a control system for a system such as the Flight Telerobotic Servicer using the NASREM framework.
NASA Astrophysics Data System (ADS)
Solano-Altamirano, J. M.; Hernández-Pérez, Julio M.
2015-11-01
DensToolKit is a suite of cross-platform, optionally parallelized, programs for analyzing the molecular electron density (ρ) and several fields derived from it. Scalar and vector fields, such as the gradient of the electron density (∇ρ), electron localization function (ELF) and its gradient, localized orbital locator (LOL), region of slow electrons (RoSE), reduced density gradient, localized electrons detector (LED), information entropy, molecular electrostatic potential, kinetic energy densities K and G, among others, can be evaluated on zero, one, two, and three dimensional grids. The suite includes a program for searching critical points and bond paths of the electron density, under the framework of Quantum Theory of Atoms in Molecules. DensToolKit also evaluates the momentum space electron density on spatial grids, and the reduced density matrix of order one along lines joining two arbitrary atoms of a molecule. The source code is distributed under the GNU-GPLv3 license, and we release the code with the intent of establishing an open-source collaborative project. The style of DensToolKit's code follows some of the guidelines of an object-oriented program. This allows us to supply the user with a simple manner for easily implement new scalar or vector fields, provided they are derived from any of the fields already implemented in the code. In this paper, we present some of the most salient features of the programs contained in the suite, some examples of how to run them, and the mathematical definitions of the implemented fields along with hints of how we optimized their evaluation. We benchmarked our suite against both a freely-available program and a commercial package. Speed-ups of ˜2×, and up to 12× were obtained using a non-parallel compilation of DensToolKit for the evaluation of fields. DensToolKit takes similar times for finding critical points, compared to a commercial package. Finally, we present some perspectives for the future development and growth of the suite.
Revel8or: Model Driven Capacity Planning Tool Suite
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zhu, Liming; Liu, Yan; Bui, Ngoc B.
2007-05-31
Designing complex multi-tier applications that must meet strict performance requirements is a challenging software engineering problem. Ideally, the application architect could derive accurate performance predictions early in the project life-cycle, leveraging initial application design-level models and a description of the target software and hardware platforms. To this end, we have developed a capacity planning tool suite for component-based applications, called Revel8tor. The tool adheres to the model driven development paradigm and supports benchmarking and performance prediction for J2EE, .Net and Web services platforms. The suite is composed of three different tools: MDAPerf, MDABench and DSLBench. MDAPerf allows annotation of designmore » diagrams and derives performance analysis models. MDABench allows a customized benchmark application to be modeled in the UML 2.0 Testing Profile and automatically generates a deployable application, with measurement automatically conducted. DSLBench allows the same benchmark modeling and generation to be conducted using a simple performance engineering Domain Specific Language (DSL) in Microsoft Visual Studio. DSLBench integrates with Visual Studio and reuses its load testing infrastructure. Together, the tool suite can assist capacity planning across platforms in an automated fashion.« less
Kockmann, Tobias; Trachsel, Christian; Panse, Christian; Wahlander, Asa; Selevsek, Nathalie; Grossmann, Jonas; Wolski, Witold E; Schlapbach, Ralph
2016-08-01
Quantitative mass spectrometry is a rapidly evolving methodology applied in a large number of omics-type research projects. During the past years, new designs of mass spectrometers have been developed and launched as commercial systems while in parallel new data acquisition schemes and data analysis paradigms have been introduced. Core facilities provide access to such technologies, but also actively support the researchers in finding and applying the best-suited analytical approach. In order to implement a solid fundament for this decision making process, core facilities need to constantly compare and benchmark the various approaches. In this article we compare the quantitative accuracy and precision of current state of the art targeted proteomics approaches single reaction monitoring (SRM), parallel reaction monitoring (PRM) and data independent acquisition (DIA) across multiple liquid chromatography mass spectrometry (LC-MS) platforms, using a readily available commercial standard sample. All workflows are able to reproducibly generate accurate quantitative data. However, SRM and PRM workflows show higher accuracy and precision compared to DIA approaches, especially when analyzing low concentrated analytes. © 2016 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Application configuration selection for energy-efficient execution on multicore systems
Wang, Shinan; Luo, Bing; Shi, Weisong; ...
2015-09-21
Balanced performance and energy consumption are incorporated in the design of modern computer systems. Several runtime factors, such as concurrency levels, thread mapping strategies, and dynamic voltage and frequency scaling (DVFS) should be considered in order to achieve optimal energy efficiency fora workload. Selecting appropriate run-time factors, however, is one of the most challenging tasks because the run-time factors are architecture-specific and workload-specific. And while most existing works concentrate on either static analysis of the workload or run-time prediction results, we present a hybrid two-step method that utilizes concurrency levels and DVFS settings to achieve the energy efficiency configuration formore » a worldoad. The experimental results based on a Xeon E5620 server with NPB and PARSEC benchmark suites show that the model is able to predict the energy efficient configuration accurately. On average, an additional 10% EDP (Energy Delay Product) saving is obtained by using run-time DVFS for the entire system. An off-line optimal solution is used to compare with the proposed scheme. Finally, the experimental results show that the average extra EDP saved by the optimal solution is within 5% on selective parallel benchmarks.« less
RISC Processors and High Performance Computing
NASA Technical Reports Server (NTRS)
Bailey, David H.; Saini, Subhash; Craw, James M. (Technical Monitor)
1995-01-01
This tutorial will discuss the top five RISC microprocessors and the parallel systems in which they are used. It will provide a unique cross-machine comparison not available elsewhere. The effective performance of these processors will be compared by citing standard benchmarks in the context of real applications. The latest NAS Parallel Benchmarks, both absolute performance and performance per dollar, will be listed. The next generation of the NPB will be described. The tutorial will conclude with a discussion of future directions in the field. Technology Transfer Considerations: All of these computer systems are commercially available internationally. Information about these processors is available in the public domain, mostly from the vendors themselves. The NAS Parallel Benchmarks and their results have been previously approved numerous times for public release, beginning back in 1991.
Performance Evaluation of Supercomputers using HPCC and IMB Benchmarks
NASA Technical Reports Server (NTRS)
Saini, Subhash; Ciotti, Robert; Gunney, Brian T. N.; Spelce, Thomas E.; Koniges, Alice; Dossa, Don; Adamidis, Panagiotis; Rabenseifner, Rolf; Tiyyagura, Sunil R.; Mueller, Matthias;
2006-01-01
The HPC Challenge (HPCC) benchmark suite and the Intel MPI Benchmark (IMB) are used to compare and evaluate the combined performance of processor, memory subsystem and interconnect fabric of five leading supercomputers - SGI Altix BX2, Cray XI, Cray Opteron Cluster, Dell Xeon cluster, and NEC SX-8. These five systems use five different networks (SGI NUMALINK4, Cray network, Myrinet, InfiniBand, and NEC IXS). The complete set of HPCC benchmarks are run on each of these systems. Additionally, we present Intel MPI Benchmarks (IMB) results to study the performance of 11 MPI communication functions on these systems.
Di Tommaso, Paolo; Orobitg, Miquel; Guirado, Fernando; Cores, Fernado; Espinosa, Toni; Notredame, Cedric
2010-08-01
We present the first parallel implementation of the T-Coffee consistency-based multiple aligner. We benchmark it on the Amazon Elastic Cloud (EC2) and show that the parallelization procedure is reasonably effective. We also conclude that for a web server with moderate usage (10K hits/month) the cloud provides a cost-effective alternative to in-house deployment. T-Coffee is a freeware open source package available from http://www.tcoffee.org/homepage.html
Benchmarking gate-based quantum computers
NASA Astrophysics Data System (ADS)
Michielsen, Kristel; Nocon, Madita; Willsch, Dennis; Jin, Fengping; Lippert, Thomas; De Raedt, Hans
2017-11-01
With the advent of public access to small gate-based quantum processors, it becomes necessary to develop a benchmarking methodology such that independent researchers can validate the operation of these processors. We explore the usefulness of a number of simple quantum circuits as benchmarks for gate-based quantum computing devices and show that circuits performing identity operations are very simple, scalable and sensitive to gate errors and are therefore very well suited for this task. We illustrate the procedure by presenting benchmark results for the IBM Quantum Experience, a cloud-based platform for gate-based quantum computing.
PARALLEL HOP: A SCALABLE HALO FINDER FOR MASSIVE COSMOLOGICAL DATA SETS
DOE Office of Scientific and Technical Information (OSTI.GOV)
Skory, Stephen; Turk, Matthew J.; Norman, Michael L.
2010-11-15
Modern N-body cosmological simulations contain billions (10{sup 9}) of dark matter particles. These simulations require hundreds to thousands of gigabytes of memory and employ hundreds to tens of thousands of processing cores on many compute nodes. In order to study the distribution of dark matter in a cosmological simulation, the dark matter halos must be identified using a halo finder, which establishes the halo membership of every particle in the simulation. The resources required for halo finding are similar to the requirements for the simulation itself. In particular, simulations have become too extensive to use commonly employed halo finders, suchmore » that the computational requirements to identify halos must now be spread across multiple nodes and cores. Here, we present a scalable-parallel halo finding method called Parallel HOP for large-scale cosmological simulation data. Based on the halo finder HOP, it utilizes message passing interface and domain decomposition to distribute the halo finding workload across multiple compute nodes, enabling analysis of much larger data sets than is possible with the strictly serial or previous parallel implementations of HOP. We provide a reference implementation of this method as a part of the toolkit {sup yt}, an analysis toolkit for adaptive mesh refinement data that include complementary analysis modules. Additionally, we discuss a suite of benchmarks that demonstrate that this method scales well up to several hundred tasks and data sets in excess of 2000{sup 3} particles. The Parallel HOP method and our implementation can be readily applied to any kind of N-body simulation data and is therefore widely applicable.« less
A note on bound constraints handling for the IEEE CEC'05 benchmark function suite.
Liao, Tianjun; Molina, Daniel; de Oca, Marco A Montes; Stützle, Thomas
2014-01-01
The benchmark functions and some of the algorithms proposed for the special session on real parameter optimization of the 2005 IEEE Congress on Evolutionary Computation (CEC'05) have played and still play an important role in the assessment of the state of the art in continuous optimization. In this article, we show that if bound constraints are not enforced for the final reported solutions, state-of-the-art algorithms produce infeasible best candidate solutions for the majority of functions of the IEEE CEC'05 benchmark function suite. This occurs even though the optima of the CEC'05 functions are within the specified bounds. This phenomenon has important implications on algorithm comparisons, and therefore on algorithm designs. This article's goal is to draw the attention of the community to the fact that some authors might have drawn wrong conclusions from experiments using the CEC'05 problems.
Processor Emulator with Benchmark Applications
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lloyd, G. Scott; Pearce, Roger; Gokhale, Maya
2015-11-13
A processor emulator and a suite of benchmark applications have been developed to assist in characterizing the performance of data-centric workloads on current and future computer architectures. Some of the applications have been collected from other open source projects. For more details on the emulator and an example of its usage, see reference [1].
Benefits of e-Learning Benchmarks: Australian Case Studies
ERIC Educational Resources Information Center
Choy, Sarojni
2007-01-01
In 2004 the Australian Flexible Learning Framework developed a suite of quantitative and qualitative indicators on the uptake, use and impact of e-learning in the Vocational Education and Training (VET) sector. These indicators were used to design items for a survey to gather quantitative data for benchmarking. A series of four surveys gathered…
Benchmark Lisp And Ada Programs
NASA Technical Reports Server (NTRS)
Davis, Gloria; Galant, David; Lim, Raymond; Stutz, John; Gibson, J.; Raghavan, B.; Cheesema, P.; Taylor, W.
1992-01-01
Suite of nonparallel benchmark programs, ELAPSE, designed for three tests: comparing efficiency of computer processing via Lisp vs. Ada; comparing efficiencies of several computers processing via Lisp; or comparing several computers processing via Ada. Tests efficiency which computer executes routines in each language. Available for computer equipped with validated Ada compiler and/or Common Lisp system.
Evaluation of Graph Pattern Matching Workloads in Graph Analysis Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hong, Seokyong; Lee, Sangkeun; Lim, Seung-Hwan
2016-01-01
Graph analysis has emerged as a powerful method for data scientists to represent, integrate, query, and explore heterogeneous data sources. As a result, graph data management and mining became a popular area of research, and led to the development of plethora of systems in recent years. Unfortunately, the number of emerging graph analysis systems and the wide range of applications, coupled with a lack of apples-to-apples comparisons, make it difficult to understand the trade-offs between different systems and the graph operations for which they are designed. A fair comparison of these systems is a challenging task for the following reasons:more » multiple data models, non-standardized serialization formats, various query interfaces to users, and diverse environments they operate in. To address these key challenges, in this paper we present a new benchmark suite by extending the Lehigh University Benchmark (LUBM) to cover the most common capabilities of various graph analysis systems. We provide the design process of the benchmark, which generalizes the workflow for data scientists to conduct the desired graph analysis on different graph analysis systems. Equipped with this extended benchmark suite, we present performance comparison for nine subgraph pattern retrieval operations over six graph analysis systems, namely NetworkX, Neo4j, Jena, Titan, GraphX, and uRiKA. Through the proposed benchmark suite, this study reveals both quantitative and qualitative findings in (1) implications in loading data into each system; (2) challenges in describing graph patterns for each query interface; and (3) different sensitivity of each system to query selectivity. We envision that this study will pave the road for: (i) data scientists to select the suitable graph analysis systems, and (ii) data management system designers to advance graph analysis systems.« less
Using benchmarks for radiation testing of microprocessors and FPGAs
DOE Office of Scientific and Technical Information (OSTI.GOV)
Quinn, Heather; Robinson, William H.; Rech, Paolo
Performance benchmarks have been used over the years to compare different systems. These benchmarks can be useful for researchers trying to determine how changes to the technology, architecture, or compiler affect the system's performance. No such standard exists for systems deployed into high radiation environments, making it difficult to assess whether changes in the fabrication process, circuitry, architecture, or software affect reliability or radiation sensitivity. In this paper, we propose a benchmark suite for high-reliability systems that is designed for field-programmable gate arrays and microprocessors. As a result, we describe the development process and report neutron test data for themore » hardware and software benchmarks.« less
Using benchmarks for radiation testing of microprocessors and FPGAs
Quinn, Heather; Robinson, William H.; Rech, Paolo; ...
2015-12-17
Performance benchmarks have been used over the years to compare different systems. These benchmarks can be useful for researchers trying to determine how changes to the technology, architecture, or compiler affect the system's performance. No such standard exists for systems deployed into high radiation environments, making it difficult to assess whether changes in the fabrication process, circuitry, architecture, or software affect reliability or radiation sensitivity. In this paper, we propose a benchmark suite for high-reliability systems that is designed for field-programmable gate arrays and microprocessors. As a result, we describe the development process and report neutron test data for themore » hardware and software benchmarks.« less
High performance cellular level agent-based simulation with FLAME for the GPU.
Richmond, Paul; Walker, Dawn; Coakley, Simon; Romano, Daniela
2010-05-01
Driven by the availability of experimental data and ability to simulate a biological scale which is of immediate interest, the cellular scale is fast emerging as an ideal candidate for middle-out modelling. As with 'bottom-up' simulation approaches, cellular level simulations demand a high degree of computational power, which in large-scale simulations can only be achieved through parallel computing. The flexible large-scale agent modelling environment (FLAME) is a template driven framework for agent-based modelling (ABM) on parallel architectures ideally suited to the simulation of cellular systems. It is available for both high performance computing clusters (www.flame.ac.uk) and GPU hardware (www.flamegpu.com) and uses a formal specification technique that acts as a universal modelling format. This not only creates an abstraction from the underlying hardware architectures, but avoids the steep learning curve associated with programming them. In benchmarking tests and simulations of advanced cellular systems, FLAME GPU has reported massive improvement in performance over more traditional ABM frameworks. This allows the time spent in the development and testing stages of modelling to be drastically reduced and creates the possibility of real-time visualisation for simple visual face-validation.
2017-04-13
modelling code, a parallel benchmark , and a communication avoiding version of the QR algorithm. Further, several improvements to the OmpSs model were...movement; and a port of the dynamic load balancing library to OmpSs. Finally, several updates to the tools infrastructure were accomplished, including: an...OmpSs: a basic algorithm on image processing applications, a mini application representative of an ocean modelling code, a parallel benchmark , and a
NASA Astrophysics Data System (ADS)
Moon, Hongsik
What is the impact of multicore and associated advanced technologies on computational software for science? Most researchers and students have multicore laptops or desktops for their research and they need computing power to run computational software packages. Computing power was initially derived from Central Processing Unit (CPU) clock speed. That changed when increases in clock speed became constrained by power requirements. Chip manufacturers turned to multicore CPU architectures and associated technological advancements to create the CPUs for the future. Most software applications benefited by the increased computing power the same way that increases in clock speed helped applications run faster. However, for Computational ElectroMagnetics (CEM) software developers, this change was not an obvious benefit - it appeared to be a detriment. Developers were challenged to find a way to correctly utilize the advancements in hardware so that their codes could benefit. The solution was parallelization and this dissertation details the investigation to address these challenges. Prior to multicore CPUs, advanced computer technologies were compared with the performance using benchmark software and the metric was FLoting-point Operations Per Seconds (FLOPS) which indicates system performance for scientific applications that make heavy use of floating-point calculations. Is FLOPS an effective metric for parallelized CEM simulation tools on new multicore system? Parallel CEM software needs to be benchmarked not only by FLOPS but also by the performance of other parameters related to type and utilization of the hardware, such as CPU, Random Access Memory (RAM), hard disk, network, etc. The codes need to be optimized for more than just FLOPs and new parameters must be included in benchmarking. In this dissertation, the parallel CEM software named High Order Basis Based Integral Equation Solver (HOBBIES) is introduced. This code was developed to address the needs of the changing computer hardware platforms in order to provide fast, accurate and efficient solutions to large, complex electromagnetic problems. The research in this dissertation proves that the performance of parallel code is intimately related to the configuration of the computer hardware and can be maximized for different hardware platforms. To benchmark and optimize the performance of parallel CEM software, a variety of large, complex projects are created and executed on a variety of computer platforms. The computer platforms used in this research are detailed in this dissertation. The projects run as benchmarks are also described in detail and results are presented. The parameters that affect parallel CEM software on High Performance Computing Clusters (HPCC) are investigated. This research demonstrates methods to maximize the performance of parallel CEM software code.
NASA Technical Reports Server (NTRS)
VanderWijngaart, Rob; Biegel, Bryan A. (Technical Monitor)
2002-01-01
We describe a new problem size, called Class D, for the NAS Parallel Benchmarks (NPB), whose MPI source code implementation is being released as NPB 2.4. A brief rationale is given for how the new class is derived. We also describe the modifications made to the MPI (Message Passing Interface) implementation to allow the new class to be run on systems with 32-bit integers, and with moderate amounts of memory. Finally, we give the verification values for the new problem size.
Yu, Dongjun; Wu, Xiaowei; Shen, Hongbin; Yang, Jian; Tang, Zhenmin; Qi, Yong; Yang, Jingyu
2012-12-01
Membrane proteins are encoded by ~ 30% in the genome and function importantly in the living organisms. Previous studies have revealed that membrane proteins' structures and functions show obvious cell organelle-specific properties. Hence, it is highly desired to predict membrane protein's subcellular location from the primary sequence considering the extreme difficulties of membrane protein wet-lab studies. Although many models have been developed for predicting protein subcellular locations, only a few are specific to membrane proteins. Existing prediction approaches were constructed based on statistical machine learning algorithms with serial combination of multi-view features, i.e., different feature vectors are simply serially combined to form a super feature vector. However, such simple combination of features will simultaneously increase the information redundancy that could, in turn, deteriorate the final prediction accuracy. That's why it was often found that prediction success rates in the serial super space were even lower than those in a single-view space. The purpose of this paper is investigation of a proper method for fusing multiple multi-view protein sequential features for subcellular location predictions. Instead of serial strategy, we propose a novel parallel framework for fusing multiple membrane protein multi-view attributes that will represent protein samples in complex spaces. We also proposed generalized principle component analysis (GPCA) for feature reduction purpose in the complex geometry. All the experimental results through different machine learning algorithms on benchmark membrane protein subcellular localization datasets demonstrate that the newly proposed parallel strategy outperforms the traditional serial approach. We also demonstrate the efficacy of the parallel strategy on a soluble protein subcellular localization dataset indicating the parallel technique is flexible to suite for other computational biology problems. The software and datasets are available at: http://www.csbio.sjtu.edu.cn/bioinf/mpsp.
Testing New Programming Paradigms with NAS Parallel Benchmarks
NASA Technical Reports Server (NTRS)
Jin, H.; Frumkin, M.; Schultz, M.; Yan, J.
2000-01-01
Over the past decade, high performance computing has evolved rapidly, not only in hardware architectures but also with increasing complexity of real applications. Technologies have been developing to aim at scaling up to thousands of processors on both distributed and shared memory systems. Development of parallel programs on these computers is always a challenging task. Today, writing parallel programs with message passing (e.g. MPI) is the most popular way of achieving scalability and high performance. However, writing message passing programs is difficult and error prone. Recent years new effort has been made in defining new parallel programming paradigms. The best examples are: HPF (based on data parallelism) and OpenMP (based on shared memory parallelism). Both provide simple and clear extensions to sequential programs, thus greatly simplify the tedious tasks encountered in writing message passing programs. HPF is independent of memory hierarchy, however, due to the immaturity of compiler technology its performance is still questionable. Although use of parallel compiler directives is not new, OpenMP offers a portable solution in the shared-memory domain. Another important development involves the tremendous progress in the internet and its associated technology. Although still in its infancy, Java promisses portability in a heterogeneous environment and offers possibility to "compile once and run anywhere." In light of testing these new technologies, we implemented new parallel versions of the NAS Parallel Benchmarks (NPBs) with HPF and OpenMP directives, and extended the work with Java and Java-threads. The purpose of this study is to examine the effectiveness of alternative programming paradigms. NPBs consist of five kernels and three simulated applications that mimic the computation and data movement of large scale computational fluid dynamics (CFD) applications. We started with the serial version included in NPB2.3. Optimization of memory and cache usage was applied to several benchmarks, noticeably BT and SP, resulting in better sequential performance. In order to overcome the lack of an HPF performance model and guide the development of the HPF codes, we employed an empirical performance model for several primitives found in the benchmarks. We encountered a few limitations of HPF, such as lack of supporting the "REDISTRIBUTION" directive and no easy way to handle irregular computation. The parallelization with OpenMP directives was done at the outer-most loop level to achieve the largest granularity. The performance of six HPF and OpenMP benchmarks is compared with their MPI counterparts for the Class-A problem size in the figure in next page. These results were obtained on an SGI Origin2000 (195MHz) with MIPSpro-f77 compiler 7.2.1 for OpenMP and MPI codes and PGI pghpf-2.4.3 compiler with MPI interface for HPF programs.
Automatic Thread-Level Parallelization in the Chombo AMR Library
DOE Office of Scientific and Technical Information (OSTI.GOV)
Christen, Matthias; Keen, Noel; Ligocki, Terry
2011-05-26
The increasing on-chip parallelism has some substantial implications for HPC applications. Currently, hybrid programming models (typically MPI+OpenMP) are employed for mapping software to the hardware in order to leverage the hardware?s architectural features. In this paper, we present an approach that automatically introduces thread level parallelism into Chombo, a parallel adaptive mesh refinement framework for finite difference type PDE solvers. In Chombo, core algorithms are specified in the ChomboFortran, a macro language extension to F77 that is part of the Chombo framework. This domain-specific language forms an already used target language for an automatic migration of the large number ofmore » existing algorithms into a hybrid MPI+OpenMP implementation. It also provides access to the auto-tuning methodology that enables tuning certain aspects of an algorithm to hardware characteristics. Performance measurements are presented for a few of the most relevant kernels with respect to a specific application benchmark using this technique as well as benchmark results for the entire application. The kernel benchmarks show that, using auto-tuning, up to a factor of 11 in performance was gained with 4 threads with respect to the serial reference implementation.« less
RF Wave Simulation Using the MFEM Open Source FEM Package
NASA Astrophysics Data System (ADS)
Stillerman, J.; Shiraiwa, S.; Bonoli, P. T.; Wright, J. C.; Green, D. L.; Kolev, T.
2016-10-01
A new plasma wave simulation environment based on the finite element method is presented. MFEM, a scalable open-source FEM library, is used as the basis for this capability. MFEM allows for assembling an FEM matrix of arbitrarily high order in a parallel computing environment. A 3D frequency domain RF physics layer was implemented using a python wrapper for MFEM and a cold collisional plasma model was ported. This physics layer allows for defining the plasma RF wave simulation model without user knowledge of the FEM weak-form formulation. A graphical user interface is built on πScope, a python-based scientific workbench, such that a user can build a model definition file interactively. Benchmark cases have been ported to this new environment, with results being consistent with those obtained using COMSOL multiphysics, GENRAY, and TORIC/TORLH spectral solvers. This work is a first step in bringing to bear the sophisticated computational tool suite that MFEM provides (e.g., adaptive mesh refinement, solver suite, element types) to the linear plasma-wave interaction problem, and within more complicated integrated workflows, such as coupling with core spectral solver, or incorporating additional physics such as an RF sheath potential model or kinetic effects. USDoE Awards DE-FC02-99ER54512, DE-FC02-01ER54648.
pcircle - A Suite of Scalable Parallel File System Tools
DOE Office of Scientific and Technical Information (OSTI.GOV)
WANG, FEIYI
2015-10-01
Most of the software related to file system are written for conventional local file system, they are serialized and can't take advantage of the benefit of a large scale parallel file system. "pcircle" software builds on top of ubiquitous MPI in cluster computing environment and "work-stealing" pattern to provide a scalable, high-performance suite of file system tools. In particular - it implemented parallel data copy and parallel data checksumming, with advanced features such as async progress report, checkpoint and restart, as well as integrity checking.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ganapol, B.D.; Kornreich, D.E.
Because of the requirement of accountability and quality control in the scientific world, a demand for high-quality analytical benchmark calculations has arisen in the neutron transport community. The intent of these benchmarks is to provide a numerical standard to which production neutron transport codes may be compared in order to verify proper operation. The overall investigation as modified in the second year renewal application includes the following three primary tasks. Task 1 on two dimensional neutron transport is divided into (a) single medium searchlight problem (SLP) and (b) two-adjacent half-space SLP. Task 2 on three-dimensional neutron transport covers (a) pointmore » source in arbitrary geometry, (b) single medium SLP, and (c) two-adjacent half-space SLP. Task 3 on code verification, includes deterministic and probabilistic codes. The primary aim of the proposed investigation was to provide a suite of comprehensive two- and three-dimensional analytical benchmarks for neutron transport theory applications. This objective has been achieved. The suite of benchmarks in infinite media and the three-dimensional SLP are a relatively comprehensive set of one-group benchmarks for isotropically scattering media. Because of time and resource limitations, the extensions of the benchmarks to include multi-group and anisotropic scattering are not included here. Presently, however, enormous advances in the solution for the planar Green`s function in an anisotropically scattering medium have been made and will eventually be implemented in the two- and three-dimensional solutions considered under this grant. Of particular note in this work are the numerical results for the three-dimensional SLP, which have never before been presented. The results presented were made possible only because of the tremendous advances in computing power that have occurred during the past decade.« less
A comparison of five benchmarks
NASA Technical Reports Server (NTRS)
Huss, Janice E.; Pennline, James A.
1987-01-01
Five benchmark programs were obtained and run on the NASA Lewis CRAY X-MP/24. A comparison was made between the programs codes and between the methods for calculating performance figures. Several multitasking jobs were run to gain experience in how parallel performance is measured.
Toward Scalable Benchmarks for Mass Storage Systems
NASA Technical Reports Server (NTRS)
Miller, Ethan L.
1996-01-01
This paper presents guidelines for the design of a mass storage system benchmark suite, along with preliminary suggestions for programs to be included. The benchmarks will measure both peak and sustained performance of the system as well as predicting both short- and long-term behavior. These benchmarks should be both portable and scalable so they may be used on storage systems from tens of gigabytes to petabytes or more. By developing a standard set of benchmarks that reflect real user workload, we hope to encourage system designers and users to publish performance figures that can be compared with those of other systems. This will allow users to choose the system that best meets their needs and give designers a tool with which they can measure the performance effects of improvements to their systems.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gylenhaal, J.; Bronevetsky, G.
2007-05-25
CLOMP is the C version of the Livermore OpenMP benchmark deeloped to measure OpenMP overheads and other performance impacts due to threading (like NUMA memory layouts, memory contention, cache effects, etc.) in order to influence future system design. Current best-in-class implementations of OpenMP have overheads at least ten times larger than is required by many of our applications for effective use of OpenMP. This benchmark shows the significant negative performance impact of these relatively large overheads and of other thread effects. The CLOMP benchmark highly configurable to allow a variety of problem sizes and threading effects to be studied andmore » it carefully checks its results to catch many common threading errors. This benchmark is expected to be included as part of the Sequoia Benchmark suite for the Sequoia procurement.« less
Automatic Multilevel Parallelization Using OpenMP
NASA Technical Reports Server (NTRS)
Jin, Hao-Qiang; Jost, Gabriele; Yan, Jerry; Ayguade, Eduard; Gonzalez, Marc; Martorell, Xavier; Biegel, Bryan (Technical Monitor)
2002-01-01
In this paper we describe the extension of the CAPO (CAPtools (Computer Aided Parallelization Toolkit) OpenMP) parallelization support tool to support multilevel parallelism based on OpenMP directives. CAPO generates OpenMP directives with extensions supported by the NanosCompiler to allow for directive nesting and definition of thread groups. We report some results for several benchmark codes and one full application that have been parallelized using our system.
Automatic Multilevel Parallelization Using OpenMP
NASA Technical Reports Server (NTRS)
Jin, Hao-Qiang; Jost, Gabriele; Yan, Jerry; Ayguade, Eduard; Gonzalez, Marc; Martorell, Xavier; Biegel, Bryan (Technical Monitor)
2002-01-01
In this paper we describe the extension of the CAPO parallelization support tool to support multilevel parallelism based on OpenMP directives. CAPO generates OpenMP directives with extensions supported by the NanosCompiler to allow for directive nesting and definition of thread groups. We report first results for several benchmark codes and one full application that have been parallelized using our system.
A suite of benchmark and challenge problems for enhanced geothermal systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
White, Mark; Fu, Pengcheng; McClure, Mark
A diverse suite of numerical simulators is currently being applied to predict or understand the performance of enhanced geothermal systems (EGS). To build confidence and identify critical development needs for these analytical tools, the United States Department of Energy, Geothermal Technologies Office sponsored a Code Comparison Study (GTO-CCS), with participants from universities, industry, and national laboratories. A principal objective for the study was to create a community forum for improvement and verification of numerical simulators for EGS modeling. Teams participating in the study were those representing U.S. national laboratories, universities, and industries, and each team brought unique numerical simulation capabilitiesmore » to bear on the problems. Two classes of problems were developed during the study, benchmark problems and challenge problems. The benchmark problems were structured to test the ability of the collection of numerical simulators to solve various combinations of coupled thermal, hydrologic, geomechanical, and geochemical processes. This class of problems was strictly defined in terms of properties, driving forces, initial conditions, and boundary conditions. The challenge problems were based on the enhanced geothermal systems research conducted at Fenton Hill, near Los Alamos, New Mexico, between 1974 and 1995. The problems involved two phases of research, stimulation, development, and circulation in two separate reservoirs. The challenge problems had specific questions to be answered via numerical simulation in three topical areas: 1) reservoir creation/stimulation, 2) reactive and passive transport, and 3) thermal recovery. Whereas the benchmark class of problems were designed to test capabilities for modeling coupled processes under strictly specified conditions, the stated objective for the challenge class of problems was to demonstrate what new understanding of the Fenton Hill experiments could be realized via the application of modern numerical simulation tools by recognized expert practitioners. We present the suite of benchmark and challenge problems developed for the GTO-CCS, providing problem descriptions and sample solutions.« less
PHISICS/RELAP5-3D RESULTS FOR EXERCISES II-1 AND II-2 OF THE OECD/NEA MHTGR-350 BENCHMARK
DOE Office of Scientific and Technical Information (OSTI.GOV)
Strydom, Gerhard
2016-03-01
The Idaho National Laboratory (INL) Advanced Reactor Technologies (ART) High-Temperature Gas-Cooled Reactor (HTGR) Methods group currently leads the Modular High-Temperature Gas-Cooled Reactor (MHTGR) 350 benchmark. The benchmark consists of a set of lattice-depletion, steady-state, and transient problems that can be used by HTGR simulation groups to assess the performance of their code suites. The paper summarizes the results obtained for the first two transient exercises defined for Phase II of the benchmark. The Parallel and Highly Innovative Simulation for INL Code System (PHISICS), coupled with the INL system code RELAP5-3D, was used to generate the results for the Depressurized Conductionmore » Cooldown (DCC) (exercise II-1a) and Pressurized Conduction Cooldown (PCC) (exercise II-2) transients. These exercises require the time-dependent simulation of coupled neutronics and thermal-hydraulics phenomena, and utilize the steady-state solution previously obtained for exercise I-3 of Phase I. This paper also includes a comparison of the benchmark results obtained with a traditional system code “ring” model against a more detailed “block” model that include kinetics feedback on an individual block level and thermal feedbacks on a triangular sub-mesh. The higher spatial fidelity that can be obtained by the block model is illustrated with comparisons of the maximum fuel temperatures, especially in the case of natural convection conditions that dominate the DCC and PCC events. Differences up to 125 K (or 10%) were observed between the ring and block model predictions of the DCC transient, mostly due to the block model’s capability of tracking individual block decay powers and more detailed helium flow distributions. In general, the block model only required DCC and PCC calculation times twice as long as the ring models, and it therefore seems that the additional development and calculation time required for the block model could be worth the gain that can be obtained in the spatial resolution« less
Predicting Cost/Performance Trade-Offs for Whitney: A Commodity Computing Cluster
NASA Technical Reports Server (NTRS)
Becker, Jeffrey C.; Nitzberg, Bill; VanderWijngaart, Rob F.; Kutler, Paul (Technical Monitor)
1997-01-01
Recent advances in low-end processor and network technology have made it possible to build a "supercomputer" out of commodity components. We develop simple models of the NAS Parallel Benchmarks version 2 (NPB 2) to explore the cost/performance trade-offs involved in building a balanced parallel computer supporting a scientific workload. We develop closed form expressions detailing the number and size of messages sent by each benchmark. Coupling these with measured single processor performance, network latency, and network bandwidth, our models predict benchmark performance to within 30%. A comparison based on total system cost reveals that current commodity technology (200 MHz Pentium Pros with 100baseT Ethernet) is well balanced for the NPBs up to a total system cost of around $1,000,000.
Analysis of 100Mb/s Ethernet for the Whitney Commodity Computing Testbed
NASA Technical Reports Server (NTRS)
Fineberg, Samuel A.; Pedretti, Kevin T.; Kutler, Paul (Technical Monitor)
1997-01-01
We evaluate the performance of a Fast Ethernet network configured with a single large switch, a single hub, and a 4x4 2D torus topology in a testbed cluster of "commodity" Pentium Pro PCs. We also evaluated a mixed network composed of ethernet hubs and switches. An MPI collective communication benchmark, and the NAS Parallel Benchmarks version 2.2 (NPB2) show that the torus network performs best for all sizes that we were able to test (up to 16 nodes). For larger networks the ethernet switch outperforms the hub, though its performance is far less than peak. The hub/switch combination tests indicate that the NAS parallel benchmarks are relatively insensitive to hub densities of less than 7 nodes per hub.
NASA Astrophysics Data System (ADS)
Favata, Antonino; Micheletti, Andrea; Ryu, Seunghwa; Pugno, Nicola M.
2016-10-01
An analytical benchmark and a simple consistent Mathematica program are proposed for graphene and carbon nanotubes, that may serve to test any molecular dynamics code implemented with REBO potentials. By exploiting the benchmark, we checked results produced by LAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator) when adopting the second generation Brenner potential, we made evident that this code in its current implementation produces results which are offset from those of the benchmark by a significant amount, and provide evidence of the reason.
Evaluation of CHO Benchmarks on the Arria 10 FPGA using Intel FPGA SDK for OpenCL
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jin, Zheming; Yoshii, Kazutomo; Finkel, Hal
The OpenCL standard is an open programming model for accelerating algorithms on heterogeneous computing system. OpenCL extends the C-based programming language for developing portable codes on different platforms such as CPU, Graphics processing units (GPUs), Digital Signal Processors (DSPs) and Field Programmable Gate Arrays (FPGAs). The Intel FPGA SDK for OpenCL is a suite of tools that allows developers to abstract away the complex FPGA-based development flow for a high-level software development flow. Users can focus on the design of hardware-accelerated kernel functions in OpenCL and then direct the tools to generate the low-level FPGA implementations. The approach makes themore » FPGA-based development more accessible to software users as the needs for hybrid computing using CPUs and FPGAs are increasing. It can also significantly reduce the hardware development time as users can evaluate different ideas with high-level language without deep FPGA domain knowledge. Benchmarking of OpenCL-based framework is an effective way for analyzing the performance of system by studying the execution of the benchmark applications. CHO is a suite of benchmark applications that provides support for OpenCL [1]. The authors presented CHO as an OpenCL port of the CHStone benchmark. Using Altera OpenCL (AOCL) compiler to synthesize the benchmark applications, they listed the resource usage and performance of each kernel that can be successfully synthesized by the compiler. In this report, we evaluate the resource usage and performance of the CHO benchmark applications using the Intel FPGA SDK for OpenCL and Nallatech 385A FPGA board that features an Arria 10 FPGA device. The focus of the report is to have a better understanding of the resource usage and performance of the kernel implementations using Arria-10 FPGA devices compared to Stratix-5 FPGA devices. In addition, we also gain knowledge about the limitations of the current compiler when it fails to synthesize a benchmark application.« less
High Performance Computing at NASA
NASA Technical Reports Server (NTRS)
Bailey, David H.; Cooper, D. M. (Technical Monitor)
1994-01-01
The speaker will give an overview of high performance computing in the U.S. in general and within NASA in particular, including a description of the recently signed NASA-IBM cooperative agreement. The latest performance figures of various parallel systems on the NAS Parallel Benchmarks will be presented. The speaker was one of the authors of the NAS (National Aerospace Standards) Parallel Benchmarks, which are now widely cited in the industry as a measure of sustained performance on realistic high-end scientific applications. It will be shown that significant progress has been made by the highly parallel supercomputer industry during the past year or so, with several new systems, based on high-performance RISC processors, that now deliver superior performance per dollar compared to conventional supercomputers. Various pitfalls in reporting performance will be discussed. The speaker will then conclude by assessing the general state of the high performance computing field.
Parallel Ada benchmarks for the SVMS
NASA Technical Reports Server (NTRS)
Collard, Philippe E.
1990-01-01
The use of parallel processing paradigm to design and develop faster and more reliable computers appear to clearly mark the future of information processing. NASA started the development of such an architecture: the Spaceborne VHSIC Multi-processor System (SVMS). Ada will be one of the languages used to program the SVMS. One of the unique characteristics of Ada is that it supports parallel processing at the language level through the tasking constructs. It is important for the SVMS project team to assess how efficiently the SVMS architecture will be implemented, as well as how efficiently Ada environment will be ported to the SVMS. AUTOCLASS II, a Bayesian classifier written in Common Lisp, was selected as one of the benchmarks for SVMS configurations. The purpose of the R and D effort was to provide the SVMS project team with the version of AUTOCLASS II, written in Ada, that would make use of Ada tasking constructs as much as possible so as to constitute a suitable benchmark. Additionally, a set of programs was developed that would measure Ada tasking efficiency on parallel architectures as well as determine the critical parameters influencing tasking efficiency. All this was designed to provide the SVMS project team with a set of suitable tools in the development of the SVMS architecture.
Selecting a Benchmark Suite to Profile High-Performance Computing (HPC) Machines
2014-11-01
architectures. Machines now contain central processing units (CPUs), graphics processing units (GPUs), and many integrated core ( MIC ) architecture all...evaluate the feasibility and applicability of a new architecture just released to the market . Researchers are often unsure how available resources will...architectures. Having a suite of programs running on different architectures, such as GPUs, MICs , and CPUs, adds complexity and technical challenges
How to Advance TPC Benchmarks with Dependability Aspects
NASA Astrophysics Data System (ADS)
Almeida, Raquel; Poess, Meikel; Nambiar, Raghunath; Patil, Indira; Vieira, Marco
Transactional systems are the core of the information systems of most organizations. Although there is general acknowledgement that failures in these systems often entail significant impact both on the proceeds and reputation of companies, the benchmarks developed and managed by the Transaction Processing Performance Council (TPC) still maintain their focus on reporting bare performance. Each TPC benchmark has to pass a list of dependability-related tests (to verify ACID properties), but not all benchmarks require measuring their performances. While TPC-E measures the recovery time of some system failures, TPC-H and TPC-C only require functional correctness of such recovery. Consequently, systems used in TPC benchmarks are tuned mostly for performance. In this paper we argue that nowadays systems should be tuned for a more comprehensive suite of dependability tests, and that a dependability metric should be part of TPC benchmark publications. The paper discusses WHY and HOW this can be achieved. Two approaches are introduced and discussed: augmenting each TPC benchmark in a customized way, by extending each specification individually; and pursuing a more unified approach, defining a generic specification that could be adjoined to any TPC benchmark.
On the impact of approximate computation in an analog DeSTIN architecture.
Young, Steven; Lu, Junjie; Holleman, Jeremy; Arel, Itamar
2014-05-01
Deep machine learning (DML) holds the potential to revolutionize machine learning by automating rich feature extraction, which has become the primary bottleneck of human engineering in pattern recognition systems. However, the heavy computational burden renders DML systems implemented on conventional digital processors impractical for large-scale problems. The highly parallel computations required to implement large-scale deep learning systems are well suited to custom hardware. Analog computation has demonstrated power efficiency advantages of multiple orders of magnitude relative to digital systems while performing nonideal computations. In this paper, we investigate typical error sources introduced by analog computational elements and their impact on system-level performance in DeSTIN--a compositional deep learning architecture. These inaccuracies are evaluated on a pattern classification benchmark, clearly demonstrating the robustness of the underlying algorithm to the errors introduced by analog computational elements. A clear understanding of the impacts of nonideal computations is necessary to fully exploit the efficiency of analog circuits.
Computational attributes of the integral form of the equation of transfer
NASA Technical Reports Server (NTRS)
Frankel, J. I.
1991-01-01
Difficulties can arise in radiative and neutron transport calculations when a highly anisotropic scattering phase function is present. In the presence of anisotropy, currently used numerical solutions are based on the integro-differential form of the linearized Boltzmann transport equation. This paper, departs from classical thought and presents an alternative numerical approach based on application of the integral form of the transport equation. Use of the integral formalism facilitates the following steps: a reduction in dimensionality of the system prior to discretization, the use of symbolic manipulation to augment the computational procedure, and the direct determination of key physical quantities which are derivable through the various Legendre moments of the intensity. The approach is developed in the context of radiative heat transfer in a plane-parallel geometry, and results are presented and compared with existing benchmark solutions. Encouraging results are presented to illustrate the potential of the integral formalism for computation. The integral formalism appears to possess several computational attributes which are well-suited to radiative and neutron transport calculations.
Fast segmentation of stained nuclei in terabyte-scale, time resolved 3D microscopy image stacks.
Stegmaier, Johannes; Otte, Jens C; Kobitski, Andrei; Bartschat, Andreas; Garcia, Ariel; Nienhaus, G Ulrich; Strähle, Uwe; Mikut, Ralf
2014-01-01
Automated analysis of multi-dimensional microscopy images has become an integral part of modern research in life science. Most available algorithms that provide sufficient segmentation quality, however, are infeasible for a large amount of data due to their high complexity. In this contribution we present a fast parallelized segmentation method that is especially suited for the extraction of stained nuclei from microscopy images, e.g., of developing zebrafish embryos. The idea is to transform the input image based on gradient and normal directions in the proximity of detected seed points such that it can be handled by straightforward global thresholding like Otsu's method. We evaluate the quality of the obtained segmentation results on a set of real and simulated benchmark images in 2D and 3D and show the algorithm's superior performance compared to other state-of-the-art algorithms. We achieve an up to ten-fold decrease in processing times, allowing us to process large data sets while still providing reasonable segmentation results.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Madduri, Kamesh; Ediger, David; Jiang, Karl
2009-02-15
We present a new lock-free parallel algorithm for computing betweenness centralityof massive small-world networks. With minor changes to the data structures, ouralgorithm also achieves better spatial cache locality compared to previous approaches. Betweenness centrality is a key algorithm kernel in HPCS SSCA#2, a benchmark extensively used to evaluate the performance of emerging high-performance computing architectures for graph-theoretic computations. We design optimized implementations of betweenness centrality and the SSCA#2 benchmark for two hardware multithreaded systems: a Cray XMT system with the Threadstorm processor, and a single-socket Sun multicore server with the UltraSPARC T2 processor. For a small-world network of 134 millionmore » vertices and 1.073 billion edges, the 16-processor XMT system and the 8-core Sun Fire T5120 server achieve TEPS scores (an algorithmic performance count for the SSCA#2 benchmark) of 160 million and 90 million respectively, which corresponds to more than a 2X performance improvement over the previous parallel implementations. To better characterize the performance of these multithreaded systems, we correlate the SSCA#2 performance results with data from the memory-intensive STREAM and RandomAccess benchmarks. Finally, we demonstrate the applicability of our implementation to analyze massive real-world datasets by computing approximate betweenness centrality for a large-scale IMDb movie-actor network.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Madduri, Kamesh; Ediger, David; Jiang, Karl
2009-05-29
We present a new lock-free parallel algorithm for computing betweenness centrality of massive small-world networks. With minor changes to the data structures, our algorithm also achieves better spatial cache locality compared to previous approaches. Betweenness centrality is a key algorithm kernel in the HPCS SSCA#2 Graph Analysis benchmark, which has been extensively used to evaluate the performance of emerging high-performance computing architectures for graph-theoretic computations. We design optimized implementations of betweenness centrality and the SSCA#2 benchmark for two hardware multithreaded systems: a Cray XMT system with the ThreadStorm processor, and a single-socket Sun multicore server with the UltraSparc T2 processor.more » For a small-world network of 134 million vertices and 1.073 billion edges, the 16-processor XMT system and the 8-core Sun Fire T5120 server achieve TEPS scores (an algorithmic performance count for the SSCA#2 benchmark) of 160 million and 90 million respectively, which corresponds to more than a 2X performance improvement over the previous parallel implementations. To better characterize the performance of these multithreaded systems, we correlate the SSCA#2 performance results with data from the memory-intensive STREAM and RandomAccess benchmarks. Finally, we demonstrate the applicability of our implementation to analyze massive real-world datasets by computing approximate betweenness centrality for a large-scale IMDb movie-actor network.« less
Enabling the High Level Synthesis of Data Analytics Accelerators
DOE Office of Scientific and Technical Information (OSTI.GOV)
Minutoli, Marco; Castellana, Vito G.; Tumeo, Antonino
Conventional High Level Synthesis (HLS) tools mainly tar- get compute intensive kernels typical of digital signal pro- cessing applications. We are developing techniques and ar- chitectural templates to enable HLS of data analytics appli- cations. These applications are memory intensive, present fine-grained, unpredictable data accesses, and irregular, dy- namic task parallelism. We discuss an architectural tem- plate based around a distributed controller to efficiently ex- ploit thread level parallelism. We present a memory in- terface that supports parallel memory subsystems and en- ables implementing atomic memory operations. We intro- duce a dynamic task scheduling approach to efficiently ex- ecute heavilymore » unbalanced workload. The templates are val- idated by synthesizing queries from the Lehigh University Benchmark (LUBM), a well know SPARQL benchmark.« less
Kan, Guangyuan; He, Xiaoyan; Ding, Liuqian; Li, Jiren; Liang, Ke; Hong, Yang
2017-10-01
The shuffled complex evolution optimization developed at the University of Arizona (SCE-UA) has been successfully applied in various kinds of scientific and engineering optimization applications, such as hydrological model parameter calibration, for many years. The algorithm possesses good global optimality, convergence stability and robustness. However, benchmark and real-world applications reveal the poor computational efficiency of the SCE-UA. This research aims at the parallelization and acceleration of the SCE-UA method based on powerful heterogeneous computing technology. The parallel SCE-UA is implemented on Intel Xeon multi-core CPU (by using OpenMP and OpenCL) and NVIDIA Tesla many-core GPU (by using OpenCL, CUDA, and OpenACC). The serial and parallel SCE-UA were tested based on the Griewank benchmark function. Comparison results indicate the parallel SCE-UA significantly improves computational efficiency compared to the original serial version. The OpenCL implementation obtains the best overall acceleration results however, with the most complex source code. The parallel SCE-UA has bright prospects to be applied in real-world applications.
Development and Application of a Parallel LCAO Cluster Method
NASA Astrophysics Data System (ADS)
Patton, David C.
1997-08-01
CPU intensive steps in the SCF electronic structure calculations of clusters and molecules with a first-principles LCAO method have been fully parallelized via a message passing paradigm. Identification of the parts of the code that are composed of many independent compute-intensive steps is discussed in detail as they are the most readily parallelized. Most of the parallelization involves spatially decomposing numerical operations on a mesh. One exception is the solution of Poisson's equation which relies on distribution of the charge density and multipole methods. The method we use to parallelize this part of the calculation is quite novel and is covered in detail. We present a general method for dynamically load-balancing a parallel calculation and discuss how we use this method in our code. The results of benchmark calculations of the IR and Raman spectra of PAH molecules such as anthracene (C_14H_10) and tetracene (C_18H_12) are presented. These benchmark calculations were performed on an IBM SP2 and a SUN Ultra HPC server with both MPI and PVM. Scalability and speedup for these calculations is analyzed to determine the efficiency of the code. In addition, performance and usage issues for MPI and PVM are presented.
BioPreDyn-bench: a suite of benchmark problems for dynamic modelling in systems biology.
Villaverde, Alejandro F; Henriques, David; Smallbone, Kieran; Bongard, Sophia; Schmid, Joachim; Cicin-Sain, Damjan; Crombach, Anton; Saez-Rodriguez, Julio; Mauch, Klaus; Balsa-Canto, Eva; Mendes, Pedro; Jaeger, Johannes; Banga, Julio R
2015-02-20
Dynamic modelling is one of the cornerstones of systems biology. Many research efforts are currently being invested in the development and exploitation of large-scale kinetic models. The associated problems of parameter estimation (model calibration) and optimal experimental design are particularly challenging. The community has already developed many methods and software packages which aim to facilitate these tasks. However, there is a lack of suitable benchmark problems which allow a fair and systematic evaluation and comparison of these contributions. Here we present BioPreDyn-bench, a set of challenging parameter estimation problems which aspire to serve as reference test cases in this area. This set comprises six problems including medium and large-scale kinetic models of the bacterium E. coli, baker's yeast S. cerevisiae, the vinegar fly D. melanogaster, Chinese Hamster Ovary cells, and a generic signal transduction network. The level of description includes metabolism, transcription, signal transduction, and development. For each problem we provide (i) a basic description and formulation, (ii) implementations ready-to-run in several formats, (iii) computational results obtained with specific solvers, (iv) a basic analysis and interpretation. This suite of benchmark problems can be readily used to evaluate and compare parameter estimation methods. Further, it can also be used to build test problems for sensitivity and identifiability analysis, model reduction and optimal experimental design methods. The suite, including codes and documentation, can be freely downloaded from the BioPreDyn-bench website, https://sites.google.com/site/biopredynbenchmarks/ .
Benchmarking high performance computing architectures with CMS’ skeleton framework
NASA Astrophysics Data System (ADS)
Sexton-Kennedy, E.; Gartung, P.; Jones, C. D.
2017-10-01
In 2012 CMS evaluated which underlying concurrency technology would be the best to use for its multi-threaded framework. The available technologies were evaluated on the high throughput computing systems dominating the resources in use at that time. A skeleton framework benchmarking suite that emulates the tasks performed within a CMSSW application was used to select Intel’s Thread Building Block library, based on the measured overheads in both memory and CPU on the different technologies benchmarked. In 2016 CMS will get access to high performance computing resources that use new many core architectures; machines such as Cori Phase 1&2, Theta, Mira. Because of this we have revived the 2012 benchmark to test it’s performance and conclusions on these new architectures. This talk will discuss the results of this exercise.
2015-06-01
headquarters Services , Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 22202-4302, and...are positioned on the outer ASW screen to protect an HVU from submarine attacks. This baseline scenario provides a standardized benchmark on current...are positioned on the outer ASW screen to protect an HVU from submarine attacks. This baseline scenario provides us a standardized benchmark . In the
Rethinking key–value store for parallel I/O optimization
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kougkas, Anthony; Eslami, Hassan; Sun, Xian-He
2015-01-26
Key-value stores are being widely used as the storage system for large-scale internet services and cloud storage systems. However, they are rarely used in HPC systems, where parallel file systems are the dominant storage solution. In this study, we examine the architecture differences and performance characteristics of parallel file systems and key-value stores. We propose using key-value stores to optimize overall Input/Output (I/O) performance, especially for workloads that parallel file systems cannot handle well, such as the cases with intense data synchronization or heavy metadata operations. We conducted experiments with several synthetic benchmarks, an I/O benchmark, and a real application.more » We modeled the performance of these two systems using collected data from our experiments, and we provide a predictive method to identify which system offers better I/O performance given a specific workload. The results show that we can optimize the I/O performance in HPC systems by utilizing key-value stores.« less
Unstructured Adaptive Meshes: Bad for Your Memory?
NASA Technical Reports Server (NTRS)
Biswas, Rupak; Feng, Hui-Yu; VanderWijngaart, Rob
2003-01-01
This viewgraph presentation explores the need for a NASA Advanced Supercomputing (NAS) parallel benchmark for problems with irregular dynamical memory access. This benchmark is important and necessary because: 1) Problems with localized error source benefit from adaptive nonuniform meshes; 2) Certain machines perform poorly on such problems; 3) Parallel implementation may provide further performance improvement but is difficult. Some examples of problems which use irregular dynamical memory access include: 1) Heat transfer problem; 2) Heat source term; 3) Spectral element method; 4) Base functions; 5) Elemental discrete equations; 6) Global discrete equations. Nonconforming Mesh and Mortar Element Method are covered in greater detail in this presentation.
Performance Comparison of HPF and MPI Based NAS Parallel Benchmarks
NASA Technical Reports Server (NTRS)
Saini, Subhash
1997-01-01
Compilers supporting High Performance Form (HPF) features first appeared in late 1994 and early 1995 from Applied Parallel Research (APR), Digital Equipment Corporation, and The Portland Group (PGI). IBM introduced an HPF compiler for the IBM RS/6000 SP2 in April of 1996. Over the past two years, these implementations have shown steady improvement in terms of both features and performance. The performance of various hardware/ programming model (HPF and MPI) combinations will be compared, based on latest NAS Parallel Benchmark results, thus providing a cross-machine and cross-model comparison. Specifically, HPF based NPB results will be compared with MPI based NPB results to provide perspective on performance currently obtainable using HPF versus MPI or versus hand-tuned implementations such as those supplied by the hardware vendors. In addition, we would also present NPB, (Version 1.0) performance results for the following systems: DEC Alpha Server 8400 5/440, Fujitsu CAPP Series (VX, VPP300, and VPP700), HP/Convex Exemplar SPP2000, IBM RS/6000 SP P2SC node (120 MHz), NEC SX-4/32, SGI/CRAY T3E, and SGI Origin2000. We would also present sustained performance per dollar for Class B LU, SP and BT benchmarks.
A Programming Model Performance Study Using the NAS Parallel Benchmarks
Shan, Hongzhang; Blagojević, Filip; Min, Seung-Jai; ...
2010-01-01
Harnessing the power of multicore platforms is challenging due to the additional levels of parallelism present. In this paper we use the NAS Parallel Benchmarks to study three programming models, MPI, OpenMP and PGAS to understand their performance and memory usage characteristics on current multicore architectures. To understand these characteristics we use the Integrated Performance Monitoring tool and other ways to measure communication versus computation time, as well as the fraction of the run time spent in OpenMP. The benchmarks are run on two different Cray XT5 systems and an Infiniband cluster. Our results show that in general the threemore » programming models exhibit very similar performance characteristics. In a few cases, OpenMP is significantly faster because it explicitly avoids communication. For these particular cases, we were able to re-write the UPC versions and achieve equal performance to OpenMP. Using OpenMP was also the most advantageous in terms of memory usage. Also we compare performance differences between the two Cray systems, which have quad-core and hex-core processors. We show that at scale the performance is almost always slower on the hex-core system because of increased contention for network resources.« less
NAS Grid Benchmarks: A Tool for Grid Space Exploration
NASA Technical Reports Server (NTRS)
Frumkin, Michael; VanderWijngaart, Rob F.; Biegel, Bryan (Technical Monitor)
2001-01-01
We present an approach for benchmarking services provided by computational Grids. It is based on the NAS Parallel Benchmarks (NPB) and is called NAS Grid Benchmark (NGB) in this paper. We present NGB as a data flow graph encapsulating an instance of an NPB code in each graph node, which communicates with other nodes by sending/receiving initialization data. These nodes may be mapped to the same or different Grid machines. Like NPB, NGB will specify several different classes (problem sizes). NGB also specifies the generic Grid services sufficient for running the bench-mark. The implementor has the freedom to choose any specific Grid environment. However, we describe a reference implementation in Java, and present some scenarios for using NGB.
Benchmarking high performance computing architectures with CMS’ skeleton framework
Sexton-Kennedy, E.; Gartung, P.; Jones, C. D.
2017-11-23
Here, in 2012 CMS evaluated which underlying concurrency technology would be the best to use for its multi-threaded framework. The available technologies were evaluated on the high throughput computing systems dominating the resources in use at that time. A skeleton framework benchmarking suite that emulates the tasks performed within a CMSSW application was used to select Intel’s Thread Building Block library, based on the measured overheads in both memory and CPU on the different technologies benchmarked. In 2016 CMS will get access to high performance computing resources that use new many core architectures; machines such as Cori Phase 1&2, Theta,more » Mira. Because of this we have revived the 2012 benchmark to test it’s performance and conclusions on these new architectures. This talk will discuss the results of this exercise.« less
Benchmarking high performance computing architectures with CMS’ skeleton framework
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sexton-Kennedy, E.; Gartung, P.; Jones, C. D.
Here, in 2012 CMS evaluated which underlying concurrency technology would be the best to use for its multi-threaded framework. The available technologies were evaluated on the high throughput computing systems dominating the resources in use at that time. A skeleton framework benchmarking suite that emulates the tasks performed within a CMSSW application was used to select Intel’s Thread Building Block library, based on the measured overheads in both memory and CPU on the different technologies benchmarked. In 2016 CMS will get access to high performance computing resources that use new many core architectures; machines such as Cori Phase 1&2, Theta,more » Mira. Because of this we have revived the 2012 benchmark to test it’s performance and conclusions on these new architectures. This talk will discuss the results of this exercise.« less
Subgroup Benchmark Calculations for the Intra-Pellet Nonuniform Temperature Cases
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kim, Kang Seog; Jung, Yeon Sang; Liu, Yuxuan
A benchmark suite has been developed by Seoul National University (SNU) for intrapellet nonuniform temperature distribution cases based on the practical temperature profiles according to the thermal power levels. Though a new subgroup capability for nonuniform temperature distribution was implemented in MPACT, no validation calculation has been performed for the new capability. This study focuses on bench-marking the new capability through a code-to-code comparison. Two continuous-energy Monte Carlo codes, McCARD and CE-KENO, are engaged in obtaining reference solutions, and the MPACT results are compared to the SNU nTRACER using a similar cross section library and subgroup method to obtain self-shieldedmore » cross sections.« less
Nema, Vijay; Pal, Sudhir Kumar
2013-01-01
This study was conducted to find the best suited freely available software for modelling of proteins by taking a few sample proteins. The proteins used were small to big in size with available crystal structures for the purpose of benchmarking. Key players like Phyre2, Swiss-Model, CPHmodels-3.0, Homer, (PS)2, (PS)(2)-V(2), Modweb were used for the comparison and model generation. Benchmarking process was done for four proteins, Icl, InhA, and KatG of Mycobacterium tuberculosis and RpoB of Thermus Thermophilus to get the most suited software. Parameters compared during analysis gave relatively better values for Phyre2 and Swiss-Model. This comparative study gave the information that Phyre2 and Swiss-Model make good models of small and large proteins as compared to other screened software. Other software was also good but is often not very efficient in providing full-length and properly folded structure.
libvdwxc: a library for exchange-correlation functionals in the vdW-DF family
NASA Astrophysics Data System (ADS)
Hjorth Larsen, Ask; Kuisma, Mikael; Löfgren, Joakim; Pouillon, Yann; Erhart, Paul; Hyldgaard, Per
2017-09-01
We present libvdwxc, a general library for evaluating the energy and potential for the family of vdW-DF exchange-correlation functionals. libvdwxc is written in C and provides an efficient implementation of the vdW-DF method and can be interfaced with various general-purpose DFT codes. Currently, the Gpaw and Octopus codes implement interfaces to libvdwxc. The present implementation emphasizes scalability and parallel performance, and thereby enables ab initio calculations of nanometer-scale complexes. The numerical accuracy is benchmarked on the S22 test set whereas parallel performance is benchmarked on ligand-protected gold nanoparticles ({{Au}}144{({{SC}}11{{NH}}25)}60) up to 9696 atoms.
Test suite for image-based motion estimation of the brain and tongue
NASA Astrophysics Data System (ADS)
Ramsey, Jordan; Prince, Jerry L.; Gomez, Arnold D.
2017-03-01
Noninvasive analysis of motion has important uses as qualitative markers for organ function and to validate biomechanical computer simulations relative to experimental observations. Tagged MRI is considered the gold standard for noninvasive tissue motion estimation in the heart, and this has inspired multiple studies focusing on other organs, including the brain under mild acceleration and the tongue during speech. As with other motion estimation approaches, using tagged MRI to measure 3D motion includes several preprocessing steps that affect the quality and accuracy of estimation. Benchmarks, or test suites, are datasets of known geometries and displacements that act as tools to tune tracking parameters or to compare different motion estimation approaches. Because motion estimation was originally developed to study the heart, existing test suites focus on cardiac motion. However, many fundamental differences exist between the heart and other organs, such that parameter tuning (or other optimization) with respect to a cardiac database may not be appropriate. Therefore, the objective of this research was to design and construct motion benchmarks by adopting an "image synthesis" test suite to study brain deformation due to mild rotational accelerations, and a benchmark to model motion of the tongue during speech. To obtain a realistic representation of mechanical behavior, kinematics were obtained from finite-element (FE) models. These results were combined with an approximation of the acquisition process of tagged MRI (including tag generation, slice thickness, and inconsistent motion repetition). To demonstrate an application of the presented methodology, the effect of motion inconsistency on synthetic measurements of head- brain rotation and deformation was evaluated. The results indicated that acquisition inconsistency is roughly proportional to head rotation estimation error. Furthermore, when evaluating non-rigid deformation, the results suggest that inconsistent motion can yield "ghost" shear strains, which are a function of slice acquisition viability as opposed to a true physical deformation.
Test Suite for Image-Based Motion Estimation of the Brain and Tongue
Ramsey, Jordan; Prince, Jerry L.; Gomez, Arnold D.
2017-01-01
Noninvasive analysis of motion has important uses as qualitative markers for organ function and to validate biomechanical computer simulations relative to experimental observations. Tagged MRI is considered the gold standard for noninvasive tissue motion estimation in the heart, and this has inspired multiple studies focusing on other organs, including the brain under mild acceleration and the tongue during speech. As with other motion estimation approaches, using tagged MRI to measure 3D motion includes several preprocessing steps that affect the quality and accuracy of estimation. Benchmarks, or test suites, are datasets of known geometries and displacements that act as tools to tune tracking parameters or to compare different motion estimation approaches. Because motion estimation was originally developed to study the heart, existing test suites focus on cardiac motion. However, many fundamental differences exist between the heart and other organs, such that parameter tuning (or other optimization) with respect to a cardiac database may not be appropriate. Therefore, the objective of this research was to design and construct motion benchmarks by adopting an “image synthesis” test suite to study brain deformation due to mild rotational accelerations, and a benchmark to model motion of the tongue during speech. To obtain a realistic representation of mechanical behavior, kinematics were obtained from finite-element (FE) models. These results were combined with an approximation of the acquisition process of tagged MRI (including tag generation, slice thickness, and inconsistent motion repetition). To demonstrate an application of the presented methodology, the effect of motion inconsistency on synthetic measurements of head-brain rotation and deformation was evaluated. The results indicated that acquisition inconsistency is roughly proportional to head rotation estimation error. Furthermore, when evaluating non-rigid deformation, the results suggest that inconsistent motion can yield “ghost” shear strains, which are a function of slice acquisition viability as opposed to a true physical deformation. PMID:28781414
NASA Technical Reports Server (NTRS)
Saini, Subash; Bailey, David; Chancellor, Marisa K. (Technical Monitor)
1997-01-01
High Performance Fortran (HPF), the high-level language for parallel Fortran programming, is based on Fortran 90. HALF was defined by an informal standards committee known as the High Performance Fortran Forum (HPFF) in 1993, and modeled on TMC's CM Fortran language. Several HPF features have since been incorporated into the draft ANSI/ISO Fortran 95, the next formal revision of the Fortran standard. HPF allows users to write a single parallel program that can execute on a serial machine, a shared-memory parallel machine, or a distributed-memory parallel machine. HPF eliminates the complex, error-prone task of explicitly specifying how, where, and when to pass messages between processors on distributed-memory machines, or when to synchronize processors on shared-memory machines. HPF is designed in a way that allows the programmer to code an application at a high level, and then selectively optimize portions of the code by dropping into message-passing or calling tuned library routines as 'extrinsics'. Compilers supporting High Performance Fortran features first appeared in late 1994 and early 1995 from Applied Parallel Research (APR) Digital Equipment Corporation, and The Portland Group (PGI). IBM introduced an HPF compiler for the IBM RS/6000 SP/2 in April of 1996. Over the past two years, these implementations have shown steady improvement in terms of both features and performance. The performance of various hardware/ programming model (HPF and MPI (message passing interface)) combinations will be compared, based on latest NAS (NASA Advanced Supercomputing) Parallel Benchmark (NPB) results, thus providing a cross-machine and cross-model comparison. Specifically, HPF based NPB results will be compared with MPI based NPB results to provide perspective on performance currently obtainable using HPF versus MPI or versus hand-tuned implementations such as those supplied by the hardware vendors. In addition we would also present NPB (Version 1.0) performance results for the following systems: DEC Alpha Server 8400 5/440, Fujitsu VPP Series (VX, VPP300, and VPP700), HP/Convex Exemplar SPP2000, IBM RS/6000 SP P2SC node (120 MHz) NEC SX-4/32, SGI/CRAY T3E, SGI Origin2000.
Memory-Intensive Benchmarks: IRAM vs. Cache-Based Machines
NASA Technical Reports Server (NTRS)
Biswas, Rupak; Gaeke, Brian R.; Husbands, Parry; Li, Xiaoye S.; Oliker, Leonid; Yelick, Katherine A.; Biegel, Bryan (Technical Monitor)
2002-01-01
The increasing gap between processor and memory performance has lead to new architectural models for memory-intensive applications. In this paper, we explore the performance of a set of memory-intensive benchmarks and use them to compare the performance of conventional cache-based microprocessors to a mixed logic and DRAM processor called VIRAM. The benchmarks are based on problem statements, rather than specific implementations, and in each case we explore the fundamental hardware requirements of the problem, as well as alternative algorithms and data structures that can help expose fine-grained parallelism or simplify memory access patterns. The benchmarks are characterized by their memory access patterns, their basic control structures, and the ratio of computation to memory operation.
Benchmarking short sequence mapping tools
2013-01-01
Background The development of next-generation sequencing instruments has led to the generation of millions of short sequences in a single run. The process of aligning these reads to a reference genome is time consuming and demands the development of fast and accurate alignment tools. However, the current proposed tools make different compromises between the accuracy and the speed of mapping. Moreover, many important aspects are overlooked while comparing the performance of a newly developed tool to the state of the art. Therefore, there is a need for an objective evaluation method that covers all the aspects. In this work, we introduce a benchmarking suite to extensively analyze sequencing tools with respect to various aspects and provide an objective comparison. Results We applied our benchmarking tests on 9 well known mapping tools, namely, Bowtie, Bowtie2, BWA, SOAP2, MAQ, RMAP, GSNAP, Novoalign, and mrsFAST (mrFAST) using synthetic data and real RNA-Seq data. MAQ and RMAP are based on building hash tables for the reads, whereas the remaining tools are based on indexing the reference genome. The benchmarking tests reveal the strengths and weaknesses of each tool. The results show that no single tool outperforms all others in all metrics. However, Bowtie maintained the best throughput for most of the tests while BWA performed better for longer read lengths. The benchmarking tests are not restricted to the mentioned tools and can be further applied to others. Conclusion The mapping process is still a hard problem that is affected by many factors. In this work, we provided a benchmarking suite that reveals and evaluates the different factors affecting the mapping process. Still, there is no tool that outperforms all of the others in all the tests. Therefore, the end user should clearly specify his needs in order to choose the tool that provides the best results. PMID:23758764
Progress on the Multiphysics Capabilities of the Parallel Electromagnetic ACE3P Simulation Suite
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kononenko, Oleksiy
2015-03-26
ACE3P is a 3D parallel simulation suite that is being developed at SLAC National Accelerator Laboratory. Effectively utilizing supercomputer resources, ACE3P has become a key tool for the coupled electromagnetic, thermal and mechanical research and design of particle accelerators. Based on the existing finite-element infrastructure, a massively parallel eigensolver is developed for modal analysis of mechanical structures. It complements a set of the multiphysics tools in ACE3P and, in particular, can be used for the comprehensive study of microphonics in accelerating cavities ensuring the operational reliability of a particle accelerator.
Hybrid Parallelization of Adaptive MHD-Kinetic Module in Multi-Scale Fluid-Kinetic Simulation Suite
Borovikov, Sergey; Heerikhuisen, Jacob; Pogorelov, Nikolai
2013-04-01
The Multi-Scale Fluid-Kinetic Simulation Suite has a computational tool set for solving partially ionized flows. In this paper we focus on recent developments of the kinetic module which solves the Boltzmann equation using the Monte-Carlo method. The module has been recently redesigned to utilize intra-node hybrid parallelization. We describe in detail the redesign process, implementation issues, and modifications made to the code. Finally, we conduct a performance analysis.
Roofline model toolkit: A practical tool for architectural and program analysis
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lo, Yu Jung; Williams, Samuel; Van Straalen, Brian
We present preliminary results of the Roofline Toolkit for multicore, many core, and accelerated architectures. This paper focuses on the processor architecture characterization engine, a collection of portable instrumented micro benchmarks implemented with Message Passing Interface (MPI), and OpenMP used to express thread-level parallelism. These benchmarks are specialized to quantify the behavior of different architectural features. Compared to previous work on performance characterization, these microbenchmarks focus on capturing the performance of each level of the memory hierarchy, along with thread-level parallelism, instruction-level parallelism and explicit SIMD parallelism, measured in the context of the compilers and run-time environments. We also measuremore » sustained PCIe throughput with four GPU memory managed mechanisms. By combining results from the architecture characterization with the Roofline model based solely on architectural specifications, this work offers insights for performance prediction of current and future architectures and their software systems. To that end, we instrument three applications and plot their resultant performance on the corresponding Roofline model when run on a Blue Gene/Q architecture.« less
A call for benchmarking transposable element annotation methods.
Hoen, Douglas R; Hickey, Glenn; Bourque, Guillaume; Casacuberta, Josep; Cordaux, Richard; Feschotte, Cédric; Fiston-Lavier, Anna-Sophie; Hua-Van, Aurélie; Hubley, Robert; Kapusta, Aurélie; Lerat, Emmanuelle; Maumus, Florian; Pollock, David D; Quesneville, Hadi; Smit, Arian; Wheeler, Travis J; Bureau, Thomas E; Blanchette, Mathieu
2015-01-01
DNA derived from transposable elements (TEs) constitutes large parts of the genomes of complex eukaryotes, with major impacts not only on genomic research but also on how organisms evolve and function. Although a variety of methods and tools have been developed to detect and annotate TEs, there are as yet no standard benchmarks-that is, no standard way to measure or compare their accuracy. This lack of accuracy assessment calls into question conclusions from a wide range of research that depends explicitly or implicitly on TE annotation. In the absence of standard benchmarks, toolmakers are impeded in improving their tools, annotators cannot properly assess which tools might best suit their needs, and downstream researchers cannot judge how accuracy limitations might impact their studies. We therefore propose that the TE research community create and adopt standard TE annotation benchmarks, and we call for other researchers to join the authors in making this long-overdue effort a success.
Implementation and performance of parallel Prolog interpreter
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wei, S.; Kale, L.V.; Balkrishna, R.
1988-01-01
In this paper, the authors discuss the implementation of a parallel Prolog interpreter on different parallel machines. The implementation is based on the REDUCE--OR process model which exploits both AND and OR parallelism in logic programs. It is machine independent as it runs on top of the chare-kernel--a machine-independent parallel programming system. The authors also give the performance of the interpreter running a diverse set of benchmark pargrams on parallel machines including shared memory systems: an Alliant FX/8, Sequent and a MultiMax, and a non-shared memory systems: Intel iPSC/32 hypercube, in addition to its performance on a multiprocessor simulation system.
DE-NE0008277_PROTEUS final technical report 2018
DOE Office of Scientific and Technical Information (OSTI.GOV)
Enqvist, Andreas
This project details re-evaluations of experiments of gas-cooled fast reactor (GCFR) core designs performed in the 1970s at the PROTEUS reactor and create a series of International Reactor Physics Experiment Evaluation Project (IRPhEP) benchmarks. Currently there are no gas-cooled fast reactor (GCFR) experiments available in the International Handbook of Evaluated Reactor Physics Benchmark Experiments (IRPhEP Handbook). These experiments are excellent candidates for reanalysis and development of multiple benchmarks because these experiments provide high-quality integral nuclear data relevant to the validation and refinement of thorium, neptunium, uranium, plutonium, iron, and graphite cross sections. It would be cost prohibitive to reproduce suchmore » a comprehensive suite of experimental data to support any future GCFR endeavors.« less
Finding undetected protein associations in cell signaling by belief propagation.
Bailly-Bechet, M; Borgs, C; Braunstein, A; Chayes, J; Dagkessamanskaia, A; François, J-M; Zecchina, R
2011-01-11
External information propagates in the cell mainly through signaling cascades and transcriptional activation, allowing it to react to a wide spectrum of environmental changes. High-throughput experiments identify numerous molecular components of such cascades that may, however, interact through unknown partners. Some of them may be detected using data coming from the integration of a protein-protein interaction network and mRNA expression profiles. This inference problem can be mapped onto the problem of finding appropriate optimal connected subgraphs of a network defined by these datasets. The optimization procedure turns out to be computationally intractable in general. Here we present a new distributed algorithm for this task, inspired from statistical physics, and apply this scheme to alpha factor and drug perturbations data in yeast. We identify the role of the COS8 protein, a member of a gene family of previously unknown function, and validate the results by genetic experiments. The algorithm we present is specially suited for very large datasets, can run in parallel, and can be adapted to other problems in systems biology. On renowned benchmarks it outperforms other algorithms in the field.
Fast Segmentation of Stained Nuclei in Terabyte-Scale, Time Resolved 3D Microscopy Image Stacks
Stegmaier, Johannes; Otte, Jens C.; Kobitski, Andrei; Bartschat, Andreas; Garcia, Ariel; Nienhaus, G. Ulrich; Strähle, Uwe; Mikut, Ralf
2014-01-01
Automated analysis of multi-dimensional microscopy images has become an integral part of modern research in life science. Most available algorithms that provide sufficient segmentation quality, however, are infeasible for a large amount of data due to their high complexity. In this contribution we present a fast parallelized segmentation method that is especially suited for the extraction of stained nuclei from microscopy images, e.g., of developing zebrafish embryos. The idea is to transform the input image based on gradient and normal directions in the proximity of detected seed points such that it can be handled by straightforward global thresholding like Otsu’s method. We evaluate the quality of the obtained segmentation results on a set of real and simulated benchmark images in 2D and 3D and show the algorithm’s superior performance compared to other state-of-the-art algorithms. We achieve an up to ten-fold decrease in processing times, allowing us to process large data sets while still providing reasonable segmentation results. PMID:24587204
Proof of Concept for the Rewrite Rule Machine: Interensemble Studies
1994-02-23
34 -,,, S2 •fbo fibo 0 1 Figure 1: Concurrent Rewriting of Fibonacci Expressions exploit a problem’s parallelism at several levels. We call this...property multigrain concurrency; it makes the RRM very well suited for solving not only homogeneous problems, but also complex, locally homogeneous but...interprocessor message passing over a network-is not well suited to data parallelism. A key goal of the RRM is to combine the best of these two approaches in a
NASA Technical Reports Server (NTRS)
Feng, Hui-Yu; VanderWijngaart, Rob; Biswas, Rupak; Biegel, Bryan (Technical Monitor)
2001-01-01
We describe the design of a new method for the measurement of the performance of modern computer systems when solving scientific problems featuring irregular, dynamic memory accesses. The method involves the solution of a stylized heat transfer problem on an unstructured, adaptive grid. A Spectral Element Method (SEM) with an adaptive, nonconforming mesh is selected to discretize the transport equation. The relatively high order of the SEM lowers the fraction of wall clock time spent on inter-processor communication, which eases the load balancing task and allows us to concentrate on the memory accesses. The benchmark is designed to be three-dimensional. Parallelization and load balance issues of a reference implementation will be described in detail in future reports.
Investigating a method of producing "red and dead" galaxies
NASA Astrophysics Data System (ADS)
Skory, Stephen
2010-08-01
In optical wavelengths, galaxies are observed to be either red or blue. The overall color of a galaxy is due to the distribution of the ages of its stellar population. Galaxies with currently active star formation appear blue, while those with no recent star formation at all (greater than about a Gyr) have only old, red stars. This strong bimodality has lead to the idea of star formation quenching, and various proposed physical mechanisms. In this dissertation, I attempt to reproduce with Enzo the results of Naab et al. (2007), in which red and dead galaxies are formed using gravitational quenching, rather than with one of the more typical methods of quenching. My initial attempts are unsuccessful, and I explore the reasons why I think they failed. Then using simpler methods better suited to Enzo + AMR, I am successful in producing a galaxy that appears to be similar in color and formation history to those in Naab et al. However, quenching is achieved using unphysically high star formation efficiencies, which is a different mechanism than Naab et al. suggests. Preliminary results of a much higher resolution, follow-on simulation of the above show some possible contradiction with the results of Naab et al. Cold gas is streaming into the galaxy to fuel starbursts, while at a similar epoch the galaxies in Naab et al. have largely already ceased forming stars in the galaxy. On the other hand, the results of the high resolution simulation are qualitatively similar to other works in the literature that show a somewhat different gravitational quenching mechanism than Naab et al. I also discuss my work using halo finders to analyze simulated cosmological data, and my work improving the Enzo/AMR analysis tool "yt". This includes two parallelizations of the halo finder HOP (Eisenstein and Hut, 1998) which allows analysis of very large cosmological datasets on parallel machines. The first version is "yt-HOP," which works well for datasets between about 2563 and 5123 particles, but has memory bottlenecks as the datasets get larger. These bottlenecks inspired the second version, "Parallel HOP," which is a fully parallelized method and implementation of HOP that has worked on datasets with more than 20483 particles on hundreds of processing cores. Both methods are described in detail, as are the various effects of performance-related runtime options. Additionally, both halo finders are subjected to a full suite of performance benchmarks varying both dataset sizes and computational resources used. I conclude with descriptions of four new tools I added to yt. A Parallel Structure Function Generator allows analysis of two-point functions, such as correlation functions, using memory- and workload-parallelism. A Parallel Merger Tree Generator leverages the parallel halo finders in yt, such as Parallel HOP, to build the merger tree of halos in a cosmological simulation, and outputs the result to a SQLite database for simple and powerful data extraction. A Star Particle Analysis toolkit takes a group of star particles and can output the rate of formation as a function of time, and/or a synthetic Spectral Energy Distribution (S.E.D.) using the Bruzual and Charlot (2003) data tables. Finally, a Halo Mass Function toolkit takes as input a list of halo masses and can output the halo mass function for the halos, as well as an analytical fit for those halos using several previously published fits.
Implementation of BT, SP, LU, and FT of NAS Parallel Benchmarks in Java
NASA Technical Reports Server (NTRS)
Schultz, Matthew; Frumkin, Michael; Jin, Hao-Qiang; Yan, Jerry
2000-01-01
A number of Java features make it an attractive but a debatable choice for High Performance Computing. We have implemented benchmarks working on single structured grid BT,SP,LU and FT in Java. The performance and scalability of the Java code shows that a significant improvement in Java compiler technology and in Java thread implementation are necessary for Java to compete with Fortran in HPC applications.
Accelerating the Pace of Protein Functional Annotation With Intel Xeon Phi Coprocessors.
Feinstein, Wei P; Moreno, Juana; Jarrell, Mark; Brylinski, Michal
2015-06-01
Intel Xeon Phi is a new addition to the family of powerful parallel accelerators. The range of its potential applications in computationally driven research is broad; however, at present, the repository of scientific codes is still relatively limited. In this study, we describe the development and benchmarking of a parallel version of eFindSite, a structural bioinformatics algorithm for the prediction of ligand-binding sites in proteins. Implemented for the Intel Xeon Phi platform, the parallelization of the structure alignment portion of eFindSite using pragma-based OpenMP brings about the desired performance improvements, which scale well with the number of computing cores. Compared to a serial version, the parallel code runs 11.8 and 10.1 times faster on the CPU and the coprocessor, respectively; when both resources are utilized simultaneously, the speedup is 17.6. For example, ligand-binding predictions for 501 benchmarking proteins are completed in 2.1 hours on a single Stampede node equipped with the Intel Xeon Phi card compared to 3.1 hours without the accelerator and 36.8 hours required by a serial version. In addition to the satisfactory parallel performance, porting existing scientific codes to the Intel Xeon Phi architecture is relatively straightforward with a short development time due to the support of common parallel programming models by the coprocessor. The parallel version of eFindSite is freely available to the academic community at www.brylinski.org/efindsite.
Methodology and issues of integral experiments selection for nuclear data validation
NASA Astrophysics Data System (ADS)
Tatiana, Ivanova; Ivanov, Evgeny; Hill, Ian
2017-09-01
Nuclear data validation involves a large suite of Integral Experiments (IEs) for criticality, reactor physics and dosimetry applications. [1] Often benchmarks are taken from international Handbooks. [2, 3] Depending on the application, IEs have different degrees of usefulness in validation, and usually the use of a single benchmark is not advised; indeed, it may lead to erroneous interpretation and results. [1] This work aims at quantifying the importance of benchmarks used in application dependent cross section validation. The approach is based on well-known General Linear Least Squared Method (GLLSM) extended to establish biases and uncertainties for given cross sections (within a given energy interval). The statistical treatment results in a vector of weighting factors for the integral benchmarks. These factors characterize the value added by a benchmark for nuclear data validation for the given application. The methodology is illustrated by one example, selecting benchmarks for 239Pu cross section validation. The studies were performed in the framework of Subgroup 39 (Methods and approaches to provide feedback from nuclear and covariance data adjustment for improvement of nuclear data files) established at the Working Party on International Nuclear Data Evaluation Cooperation (WPEC) of the Nuclear Science Committee under the Nuclear Energy Agency (NEA/OECD).
DOE Office of Scientific and Technical Information (OSTI.GOV)
Liu, Peiyuan; Brown, Timothy; Fullmer, William D.
Five benchmark problems are developed and simulated with the computational fluid dynamics and discrete element model code MFiX. The benchmark problems span dilute and dense regimes, consider statistically homogeneous and inhomogeneous (both clusters and bubbles) particle concentrations and a range of particle and fluid dynamic computational loads. Several variations of the benchmark problems are also discussed to extend the computational phase space to cover granular (particles only), bidisperse and heat transfer cases. A weak scaling analysis is performed for each benchmark problem and, in most cases, the scalability of the code appears reasonable up to approx. 103 cores. Profiling ofmore » the benchmark problems indicate that the most substantial computational time is being spent on particle-particle force calculations, drag force calculations and interpolating between discrete particle and continuum fields. Hardware performance analysis was also carried out showing significant Level 2 cache miss ratios and a rather low degree of vectorization. These results are intended to serve as a baseline for future developments to the code as well as a preliminary indicator of where to best focus performance optimizations.« less
A large-scale benchmark of gene prioritization methods.
Guala, Dimitri; Sonnhammer, Erik L L
2017-04-21
In order to maximize the use of results from high-throughput experimental studies, e.g. GWAS, for identification and diagnostics of new disease-associated genes, it is important to have properly analyzed and benchmarked gene prioritization tools. While prospective benchmarks are underpowered to provide statistically significant results in their attempt to differentiate the performance of gene prioritization tools, a strategy for retrospective benchmarking has been missing, and new tools usually only provide internal validations. The Gene Ontology(GO) contains genes clustered around annotation terms. This intrinsic property of GO can be utilized in construction of robust benchmarks, objective to the problem domain. We demonstrate how this can be achieved for network-based gene prioritization tools, utilizing the FunCoup network. We use cross-validation and a set of appropriate performance measures to compare state-of-the-art gene prioritization algorithms: three based on network diffusion, NetRank and two implementations of Random Walk with Restart, and MaxLink that utilizes network neighborhood. Our benchmark suite provides a systematic and objective way to compare the multitude of available and future gene prioritization tools, enabling researchers to select the best gene prioritization tool for the task at hand, and helping to guide the development of more accurate methods.
NASA Astrophysics Data System (ADS)
Jacobs, Shane Earl
This dissertation presents the concept of a Morphing Upper Torso, an innovative pressure suit design that incorporates robotic elements to enable a resizable, highly mobile and easy to don/doff spacesuit. The torso is modeled as a system of interconnected, pressure-constrained, reduced-DOF, wire-actuated parallel manipulators, that enable the dimensions of the suit to be reconfigured to match the wearer. The kinematics, dynamics and control of wire-actuated manipulators are derived and simulated, along with the Jacobian transforms, which relate the total twist vector of the system to the vector of actuator velocities. Tools are developed that allow calculation of the workspace for both single and interconnected reduced-DOF robots of this type, using knowledge of the link lengths. The forward kinematics and statics equations are combined and solved to produce the pose of the platforms along with the link tensions. These tools allow analysis of the full Morphing Upper Torso design, in which the back hatch of a rear-entry torso is interconnected with the waist ring, helmet ring and two scye bearings. Half-scale and full-scale experimental models are used along with analytical models to examine the feasibility of this novel space suit concept. The analytical and experimental results demonstrate that the torso could be expanded to facilitate donning and doffng, and then contracted to match different wearer's body dimensions. Using the system of interconnected parallel manipulators, suit components can be accurately repositioned to different desired configurations. The demonstrated feasibility of the Morphing Upper Torso concept makes it an exciting candidate for inclusion in a future planetary suit architecture.
EVA Suit R and D for Performance Optimization
NASA Technical Reports Server (NTRS)
Cowley, Matthew S.; Harvill, Lauren; Benson, Elizabeth; Rajulu, Sudhakar
2014-01-01
Designing a planetary suit is very complex and often requires difficult trade-offs between performance, cost, mass, and system complexity. To verify that new suit designs meet requirements, full prototypes must be built and tested with human subjects. However, numerous design iterations will occur before the hardware meets those requirements. Traditional draw-prototype-test paradigms for R&D are prohibitively expensive with today's shrinking Government budgets. Personnel at NASA are developing modern simulation techniques which focus on human-centric designs by creating virtual prototype simulations and fully adjustable physical prototypes of suit hardware. During the R&D design phase, these easily modifiable representations of an EVA suit's hard components will allow designers to think creatively and exhaust design possibilities before they build and test working prototypes with human subjects. It allows scientists to comprehensively benchmark current suit capabilities and limitations for existing suit sizes and sizes that do not exist. This is extremely advantageous and enables comprehensive design down-selections to be made early in the design process, enables the use of human performance as design criteria, and enables designs to target specific populations
Can Humans Fly Action Understanding with Multiple Classes of Actors
2015-06-08
recognition using structure from motion point clouds. In European Conference on Computer Vision, 2008. [5] R. Caruana. Multitask learning. Machine Learning...tonomous driving ? the kitti vision benchmark suite. In IEEE Conference on Computer Vision and Pattern Recognition, 2012. [12] L. Gorelick, M. Blank
Nadkarni, P M; Miller, P L
1991-01-01
A parallel program for inter-database sequence comparison was developed on the Intel Hypercube using two models of parallel programming. One version was built using machine-specific Hypercube parallel programming commands. The other version was built using Linda, a machine-independent parallel programming language. The two versions of the program provide a case study comparing these two approaches to parallelization in an important biological application area. Benchmark tests with both programs gave comparable results with a small number of processors. As the number of processors was increased, the Linda version was somewhat less efficient. The Linda version was also run without change on Network Linda, a virtual parallel machine running on a network of desktop workstations.
Nema, Vijay; Pal, Sudhir Kumar
2013-01-01
Aim: This study was conducted to find the best suited freely available software for modelling of proteins by taking a few sample proteins. The proteins used were small to big in size with available crystal structures for the purpose of benchmarking. Key players like Phyre2, Swiss-Model, CPHmodels-3.0, Homer, (PS)2, (PS)2-V2, Modweb were used for the comparison and model generation. Results: Benchmarking process was done for four proteins, Icl, InhA, and KatG of Mycobacterium tuberculosis and RpoB of Thermus Thermophilus to get the most suited software. Parameters compared during analysis gave relatively better values for Phyre2 and Swiss-Model. Conclusion: This comparative study gave the information that Phyre2 and Swiss-Model make good models of small and large proteins as compared to other screened software. Other software was also good but is often not very efficient in providing full-length and properly folded structure. PMID:24023424
Ó Conchúir, Shane; Barlow, Kyle A; Pache, Roland A; Ollikainen, Noah; Kundert, Kale; O'Meara, Matthew J; Smith, Colin A; Kortemme, Tanja
2015-01-01
The development and validation of computational macromolecular modeling and design methods depend on suitable benchmark datasets and informative metrics for comparing protocols. In addition, if a method is intended to be adopted broadly in diverse biological applications, there needs to be information on appropriate parameters for each protocol, as well as metrics describing the expected accuracy compared to experimental data. In certain disciplines, there exist established benchmarks and public resources where experts in a particular methodology are encouraged to supply their most efficient implementation of each particular benchmark. We aim to provide such a resource for protocols in macromolecular modeling and design. We present a freely accessible web resource (https://kortemmelab.ucsf.edu/benchmarks) to guide the development of protocols for protein modeling and design. The site provides benchmark datasets and metrics to compare the performance of a variety of modeling protocols using different computational sampling methods and energy functions, providing a "best practice" set of parameters for each method. Each benchmark has an associated downloadable benchmark capture archive containing the input files, analysis scripts, and tutorials for running the benchmark. The captures may be run with any suitable modeling method; we supply command lines for running the benchmarks using the Rosetta software suite. We have compiled initial benchmarks for the resource spanning three key areas: prediction of energetic effects of mutations, protein design, and protein structure prediction, each with associated state-of-the-art modeling protocols. With the help of the wider macromolecular modeling community, we hope to expand the variety of benchmarks included on the website and continue to evaluate new iterations of current methods as they become available.
Verification of MCNP6.2 for Nuclear Criticality Safety Applications
DOE Office of Scientific and Technical Information (OSTI.GOV)
Brown, Forrest B.; Rising, Michael Evan; Alwin, Jennifer Louise
2017-05-10
Several suites of verification/validation benchmark problems were run in early 2017 to verify that the new production release of MCNP6.2 performs correctly for nuclear criticality safety applications (NCS). MCNP6.2 results for several NCS validation suites were compared to the results from MCNP6.1 [1] and MCNP6.1.1 [2]. MCNP6.1 is the production version of MCNP® released in 2013, and MCNP6.1.1 is the update released in 2014. MCNP6.2 includes all of the standard features for NCS calculations that have been available for the past 15 years, along with new features for sensitivity-uncertainty based methods for NCS validation [3]. Results from the benchmark suitesmore » were compared with results from previous verification testing [4-8]. Criticality safety analysts should consider testing MCNP6.2 on their particular problems and validation suites. No further development of MCNP5 is planned. MCNP6.1 is now 4 years old, and MCNP6.1.1 is now 3 years old. In general, released versions of MCNP are supported only for about 5 years, due to resource limitations. All future MCNP improvements, bug fixes, user support, and new capabilities are targeted only to MCNP6.2 and beyond.« less
Simulation of Benchmark Cases with the Terminal Area Simulation System (TASS)
NASA Technical Reports Server (NTRS)
Ahmad, Nashat N.; Proctor, Fred H.
2011-01-01
The hydrodynamic core of the Terminal Area Simulation System (TASS) is evaluated against different benchmark cases. In the absence of closed form solutions for the equations governing atmospheric flows, the models are usually evaluated against idealized test cases. Over the years, various authors have suggested a suite of these idealized cases which have become standards for testing and evaluating the dynamics and thermodynamics of atmospheric flow models. In this paper, simulations of three such cases are described. In addition, the TASS model is evaluated against a test case that uses an exact solution of the Navier-Stokes equations. The TASS results are compared against previously reported simulations of these benchmark cases in the literature. It is demonstrated that the TASS model is highly accurate, stable and robust.
MPI_XSTAR: MPI-based Parallelization of the XSTAR Photoionization Program
NASA Astrophysics Data System (ADS)
Danehkar, Ashkbiz; Nowak, Michael A.; Lee, Julia C.; Smith, Randall K.
2018-02-01
We describe a program for the parallel implementation of multiple runs of XSTAR, a photoionization code that is used to predict the physical properties of an ionized gas from its emission and/or absorption lines. The parallelization program, called MPI_XSTAR, has been developed and implemented in the C++ language by using the Message Passing Interface (MPI) protocol, a conventional standard of parallel computing. We have benchmarked parallel multiprocessing executions of XSTAR, using MPI_XSTAR, against a serial execution of XSTAR, in terms of the parallelization speedup and the computing resource efficiency. Our experience indicates that the parallel execution runs significantly faster than the serial execution, however, the efficiency in terms of the computing resource usage decreases with increasing the number of processors used in the parallel computing.
The implementation of an aeronautical CFD flow code onto distributed memory parallel systems
NASA Astrophysics Data System (ADS)
Ierotheou, C. S.; Forsey, C. R.; Leatham, M.
2000-04-01
The parallelization of an industrially important in-house computational fluid dynamics (CFD) code for calculating the airflow over complex aircraft configurations using the Euler or Navier-Stokes equations is presented. The code discussed is the flow solver module of the SAUNA CFD suite. This suite uses a novel grid system that may include block-structured hexahedral or pyramidal grids, unstructured tetrahedral grids or a hybrid combination of both. To assist in the rapid convergence to a solution, a number of convergence acceleration techniques are employed including implicit residual smoothing and a multigrid full approximation storage scheme (FAS). Key features of the parallelization approach are the use of domain decomposition and encapsulated message passing to enable the execution in parallel using a single programme multiple data (SPMD) paradigm. In the case where a hybrid grid is used, a unified grid partitioning scheme is employed to define the decomposition of the mesh. The parallel code has been tested using both structured and hybrid grids on a number of different distributed memory parallel systems and is now routinely used to perform industrial scale aeronautical simulations. Copyright
An OpenACC-Based Unified Programming Model for Multi-accelerator Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kim, Jungwon; Lee, Seyong; Vetter, Jeffrey S
2015-01-01
This paper proposes a novel SPMD programming model of OpenACC. Our model integrates the different granularities of parallelism from vector-level parallelism to node-level parallelism into a single, unified model based on OpenACC. It allows programmers to write programs for multiple accelerators using a uniform programming model whether they are in shared or distributed memory systems. We implement a prototype of our model and evaluate its performance with a GPU-based supercomputer using three benchmark applications.
A conservative approach to parallelizing the Sharks World simulation
NASA Technical Reports Server (NTRS)
Nicol, David M.; Riffe, Scott E.
1990-01-01
Parallelizing a benchmark problem for parallel simulation, the Sharks World, is described. The described solution is conservative, in the sense that no state information is saved, and no 'rollbacks' occur. The used approach illustrates both the principal advantage and principal disadvantage of conservative parallel simulation. The advantage is that by exploiting lookahead an approach was found that dramatically improves the serial execution time, and also achieves excellent speedups. The disadvantage is that if the model rules are changed in such a way that the lookahead is destroyed, it is difficult to modify the solution to accommodate the changes.
NAVO MSRC Navigator. Spring 2006
2006-01-01
all of these upgrades are complete, the effective computing power of the NAVO MSRC will be essentially tripled, as measured by sustainable ... performance on the HPCMP benchmark suite. All four of these systems will be configured with two gigabytes of memory per processor, IBM’s “Federation” inter
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hornung, Richard D.; Hones, Holger E.
The RAJA Performance Suite is designed to evaluate performance of the RAJA performance portability library on a wide variety of important high performance computing (HPC) algorithmic lulmels. These kernels assess compiler optimizations and various parallel programming model backends accessible through RAJA, such as OpenMP, CUDA, etc. The Initial version of the suite contains 25 computational kernels, each of which appears in 6 variants: Baseline SequcntiaJ, RAJA SequentiaJ, Baseline OpenMP, RAJA OpenMP, Baseline CUDA, RAJA CUDA. All variants of each kernel perform essentially the same mathematical operations and the loop body code for each kernel is identical across all variants. Theremore » are a few kernels, such as those that contain reduction operations, that require CUDA-specific coding for their CUDA variants. ActuaJ computer instructions executed and how they run in parallel differs depending on the parallel programming model backend used and which optimizations are perfonned by the compiler used to build the Perfonnance Suite executable. The Suite will be used primarily by RAJA developers to perform regular assessments of RAJA performance across a range of hardware platforms and compilers as RAJA features are being developed. It will also be used by LLNL hardware and software vendor panners for new defining requirements for future computing platform procurements and acceptance testing. In particular, the RAJA Performance Suite will be used for compiler acceptance testing of the upcoming CORAUSierra machine {initial LLNL delivery expected in late-2017/early 2018) and the CORAL-2 procurement. The Suite will aJso be used to generate concise source code reproducers of compiler and runtime issues we uncover so that we may provide them to relevant vendors to be fixed.« less
Parallel tempering for the traveling salesman problem
DOE Office of Scientific and Technical Information (OSTI.GOV)
Percus, Allon; Wang, Richard; Hyman, Jeffrey
We explore the potential of parallel tempering as a combinatorial optimization method, applying it to the traveling salesman problem. We compare simulation results of parallel tempering with a benchmark implementation of simulated annealing, and study how different choices of parameters affect the relative performance of the two methods. We find that a straightforward implementation of parallel tempering can outperform simulated annealing in several crucial respects. When parameters are chosen appropriately, both methods yield close approximation to the actual minimum distance for an instance with 200 nodes. However, parallel tempering yields more consistently accurate results when a series of independent simulationsmore » are performed. Our results suggest that parallel tempering might offer a simple but powerful alternative to simulated annealing for combinatorial optimization problems.« less
Nadkarni, P. M.; Miller, P. L.
1991-01-01
A parallel program for inter-database sequence comparison was developed on the Intel Hypercube using two models of parallel programming. One version was built using machine-specific Hypercube parallel programming commands. The other version was built using Linda, a machine-independent parallel programming language. The two versions of the program provide a case study comparing these two approaches to parallelization in an important biological application area. Benchmark tests with both programs gave comparable results with a small number of processors. As the number of processors was increased, the Linda version was somewhat less efficient. The Linda version was also run without change on Network Linda, a virtual parallel machine running on a network of desktop workstations. PMID:1807632
77 FR 38288 - Ocean Transportation Intermediary License; Applicants
Federal Register 2010, 2011, 2012, 2013, 2014
2012-06-27
...). Application Type: Name Change. Global Atlantic Logistics LLC (OFF), 1901 SW 31st Avenue, Pembroke Park, FL....gov . Bellcom, Inc. (NVO), 503 Commerce Park Drive, Suite E, Marietta, GA 30060. Officers: Cornelius U... License. Benchmark Worldwide Logistics, Inc. dba Star Ocean Lines (NVO & OFF), 24900 South Route 53...
A suite of exercises for verifying dynamic earthquake rupture codes
Harris, Ruth A.; Barall, Michael; Aagaard, Brad T.; Ma, Shuo; Roten, Daniel; Olsen, Kim B.; Duan, Benchun; Liu, Dunyu; Luo, Bin; Bai, Kangchen; Ampuero, Jean-Paul; Kaneko, Yoshihiro; Gabriel, Alice-Agnes; Duru, Kenneth; Ulrich, Thomas; Wollherr, Stephanie; Shi, Zheqiang; Dunham, Eric; Bydlon, Sam; Zhang, Zhenguo; Chen, Xiaofei; Somala, Surendra N.; Pelties, Christian; Tago, Josue; Cruz-Atienza, Victor Manuel; Kozdon, Jeremy; Daub, Eric; Aslam, Khurram; Kase, Yuko; Withers, Kyle; Dalguer, Luis
2018-01-01
We describe a set of benchmark exercises that are designed to test if computer codes that simulate dynamic earthquake rupture are working as intended. These types of computer codes are often used to understand how earthquakes operate, and they produce simulation results that include earthquake size, amounts of fault slip, and the patterns of ground shaking and crustal deformation. The benchmark exercises examine a range of features that scientists incorporate in their dynamic earthquake rupture simulations. These include implementations of simple or complex fault geometry, off‐fault rock response to an earthquake, stress conditions, and a variety of formulations for fault friction. Many of the benchmarks were designed to investigate scientific problems at the forefronts of earthquake physics and strong ground motions research. The exercises are freely available on our website for use by the scientific community.
Assessing and benchmarking multiphoton microscopes for biologists
Corbin, Kaitlin; Pinkard, Henry; Peck, Sebastian; Beemiller, Peter; Krummel, Matthew F.
2017-01-01
Multiphoton microscopy has become staple tool for tracking cells within tissues and organs due to superior depth of penetration, low excitation volumes, and reduced phototoxicity. Many factors, ranging from laser pulse width to relay optics to detectors and electronics, contribute to the overall ability of these microscopes to excite and detect fluorescence deep within tissues. However, we have found that there are few standard ways already described in the literature to distinguish between microscopes or to benchmark existing microscopes to measure the overall quality and efficiency of these instruments. Here, we discuss some simple parameters and methods that can either be used within a multiphoton facility or by a prospective purchaser to benchmark performance. This can both assist in identifying decay in microscope performance and in choosing features of a scope that are suited to experimental needs. PMID:24974026
NASA Astrophysics Data System (ADS)
Kahler, A. C.; MacFarlane, R. E.; Mosteller, R. D.; Kiedrowski, B. C.; Frankle, S. C.; Chadwick, M. B.; McKnight, R. D.; Lell, R. M.; Palmiotti, G.; Hiruta, H.; Herman, M.; Arcilla, R.; Mughabghab, S. F.; Sublet, J. C.; Trkov, A.; Trumbull, T. H.; Dunn, M.
2011-12-01
The ENDF/B-VII.1 library is the latest revision to the United States' Evaluated Nuclear Data File (ENDF). The ENDF library is currently in its seventh generation, with ENDF/B-VII.0 being released in 2006. This revision expands upon that library, including the addition of new evaluated files (was 393 neutron files previously, now 423 including replacement of elemental vanadium and zinc evaluations with isotopic evaluations) and extension or updating of many existing neutron data files. Complete details are provided in the companion paper [M. B. Chadwick et al., "ENDF/B-VII.1 Nuclear Data for Science and Technology: Cross Sections, Covariances, Fission Product Yields and Decay Data," Nuclear Data Sheets, 112, 2887 (2011)]. This paper focuses on how accurately application libraries may be expected to perform in criticality calculations with these data. Continuous energy cross section libraries, suitable for use with the MCNP Monte Carlo transport code, have been generated and applied to a suite of nearly one thousand critical benchmark assemblies defined in the International Criticality Safety Benchmark Evaluation Project's International Handbook of Evaluated Criticality Safety Benchmark Experiments. This suite covers uranium and plutonium fuel systems in a variety of forms such as metallic, oxide or solution, and under a variety of spectral conditions, including unmoderated (i.e., bare), metal reflected and water or other light element reflected. Assembly eigenvalues that were accurately predicted with ENDF/B-VII.0 cross sections such as unmoderated and uranium reflected 235U and 239Pu assemblies, HEU solution systems and LEU oxide lattice systems that mimic commercial PWR configurations continue to be accurately calculated with ENDF/B-VII.1 cross sections, and deficiencies in predicted eigenvalues for assemblies containing selected materials, including titanium, manganese, cadmium and tungsten are greatly reduced. Improvements are also confirmed for selected actinide reaction rates such as 236U, 238,242Pu and 241,243Am capture in fast systems. Other deficiencies, such as the overprediction of Pu solution system critical eigenvalues and a decreasing trend in calculated eigenvalue for 233U fueled systems as a function of Above-Thermal Fission Fraction remain. The comprehensive nature of this critical benchmark suite and the generally accurate calculated eigenvalues obtained with ENDF/B-VII.1 neutron cross sections support the conclusion that this is the most accurate general purpose ENDF/B cross section library yet released to the technical community.
Automatic Generation of Directive-Based Parallel Programs for Shared Memory Parallel Systems
NASA Technical Reports Server (NTRS)
Jin, Hao-Qiang; Yan, Jerry; Frumkin, Michael
2000-01-01
The shared-memory programming model is a very effective way to achieve parallelism on shared memory parallel computers. As great progress was made in hardware and software technologies, performance of parallel programs with compiler directives has demonstrated large improvement. The introduction of OpenMP directives, the industrial standard for shared-memory programming, has minimized the issue of portability. Due to its ease of programming and its good performance, the technique has become very popular. In this study, we have extended CAPTools, a computer-aided parallelization toolkit, to automatically generate directive-based, OpenMP, parallel programs. We outline techniques used in the implementation of the tool and present test results on the NAS parallel benchmarks and ARC3D, a CFD application. This work demonstrates the great potential of using computer-aided tools to quickly port parallel programs and also achieve good performance.
Automatic recognition of vector and parallel operations in a higher level language
NASA Technical Reports Server (NTRS)
Schneck, P. B.
1971-01-01
A compiler for recognizing statements of a FORTRAN program which are suited for fast execution on a parallel or pipeline machine such as Illiac-4, Star or ASC is described. The technique employs interval analysis to provide flow information to the vector/parallel recognizer. Where profitable the compiler changes scalar variables to subscripted variables. The output of the compiler is an extension to FORTRAN which shows parallel and vector operations explicitly.
Rubus: A compiler for seamless and extensible parallelism.
Adnan, Muhammad; Aslam, Faisal; Nawaz, Zubair; Sarwar, Syed Mansoor
2017-01-01
Nowadays, a typical processor may have multiple processing cores on a single chip. Furthermore, a special purpose processing unit called Graphic Processing Unit (GPU), originally designed for 2D/3D games, is now available for general purpose use in computers and mobile devices. However, the traditional programming languages which were designed to work with machines having single core CPUs, cannot utilize the parallelism available on multi-core processors efficiently. Therefore, to exploit the extraordinary processing power of multi-core processors, researchers are working on new tools and techniques to facilitate parallel programming. To this end, languages like CUDA and OpenCL have been introduced, which can be used to write code with parallelism. The main shortcoming of these languages is that programmer needs to specify all the complex details manually in order to parallelize the code across multiple cores. Therefore, the code written in these languages is difficult to understand, debug and maintain. Furthermore, to parallelize legacy code can require rewriting a significant portion of code in CUDA or OpenCL, which can consume significant time and resources. Thus, the amount of parallelism achieved is proportional to the skills of the programmer and the time spent in code optimizations. This paper proposes a new open source compiler, Rubus, to achieve seamless parallelism. The Rubus compiler relieves the programmer from manually specifying the low-level details. It analyses and transforms a sequential program into a parallel program automatically, without any user intervention. This achieves massive speedup and better utilization of the underlying hardware without a programmer's expertise in parallel programming. For five different benchmarks, on average a speedup of 34.54 times has been achieved by Rubus as compared to Java on a basic GPU having only 96 cores. Whereas, for a matrix multiplication benchmark the average execution speedup of 84 times has been achieved by Rubus on the same GPU. Moreover, Rubus achieves this performance without drastically increasing the memory footprint of a program.
Rubus: A compiler for seamless and extensible parallelism
Adnan, Muhammad; Aslam, Faisal; Sarwar, Syed Mansoor
2017-01-01
Nowadays, a typical processor may have multiple processing cores on a single chip. Furthermore, a special purpose processing unit called Graphic Processing Unit (GPU), originally designed for 2D/3D games, is now available for general purpose use in computers and mobile devices. However, the traditional programming languages which were designed to work with machines having single core CPUs, cannot utilize the parallelism available on multi-core processors efficiently. Therefore, to exploit the extraordinary processing power of multi-core processors, researchers are working on new tools and techniques to facilitate parallel programming. To this end, languages like CUDA and OpenCL have been introduced, which can be used to write code with parallelism. The main shortcoming of these languages is that programmer needs to specify all the complex details manually in order to parallelize the code across multiple cores. Therefore, the code written in these languages is difficult to understand, debug and maintain. Furthermore, to parallelize legacy code can require rewriting a significant portion of code in CUDA or OpenCL, which can consume significant time and resources. Thus, the amount of parallelism achieved is proportional to the skills of the programmer and the time spent in code optimizations. This paper proposes a new open source compiler, Rubus, to achieve seamless parallelism. The Rubus compiler relieves the programmer from manually specifying the low-level details. It analyses and transforms a sequential program into a parallel program automatically, without any user intervention. This achieves massive speedup and better utilization of the underlying hardware without a programmer’s expertise in parallel programming. For five different benchmarks, on average a speedup of 34.54 times has been achieved by Rubus as compared to Java on a basic GPU having only 96 cores. Whereas, for a matrix multiplication benchmark the average execution speedup of 84 times has been achieved by Rubus on the same GPU. Moreover, Rubus achieves this performance without drastically increasing the memory footprint of a program. PMID:29211758
Automatic Generation of OpenMP Directives and Its Application to Computational Fluid Dynamics Codes
NASA Technical Reports Server (NTRS)
Yan, Jerry; Jin, Haoqiang; Frumkin, Michael; Yan, Jerry (Technical Monitor)
2000-01-01
The shared-memory programming model is a very effective way to achieve parallelism on shared memory parallel computers. As great progress was made in hardware and software technologies, performance of parallel programs with compiler directives has demonstrated large improvement. The introduction of OpenMP directives, the industrial standard for shared-memory programming, has minimized the issue of portability. In this study, we have extended CAPTools, a computer-aided parallelization toolkit, to automatically generate OpenMP-based parallel programs with nominal user assistance. We outline techniques used in the implementation of the tool and discuss the application of this tool on the NAS Parallel Benchmarks and several computational fluid dynamics codes. This work demonstrates the great potential of using the tool to quickly port parallel programs and also achieve good performance that exceeds some of the commercial tools.
AdiosStMan: Parallelizing Casacore Table Data System using Adaptive IO System
NASA Astrophysics Data System (ADS)
Wang, R.; Harris, C.; Wicenec, A.
2016-07-01
In this paper, we investigate the Casacore Table Data System (CTDS) used in the casacore and CASA libraries, and methods to parallelize it. CTDS provides a storage manager plugin mechanism for third-party developers to design and implement their own CTDS storage managers. Having this in mind, we looked into various storage backend techniques that can possibly enable parallel I/O for CTDS by implementing new storage managers. After carrying on benchmarks showing the excellent parallel I/O throughput of the Adaptive IO System (ADIOS), we implemented an ADIOS based parallel CTDS storage manager. We then applied the CASA MSTransform frequency split task to verify the ADIOS Storage Manager. We also ran a series of performance tests to examine the I/O throughput in a massively parallel scenario.
Benchmark Dose Software (BMDS) Development and ...
This report is intended to provide an overview of beta version 1.0 of the implementation of a model of repeated measures data referred to as the Toxicodiffusion model. The implementation described here represents the first steps towards integration of the Toxicodiffusion model into the EPA benchmark dose software (BMDS). This version runs from within BMDS 2.0 using an option screen for making model selection, as is done for other models in the BMDS 2.0 suite. This report is intended to provide an overview of beta version 1.0 of the implementation of a model of repeated measures data referred to as the Toxicodiffusion model.
Nair, Pradeep S; John, Eugene B
2007-01-01
Aligning specific sequences against a very large number of other sequences is a central aspect of bioinformatics. With the widespread availability of personal computers in biology laboratories, sequence alignment is now often performed locally. This makes it necessary to analyse the performance of personal computers for sequence aligning bioinformatics benchmarks. In this paper, we analyse the performance of a personal computer for the popular BLAST and FASTA sequence alignment suites. Results indicate that these benchmarks have a large number of recurring operations and use memory operations extensively. It seems that the performance can be improved with a bigger L1-cache.
Development and application of a DNA microarray-based yeast two-hybrid system
Suter, Bernhard; Fontaine, Jean-Fred; Yildirimman, Reha; Raskó, Tamás; Schaefer, Martin H.; Rasche, Axel; Porras, Pablo; Vázquez-Álvarez, Blanca M.; Russ, Jenny; Rau, Kirstin; Foulle, Raphaele; Zenkner, Martina; Saar, Kathrin; Herwig, Ralf; Andrade-Navarro, Miguel A.; Wanker, Erich E.
2013-01-01
The yeast two-hybrid (Y2H) system is the most widely applied methodology for systematic protein–protein interaction (PPI) screening and the generation of comprehensive interaction networks. We developed a novel Y2H interaction screening procedure using DNA microarrays for high-throughput quantitative PPI detection. Applying a global pooling and selection scheme to a large collection of human open reading frames, proof-of-principle Y2H interaction screens were performed for the human neurodegenerative disease proteins huntingtin and ataxin-1. Using systematic controls for unspecific Y2H results and quantitative benchmarking, we identified and scored a large number of known and novel partner proteins for both huntingtin and ataxin-1. Moreover, we show that this parallelized screening procedure and the global inspection of Y2H interaction data are uniquely suited to define specific PPI patterns and their alteration by disease-causing mutations in huntingtin and ataxin-1. This approach takes advantage of the specificity and flexibility of DNA microarrays and of the existence of solid-related statistical methods for the analysis of DNA microarray data, and allows a quantitative approach toward interaction screens in human and in model organisms. PMID:23275563
Selecting a Relational Database Management System for Library Automation Systems.
ERIC Educational Resources Information Center
Shekhel, Alex; O'Brien, Mike
1989-01-01
Describes the evaluation of four relational database management systems (RDBMSs) (Informix Turbo, Oracle 6.0 TPS, Unify 2000 and Relational Technology's Ingres 5.0) to determine which is best suited for library automation. The evaluation criteria used to develop a benchmark specifically designed to test RDBMSs for libraries are discussed. (CLB)
The ADER-DG method for seismic wave propagation and earthquake rupture dynamics
NASA Astrophysics Data System (ADS)
Pelties, Christian; Gabriel, Alice; Ampuero, Jean-Paul; de la Puente, Josep; Käser, Martin
2013-04-01
We will present the Arbitrary high-order DERivatives Discontinuous Galerkin (ADER-DG) method for solving the combined elastodynamic wave propagation and dynamic rupture problem. The ADER-DG method enables high-order accuracy in space and time while being implemented on unstructured tetrahedral meshes. A tetrahedral element discretization provides rapid and automatized mesh generation as well as geometrical flexibility. Features as mesh coarsening and local time stepping schemes can be applied to reduce computational efforts without introducing numerical artifacts. The method is well suited for parallelization and large scale high-performance computing since only directly neighboring elements exchange information via numerical fluxes. The concept of fluxes is a key ingredient of the numerical scheme as it governs the numerical dispersion and diffusion properties and allows to accommodate for boundary conditions, empirical friction laws of dynamic rupture processes, or the combination of different element types and non-conforming mesh transitions. After introducing fault dynamics into the ADER-DG framework, we will demonstrate its specific advantages in benchmarking test scenarios provided by the SCEC/USGS Spontaneous Rupture Code Verification Exercise. An important result of the benchmark is that the ADER-DG method avoids spurious high-frequency contributions in the slip rate spectra and therefore does not require artificial Kelvin-Voigt damping, filtering or other modifications of the produced synthetic seismograms. To demonstrate the capabilities of the proposed scheme we simulate an earthquake scenario, inspired by the 1992 Landers earthquake, that includes branching and curved fault segments. Furthermore, topography is respected in the discretized model to capture the surface waves correctly. The advanced geometrical flexibility combined with an enhanced accuracy will make the ADER-DG method a useful tool to study earthquake dynamics on complex fault systems in realistic rheologies.
Parallelization of an Object-Oriented Unstructured Aeroacoustics Solver
NASA Technical Reports Server (NTRS)
Baggag, Abdelkader; Atkins, Harold; Oezturan, Can; Keyes, David
1999-01-01
A computational aeroacoustics code based on the discontinuous Galerkin method is ported to several parallel platforms using MPI. The discontinuous Galerkin method is a compact high-order method that retains its accuracy and robustness on non-smooth unstructured meshes. In its semi-discrete form, the discontinuous Galerkin method can be combined with explicit time marching methods making it well suited to time accurate computations. The compact nature of the discontinuous Galerkin method also makes it well suited for distributed memory parallel platforms. The original serial code was written using an object-oriented approach and was previously optimized for cache-based machines. The port to parallel platforms was achieved simply by treating partition boundaries as a type of boundary condition. Code modifications were minimal because boundary conditions were abstractions in the original program. Scalability results are presented for the SCI Origin, IBM SP2, and clusters of SGI and Sun workstations. Slightly superlinear speedup is achieved on a fixed-size problem on the Origin, due to cache effects.
An Open-Source Standard T-Wave Alternans Detector for Benchmarking.
Khaustov, A; Nemati, S; Clifford, Gd
2008-09-14
We describe an open source algorithm suite for T-Wave Alternans (TWA) detection and quantification. The software consists of Matlab implementations of the widely used Spectral Method and Modified Moving Average with libraries to read both WFDB and ASCII data under windows and Linux. The software suite can run in both batch mode and with a provided graphical user interface to aid waveform exploration. Our software suite was calibrated using an open source TWA model, described in a partner paper [1] by Clifford and Sameni. For the PhysioNet/CinC Challenge 2008 we obtained a score of 0.881 for the Spectral Method and 0.400 for the MMA method. However, our objective was not to provide the best TWA detector, but rather a basis for detailed discussion of algorithms.
NASA Technical Reports Server (NTRS)
Hall, Lawrence O.; Bennett, Bonnie H.; Tello, Ivan
1994-01-01
A parallel version of CLIPS 5.1 has been developed to run on Intel Hypercubes. The user interface is the same as that for CLIPS with some added commands to allow for parallel calls. A complete version of CLIPS runs on each node of the hypercube. The system has been instrumented to display the time spent in the match, recognize, and act cycles on each node. Only rule-level parallelism is supported. Parallel commands enable the assertion and retraction of facts to/from remote nodes working memory. Parallel CLIPS was used to implement a knowledge-based command, control, communications, and intelligence (C(sup 3)I) system to demonstrate the fusion of high-level, disparate sources. We discuss the nature of the information fusion problem, our approach, and implementation. Parallel CLIPS has also be used to run several benchmark parallel knowledge bases such as one to set up a cafeteria. Results show from running Parallel CLIPS with parallel knowledge base partitions indicate that significant speed increases, including superlinear in some cases, are possible.
Nonlinear viscoplasticity in ASPECT: benchmarking and applications to subduction
NASA Astrophysics Data System (ADS)
Glerum, Anne; Thieulot, Cedric; Fraters, Menno; Blom, Constantijn; Spakman, Wim
2018-03-01
ASPECT (Advanced Solver for Problems in Earth's ConvecTion) is a massively parallel finite element code originally designed for modeling thermal convection in the mantle with a Newtonian rheology. The code is characterized by modern numerical methods, high-performance parallelism and extensibility. This last characteristic is illustrated in this work: we have extended the use of ASPECT from global thermal convection modeling to upper-mantle-scale applications of subduction. Subduction modeling generally requires the tracking of multiple materials with different properties and with nonlinear viscous and viscoplastic rheologies. To this end, we implemented a frictional plasticity criterion that is combined with a viscous diffusion and dislocation creep rheology. Because ASPECT uses compositional fields to represent different materials, all material parameters are made dependent on a user-specified number of fields. The goal of this paper is primarily to describe and verify our implementations of complex, multi-material rheology by reproducing the results of four well-known two-dimensional benchmarks: the indentor benchmark, the brick experiment, the sandbox experiment and the slab detachment benchmark. Furthermore, we aim to provide hands-on examples for prospective users by demonstrating the use of multi-material viscoplasticity with three-dimensional, thermomechanical models of oceanic subduction, putting ASPECT on the map as a community code for high-resolution, nonlinear rheology subduction modeling.
Performance and scalability evaluation of "Big Memory" on Blue Gene Linux.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yoshii, K.; Iskra, K.; Naik, H.
2011-05-01
We address memory performance issues observed in Blue Gene Linux and discuss the design and implementation of 'Big Memory' - an alternative, transparent memory space introduced to eliminate the memory performance issues. We evaluate the performance of Big Memory using custom memory benchmarks, NAS Parallel Benchmarks, and the Parallel Ocean Program, at a scale of up to 4,096 nodes. We find that Big Memory successfully resolves the performance issues normally encountered in Blue Gene Linux. For the ocean simulation program, we even find that Linux with Big Memory provides better scalability than does the lightweight compute node kernel designed solelymore » for high-performance applications. Originally intended exclusively for compute node tasks, our new memory subsystem dramatically improves the performance of certain I/O node applications as well. We demonstrate this performance using the central processor of the LOw Frequency ARray radio telescope as an example.« less
ComprehensiveBench: a Benchmark for the Extensive Evaluation of Global Scheduling Algorithms
NASA Astrophysics Data System (ADS)
Pilla, Laércio L.; Bozzetti, Tiago C.; Castro, Márcio; Navaux, Philippe O. A.; Méhaut, Jean-François
2015-10-01
Parallel applications that present tasks with imbalanced loads or complex communication behavior usually do not exploit the underlying resources of parallel platforms to their full potential. In order to mitigate this issue, global scheduling algorithms are employed. As finding the optimal task distribution is an NP-Hard problem, identifying the most suitable algorithm for a specific scenario and comparing algorithms are not trivial tasks. In this context, this paper presents ComprehensiveBench, a benchmark for global scheduling algorithms that enables the variation of a vast range of parameters that affect performance. ComprehensiveBench can be used to assist in the development and evaluation of new scheduling algorithms, to help choose a specific algorithm for an arbitrary application, to emulate other applications, and to enable statistical tests. We illustrate its use in this paper with an evaluation of Charm++ periodic load balancers that stresses their characteristics.
NASA Astrophysics Data System (ADS)
Pernot, Pascal; Savin, Andreas
2018-06-01
Benchmarking studies in computational chemistry use reference datasets to assess the accuracy of a method through error statistics. The commonly used error statistics, such as the mean signed and mean unsigned errors, do not inform end-users on the expected amplitude of prediction errors attached to these methods. We show that, the distributions of model errors being neither normal nor zero-centered, these error statistics cannot be used to infer prediction error probabilities. To overcome this limitation, we advocate for the use of more informative statistics, based on the empirical cumulative distribution function of unsigned errors, namely, (1) the probability for a new calculation to have an absolute error below a chosen threshold and (2) the maximal amplitude of errors one can expect with a chosen high confidence level. Those statistics are also shown to be well suited for benchmarking and ranking studies. Moreover, the standard error on all benchmarking statistics depends on the size of the reference dataset. Systematic publication of these standard errors would be very helpful to assess the statistical reliability of benchmarking conclusions.
Verification and benchmark testing of the NUFT computer code
NASA Astrophysics Data System (ADS)
Lee, K. H.; Nitao, J. J.; Kulshrestha, A.
1993-10-01
This interim report presents results of work completed in the ongoing verification and benchmark testing of the NUFT (Nonisothermal Unsaturated-saturated Flow and Transport) computer code. NUFT is a suite of multiphase, multicomponent models for numerical solution of thermal and isothermal flow and transport in porous media, with application to subsurface contaminant transport problems. The code simulates the coupled transport of heat, fluids, and chemical components, including volatile organic compounds. Grid systems may be cartesian or cylindrical, with one-, two-, or fully three-dimensional configurations possible. In this initial phase of testing, the NUFT code was used to solve seven one-dimensional unsaturated flow and heat transfer problems. Three verification and four benchmarking problems were solved. In the verification testing, excellent agreement was observed between NUFT results and the analytical or quasianalytical solutions. In the benchmark testing, results of code intercomparison were very satisfactory. From these testing results, it is concluded that the NUFT code is ready for application to field and laboratory problems similar to those addressed here. Multidimensional problems, including those dealing with chemical transport, will be addressed in a subsequent report.
Ground truth and benchmarks for performance evaluation
NASA Astrophysics Data System (ADS)
Takeuchi, Ayako; Shneier, Michael; Hong, Tsai Hong; Chang, Tommy; Scrapper, Christopher; Cheok, Geraldine S.
2003-09-01
Progress in algorithm development and transfer of results to practical applications such as military robotics requires the setup of standard tasks, of standard qualitative and quantitative measurements for performance evaluation and validation. Although the evaluation and validation of algorithms have been discussed for over a decade, the research community still faces a lack of well-defined and standardized methodology. The range of fundamental problems include a lack of quantifiable measures of performance, a lack of data from state-of-the-art sensors in calibrated real-world environments, and a lack of facilities for conducting realistic experiments. In this research, we propose three methods for creating ground truth databases and benchmarks using multiple sensors. The databases and benchmarks will provide researchers with high quality data from suites of sensors operating in complex environments representing real problems of great relevance to the development of autonomous driving systems. At NIST, we have prototyped a High Mobility Multi-purpose Wheeled Vehicle (HMMWV) system with a suite of sensors including a Riegl ladar, GDRS ladar, stereo CCD, several color cameras, Global Position System (GPS), Inertial Navigation System (INS), pan/tilt encoders, and odometry . All sensors are calibrated with respect to each other in space and time. This allows a database of features and terrain elevation to be built. Ground truth for each sensor can then be extracted from the database. The main goal of this research is to provide ground truth databases for researchers and engineers to evaluate algorithms for effectiveness, efficiency, reliability, and robustness, thus advancing the development of algorithms.
NASA Technical Reports Server (NTRS)
1991-01-01
Various papers on supercomputing are presented. The general topics addressed include: program analysis/data dependence, memory access, distributed memory code generation, numerical algorithms, supercomputer benchmarks, latency tolerance, parallel programming, applications, processor design, networks, performance tools, mapping and scheduling, characterization affecting performance, parallelism packaging, computing climate change, combinatorial algorithms, hardware and software performance issues, system issues. (No individual items are abstracted in this volume)
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster
NASA Technical Reports Server (NTRS)
Jost, Gabriele; Jin, Hao-Qiang; anMey, Dieter; Hatay, Ferhat F.
2003-01-01
Clusters of SMP (Symmetric Multi-Processors) nodes provide support for a wide range of parallel programming paradigms. The shared address space within each node is suitable for OpenMP parallelization. Message passing can be employed within and across the nodes of a cluster. Multiple levels of parallelism can be achieved by combining message passing and OpenMP parallelization. Which programming paradigm is the best will depend on the nature of the given problem, the hardware components of the cluster, the network, and the available software. In this study we compare the performance of different implementations of the same CFD benchmark application, using the same numerical algorithm but employing different programming paradigms.
Research on computer systems benchmarking
NASA Technical Reports Server (NTRS)
Smith, Alan Jay (Principal Investigator)
1996-01-01
This grant addresses the topic of research on computer systems benchmarking and is more generally concerned with performance issues in computer systems. This report reviews work in those areas during the period of NASA support under this grant. The bulk of the work performed concerned benchmarking and analysis of CPUs, compilers, caches, and benchmark programs. The first part of this work concerned the issue of benchmark performance prediction. A new approach to benchmarking and machine characterization was reported, using a machine characterizer that measures the performance of a given system in terms of a Fortran abstract machine. Another report focused on analyzing compiler performance. The performance impact of optimization in the context of our methodology for CPU performance characterization was based on the abstract machine model. Benchmark programs are analyzed in another paper. A machine-independent model of program execution was developed to characterize both machine performance and program execution. By merging these machine and program characterizations, execution time can be estimated for arbitrary machine/program combinations. The work was continued into the domain of parallel and vector machines, including the issue of caches in vector processors and multiprocessors. All of the afore-mentioned accomplishments are more specifically summarized in this report, as well as those smaller in magnitude supported by this grant.
Interfacing Computer Aided Parallelization and Performance Analysis
NASA Technical Reports Server (NTRS)
Jost, Gabriele; Jin, Haoqiang; Labarta, Jesus; Gimenez, Judit; Biegel, Bryan A. (Technical Monitor)
2003-01-01
When porting sequential applications to parallel computer architectures, the program developer will typically go through several cycles of source code optimization and performance analysis. We have started a project to develop an environment where the user can jointly navigate through program structure and performance data information in order to make efficient optimization decisions. In a prototype implementation we have interfaced the CAPO computer aided parallelization tool with the Paraver performance analysis tool. We describe both tools and their interface and give an example for how the interface helps within the program development cycle of a benchmark code.
Scalability and Portability of Two Parallel Implementations of ADI
NASA Technical Reports Server (NTRS)
Phung, Thanh; VanderWijngaart, Rob F.
1994-01-01
Two domain decompositions for the implementation of the NAS Scalar Penta-diagonal Parallel Benchmark on MIMD systems are investigated, namely transposition and multi-partitioning. Hardware platforms considered are the Intel iPSC/860 and Paragon XP/S-15, and clusters of SGI workstations on ethernet, communicating through PVM. It is found that the multi-partitioning strategy offers the kind of coarse granularity that allows scaling up to hundreds of processors on a massively parallel machine. Moreover, efficiency is retained when the code is ported verbatim (save message passing syntax) to a PVM environment on a modest size cluster of workstations.
Locating hardware faults in a data communications network of a parallel computer
Archer, Charles J.; Megerian, Mark G.; Ratterman, Joseph D.; Smith, Brian E.
2010-01-12
Hardware faults location in a data communications network of a parallel computer. Such a parallel computer includes a plurality of compute nodes and a data communications network that couples the compute nodes for data communications and organizes the compute node as a tree. Locating hardware faults includes identifying a next compute node as a parent node and a root of a parent test tree, identifying for each child compute node of the parent node a child test tree having the child compute node as root, running a same test suite on the parent test tree and each child test tree, and identifying the parent compute node as having a defective link connected from the parent compute node to a child compute node if the test suite fails on the parent test tree and succeeds on all the child test trees.
Experiences using OpenMP based on Computer Directed Software DSM on a PC Cluster
NASA Technical Reports Server (NTRS)
Hess, Matthias; Jost, Gabriele; Mueller, Matthias; Ruehle, Roland
2003-01-01
In this work we report on our experiences running OpenMP programs on a commodity cluster of PCs running a software distributed shared memory (DSM) system. We describe our test environment and report on the performance of a subset of the NAS Parallel Benchmarks that have been automaticaly parallelized for OpenMP. We compare the performance of the OpenMP implementations with that of their message passing counterparts and discuss performance differences.
RISC Processors and High Performance Computing
NASA Technical Reports Server (NTRS)
Saini, Subhash; Bailey, David H.; Lasinski, T. A. (Technical Monitor)
1995-01-01
In this tutorial, we will discuss top five current RISC microprocessors: The IBM Power2, which is used in the IBM RS6000/590 workstation and in the IBM SP2 parallel supercomputer, the DEC Alpha, which is in the DEC Alpha workstation and in the Cray T3D; the MIPS R8000, which is used in the SGI Power Challenge; the HP PA-RISC 7100, which is used in the HP 700 series workstations and in the Convex Exemplar; and the Cray proprietary processor, which is used in the new Cray J916. The architecture of these microprocessors will first be presented. The effective performance of these processors will then be compared, both by citing standard benchmarks and also in the context of implementing a real applications. In the process, different programming models such as data parallel (CM Fortran and HPF) and message passing (PVM and MPI) will be introduced and compared. The latest NAS Parallel Benchmark (NPB) absolute performance and performance per dollar figures will be presented. The next generation of the NP13 will also be described. The tutorial will conclude with a discussion of general trends in the field of high performance computing, including likely future developments in hardware and software technology, and the relative roles of vector supercomputers tightly coupled parallel computers, and clusters of workstations. This tutorial will provide a unique cross-machine comparison not available elsewhere.
Parareal in time 3D numerical solver for the LWR Benchmark neutron diffusion transient model
DOE Office of Scientific and Technical Information (OSTI.GOV)
Baudron, Anne-Marie, E-mail: anne-marie.baudron@cea.fr; CEA-DRN/DMT/SERMA, CEN-Saclay, 91191 Gif sur Yvette Cedex; Lautard, Jean-Jacques, E-mail: jean-jacques.lautard@cea.fr
2014-12-15
In this paper we present a time-parallel algorithm for the 3D neutrons calculation of a transient model in a nuclear reactor core. The neutrons calculation consists in numerically solving the time dependent diffusion approximation equation, which is a simplified transport equation. The numerical resolution is done with finite elements method based on a tetrahedral meshing of the computational domain, representing the reactor core, and time discretization is achieved using a θ-scheme. The transient model presents moving control rods during the time of the reaction. Therefore, cross-sections (piecewise constants) are taken into account by interpolations with respect to the velocity ofmore » the control rods. The parallelism across the time is achieved by an adequate use of the parareal in time algorithm to the handled problem. This parallel method is a predictor corrector scheme that iteratively combines the use of two kinds of numerical propagators, one coarse and one fine. Our method is made efficient by means of a coarse solver defined with large time step and fixed position control rods model, while the fine propagator is assumed to be a high order numerical approximation of the full model. The parallel implementation of our method provides a good scalability of the algorithm. Numerical results show the efficiency of the parareal method on large light water reactor transient model corresponding to the Langenbuch–Maurer–Werner benchmark.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ramos-Mendez, J; Faddegon, B; Perl, J
2015-06-15
Purpose: To develop and verify an extension to TOPAS for calculation of dose response models (TCP/NTCP). TOPAS wraps and extends Geant4. Methods: The TOPAS DICOM interface was extended to include structure contours, for subsequent calculation of DVH’s and TCP/NTCP. The following dose response models were implemented: Lyman-Kutcher-Burman (LKB), critical element (CE), population based critical volume (CV), parallel-serials, a sigmoid-based model of Niemierko for NTCP and TCP, and a Poisson-based model for TCP. For verification, results for the parallel-serial and Poisson models, with 6 MV x-ray dose distributions calculated with TOPAS and Pinnacle v9.2, were compared to data from the benchmarkmore » configuration of the AAPM Task Group 166 (TG166). We provide a benchmark configuration suitable for proton therapy along with results for the implementation of the Niemierko, CV and CE models. Results: The maximum difference in DVH calculated with Pinnacle and TOPAS was 2%. Differences between TG166 data and Monte Carlo calculations of up to 4.2%±6.1% were found for the parallel-serial model and up to 1.0%±0.7% for the Poisson model (including the uncertainty due to lack of knowledge of the point spacing in TG166). For CE, CV and Niemierko models, the discrepancies between the Pinnacle and TOPAS results are 74.5%, 34.8% and 52.1% when using 29.7 cGy point spacing, the differences being highly sensitive to dose spacing. On the other hand, with our proposed benchmark configuration, the largest differences were 12.05%±0.38%, 3.74%±1.6%, 1.57%±4.9% and 1.97%±4.6% for the CE, CV, Niemierko and LKB models, respectively. Conclusion: Several dose response models were successfully implemented with the extension module. Reference data was calculated for future benchmarking. Dose response calculated for the different models varied much more widely for the TG166 benchmark than for the proposed benchmark, which had much lower sensitivity to the choice of DVH dose points. This work was supported by National Cancer Institute Grant R01CA140735.« less
NASA Astrophysics Data System (ADS)
Stone, Christopher P.; Alferman, Andrew T.; Niemeyer, Kyle E.
2018-05-01
Accurate and efficient methods for solving stiff ordinary differential equations (ODEs) are a critical component of turbulent combustion simulations with finite-rate chemistry. The ODEs governing the chemical kinetics at each mesh point are decoupled by operator-splitting allowing each to be solved concurrently. An efficient ODE solver must then take into account the available thread and instruction-level parallelism of the underlying hardware, especially on many-core coprocessors, as well as the numerical efficiency. A stiff Rosenbrock and a nonstiff Runge-Kutta ODE solver are both implemented using the single instruction, multiple thread (SIMT) and single instruction, multiple data (SIMD) paradigms within OpenCL. Both methods solve multiple ODEs concurrently within the same instruction stream. The performance of these parallel implementations was measured on three chemical kinetic models of increasing size across several multicore and many-core platforms. Two separate benchmarks were conducted to clearly determine any performance advantage offered by either method. The first benchmark measured the run-time of evaluating the right-hand-side source terms in parallel and the second benchmark integrated a series of constant-pressure, homogeneous reactors using the Rosenbrock and Runge-Kutta solvers. The right-hand-side evaluations with SIMD parallelism on the host multicore Xeon CPU and many-core Xeon Phi co-processor performed approximately three times faster than the baseline multithreaded C++ code. The SIMT parallel model on the host and Phi was 13%-35% slower than the baseline while the SIMT model on the NVIDIA Kepler GPU provided approximately the same performance as the SIMD model on the Phi. The runtimes for both ODE solvers decreased significantly with the SIMD implementations on the host CPU (2.5-2.7 ×) and Xeon Phi coprocessor (4.7-4.9 ×) compared to the baseline parallel code. The SIMT implementations on the GPU ran 1.5-1.6 times faster than the baseline multithreaded CPU code; however, this was significantly slower than the SIMD versions on the host CPU or the Xeon Phi. The performance difference between the three platforms was attributed to thread divergence caused by the adaptive step-sizes within the ODE integrators. Analysis showed that the wider vector width of the GPU incurs a higher level of divergence than the narrower Sandy Bridge or Xeon Phi. The significant performance improvement provided by the SIMD parallel strategy motivates further research into more ODE solver methods that are both SIMD-friendly and computationally efficient.
Application of high-performance computing to numerical simulation of human movement
NASA Technical Reports Server (NTRS)
Anderson, F. C.; Ziegler, J. M.; Pandy, M. G.; Whalen, R. T.
1995-01-01
We have examined the feasibility of using massively-parallel and vector-processing supercomputers to solve large-scale optimization problems for human movement. Specifically, we compared the computational expense of determining the optimal controls for the single support phase of gait using a conventional serial machine (SGI Iris 4D25), a MIMD parallel machine (Intel iPSC/860), and a parallel-vector-processing machine (Cray Y-MP 8/864). With the human body modeled as a 14 degree-of-freedom linkage actuated by 46 musculotendinous units, computation of the optimal controls for gait could take up to 3 months of CPU time on the Iris. Both the Cray and the Intel are able to reduce this time to practical levels. The optimal solution for gait can be found with about 77 hours of CPU on the Cray and with about 88 hours of CPU on the Intel. Although the overall speeds of the Cray and the Intel were found to be similar, the unique capabilities of each machine are better suited to different portions of the computational algorithm used. The Intel was best suited to computing the derivatives of the performance criterion and the constraints whereas the Cray was best suited to parameter optimization of the controls. These results suggest that the ideal computer architecture for solving very large-scale optimal control problems is a hybrid system in which a vector-processing machine is integrated into the communication network of a MIMD parallel machine.
FX-87 performance measurements: data-flow implementation. Technical report
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hammel, R.T.; Gifford, D.K.
1988-11-01
This report documents a series of experiments performed to explore the thesis that the FX-87 effect system permits a compiler to schedule imperative programs (i.e., programs that may contain side-effects) for execution on a parallel computer. The authors analyze how much the FX-87 static effect system can improve the execution times of five benchmark programs on a parallel graph interpreter. Three of their benchmark programs do not use side-effects (factorial, fibonacci, and polynomial division) and thus did not have any effect-induced constraints. Their FX-87 performance was comparable to their performance in a purely functional language. Two of the benchmark programsmore » use side effects (DNA sequence matching and Scheme interpretation) and the compiler was able to use effect information to reduce their execution times by factors of 1.7 to 5.4 when compared with sequential execution times. These results support the thesis that a static effect system is a powerful tool for compilation to multiprocessor computers. However, the graph interpreter we used was based on unrealistic assumptions, and thus our results may not accurately reflect the performance of a practical FX-87 implementation. The results also suggest that conventional loop analysis would complement the FX-87 effect system« less
I/O-Efficient Scientific Computation Using TPIE
NASA Technical Reports Server (NTRS)
Vengroff, Darren Erik; Vitter, Jeffrey Scott
1996-01-01
In recent years, input/output (I/O)-efficient algorithms for a wide variety of problems have appeared in the literature. However, systems specifically designed to assist programmers in implementing such algorithms have remained scarce. TPIE is a system designed to support I/O-efficient paradigms for problems from a variety of domains, including computational geometry, graph algorithms, and scientific computation. The TPIE interface frees programmers from having to deal not only with explicit read and write calls, but also the complex memory management that must be performed for I/O-efficient computation. In this paper we discuss applications of TPIE to problems in scientific computation. We discuss algorithmic issues underlying the design and implementation of the relevant components of TPIE and present performance results of programs written to solve a series of benchmark problems using our current TPIE prototype. Some of the benchmarks we present are based on the NAS parallel benchmarks while others are of our own creation. We demonstrate that the central processing unit (CPU) overhead required to manage I/O is small and that even with just a single disk, the I/O overhead of I/O-efficient computation ranges from negligible to the same order of magnitude as CPU time. We conjecture that if we use a number of disks in parallel this overhead can be all but eliminated.
Parallelization of PANDA discrete ordinates code using spatial decomposition
DOE Office of Scientific and Technical Information (OSTI.GOV)
Humbert, P.
2006-07-01
We present the parallel method, based on spatial domain decomposition, implemented in the 2D and 3D versions of the discrete Ordinates code PANDA. The spatial mesh is orthogonal and the spatial domain decomposition is Cartesian. For 3D problems a 3D Cartesian domain topology is created and the parallel method is based on a domain diagonal plane ordered sweep algorithm. The parallel efficiency of the method is improved by directions and octants pipelining. The implementation of the algorithm is straightforward using MPI blocking point to point communications. The efficiency of the method is illustrated by an application to the 3D-Ext C5G7more » benchmark of the OECD/NEA. (authors)« less
PCTDSE: A parallel Cartesian-grid-based TDSE solver for modeling laser-atom interactions
NASA Astrophysics Data System (ADS)
Fu, Yongsheng; Zeng, Jiaolong; Yuan, Jianmin
2017-01-01
We present a parallel Cartesian-grid-based time-dependent Schrödinger equation (TDSE) solver for modeling laser-atom interactions. It can simulate the single-electron dynamics of atoms in arbitrary time-dependent vector potentials. We use a split-operator method combined with fast Fourier transforms (FFT), on a three-dimensional (3D) Cartesian grid. Parallelization is realized using a 2D decomposition strategy based on the Message Passing Interface (MPI) library, which results in a good parallel scaling on modern supercomputers. We give simple applications for the hydrogen atom using the benchmark problems coming from the references and obtain repeatable results. The extensions to other laser-atom systems are straightforward with minimal modifications of the source code.
ERIC Educational Resources Information Center
National Center for Education Statistics, 2015
2015-01-01
The ED School Climate Surveys (EDSCLS) are a suite of survey instruments being developed for schools, school districts, and states by the U.S. Department of Education's National Center for Education Statistics (NCES). Through the EDSCLS, schools nationwide will have access to survey instruments and a survey platform that will allow for the…
A Diagnostic Assessment of Evolutionary Multiobjective Optimization for Water Resources Systems
NASA Astrophysics Data System (ADS)
Reed, P.; Hadka, D.; Herman, J.; Kasprzyk, J.; Kollat, J.
2012-04-01
This study contributes a rigorous diagnostic assessment of state-of-the-art multiobjective evolutionary algorithms (MOEAs) and highlights key advances that the water resources field can exploit to better discover the critical tradeoffs constraining our systems. This study provides the most comprehensive diagnostic assessment of MOEAs for water resources to date, exploiting more than 100,000 MOEA runs and trillions of design evaluations. The diagnostic assessment measures the effectiveness, efficiency, reliability, and controllability of ten benchmark MOEAs for a representative suite of water resources applications addressing rainfall-runoff calibration, long-term groundwater monitoring (LTM), and risk-based water supply portfolio planning. The suite of problems encompasses a range of challenging problem properties including (1) many-objective formulations with 4 or more objectives, (2) multi-modality (or false optima), (3) nonlinearity, (4) discreteness, (5) severe constraints, (6) stochastic objectives, and (7) non-separability (also called epistasis). The applications are representative of the dominant problem classes that have shaped the history of MOEAs in water resources and that will be dominant foci in the future. Recommendations are provided for which modern MOEAs should serve as tools and benchmarks in the future water resources literature.
Guturu, Parthasarathy; Dantu, Ram
2008-06-01
Many graph- and set-theoretic problems, because of their tremendous application potential and theoretical appeal, have been well investigated by the researchers in complexity theory and were found to be NP-hard. Since the combinatorial complexity of these problems does not permit exhaustive searches for optimal solutions, only near-optimal solutions can be explored using either various problem-specific heuristic strategies or metaheuristic global-optimization methods, such as simulated annealing, genetic algorithms, etc. In this paper, we propose a unified evolutionary algorithm (EA) to the problems of maximum clique finding, maximum independent set, minimum vertex cover, subgraph and double subgraph isomorphism, set packing, set partitioning, and set cover. In the proposed approach, we first map these problems onto the maximum clique-finding problem (MCP), which is later solved using an evolutionary strategy. The proposed impatient EA with probabilistic tabu search (IEA-PTS) for the MCP integrates the best features of earlier successful approaches with a number of new heuristics that we developed to yield a performance that advances the state of the art in EAs for the exploration of the maximum cliques in a graph. Results of experimentation with the 37 DIMACS benchmark graphs and comparative analyses with six state-of-the-art algorithms, including two from the smaller EA community and four from the larger metaheuristics community, indicate that the IEA-PTS outperforms the EAs with respect to a Pareto-lexicographic ranking criterion and offers competitive performance on some graph instances when individually compared to the other heuristic algorithms. It has also successfully set a new benchmark on one graph instance. On another benchmark suite called Benchmarks with Hidden Optimal Solutions, IEA-PTS ranks second, after a very recent algorithm called COVER, among its peers that have experimented with this suite.
A high performance linear equation solver on the VPP500 parallel supercomputer
DOE Office of Scientific and Technical Information (OSTI.GOV)
Nakanishi, Makoto; Ina, Hiroshi; Miura, Kenichi
1994-12-31
This paper describes the implementation of two high performance linear equation solvers developed for the Fujitsu VPP500, a distributed memory parallel supercomputer system. The solvers take advantage of the key architectural features of VPP500--(1) scalability for an arbitrary number of processors up to 222 processors, (2) flexible data transfer among processors provided by a crossbar interconnection network, (3) vector processing capability on each processor, and (4) overlapped computation and transfer. The general linear equation solver based on the blocked LU decomposition method achieves 120.0 GFLOPS performance with 100 processors in the LIN-PACK Highly Parallel Computing benchmark.
Lightweight Specifications for Parallel Correctness
2012-12-05
Galenson, Benjamin Hindman, Thibaud Hottelier, Pallavi Joshi, Ben- jamin Lipshitz, Leo Meyerovich, Mayur Naik, Chang-Seo Park, and Philip Reames — many...violating executions. We discuss some of these errors in detail in the CHAPTER 5. SPECIFYING AND CHECKING SEMANTIC ATOMICITY 84 Benchmark Approx. LoC
Hierarchically Parallelized Constrained Nonlinear Solvers with Automated Substructuring
NASA Technical Reports Server (NTRS)
Padovan, Joe; Kwang, Abel
1994-01-01
This paper develops a parallelizable multilevel multiple constrained nonlinear equation solver. The substructuring process is automated to yield appropriately balanced partitioning of each succeeding level. Due to the generality of the procedure,_sequential, as well as partially and fully parallel environments can be handled. This includes both single and multiprocessor assignment per individual partition. Several benchmark examples are presented. These illustrate the robustness of the procedure as well as its capability to yield significant reductions in memory utilization and calculational effort due both to updating and inversion.
Experiences Using OpenMP Based on Compiler Directed Software DSM on a PC Cluster
NASA Technical Reports Server (NTRS)
Hess, Matthias; Jost, Gabriele; Mueller, Matthias; Ruehle, Roland; Biegel, Bryan (Technical Monitor)
2002-01-01
In this work we report on our experiences running OpenMP (message passing) programs on a commodity cluster of PCs (personal computers) running a software distributed shared memory (DSM) system. We describe our test environment and report on the performance of a subset of the NAS (NASA Advanced Supercomputing) Parallel Benchmarks that have been automatically parallelized for OpenMP. We compare the performance of the OpenMP implementations with that of their message passing counterparts and discuss performance differences.
Distributed File System Utilities to Manage Large DatasetsVersion 0.5
DOE Office of Scientific and Technical Information (OSTI.GOV)
2014-05-21
FileUtils provides a suite of tools to manage large datasets typically created by large parallel MPI applications. They are written in C and use standard POSIX I/Ocalls. The current suite consists of tools to copy, compare, remove, and list. The tools provide dramatic speedup over existing Linux tools, which often run as a single process.
Interactive visual optimization and analysis for RFID benchmarking.
Wu, Yingcai; Chung, Ka-Kei; Qu, Huamin; Yuan, Xiaoru; Cheung, S C
2009-01-01
Radio frequency identification (RFID) is a powerful automatic remote identification technique that has wide applications. To facilitate RFID deployment, an RFID benchmarking instrument called aGate has been invented to identify the strengths and weaknesses of different RFID technologies in various environments. However, the data acquired by aGate are usually complex time varying multidimensional 3D volumetric data, which are extremely challenging for engineers to analyze. In this paper, we introduce a set of visualization techniques, namely, parallel coordinate plots, orientation plots, a visual history mechanism, and a 3D spatial viewer, to help RFID engineers analyze benchmark data visually and intuitively. With the techniques, we further introduce two workflow procedures (a visual optimization procedure for finding the optimum reader antenna configuration and a visual analysis procedure for comparing the performance and identifying the flaws of RFID devices) for the RFID benchmarking, with focus on the performance analysis of the aGate system. The usefulness and usability of the system are demonstrated in the user evaluation.
NASA Astrophysics Data System (ADS)
Garnier, Valérie; Honnorat, Marc; Benshila, Rachid; Boutet, Martial; Cambon, Gildas; Chanut, Jérome; Couvelard, Xavier; Debreu, Laurent; Ducousso, Nicolas; Duhaut, Thomas; Dumas, Franck; Flavoni, Simona; Gouillon, Flavien; Lathuilière, Cyril; Le Boyer, Arnaud; Le Sommer, Julien; Lyard, Florent; Marsaleix, Patrick; Marchesiello, Patrick; Soufflet, Yves
2016-04-01
The COMODO group (http://www.comodo-ocean.fr) gathers developers of global and limited-area ocean models (NEMO, ROMS_AGRIF, S, MARS, HYCOM, S-TUGO) with the aim to address well-identified numerical issues. In order to evaluate existing models, to improve numerical approaches and methods or concept (such as effective resolution) to assess the behavior of numerical model in complex hydrodynamical regimes and to propose guidelines for the development of future ocean models, a benchmark suite that covers both idealized test cases dedicated to targeted properties of numerical schemes and more complex test case allowing the evaluation of the kernel coherence is proposed. The benchmark suite is built to study separately, then together, the main components of an ocean model : the continuity and momentum equations, the advection-diffusion of the tracers, the vertical coordinate design and the time stepping algorithms. The test cases are chosen for their simplicity of implementation (analytic initial conditions), for their capacity to focus on a (few) scheme or part of the kernel, for the availability of analytical solutions or accurate diagnoses and lastly to simulate a key oceanic processus in a controlled environment. Idealized test cases allow to verify properties of numerical schemes advection-diffusion of tracers, - upwelling, - lock exchange, - baroclinic vortex, - adiabatic motion along bathymetry, and to put into light numerical issues that remain undetected in realistic configurations - trajectory of barotropic vortex, - interaction current - topography. When complexity in the simulated dynamics grows up, - internal wave, - unstable baroclinic jet, the sharing of the same experimental designs by different existing models is useful to get a measure of the model sensitivity to numerical choices (Soufflet et al., 2016). Lastly, test cases help in understanding the submesoscale influence on the dynamics (Couvelard et al., 2015). Such a benchmark suite is an interesting bed to continue research in numerical approaches as well as an efficient tool to maintain any oceanic code and assure the users a stamped model in a certain range of hydrodynamical regimes. Thanks to a common netCDF format, this suite is completed with a python library that encompasses all the tools and metrics used to assess the efficiency of the numerical methods. References - Couvelard X., F. Dumas, V. Garnier, A.L. Ponte, C. Talandier, A.M. Treguier (2015). Mixed layer formation and restratification in presence of mesoscale and submesoscale turbulence. Ocean Modelling, Vol 96-2, p 243-253. doi:10.1016/j.ocemod.2015.10.004. - Soufflet Y., P. Marchesiello, F. Lemarié, J. Jouanno, X. Capet, L. Debreu , R. Benshila (2016). On effective resolution in ocean models. Ocean Modelling, in press. doi:10.1016/j.ocemod.2015.12.004
DOE Office of Scientific and Technical Information (OSTI.GOV)
Cohen, J; Dossa, D; Gokhale, M
Critical data science applications requiring frequent access to storage perform poorly on today's computing architectures. This project addresses efficient computation of data-intensive problems in national security and basic science by exploring, advancing, and applying a new form of computing called storage-intensive supercomputing (SISC). Our goal is to enable applications that simply cannot run on current systems, and, for a broad range of data-intensive problems, to deliver an order of magnitude improvement in price/performance over today's data-intensive architectures. This technical report documents much of the work done under LDRD 07-ERD-063 Storage Intensive Supercomputing during the period 05/07-09/07. The following chapters describe:more » (1) a new file I/O monitoring tool iotrace developed to capture the dynamic I/O profiles of Linux processes; (2) an out-of-core graph benchmark for level-set expansion of scale-free graphs; (3) an entity extraction benchmark consisting of a pipeline of eight components; and (4) an image resampling benchmark drawn from the SWarp program in the LSST data processing pipeline. The performance of the graph and entity extraction benchmarks was measured in three different scenarios: data sets residing on the NFS file server and accessed over the network; data sets stored on local disk; and data sets stored on the Fusion I/O parallel NAND Flash array. The image resampling benchmark compared performance of software-only to GPU-accelerated. In addition to the work reported here, an additional text processing application was developed that used an FPGA to accelerate n-gram profiling for language classification. The n-gram application will be presented at SC07 at the High Performance Reconfigurable Computing Technologies and Applications Workshop. The graph and entity extraction benchmarks were run on a Supermicro server housing the NAND Flash 40GB parallel disk array, the Fusion-io. The Fusion system specs are as follows: SuperMicro X7DBE Xeon Dual Socket Blackford Server Motherboard; 2 Intel Xeon Dual-Core 2.66 GHz processors; 1 GB DDR2 PC2-5300 RAM (2 x 512); 80GB Hard Drive (Seagate SATA II Barracuda). The Fusion board is presently capable of 4X in a PCIe slot. The image resampling benchmark was run on a dual Xeon workstation with NVIDIA graphics card (see Chapter 5 for full specification). An XtremeData Opteron+FPGA was used for the language classification application. We observed that these benchmarks are not uniformly I/O intensive. The only benchmark that showed greater that 50% of the time in I/O was the graph algorithm when it accessed data files over NFS. When local disk was used, the graph benchmark spent at most 40% of its time in I/O. The other benchmarks were CPU dominated. The image resampling benchmark and language classification showed order of magnitude speedup over software by using co-processor technology to offload the CPU-intensive kernels. Our experiments to date suggest that emerging hardware technologies offer significant benefit to boosting the performance of data-intensive algorithms. Using GPU and FPGA co-processors, we were able to improve performance by more than an order of magnitude on the benchmark algorithms, eliminating the processor bottleneck of CPU-bound tasks. Experiments with a prototype solid state nonvolative memory available today show 10X better throughput on random reads than disk, with a 2X speedup on a graph processing benchmark when compared to the use of local SATA disk.« less
Soucek, Alexander; Ostkamp, Lutz; Paternesi, Roberta
2015-04-01
Space suit simulators are used for extravehicular activities (EVAs) during Mars analog missions. Flight planning and EVA productivity require accurate time estimates of activities to be performed with such simulators, such as experiment execution or traverse walking. We present a benchmarking methodology for the Aouda.X space suit simulator of the Austrian Space Forum. By measuring and comparing the times needed to perform a set of 10 test activities with and without Aouda.X, an average time delay was derived in the form of a multiplicative factor. This statistical value (a second-over-second time ratio) is 1.30 and shows that operations in Aouda.X take on average a third longer than the same operations without the suit. We also show that activities predominantly requiring fine motor skills are associated with larger time delays (between 1.17 and 1.59) than those requiring short-distance locomotion or short-term muscle strain (between 1.10 and 1.16). The results of the DELTA experiment performed during the MARS2013 field mission increase analog mission planning reliability and thus EVA efficiency and productivity when using Aouda.X.
Memory transfer optimization for a lattice Boltzmann solver on Kepler architecture nVidia GPUs
NASA Astrophysics Data System (ADS)
Mawson, Mark J.; Revell, Alistair J.
2014-10-01
The Lattice Boltzmann method (LBM) for solving fluid flow is naturally well suited to an efficient implementation for massively parallel computing, due to the prevalence of local operations in the algorithm. This paper presents and analyses the performance of a 3D lattice Boltzmann solver, optimized for third generation nVidia GPU hardware, also known as 'Kepler'. We provide a review of previous optimization strategies and analyse data read/write times for different memory types. In LBM, the time propagation step (known as streaming), involves shifting data to adjacent locations and is central to parallel performance; here we examine three approaches which make use of different hardware options. Two of which make use of 'performance enhancing' features of the GPU; shared memory and the new shuffle instruction found in Kepler based GPUs. These are compared to a standard transfer of data which relies instead on optimized storage to increase coalesced access. It is shown that the more simple approach is most efficient; since the need for large numbers of registers per thread in LBM limits the block size and thus the efficiency of these special features is reduced. Detailed results are obtained for a D3Q19 LBM solver, which is benchmarked on nVidia K5000M and K20C GPUs. In the latter case the use of a read-only data cache is explored, and peak performance of over 1036 Million Lattice Updates Per Second (MLUPS) is achieved. The appearance of a periodic bottleneck in the solver performance is also reported, believed to be hardware related; spikes in iteration-time occur with a frequency of around 11 Hz for both GPUs, independent of the size of the problem.
Performance Measurement, Visualization and Modeling of Parallel and Distributed Programs
NASA Technical Reports Server (NTRS)
Yan, Jerry C.; Sarukkai, Sekhar R.; Mehra, Pankaj; Lum, Henry, Jr. (Technical Monitor)
1994-01-01
This paper presents a methodology for debugging the performance of message-passing programs on both tightly coupled and loosely coupled distributed-memory machines. The AIMS (Automated Instrumentation and Monitoring System) toolkit, a suite of software tools for measurement and analysis of performance, is introduced and its application illustrated using several benchmark programs drawn from the field of computational fluid dynamics. AIMS includes (i) Xinstrument, a powerful source-code instrumentor, which supports both Fortran77 and C as well as a number of different message-passing libraries including Intel's NX Thinking Machines' CMMD, and PVM; (ii) Monitor, a library of timestamping and trace -collection routines that run on supercomputers (such as Intel's iPSC/860, Delta, and Paragon and Thinking Machines' CM5) as well as on networks of workstations (including Convex Cluster and SparcStations connected by a LAN); (iii) Visualization Kernel, a trace-animation facility that supports source-code clickback, simultaneous visualization of computation and communication patterns, as well as analysis of data movements; (iv) Statistics Kernel, an advanced profiling facility, that associates a variety of performance data with various syntactic components of a parallel program; (v) Index Kernel, a diagnostic tool that helps pinpoint performance bottlenecks through the use of abstract indices; (vi) Modeling Kernel, a facility for automated modeling of message-passing programs that supports both simulation -based and analytical approaches to performance prediction and scalability analysis; (vii) Intrusion Compensator, a utility for recovering true performance from observed performance by removing the overheads of monitoring and their effects on the communication pattern of the program; and (viii) Compatibility Tools, that convert AIMS-generated traces into formats used by other performance-visualization tools, such as ParaGraph, Pablo, and certain AVS/Explorer modules.
Memory Benchmarks for SMP-Based High Performance Parallel Computers
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yoo, A B; de Supinski, B; Mueller, F
2001-11-20
As the speed gap between CPU and main memory continues to grow, memory accesses increasingly dominates the performance of many applications. The problem is particularly acute for symmetric multiprocessor (SMP) systems, where the shared memory may be accessed concurrently by a group of threads running on separate CPUs. Unfortunately, several key issues governing memory system performance in current systems are not well understood. Complex interactions between the levels of the memory hierarchy, buses or switches, DRAM back-ends, system software, and application access patterns can make it difficult to pinpoint bottlenecks and determine appropriate optimizations, and the situation is even moremore » complex for SMP systems. To partially address this problem, we formulated a set of multi-threaded microbenchmarks for characterizing and measuring the performance of the underlying memory system in SMP-based high-performance computers. We report our use of these microbenchmarks on two important SMP-based machines. This paper has four primary contributions. First, we introduce a microbenchmark suite to systematically assess and compare the performance of different levels in SMP memory hierarchies. Second, we present a new tool based on hardware performance monitors to determine a wide array of memory system characteristics, such as cache sizes, quickly and easily; by using this tool, memory performance studies can be targeted to the full spectrum of performance regimes with many fewer data points than is otherwise required. Third, we present experimental results indicating that the performance of applications with large memory footprints remains largely constrained by memory. Fourth, we demonstrate that thread-level parallelism further degrades memory performance, even for the latest SMPs with hardware prefetching and switch-based memory interconnects.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sublet, J.-Ch.; Koning, A.J.; Forrest, R.A.
The reasons for the conversion of the European Activation File, EAF into ENDF-6 format are threefold. First, it significantly enhances the JEFF-3.0 release by the addition of an activation file. Second, to considerably increase its usage by using a recognized, official file format, allowing existing plug-in processes to be effective; and third, to move towards a universal nuclear data file in contrast to the current separate general and special-purpose files. The format chosen for the JEFF-3.0/A file uses reaction cross sections (MF-3), cross sections (MF-10), and multiplicities (MF-9). Having the data in ENDF-6 format allows the ENDF suite of utilitiesmore » and checker codes to be used alongside many other utility, visualizing, and processing codes. It is based on the EAF activation file used for many applications from fission to fusion, including dosimetry, inventories, depletion-transmutation, and geophysics. JEFF-3.0/A takes advantage of four generations of EAF files. Extensive benchmarking activities on these files provide feedback and validation with integral measurements. These, in parallel with a detailed graphical analysis based on EXFOR, have been applied stimulating new measurements, significantly increasing the quality of this activation file. The next step is to include the EAF uncertainty data for all channels into JEFF-3.0/A.« less
Towards unbiased benchmarking of evolutionary and hybrid algorithms for real-valued optimisation
NASA Astrophysics Data System (ADS)
MacNish, Cara
2007-12-01
Randomised population-based algorithms, such as evolutionary, genetic and swarm-based algorithms, and their hybrids with traditional search techniques, have proven successful and robust on many difficult real-valued optimisation problems. This success, along with the readily applicable nature of these techniques, has led to an explosion in the number of algorithms and variants proposed. In order for the field to advance it is necessary to carry out effective comparative evaluations of these algorithms, and thereby better identify and understand those properties that lead to better performance. This paper discusses the difficulties of providing benchmarking of evolutionary and allied algorithms that is both meaningful and logistically viable. To be meaningful the benchmarking test must give a fair comparison that is free, as far as possible, from biases that favour one style of algorithm over another. To be logistically viable it must overcome the need for pairwise comparison between all the proposed algorithms. To address the first problem, we begin by attempting to identify the biases that are inherent in commonly used benchmarking functions. We then describe a suite of test problems, generated recursively as self-similar or fractal landscapes, designed to overcome these biases. For the second, we describe a server that uses web services to allow researchers to 'plug in' their algorithms, running on their local machines, to a central benchmarking repository.
Benchmarking and Evaluating Unified Memory for OpenMP GPU Offloading
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mishra, Alok; Li, Lingda; Kong, Martin
Here, the latest OpenMP standard offers automatic device offloading capabilities which facilitate GPU programming. Despite this, there remain many challenges. One of these is the unified memory feature introduced in recent GPUs. GPUs in current and future HPC systems have enhanced support for unified memory space. In such systems, CPU and GPU can access each other's memory transparently, that is, the data movement is managed automatically by the underlying system software and hardware. Memory over subscription is also possible in these systems. However, there is a significant lack of knowledge about how this mechanism will perform, and how programmers shouldmore » use it. We have modified several benchmarks codes, in the Rodinia benchmark suite, to study the behavior of OpenMP accelerator extensions and have used them to explore the impact of unified memory in an OpenMP context. We moreover modified the open source LLVM compiler to allow OpenMP programs to exploit unified memory. The results of our evaluation reveal that, while the performance of unified memory is comparable with that of normal GPU offloading for benchmarks with little data reuse, it suffers from significant overhead when GPU memory is over subcribed for benchmarks with large amount of data reuse. Based on these results, we provide several guidelines for programmers to achieve better performance with unified memory.« less
How do I know if my forecasts are better? Using benchmarks in hydrological ensemble prediction
NASA Astrophysics Data System (ADS)
Pappenberger, F.; Ramos, M. H.; Cloke, H. L.; Wetterhall, F.; Alfieri, L.; Bogner, K.; Mueller, A.; Salamon, P.
2015-03-01
The skill of a forecast can be assessed by comparing the relative proximity of both the forecast and a benchmark to the observations. Example benchmarks include climatology or a naïve forecast. Hydrological ensemble prediction systems (HEPS) are currently transforming the hydrological forecasting environment but in this new field there is little information to guide researchers and operational forecasters on how benchmarks can be best used to evaluate their probabilistic forecasts. In this study, it is identified that the forecast skill calculated can vary depending on the benchmark selected and that the selection of a benchmark for determining forecasting system skill is sensitive to a number of hydrological and system factors. A benchmark intercomparison experiment is then undertaken using the continuous ranked probability score (CRPS), a reference forecasting system and a suite of 23 different methods to derive benchmarks. The benchmarks are assessed within the operational set-up of the European Flood Awareness System (EFAS) to determine those that are 'toughest to beat' and so give the most robust discrimination of forecast skill, particularly for the spatial average fields that EFAS relies upon. Evaluating against an observed discharge proxy the benchmark that has most utility for EFAS and avoids the most naïve skill across different hydrological situations is found to be meteorological persistency. This benchmark uses the latest meteorological observations of precipitation and temperature to drive the hydrological model. Hydrological long term average benchmarks, which are currently used in EFAS, are very easily beaten by the forecasting system and the use of these produces much naïve skill. When decomposed into seasons, the advanced meteorological benchmarks, which make use of meteorological observations from the past 20 years at the same calendar date, have the most skill discrimination. They are also good at discriminating skill in low flows and for all catchment sizes. Simpler meteorological benchmarks are particularly useful for high flows. Recommendations for EFAS are to move to routine use of meteorological persistency, an advanced meteorological benchmark and a simple meteorological benchmark in order to provide a robust evaluation of forecast skill. This work provides the first comprehensive evidence on how benchmarks can be used in evaluation of skill in probabilistic hydrological forecasts and which benchmarks are most useful for skill discrimination and avoidance of naïve skill in a large scale HEPS. It is recommended that all HEPS use the evidence and methodology provided here to evaluate which benchmarks to employ; so forecasters can have trust in their skill evaluation and will have confidence that their forecasts are indeed better.
NASA Astrophysics Data System (ADS)
Zamani, K.; Bombardelli, F. A.
2014-12-01
Verification of geophysics codes is imperative to avoid serious academic as well as practical consequences. In case that access to any given source code is not possible, the Method of Manufactured Solution (MMS) cannot be employed in code verification. In contrast, employing the Method of Exact Solution (MES) has several practical advantages. In this research, we first provide four new one-dimensional analytical solutions designed for code verification; these solutions are able to uncover the particular imperfections of the Advection-diffusion-reaction equation, such as nonlinear advection, diffusion or source terms, as well as non-constant coefficient equations. After that, we provide a solution of Burgers' equation in a novel setup. Proposed solutions satisfy the continuity of mass for the ambient flow, which is a crucial factor for coupled hydrodynamics-transport solvers. Then, we use the derived analytical solutions for code verification. To clarify gray-literature issues in the verification of transport codes, we designed a comprehensive test suite to uncover any imperfection in transport solvers via a hierarchical increase in the level of tests' complexity. The test suite includes hundreds of unit tests and system tests to check vis-a-vis the portions of the code. Examples for checking the suite start by testing a simple case of unidirectional advection; then, bidirectional advection and tidal flow and build up to nonlinear cases. We design tests to check nonlinearity in velocity, dispersivity and reactions. The concealing effect of scales (Peclet and Damkohler numbers) on the mesh-convergence study and appropriate remedies are also discussed. For the cases in which the appropriate benchmarks for mesh convergence study are not available, we utilize symmetry. Auxiliary subroutines for automation of the test suite and report generation are designed. All in all, the test package is not only a robust tool for code verification but it also provides comprehensive insight on the ADR solvers capabilities. Such information is essential for any rigorous computational modeling of ADR equation for surface/subsurface pollution transport. We also convey our experiences in finding several errors which were not detectable with routine verification techniques.
Synchronization Of Parallel Discrete Event Simulations
NASA Technical Reports Server (NTRS)
Steinman, Jeffrey S.
1992-01-01
Adaptive, parallel, discrete-event-simulation-synchronization algorithm, Breathing Time Buckets, developed in Synchronous Parallel Environment for Emulation and Discrete Event Simulation (SPEEDES) operating system. Algorithm allows parallel simulations to process events optimistically in fluctuating time cycles that naturally adapt while simulation in progress. Combines best of optimistic and conservative synchronization strategies while avoiding major disadvantages. Algorithm processes events optimistically in time cycles adapting while simulation in progress. Well suited for modeling communication networks, for large-scale war games, for simulated flights of aircraft, for simulations of computer equipment, for mathematical modeling, for interactive engineering simulations, and for depictions of flows of information.
Implementations of BLAST for parallel computers.
Jülich, A
1995-02-01
The BLAST sequence comparison programs have been ported to a variety of parallel computers-the shared memory machine Cray Y-MP 8/864 and the distributed memory architectures Intel iPSC/860 and nCUBE. Additionally, the programs were ported to run on workstation clusters. We explain the parallelization techniques and consider the pros and cons of these methods. The BLAST programs are very well suited for parallelization for a moderate number of processors. We illustrate our results using the program blastp as an example. As input data for blastp, a 799 residue protein query sequence and the protein database PIR were used.
A Parallel Particle Swarm Optimization Algorithm Accelerated by Asynchronous Evaluations
NASA Technical Reports Server (NTRS)
Venter, Gerhard; Sobieszczanski-Sobieski, Jaroslaw
2005-01-01
A parallel Particle Swarm Optimization (PSO) algorithm is presented. Particle swarm optimization is a fairly recent addition to the family of non-gradient based, probabilistic search algorithms that is based on a simplified social model and is closely tied to swarming theory. Although PSO algorithms present several attractive properties to the designer, they are plagued by high computational cost as measured by elapsed time. One approach to reduce the elapsed time is to make use of coarse-grained parallelization to evaluate the design points. Previous parallel PSO algorithms were mostly implemented in a synchronous manner, where all design points within a design iteration are evaluated before the next iteration is started. This approach leads to poor parallel speedup in cases where a heterogeneous parallel environment is used and/or where the analysis time depends on the design point being analyzed. This paper introduces an asynchronous parallel PSO algorithm that greatly improves the parallel e ciency. The asynchronous algorithm is benchmarked on a cluster assembled of Apple Macintosh G5 desktop computers, using the multi-disciplinary optimization of a typical transport aircraft wing as an example.
Biclustering as a method for RNA local multiple sequence alignment.
Wang, Shu; Gutell, Robin R; Miranker, Daniel P
2007-12-15
Biclustering is a clustering method that simultaneously clusters both the domain and range of a relation. A challenge in multiple sequence alignment (MSA) is that the alignment of sequences is often intended to reveal groups of conserved functional subsequences. Simultaneously, the grouping of the sequences can impact the alignment; precisely the kind of dual situation biclustering is intended to address. We define a representation of the MSA problem enabling the application of biclustering algorithms. We develop a computer program for local MSA, BlockMSA, that combines biclustering with divide-and-conquer. BlockMSA simultaneously finds groups of similar sequences and locally aligns subsequences within them. Further alignment is accomplished by dividing both the set of sequences and their contents. The net result is both a multiple sequence alignment and a hierarchical clustering of the sequences. BlockMSA was tested on the subsets of the BRAliBase 2.1 benchmark suite that display high variability and on an extension to that suite to larger problem sizes. Also, alignments were evaluated of two large datasets of current biological interest, T box sequences and Group IC1 Introns. The results were compared with alignments computed by ClustalW, MAFFT, MUCLE and PROBCONS alignment programs using Sum of Pairs (SPS) and Consensus Count. Results for the benchmark suite are sensitive to problem size. On problems of 15 or greater sequences, BlockMSA is consistently the best. On none of the problems in the test suite are there appreciable differences in scores among BlockMSA, MAFFT and PROBCONS. On the T box sequences, BlockMSA does the most faithful job of reproducing known annotations. MAFFT and PROBCONS do not. On the Intron sequences, BlockMSA, MAFFT and MUSCLE are comparable at identifying conserved regions. BlockMSA is implemented in Java. Source code and supplementary datasets are available at http://aug.csres.utexas.edu/msa/
MPF: A portable message passing facility for shared memory multiprocessors
NASA Technical Reports Server (NTRS)
Malony, Allen D.; Reed, Daniel A.; Mcguire, Patrick J.
1987-01-01
The design, implementation, and performance evaluation of a message passing facility (MPF) for shared memory multiprocessors are presented. The MPF is based on a message passing model conceptually similar to conversations. Participants (parallel processors) can enter or leave a conversation at any time. The message passing primitives for this model are implemented as a portable library of C function calls. The MPF is currently operational on a Sequent Balance 21000, and several parallel applications were developed and tested. Several simple benchmark programs are presented to establish interprocess communication performance for common patterns of interprocess communication. Finally, performance figures are presented for two parallel applications, linear systems solution, and iterative solution of partial differential equations.
Performance Evaluation of Remote Memory Access (RMA) Programming on Shared Memory Parallel Computers
NASA Technical Reports Server (NTRS)
Jin, Hao-Qiang; Jost, Gabriele; Biegel, Bryan A. (Technical Monitor)
2002-01-01
The purpose of this study is to evaluate the feasibility of remote memory access (RMA) programming on shared memory parallel computers. We discuss different RMA based implementations of selected CFD application benchmark kernels and compare them to corresponding message passing based codes. For the message-passing implementation we use MPI point-to-point and global communication routines. For the RMA based approach we consider two different libraries supporting this programming model. One is a shared memory parallelization library (SMPlib) developed at NASA Ames, the other is the MPI-2 extensions to the MPI Standard. We give timing comparisons for the different implementation strategies and discuss the performance.
A Robust and Scalable Software Library for Parallel Adaptive Refinement on Unstructured Meshes
NASA Technical Reports Server (NTRS)
Lou, John Z.; Norton, Charles D.; Cwik, Thomas A.
1999-01-01
The design and implementation of Pyramid, a software library for performing parallel adaptive mesh refinement (PAMR) on unstructured meshes, is described. This software library can be easily used in a variety of unstructured parallel computational applications, including parallel finite element, parallel finite volume, and parallel visualization applications using triangular or tetrahedral meshes. The library contains a suite of well-designed and efficiently implemented modules that perform operations in a typical PAMR process. Among these are mesh quality control during successive parallel adaptive refinement (typically guided by a local-error estimator), parallel load-balancing, and parallel mesh partitioning using the ParMeTiS partitioner. The Pyramid library is implemented in Fortran 90 with an interface to the Message-Passing Interface (MPI) library, supporting code efficiency, modularity, and portability. An EM waveguide filter application, adaptively refined using the Pyramid library, is illustrated.
Parallel Signal Processing and System Simulation using aCe
NASA Technical Reports Server (NTRS)
Dorband, John E.; Aburdene, Maurice F.
2003-01-01
Recently, networked and cluster computation have become very popular for both signal processing and system simulation. A new language is ideally suited for parallel signal processing applications and system simulation since it allows the programmer to explicitly express the computations that can be performed concurrently. In addition, the new C based parallel language (ace C) for architecture-adaptive programming allows programmers to implement algorithms and system simulation applications on parallel architectures by providing them with the assurance that future parallel architectures will be able to run their applications with a minimum of modification. In this paper, we will focus on some fundamental features of ace C and present a signal processing application (FFT).
Parallel language constructs for tensor product computations on loosely coupled architectures
NASA Technical Reports Server (NTRS)
Mehrotra, Piyush; Vanrosendale, John
1989-01-01
Distributed memory architectures offer high levels of performance and flexibility, but have proven awkard to program. Current languages for nonshared memory architectures provide a relatively low level programming environment, and are poorly suited to modular programming, and to the construction of libraries. A set of language primitives designed to allow the specification of parallel numerical algorithms at a higher level is described. Tensor product array computations are focused on along with a simple but important class of numerical algorithms. The problem of programming 1-D kernal routines is focused on first, such as parallel tridiagonal solvers, and then how such parallel kernels can be combined to form parallel tensor product algorithms is examined.
Parallel Implementation of the Discontinuous Galerkin Method
NASA Technical Reports Server (NTRS)
Baggag, Abdalkader; Atkins, Harold; Keyes, David
1999-01-01
This paper describes a parallel implementation of the discontinuous Galerkin method. Discontinuous Galerkin is a spatially compact method that retains its accuracy and robustness on non-smooth unstructured grids and is well suited for time dependent simulations. Several parallelization approaches are studied and evaluated. The most natural and symmetric of the approaches has been implemented in all object-oriented code used to simulate aeroacoustic scattering. The parallel implementation is MPI-based and has been tested on various parallel platforms such as the SGI Origin, IBM SP2, and clusters of SGI and Sun workstations. The scalability results presented for the SGI Origin show slightly superlinear speedup on a fixed-size problem due to cache effects.
Massively parallel implementation of 3D-RISM calculation with volumetric 3D-FFT.
Maruyama, Yutaka; Yoshida, Norio; Tadano, Hiroto; Takahashi, Daisuke; Sato, Mitsuhisa; Hirata, Fumio
2014-07-05
A new three-dimensional reference interaction site model (3D-RISM) program for massively parallel machines combined with the volumetric 3D fast Fourier transform (3D-FFT) was developed, and tested on the RIKEN K supercomputer. The ordinary parallel 3D-RISM program has a limitation on the number of parallelizations because of the limitations of the slab-type 3D-FFT. The volumetric 3D-FFT relieves this limitation drastically. We tested the 3D-RISM calculation on the large and fine calculation cell (2048(3) grid points) on 16,384 nodes, each having eight CPU cores. The new 3D-RISM program achieved excellent scalability to the parallelization, running on the RIKEN K supercomputer. As a benchmark application, we employed the program, combined with molecular dynamics simulation, to analyze the oligomerization process of chymotrypsin Inhibitor 2 mutant. The results demonstrate that the massive parallel 3D-RISM program is effective to analyze the hydration properties of the large biomolecular systems. Copyright © 2014 Wiley Periodicals, Inc.
Performance Analysis of Multilevel Parallel Applications on Shared Memory Architectures
NASA Technical Reports Server (NTRS)
Jost, Gabriele; Jin, Haoqiang; Labarta, Jesus; Gimenez, Judit; Caubet, Jordi; Biegel, Bryan A. (Technical Monitor)
2002-01-01
In this paper we describe how to apply powerful performance analysis techniques to understand the behavior of multilevel parallel applications. We use the Paraver/OMPItrace performance analysis system for our study. This system consists of two major components: The OMPItrace dynamic instrumentation mechanism, which allows the tracing of processes and threads and the Paraver graphical user interface for inspection and analyses of the generated traces. We describe how to use the system to conduct a detailed comparative study of a benchmark code implemented in five different programming paradigms applicable for shared memory
Understanding the Cray X1 System
NASA Technical Reports Server (NTRS)
Cheung, Samson
2004-01-01
This paper helps the reader understand the characteristics of the Cray X1 vector supercomputer system, and provides hints and information to enable the reader to port codes to the system. It provides a comparison between the basic performance of the X1 platform and other platforms that are available at NASA Ames Research Center. A set of codes, solving the Laplacian equation with different parallel paradigms, is used to understand some features of the X1 compiler. An example code from the NAS Parallel Benchmarks is used to demonstrate performance optimization on the X1 platform.
High Performance Programming Using Explicit Shared Memory Model on the Cray T3D
NASA Technical Reports Server (NTRS)
Saini, Subhash; Simon, Horst D.; Lasinski, T. A. (Technical Monitor)
1994-01-01
The Cray T3D is the first-phase system in Cray Research Inc.'s (CRI) three-phase massively parallel processing program. In this report we describe the architecture of the T3D, as well as the CRAFT (Cray Research Adaptive Fortran) programming model, and contrast it with PVM, which is also supported on the T3D We present some performance data based on the NAS Parallel Benchmarks to illustrate both architectural and software features of the T3D.
NASA Technical Reports Server (NTRS)
Pedretti, Kevin T.; Fineberg, Samuel A.; Kutler, Paul (Technical Monitor)
1997-01-01
A variety of different network technologies and topologies are currently being evaluated as part of the Whitney Project. This paper reports on the implementation and performance of a Fast Ethernet network configured in a 4x4 2D torus topology in a testbed cluster of 'commodity' Pentium Pro PCs. Several benchmarks were used for performance evaluation: an MPI point to point message passing benchmark, an MPI collective communication benchmark, and the NAS Parallel Benchmarks version 2.2 (NPB2). Our results show that for point to point communication on an unloaded network, the hub and 1 hop routes on the torus have about the same bandwidth and latency. However, the bandwidth decreases and the latency increases on the torus for each additional route hop. Collective communication benchmarks show that the torus provides roughly four times more aggregate bandwidth and eight times faster MPI barrier synchronizations than a hub based network for 16 processor systems. Finally, the SOAPBOX benchmarks, which simulate real-world CFD applications, generally demonstrated substantially better performance on the torus than on the hub. In the few cases the hub was faster, the difference was negligible. In total, our experimental results lead to the conclusion that for Fast Ethernet networks, the torus topology has better performance and scales better than a hub based network.
Integrating the Nqueens Algorithm into a Parameterized Benchmark Suite
2016-02-01
FOB is a 64-node heterogeneous cluster consisting of 16-IBM dx360M4 nodes, each with one NVIDIA Kepler K20M GPUs and 48-IBM dx360M4 nodes, and each...nodes have 256-GB of memory and an NVIDIA Tesla K40 GPU. More details on Excalibur can be found on the US Army DSRC website.19 Figures 3 and 4 show the
Commercial Building Energy Saver, API
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hong, Tianzhen; Piette, Mary; Lee, Sang Hoon
2015-08-27
The CBES API provides Application Programming Interface to a suite of functions to improve energy efficiency of buildings, including building energy benchmarking, preliminary retrofit analysis using a pre-simulation database DEEP, and detailed retrofit analysis using energy modeling with the EnergyPlus simulation engine. The CBES API is used to power the LBNL CBES Web App. It can be adopted by third party developers and vendors into their software tools and platforms.
Simulation of Benchmark Cases with the Terminal Area Simulation System (TASS)
NASA Technical Reports Server (NTRS)
Ahmad, Nash'at; Proctor, Fred
2011-01-01
The hydrodynamic core of the Terminal Area Simulation System (TASS) is evaluated against different benchmark cases. In the absence of closed form solutions for the equations governing atmospheric flows, the models are usually evaluated against idealized test cases. Over the years, various authors have suggested a suite of these idealized cases which have become standards for testing and evaluating the dynamics and thermodynamics of atmospheric flow models. In this paper, simulations of three such cases are described. In addition, the TASS model is evaluated against a test case that uses an exact solution of the Navier-Stokes equations. The TASS results are compared against previously reported simulations of these banchmark cases in the literature. It is demonstrated that the TASS model is highly accurate, stable and robust.
The ab-initio density matrix renormalization group in practice.
Olivares-Amaya, Roberto; Hu, Weifeng; Nakatani, Naoki; Sharma, Sandeep; Yang, Jun; Chan, Garnet Kin-Lic
2015-01-21
The ab-initio density matrix renormalization group (DMRG) is a tool that can be applied to a wide variety of interesting problems in quantum chemistry. Here, we examine the density matrix renormalization group from the vantage point of the quantum chemistry user. What kinds of problems is the DMRG well-suited to? What are the largest systems that can be treated at practical cost? What sort of accuracies can be obtained, and how do we reason about the computational difficulty in different molecules? By examining a diverse benchmark set of molecules: π-electron systems, benchmark main-group and transition metal dimers, and the Mn-oxo-salen and Fe-porphine organometallic compounds, we provide some answers to these questions, and show how the density matrix renormalization group is used in practice.
de Vreede, Gert-Jan; Briggs, Robert O; Reiter-Palmon, Roni
2010-04-01
The aim of this study was to compare the results of two different modes of using multiple groups (instead of one large group) to identify problems and develop solutions. Many of the complex problems facing organizations today require the use of very large groups or collaborations of groups from multiple organizations. There are many logistical problems associated with the use of such large groups, including the ability to bring everyone together at the same time and location. A field study involved two different organizations and compared productivity and satisfaction of group. The approaches included (a) multiple small groups, each completing the entire process from start to end and combining the results at the end (parallel mode); and (b) multiple subgroups, each building on the work provided by previous subgroups (serial mode). Groups using the serial mode produced more elaborations compared with parallel groups, whereas parallel groups produced more unique ideas compared with serial groups. No significant differences were found related to satisfaction with process and outcomes between the two modes. Preferred mode depends on the type of task facing the group. Parallel groups are more suited for tasks for which a variety of new ideas are needed, whereas serial groups are best suited when elaboration and in-depth thinking on the solution are required. Results of this research can guide the development of facilitated sessions of large groups or "teams of teams."
Benchmarking Memory Performance with the Data Cube Operator
NASA Technical Reports Server (NTRS)
Frumkin, Michael A.; Shabanov, Leonid V.
2004-01-01
Data movement across a computer memory hierarchy and across computational grids is known to be a limiting factor for applications processing large data sets. We use the Data Cube Operator on an Arithmetic Data Set, called ADC, to benchmark capabilities of computers and of computational grids to handle large distributed data sets. We present a prototype implementation of a parallel algorithm for computation of the operatol: The algorithm follows a known approach for computing views from the smallest parent. The ADC stresses all levels of grid memory and storage by producing some of 2d views of an Arithmetic Data Set of d-tuples described by a small number of integers. We control data intensity of the ADC by selecting the tuple parameters, the sizes of the views, and the number of realized views. Benchmarking results of memory performance of a number of computer architectures and of a small computational grid are presented.
Parallel ALLSPD-3D: Speeding Up Combustor Analysis Via Parallel Processing
NASA Technical Reports Server (NTRS)
Fricker, David M.
1997-01-01
The ALLSPD-3D Computational Fluid Dynamics code for reacting flow simulation was run on a set of benchmark test cases to determine its parallel efficiency. These test cases included non-reacting and reacting flow simulations with varying numbers of processors. Also, the tests explored the effects of scaling the simulation with the number of processors in addition to distributing a constant size problem over an increasing number of processors. The test cases were run on a cluster of IBM RS/6000 Model 590 workstations with ethernet and ATM networking plus a shared memory SGI Power Challenge L workstation. The results indicate that the network capabilities significantly influence the parallel efficiency, i.e., a shared memory machine is fastest and ATM networking provides acceptable performance. The limitations of ethernet greatly hamper the rapid calculation of flows using ALLSPD-3D.
Efficient parallelization of analytic bond-order potentials for large-scale atomistic simulations
NASA Astrophysics Data System (ADS)
Teijeiro, C.; Hammerschmidt, T.; Drautz, R.; Sutmann, G.
2016-07-01
Analytic bond-order potentials (BOPs) provide a way to compute atomistic properties with controllable accuracy. For large-scale computations of heterogeneous compounds at the atomistic level, both the computational efficiency and memory demand of BOP implementations have to be optimized. Since the evaluation of BOPs is a local operation within a finite environment, the parallelization concepts known from short-range interacting particle simulations can be applied to improve the performance of these simulations. In this work, several efficient parallelization methods for BOPs that use three-dimensional domain decomposition schemes are described. The schemes are implemented into the bond-order potential code BOPfox, and their performance is measured in a series of benchmarks. Systems of up to several millions of atoms are simulated on a high performance computing system, and parallel scaling is demonstrated for up to thousands of processors.
Using a constraint on the parallel velocity when determining electric fields with EISCAT
NASA Technical Reports Server (NTRS)
Caudal, G.; Blanc, M.
1988-01-01
A method is proposed to determine the perpendicular components of the ion velocity vector (and hence the perpendicular electric field) from EISCAT tristatic measurements, in which one introduces an additional constraint on the parallel velocity, in order to take account of our knowledge that the parallel velocity of ions is small. This procedure removes some artificial features introduced when the tristatic geometry becomes too unfavorable. It is particularly well suited for the southernmost or northernmost positions of the tristatic measurements performed by meridian scan experiments (CP3 mode).
Munguia, Lluis-Miquel; Oxberry, Geoffrey; Rajan, Deepak
2016-05-01
Stochastic mixed-integer programs (SMIPs) deal with optimization under uncertainty at many levels of the decision-making process. When solved as extensive formulation mixed- integer programs, problem instances can exceed available memory on a single workstation. In order to overcome this limitation, we present PIPS-SBB: a distributed-memory parallel stochastic MIP solver that takes advantage of parallelism at multiple levels of the optimization process. We also show promising results on the SIPLIB benchmark by combining methods known for accelerating Branch and Bound (B&B) methods with new ideas that leverage the structure of SMIPs. Finally, we expect the performance of PIPS-SBB to improve furthermore » as more functionality is added in the future.« less
Gust Acoustics Computation with a Space-Time CE/SE Parallel 3D Solver
NASA Technical Reports Server (NTRS)
Wang, X. Y.; Himansu, A.; Chang, S. C.; Jorgenson, P. C. E.; Reddy, D. R. (Technical Monitor)
2002-01-01
The benchmark Problem 2 in Category 3 of the Third Computational Aero-Acoustics (CAA) Workshop is solved using the space-time conservation element and solution element (CE/SE) method. This problem concerns the unsteady response of an isolated finite-span swept flat-plate airfoil bounded by two parallel walls to an incident gust. The acoustic field generated by the interaction of the gust with the flat-plate airfoil is computed by solving the 3D (three-dimensional) Euler equations in the time domain using a parallel version of a 3D CE/SE solver. The effect of the gust orientation on the far-field directivity is studied. Numerical solutions are presented and compared with analytical solutions, showing a reasonable agreement.
Maximal clique enumeration with data-parallel primitives
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lessley, Brenton; Perciano, Talita; Mathai, Manish
The enumeration of all maximal cliques in an undirected graph is a fundamental problem arising in several research areas. We consider maximal clique enumeration on shared-memory, multi-core architectures and introduce an approach consisting entirely of data-parallel operations, in an effort to achieve efficient and portable performance across different architectures. We study the performance of the algorithm via experiments varying over benchmark graphs and architectures. Overall, we observe that our algorithm achieves up to a 33-time speedup and 9-time speedup over state-of-the-art distributed and serial algorithms, respectively, for graphs with higher ratios of maximal cliques to total cliques. Further, we attainmore » additional speedups on a GPU architecture, demonstrating the portable performance of our data-parallel design.« less
Debugging and Analysis of Large-Scale Parallel Programs
1989-09-01
Przybylski, T. Riordan , C. Rowen, and D. Van’t Hof, "A CMOS RISC Processor with Integrated System Functions," In Proc. of the 1986 COMPCON. IEEE, March 1986...Sequencers," Communications of the ACM, 22(2):115-123, 1979. 115 [Richardson, 1988] Rick Richardson, "Dhrystone 2.1 Benchmark," Usenet Distribution
Code Parallelization with CAPO: A User Manual
NASA Technical Reports Server (NTRS)
Jin, Hao-Qiang; Frumkin, Michael; Yan, Jerry; Biegel, Bryan (Technical Monitor)
2001-01-01
A software tool has been developed to assist the parallelization of scientific codes. This tool, CAPO, extends an existing parallelization toolkit, CAPTools developed at the University of Greenwich, to generate OpenMP parallel codes for shared memory architectures. This is an interactive toolkit to transform a serial Fortran application code to an equivalent parallel version of the software - in a small fraction of the time normally required for a manual parallelization. We first discuss the way in which loop types are categorized and how efficient OpenMP directives can be defined and inserted into the existing code using the in-depth interprocedural analysis. The use of the toolkit on a number of application codes ranging from benchmark to real-world application codes is presented. This will demonstrate the great potential of using the toolkit to quickly parallelize serial programs as well as the good performance achievable on a large number of toolkit to quickly parallelize serial programs as well as the good performance achievable on a large number of processors. The second part of the document gives references to the parameters and the graphic user interface implemented in the toolkit. Finally a set of tutorials is included for hands-on experiences with this toolkit.
Parallelism in Manipulator Dynamics. Revision.
1983-12-01
computing the motor torques required to drive a lower-pair kinematic chain (e.g., a typical manipulator arm in free motion, or a mechanical leg in the... computations , and presents two "mathematically exact" formulationsespecially suited to high-speed, highly parallel implementa- tions using special-purpose...YNAMICS by I(IIAR) IIAROLI) LATIROP .4ISTRACT This paper addresses the problem of efficiently computing the motor torques required to drive a lower-pair
dfnWorks: A discrete fracture network framework for modeling subsurface flow and transport
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hyman, Jeffrey D.; Karra, Satish; Makedonska, Nataliia
DFNWORKS is a parallelized computational suite to generate three-dimensional discrete fracture networks (DFN) and simulate flow and transport. Developed at Los Alamos National Laboratory over the past five years, it has been used to study flow and transport in fractured media at scales ranging from millimeters to kilometers. The networks are created and meshed using DFNGEN, which combines FRAM (the feature rejection algorithm for meshing) methodology to stochastically generate three-dimensional DFNs with the LaGriT meshing toolbox to create a high-quality computational mesh representation. The representation produces a conforming Delaunay triangulation suitable for high performance computing finite volume solvers in anmore » intrinsically parallel fashion. Flow through the network is simulated in dfnFlow, which utilizes the massively parallel subsurface flow and reactive transport finite volume code PFLOTRAN. A Lagrangian approach to simulating transport through the DFN is adopted within DFNTRANS to determine pathlines and solute transport through the DFN. Example applications of this suite in the areas of nuclear waste repository science, hydraulic fracturing and CO 2 sequestration are also included.« less
dfnWorks: A discrete fracture network framework for modeling subsurface flow and transport
Hyman, Jeffrey D.; Karra, Satish; Makedonska, Nataliia; ...
2015-11-01
DFNWORKS is a parallelized computational suite to generate three-dimensional discrete fracture networks (DFN) and simulate flow and transport. Developed at Los Alamos National Laboratory over the past five years, it has been used to study flow and transport in fractured media at scales ranging from millimeters to kilometers. The networks are created and meshed using DFNGEN, which combines FRAM (the feature rejection algorithm for meshing) methodology to stochastically generate three-dimensional DFNs with the LaGriT meshing toolbox to create a high-quality computational mesh representation. The representation produces a conforming Delaunay triangulation suitable for high performance computing finite volume solvers in anmore » intrinsically parallel fashion. Flow through the network is simulated in dfnFlow, which utilizes the massively parallel subsurface flow and reactive transport finite volume code PFLOTRAN. A Lagrangian approach to simulating transport through the DFN is adopted within DFNTRANS to determine pathlines and solute transport through the DFN. Example applications of this suite in the areas of nuclear waste repository science, hydraulic fracturing and CO 2 sequestration are also included.« less
Reconstruction, Enhancement, Visualization, and Ergonomic Assessment for Laparoscopic Surgery
2007-02-01
support and upgrade of the REVEAL display system and tool suite in the University of Maryland Medical Center’s Simulation Center, (2) stereo video display...technology deployment, (3) stereo probe calibration benchmarks and support tools , (4) the production of research media, (5) baseline results from...endoscope can be used to generate a stereoscopic view for a surgeon, as with the DaVinci robot in use today. In order to use such an endoscope for
Automated Generation of Message-Passing Programs: An Evaluation Using CAPTools
NASA Technical Reports Server (NTRS)
Hribar, Michelle R.; Jin, Haoqiang; Yan, Jerry C.; Saini, Subhash (Technical Monitor)
1998-01-01
Scientists at NASA Ames Research Center have been developing computational aeroscience applications on highly parallel architectures over the past ten years. During that same time period, a steady transition of hardware and system software also occurred, forcing us to expend great efforts into migrating and re-coding our applications. As applications and machine architectures become increasingly complex, the cost and time required for this process will become prohibitive. In this paper, we present the first set of results in our evaluation of interactive parallelization tools. In particular, we evaluate CAPTool's ability to parallelize computational aeroscience applications. CAPTools was tested on serial versions of the NAS Parallel Benchmarks and ARC3D, a computational fluid dynamics application, on two platforms: the SGI Origin 2000 and the Cray T3E. This evaluation includes performance, amount of user interaction required, limitations and portability. Based on these results, a discussion on the feasibility of computer aided parallelization of aerospace applications is presented along with suggestions for future work.
NASA Astrophysics Data System (ADS)
Quan, Zhe; Wu, Lei
2017-09-01
This article investigates the use of parallel computing for solving the disjunctively constrained knapsack problem. The proposed parallel computing model can be viewed as a cooperative algorithm based on a multi-neighbourhood search. The cooperation system is composed of a team manager and a crowd of team members. The team members aim at applying their own search strategies to explore the solution space. The team manager collects the solutions from the members and shares the best one with them. The performance of the proposed method is evaluated on a group of benchmark data sets. The results obtained are compared to those reached by the best methods from the literature. The results show that the proposed method is able to provide the best solutions in most cases. In order to highlight the robustness of the proposed parallel computing model, a new set of large-scale instances is introduced. Encouraging results have been obtained.
Time-Dependent Simulations of Turbopump Flows
NASA Technical Reports Server (NTRS)
Kiris, Cetin; Kwak, Dochan; Chan, William; Williams, Robert
2002-01-01
Unsteady flow simulations for RLV (Reusable Launch Vehicles) 2nd Generation baseline turbopump for one and half impeller rotations have been completed by using a 34.3 Million grid points model. MLP (Multi-Level Parallelism) shared memory parallelism has been implemented in INS3D, and benchmarked. Code optimization for cash based platforms will be completed by the end of September 2001. Moving boundary capability is obtained by using DCF module. Scripting capability from CAD (computer aided design) geometry to solution has been developed. Data compression is applied to reduce data size in post processing. Fluid/Structure coupling has been initiated.
Microscope mode secondary ion mass spectrometry imaging with a Timepix detector.
Kiss, Andras; Jungmann, Julia H; Smith, Donald F; Heeren, Ron M A
2013-01-01
In-vacuum active pixel detectors enable high sensitivity, highly parallel time- and space-resolved detection of ions from complex surfaces. For the first time, a Timepix detector assembly was combined with a secondary ion mass spectrometer for microscope mode secondary ion mass spectrometry (SIMS) imaging. Time resolved images from various benchmark samples demonstrate the imaging capabilities of the detector system. The main advantages of the active pixel detector are the higher signal-to-noise ratio and parallel acquisition of arrival time and position. Microscope mode SIMS imaging of biomolecules is demonstrated from tissue sections with the Timepix detector.
3D hyperpolarized C-13 EPI with calibrationless parallel imaging
NASA Astrophysics Data System (ADS)
Gordon, Jeremy W.; Hansen, Rie B.; Shin, Peter J.; Feng, Yesu; Vigneron, Daniel B.; Larson, Peder E. Z.
2018-04-01
With the translation of metabolic MRI with hyperpolarized 13C agents into the clinic, imaging approaches will require large volumetric FOVs to support clinical applications. Parallel imaging techniques will be crucial to increasing volumetric scan coverage while minimizing RF requirements and temporal resolution. Calibrationless parallel imaging approaches are well-suited for this application because they eliminate the need to acquire coil profile maps or auto-calibration data. In this work, we explored the utility of a calibrationless parallel imaging method (SAKE) and corresponding sampling strategies to accelerate and undersample hyperpolarized 13C data using 3D blipped EPI acquisitions and multichannel receive coils, and demonstrated its application in a human study of [1-13C]pyruvate metabolism.
Visions of the Future: Hybrid Electric Aircraft Propulsion
NASA Technical Reports Server (NTRS)
Bowman, Cheryl L.
2016-01-01
The National Aeronautics and Space Administration (NASA) is investing continually in improving civil aviation. Hybridization of aircraft propulsion is one aspect of a technology suite which will transform future aircraft. In this context, hybrid propulsion is considered a combination of traditional gas turbine propulsion and electric drive enabled propulsion. This technology suite includes elements of propulsion and airframe integration, parallel hybrid shaft power, turbo-electric generation, electric drive systems, component development, materials development and system integration at multiple levels.
A Comparison of Methods for Assessing Space Suit Joint Ranges of Motion
NASA Technical Reports Server (NTRS)
Aitchison, Lindsay T.
2012-01-01
Through the Advanced Exploration Systems (AES) Program, NASA is attempting to use the vast collection of space suit mobility data from 50 years worth of space suit testing to build predictive analysis tools to aid in early architecture decisions for future missions and exploration programs. However, the design engineers must first understand if and how data generated by different methodologies can be compared directly and used in an essentially interchangeable manner. To address this question, the isolated joint range of motion data from two different test series were compared. Both data sets were generated from participants wearing the Mark III Space Suit Technology Demonstrator (MK-III), Waist Entry I-suit (WEI), and minimal clothing. Additionally the two tests shared a common test subject that allowed for within subject comparisons of the methods that greatly reduced the number of variables in play. The tests varied in their methodologies: the Space Suit Comparative Technologies Evaluation used 2-D photogrammetry to analyze isolated ranges of motion while the Constellation space suit benchmarking and requirements development used 3-D motion capture to evaluate both isolated and functional joint ranges of motion. The isolated data from both test series were compared graphically, as percent differences, and by simple statistical analysis. The results indicated that while the methods generate results that are statistically the same (significance level p= 0.01), the differences are significant enough in the practical sense to make direct comparisons ill advised. The concluding recommendations propose direction for how to bridge the data gaps and address future mobility data collection to allow for backward compatibility.
Exploiting Vector and Multicore Parallelsim for Recursive, Data- and Task-Parallel Programs
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ren, Bin; Krishnamoorthy, Sriram; Agrawal, Kunal
Modern hardware contains parallel execution resources that are well-suited for data-parallelism-vector units-and task parallelism-multicores. However, most work on parallel scheduling focuses on one type of hardware or the other. In this work, we present a scheduling framework that allows for a unified treatment of task- and data-parallelism. Our key insight is an abstraction, task blocks, that uniformly handles data-parallel iterations and task-parallel tasks, allowing them to be scheduled on vector units or executed independently as multicores. Our framework allows us to define schedulers that can dynamically select between executing task- blocks on vector units or multicores. We show that thesemore » schedulers are asymptotically optimal, and deliver the maximum amount of parallelism available in computation trees. To evaluate our schedulers, we develop program transformations that can convert mixed data- and task-parallel pro- grams into task block-based programs. Using a prototype instantiation of our scheduling framework, we show that, on an 8-core system, we can simultaneously exploit vector and multicore parallelism to achieve 14×-108× speedup over sequential baselines.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Thrower, A.W.; Patric, J.; Keister, M.
2008-07-01
The purpose of the Office of Civilian Radioactive Waste Management's (OCRWM) Logistics Benchmarking Project is to identify established government and industry practices for the safe transportation of hazardous materials which can serve as a yardstick for design and operation of OCRWM's national transportation system for shipping spent nuclear fuel and high-level radioactive waste to the proposed repository at Yucca Mountain, Nevada. The project will present logistics and transportation practices and develop implementation recommendations for adaptation by the national transportation system. This paper will describe the process used to perform the initial benchmarking study, highlight interim findings, and explain how thesemore » findings are being implemented. It will also provide an overview of the next phase of benchmarking studies. The benchmarking effort will remain a high-priority activity throughout the planning and operational phases of the transportation system. The initial phase of the project focused on government transportation programs to identify those practices which are most clearly applicable to OCRWM. These Federal programs have decades of safe transportation experience, strive for excellence in operations, and implement effective stakeholder involvement, all of which parallel OCRWM's transportation mission and vision. The initial benchmarking project focused on four business processes that are critical to OCRWM's mission success, and can be incorporated into OCRWM planning and preparation in the near term. The processes examined were: transportation business model, contract management/out-sourcing, stakeholder relations, and contingency planning. More recently, OCRWM examined logistics operations of AREVA NC's Business Unit Logistics in France. The next phase of benchmarking will focus on integrated domestic and international commercial radioactive logistic operations. The prospective companies represent large scale shippers and have vast experience in safely and efficiently shipping spent nuclear fuel and other radioactive materials. Additional business processes may be examined in this phase. The findings of these benchmarking efforts will help determine the organizational structure and requirements of the national transportation system. (authors)« less
Telescience Resource Kit (TReK)
NASA Technical Reports Server (NTRS)
Lippincott, Jeff
2015-01-01
Telescience Resource Kit (TReK) is one of the Huntsville Operations Support Center (HOSC) remote operations solutions. It can be used to monitor and control International Space Station (ISS) payloads from anywhere in the world. It is comprised of a suite of software applications and libraries that provide generic data system capabilities and access to HOSC services. The TReK Software has been operational since 2000. A new cross-platform version of TReK is under development. The new software is being released in phases during the 2014-2016 timeframe. The TReK Release 3.x series of software is the original TReK software that has been operational since 2000. This software runs on Windows. It contains capabilities to support traditional telemetry and commanding using CCSDS (Consultative Committee for Space Data Systems) packets. The TReK Release 4.x series of software is the new cross platform software. It runs on Windows and Linux. The new TReK software will support communication using standard IP protocols and traditional telemetry and commanding. All the software listed above is compatible and can be installed and run together on Windows. The new TReK software contains a suite of software that can be used by payload developers on the ground and onboard (TReK Toolkit). TReK Toolkit is a suite of lightweight libraries and utility applications for use onboard and on the ground. TReK Desktop is the full suite of TReK software -most useful on the ground. When TReK Desktop is released, the TReK installation program will provide the option to choose just the TReK Toolkit portion of the software or the full TReK Desktop suite. The ISS program is providing the TReK Toolkit software as a generic flight software capability offered as a standard service to payloads. TReK Software Verification was conducted during the April/May 2015 timeframe. Payload teams using the TReK software onboard can reference the TReK software verification. TReK will be demonstrated on-orbit running on an ISS provided T61p laptop. Target Timeframe: September 2015 -2016. The on-orbit demonstration will collect benchmark metrics, and will be used in the future to provide live demonstrations during ISS Payload Conferences. Benchmark metrics and demonstrations will address the protocols described in SSP 52050-0047 Ku Forward section 3.3.7. (Associated term: CCSDS File Delivery Protocol (CFDP)).
Comparing a Coevolutionary Genetic Algorithm for Multiobjective Optimization
NASA Technical Reports Server (NTRS)
Lohn, Jason D.; Kraus, William F.; Haith, Gary L.; Clancy, Daniel (Technical Monitor)
2002-01-01
We present results from a study comparing a recently developed coevolutionary genetic algorithm (CGA) against a set of evolutionary algorithms using a suite of multiobjective optimization benchmarks. The CGA embodies competitive coevolution and employs a simple, straightforward target population representation and fitness calculation based on developmental theory of learning. Because of these properties, setting up the additional population is trivial making implementation no more difficult than using a standard GA. Empirical results using a suite of two-objective test functions indicate that this CGA performs well at finding solutions on convex, nonconvex, discrete, and deceptive Pareto-optimal fronts, while giving respectable results on a nonuniform optimization. On a multimodal Pareto front, the CGA finds a solution that dominates solutions produced by eight other algorithms, yet the CGA has poor coverage across the Pareto front.
Morlock, Gertrud E; Meyer, Stephanie; Zimmermann, Benno F; Roussel, Jean-Marc
2014-07-11
A high-performance TLC (HPTLC) method was newly developed and validated for analysis of 7 steviol glycosides in 6 different types of food and Stevia formulations. After a minimized one-step sample preparation, 21 samples were developed in parallel, allowing an effective food screening. Depending on the sample application volume, the method was suited to analyze food sample concentrations in the mg/kg range. LOQs of stevioside in natural yoghurt matrix spiked at 0.02, 0.13 and 0.2% were determined by the calibration curve method to be 12ng/band (peak height). ANOVA was successfully passed to prove data homogeneity in the working range (30-600ng/band). The accuracy (recovery tolerance limit, 92-120%), repeatability (3.1-5.4%) and intermediate precision (4.0-8.4%) were determined for stevioside in milk-based matrix including sample preparation and recovery rates at 3 different concentration levels. For the first time, the recording of HPTLC-ESI-MS spectra via the TLC-MS Interface was demonstrated for rebaudioside A. HPTLC contents for rebaudioside A were compared with results of two (U)HPLC methods. The running costs and analysis time of the three different methods were discussed in detail with regard to screening of food products. Copyright © 2014 Elsevier B.V. All rights reserved.
NASA Astrophysics Data System (ADS)
Fakhari, Abbas; Mitchell, Travis; Leonardi, Christopher; Bolster, Diogo
2017-11-01
Based on phase-field theory, we introduce a robust lattice-Boltzmann equation for modeling immiscible multiphase flows at large density and viscosity contrasts. Our approach is built by modifying the method proposed by Zu and He [Phys. Rev. E 87, 043301 (2013), 10.1103/PhysRevE.87.043301] in such a way as to improve efficiency and numerical stability. In particular, we employ a different interface-tracking equation based on the so-called conservative phase-field model, a simplified equilibrium distribution that decouples pressure and velocity calculations, and a local scheme based on the hydrodynamic distribution functions for calculation of the stress tensor. In addition to two distribution functions for interface tracking and recovery of hydrodynamic properties, the only nonlocal variable in the proposed model is the phase field. Moreover, within our framework there is no need to use biased or mixed difference stencils for numerical stability and accuracy at high density ratios. This not only simplifies the implementation and efficiency of the model, but also leads to a model that is better suited to parallel implementation on distributed-memory machines. Several benchmark cases are considered to assess the efficacy of the proposed model, including the layered Poiseuille flow in a rectangular channel, Rayleigh-Taylor instability, and the rise of a Taylor bubble in a duct. The numerical results are in good agreement with available numerical and experimental data.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lell, R. M.; Schaefer, R. W.; McKnight, R. D.
Over a period of 30 years more than a hundred Zero Power Reactor (ZPR) critical assemblies were constructed at Argonne National Laboratory. The ZPR facilities, ZPR-3, ZPR-6, ZPR-9 and ZPPR, were all fast critical assembly facilities. The ZPR critical assemblies were constructed to support fast reactor development, but data from some of these assemblies are also well suited to form the basis for criticality safety benchmarks. Of the three classes of ZPR assemblies, engineering mockups, engineering benchmarks and physics benchmarks, the last group tends to be most useful for criticality safety. Because physics benchmarks were designed to test fast reactormore » physics data and methods, they were as simple as possible in geometry and composition. The principal fissile species was {sup 235}U or {sup 239}Pu. Fuel enrichments ranged from 9% to 95%. Often there were only one or two main core diluent materials, such as aluminum, graphite, iron, sodium or stainless steel. The cores were reflected (and insulated from room return effects) by one or two layers of materials such as depleted uranium, lead or stainless steel. Despite their more complex nature, a small number of assemblies from the other two classes would make useful criticality safety benchmarks because they have features related to criticality safety issues, such as reflection by soil-like material. The term 'benchmark' in a ZPR program connotes a particularly simple loading aimed at gaining basic reactor physics insight, as opposed to studying a reactor design. In fact, the ZPR-6/7 Benchmark Assembly (Reference 1) had a very simple core unit cell assembled from plates of depleted uranium, sodium, iron oxide, U3O8, and plutonium. The ZPR-6/7 core cell-average composition is typical of the interior region of liquid-metal fast breeder reactors (LMFBRs) of the era. It was one part of the Demonstration Reactor Benchmark Program,a which provided integral experiments characterizing the important features of demonstration-size LMFBRs. As a benchmark, ZPR-6/7 was devoid of many 'real' reactor features, such as simulated control rods and multiple enrichment zones, in its reference form. Those kinds of features were investigated experimentally in variants of the reference ZPR-6/7 or in other critical assemblies in the Demonstration Reactor Benchmark Program.« less
Nuclide Depletion Capabilities in the Shift Monte Carlo Code
Davidson, Gregory G.; Pandya, Tara M.; Johnson, Seth R.; ...
2017-12-21
A new depletion capability has been developed in the Exnihilo radiation transport code suite. This capability enables massively parallel domain-decomposed coupling between the Shift continuous-energy Monte Carlo solver and the nuclide depletion solvers in ORIGEN to perform high-performance Monte Carlo depletion calculations. This paper describes this new depletion capability and discusses its various features, including a multi-level parallel decomposition, high-order transport-depletion coupling, and energy-integrated power renormalization. Several test problems are presented to validate the new capability against other Monte Carlo depletion codes, and the parallel performance of the new capability is analyzed.
Adding Fault Tolerance to NPB Benchmarks Using ULFM
DOE Office of Scientific and Technical Information (OSTI.GOV)
Parchman, Zachary W; Vallee, Geoffroy R; Naughton III, Thomas J
2016-01-01
In the world of high-performance computing, fault tolerance and application resilience are becoming some of the primary concerns because of increasing hardware failures and memory corruptions. While the research community has been investigating various options, from system-level solutions to application-level solutions, standards such as the Message Passing Interface (MPI) are also starting to include such capabilities. The current proposal for MPI fault tolerant is centered around the User-Level Failure Mitigation (ULFM) concept, which provides means for fault detection and recovery of the MPI layer. This approach does not address application-level recovery, which is currently left to application developers. In thismore » work, we present a mod- ification of some of the benchmarks of the NAS parallel benchmark (NPB) to include support of the ULFM capabilities as well as application-level strategies and mechanisms for application-level failure recovery. As such, we present: (i) an application-level library to checkpoint and restore data, (ii) extensions of NPB benchmarks for fault tolerance based on different strategies, (iii) a fault injection tool, and (iv) some preliminary results that show the impact of such fault tolerant strategies on the application execution.« less
Execution models for mapping programs onto distributed memory parallel computers
NASA Technical Reports Server (NTRS)
Sussman, Alan
1992-01-01
The problem of exploiting the parallelism available in a program to efficiently employ the resources of the target machine is addressed. The problem is discussed in the context of building a mapping compiler for a distributed memory parallel machine. The paper describes using execution models to drive the process of mapping a program in the most efficient way onto a particular machine. Through analysis of the execution models for several mapping techniques for one class of programs, we show that the selection of the best technique for a particular program instance can make a significant difference in performance. On the other hand, the results of benchmarks from an implementation of a mapping compiler show that our execution models are accurate enough to select the best mapping technique for a given program.
Massively parallel quantum computer simulator
NASA Astrophysics Data System (ADS)
De Raedt, K.; Michielsen, K.; De Raedt, H.; Trieu, B.; Arnold, G.; Richter, M.; Lippert, Th.; Watanabe, H.; Ito, N.
2007-01-01
We describe portable software to simulate universal quantum computers on massive parallel computers. We illustrate the use of the simulation software by running various quantum algorithms on different computer architectures, such as a IBM BlueGene/L, a IBM Regatta p690+, a Hitachi SR11000/J1, a Cray X1E, a SGI Altix 3700 and clusters of PCs running Windows XP. We study the performance of the software by simulating quantum computers containing up to 36 qubits, using up to 4096 processors and up to 1 TB of memory. Our results demonstrate that the simulator exhibits nearly ideal scaling as a function of the number of processors and suggest that the simulation software described in this paper may also serve as benchmark for testing high-end parallel computers.
A Flow Solver for Three-Dimensional DRAGON Grids
NASA Technical Reports Server (NTRS)
Liou, Meng-Sing; Zheng, Yao
2002-01-01
DRAGONFLOW code has been developed to solve three-dimensional Navier-Stokes equations over a complex geometry whose flow domain is discretized with the DRAGON grid-a combination of Chimera grid and a collection of unstructured grids. In the DRAGONFLOW suite, both OVERFLOW and USM3D are presented in form of module libraries, and a master module controls the invoking of these individual modules. This report includes essential aspects, programming structures, benchmark tests and numerical simulations.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Xia, Yidong; Andrs, David; Martineau, Richard Charles
This document presents the theoretical background for a hybrid finite-element / finite-volume fluid flow solver, namely BIGHORN, based on the Multiphysics Object Oriented Simulation Environment (MOOSE) computational framework developed at the Idaho National Laboratory (INL). An overview of the numerical methods used in BIGHORN are discussed and followed by a presentation of the formulation details. The document begins with the governing equations for the compressible fluid flow, with an outline of the requisite constitutive relations. A second-order finite volume method used for solving the compressible fluid flow problems is presented next. A Pressure-Corrected Implicit Continuous-fluid Eulerian (PCICE) formulation for timemore » integration is also presented. The multi-fluid formulation is being developed. Although multi-fluid is not fully-developed, BIGHORN has been designed to handle multi-fluid problems. Due to the flexibility in the underlying MOOSE framework, BIGHORN is quite extensible, and can accommodate both multi-species and multi-phase formulations. This document also presents a suite of verification & validation benchmark test problems for BIGHORN. The intent for this suite of problems is to provide baseline comparison data that demonstrates the performance of the BIGHORN solution methods on problems that vary in complexity from laminar to turbulent flows. Wherever possible, some form of solution verification has been attempted to identify sensitivities in the solution methods, and suggest best practices when using BIGHORN.« less
Performance Evaluation and Modeling Techniques for Parallel Processors. Ph.D. Thesis
NASA Technical Reports Server (NTRS)
Dimpsey, Robert Tod
1992-01-01
In practice, the performance evaluation of supercomputers is still substantially driven by singlepoint estimates of metrics (e.g., MFLOPS) obtained by running characteristic benchmarks or workloads. With the rapid increase in the use of time-shared multiprogramming in these systems, such measurements are clearly inadequate. This is because multiprogramming and system overhead, as well as other degradations in performance due to time varying characteristics of workloads, are not taken into account. In multiprogrammed environments, multiple jobs and users can dramatically increase the amount of system overhead and degrade the performance of the machine. Performance techniques, such as benchmarking, which characterize performance on a dedicated machine ignore this major component of true computer performance. Due to the complexity of analysis, there has been little work done in analyzing, modeling, and predicting the performance of applications in multiprogrammed environments. This is especially true for parallel processors, where the costs and benefits of multi-user workloads are exacerbated. While some may claim that the issue of multiprogramming is not a viable one in the supercomputer market, experience shows otherwise. Even in recent massively parallel machines, multiprogramming is a key component. It has even been claimed that a partial cause of the demise of the CM2 was the fact that it did not efficiently support time-sharing. In the same paper, Gordon Bell postulates that, multicomputers will evolve to multiprocessors in order to support efficient multiprogramming. Therefore, it is clear that parallel processors of the future will be required to offer the user a time-shared environment with reasonable response times for the applications. In this type of environment, the most important performance metric is the completion of response time of a given application. However, there are a few evaluation efforts addressing this issue.
The Metropolis Monte Carlo method with CUDA enabled Graphic Processing Units
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hall, Clifford; School of Physics, Astronomy, and Computational Sciences, George Mason University, 4400 University Dr., Fairfax, VA 22030; Ji, Weixiao
2014-02-01
We present a CPU–GPU system for runtime acceleration of large molecular simulations using GPU computation and memory swaps. The memory architecture of the GPU can be used both as container for simulation data stored on the graphics card and as floating-point code target, providing an effective means for the manipulation of atomistic or molecular data on the GPU. To fully take advantage of this mechanism, efficient GPU realizations of algorithms used to perform atomistic and molecular simulations are essential. Our system implements a versatile molecular engine, including inter-molecule interactions and orientational variables for performing the Metropolis Monte Carlo (MMC) algorithm,more » which is one type of Markov chain Monte Carlo. By combining memory objects with floating-point code fragments we have implemented an MMC parallel engine that entirely avoids the communication time of molecular data at runtime. Our runtime acceleration system is a forerunner of a new class of CPU–GPU algorithms exploiting memory concepts combined with threading for avoiding bus bandwidth and communication. The testbed molecular system used here is a condensed phase system of oligopyrrole chains. A benchmark shows a size scaling speedup of 60 for systems with 210,000 pyrrole monomers. Our implementation can easily be combined with MPI to connect in parallel several CPU–GPU duets. -- Highlights: •We parallelize the Metropolis Monte Carlo (MMC) algorithm on one CPU—GPU duet. •The Adaptive Tempering Monte Carlo employs MMC and profits from this CPU—GPU implementation. •Our benchmark shows a size scaling-up speedup of 62 for systems with 225,000 particles. •The testbed involves a polymeric system of oligopyrroles in the condensed phase. •The CPU—GPU parallelization includes dipole—dipole and Mie—Jones classic potentials.« less
Automated Instrumentation, Monitoring and Visualization of PVM Programs Using AIMS
NASA Technical Reports Server (NTRS)
Mehra, Pankaj; VanVoorst, Brian; Yan, Jerry; Lum, Henry, Jr. (Technical Monitor)
1994-01-01
We present views and analysis of the execution of several PVM (Parallel Virtual Machine) codes for Computational Fluid Dynamics on a networks of Sparcstations, including: (1) NAS Parallel Benchmarks CG and MG; (2) a multi-partitioning algorithm for NAS Parallel Benchmark SP; and (3) an overset grid flowsolver. These views and analysis were obtained using our Automated Instrumentation and Monitoring System (AIMS) version 3.0, a toolkit for debugging the performance of PVM programs. We will describe the architecture, operation and application of AIMS. The AIMS toolkit contains: (1) Xinstrument, which can automatically instrument various computational and communication constructs in message-passing parallel programs; (2) Monitor, a library of runtime trace-collection routines; (3) VK (Visual Kernel), an execution-animation tool with source-code clickback; and (4) Tally, a tool for statistical analysis of execution profiles. Currently, Xinstrument can handle C and Fortran 77 programs using PVM 3.2.x; Monitor has been implemented and tested on Sun 4 systems running SunOS 4.1.2; and VK uses XIIR5 and Motif 1.2. Data and views obtained using AIMS clearly illustrate several characteristic features of executing parallel programs on networked workstations: (1) the impact of long message latencies; (2) the impact of multiprogramming overheads and associated load imbalance; (3) cache and virtual-memory effects; and (4) significant skews between workstation clocks. Interestingly, AIMS can compensate for constant skew (zero drift) by calibrating the skew between a parent and its spawned children. In addition, AIMS' skew-compensation algorithm can adjust timestamps in a way that eliminates physically impossible communications (e.g., messages going backwards in time). Our current efforts are directed toward creating new views to explain the observed performance of PVM programs. Some of the features planned for the near future include: (1) ConfigView, showing the physical topology of the virtual machine, inferred using specially formatted IP (Internet Protocol) packets: and (2) LoadView, synchronous animation of PVM-program execution and resource-utilization patterns.
Enhanced Verification Test Suite for Physics Simulation Codes
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kamm, J R; Brock, J S; Brandon, S T
2008-10-10
This document discusses problems with which to augment, in quantity and in quality, the existing tri-laboratory suite of verification problems used by Los Alamos National Laboratory (LANL), Lawrence Livermore National Laboratory (LLNL), and Sandia National Laboratories (SNL). The purpose of verification analysis is demonstrate whether the numerical results of the discretization algorithms in physics and engineering simulation codes provide correct solutions of the corresponding continuum equations. The key points of this document are: (1) Verification deals with mathematical correctness of the numerical algorithms in a code, while validation deals with physical correctness of a simulation in a regime of interest.more » This document is about verification. (2) The current seven-problem Tri-Laboratory Verification Test Suite, which has been used for approximately five years at the DOE WP laboratories, is limited. (3) Both the methodology for and technology used in verification analysis have evolved and been improved since the original test suite was proposed. (4) The proposed test problems are in three basic areas: (a) Hydrodynamics; (b) Transport processes; and (c) Dynamic strength-of-materials. (5) For several of the proposed problems we provide a 'strong sense verification benchmark', consisting of (i) a clear mathematical statement of the problem with sufficient information to run a computer simulation, (ii) an explanation of how the code result and benchmark solution are to be evaluated, and (iii) a description of the acceptance criterion for simulation code results. (6) It is proposed that the set of verification test problems with which any particular code be evaluated include some of the problems described in this document. Analysis of the proposed verification test problems constitutes part of a necessary--but not sufficient--step that builds confidence in physics and engineering simulation codes. More complicated test cases, including physics models of greater sophistication or other physics regimes (e.g., energetic material response, magneto-hydrodynamics), would represent a scientifically desirable complement to the fundamental test cases discussed in this report. The authors believe that this document can be used to enhance the verification analyses undertaken at the DOE WP Laboratories and, thus, to improve the quality, credibility, and usefulness of the simulation codes that are analyzed with these problems.« less
The Design and Evaluation of "CAPTools"--A Computer Aided Parallelization Toolkit
NASA Technical Reports Server (NTRS)
Yan, Jerry; Frumkin, Michael; Hribar, Michelle; Jin, Haoqiang; Waheed, Abdul; Johnson, Steve; Cross, Jark; Evans, Emyr; Ierotheou, Constantinos; Leggett, Pete;
1998-01-01
Writing applications for high performance computers is a challenging task. Although writing code by hand still offers the best performance, it is extremely costly and often not very portable. The Computer Aided Parallelization Tools (CAPTools) are a toolkit designed to help automate the mapping of sequential FORTRAN scientific applications onto multiprocessors. CAPTools consists of the following major components: an inter-procedural dependence analysis module that incorporates user knowledge; a 'self-propagating' data partitioning module driven via user guidance; an execution control mask generation and optimization module for the user to fine tune parallel processing of individual partitions; a program transformation/restructuring facility for source code clean up and optimization; a set of browsers through which the user interacts with CAPTools at each stage of the parallelization process; and a code generator supporting multiple programming paradigms on various multiprocessors. Besides describing the rationale behind the architecture of CAPTools, the parallelization process is illustrated via case studies involving structured and unstructured meshes. The programming process and the performance of the generated parallel programs are compared against other programming alternatives based on the NAS Parallel Benchmarks, ARC3D and other scientific applications. Based on these results, a discussion on the feasibility of constructing architectural independent parallel applications is presented.
NASA Technical Reports Server (NTRS)
Ierotheou, C.; Johnson, S.; Leggett, P.; Cross, M.; Evans, E.; Jin, Hao-Qiang; Frumkin, M.; Yan, J.; Biegel, Bryan (Technical Monitor)
2001-01-01
The shared-memory programming model is a very effective way to achieve parallelism on shared memory parallel computers. Historically, the lack of a programming standard for using directives and the rather limited performance due to scalability have affected the take-up of this programming model approach. Significant progress has been made in hardware and software technologies, as a result the performance of parallel programs with compiler directives has also made improvements. The introduction of an industrial standard for shared-memory programming with directives, OpenMP, has also addressed the issue of portability. In this study, we have extended the computer aided parallelization toolkit (developed at the University of Greenwich), to automatically generate OpenMP based parallel programs with nominal user assistance. We outline the way in which loop types are categorized and how efficient OpenMP directives can be defined and placed using the in-depth interprocedural analysis that is carried out by the toolkit. We also discuss the application of the toolkit on the NAS Parallel Benchmarks and a number of real-world application codes. This work not only demonstrates the great potential of using the toolkit to quickly parallelize serial programs but also the good performance achievable on up to 300 processors for hybrid message passing and directive-based parallelizations.
DNA Assembly with De Bruijn Graphs Using an FPGA Platform.
Poirier, Carl; Gosselin, Benoit; Fortier, Paul
2018-01-01
This paper presents an FPGA implementation of a DNA assembly algorithm, called Ray, initially developed to run on parallel CPUs. The OpenCL language is used and the focus is placed on modifying and optimizing the original algorithm to better suit the new parallelization tool and the radically different hardware architecture. The results show that the execution time is roughly one fourth that of the CPU and factoring energy consumption yields a tenfold savings.
Sankaran, Ramanan; Angel, Jordan; Brown, W. Michael
2015-04-08
The growth in size of networked high performance computers along with novel accelerator-based node architectures has further emphasized the importance of communication efficiency in high performance computing. The world's largest high performance computers are usually operated as shared user facilities due to the costs of acquisition and operation. Applications are scheduled for execution in a shared environment and are placed on nodes that are not necessarily contiguous on the interconnect. Furthermore, the placement of tasks on the nodes allocated by the scheduler is sub-optimal, leading to performance loss and variability. Here, we investigate the impact of task placement on themore » performance of two massively parallel application codes on the Titan supercomputer, a turbulent combustion flow solver (S3D) and a molecular dynamics code (LAMMPS). Benchmark studies show a significant deviation from ideal weak scaling and variability in performance. The inter-task communication distance was determined to be one of the significant contributors to the performance degradation and variability. A genetic algorithm-based parallel optimization technique was used to optimize the task ordering. This technique provides an improved placement of the tasks on the nodes, taking into account the application's communication topology and the system interconnect topology. As a result, application benchmarks after task reordering through genetic algorithm show a significant improvement in performance and reduction in variability, therefore enabling the applications to achieve better time to solution and scalability on Titan during production.« less
Benchmarking NWP Kernels on Multi- and Many-core Processors
NASA Astrophysics Data System (ADS)
Michalakes, J.; Vachharajani, M.
2008-12-01
Increased computing power for weather, climate, and atmospheric science has provided direct benefits for defense, agriculture, the economy, the environment, and public welfare and convenience. Today, very large clusters with many thousands of processors are allowing scientists to move forward with simulations of unprecedented size. But time-critical applications such as real-time forecasting or climate prediction need strong scaling: faster nodes and processors, not more of them. Moreover, the need for good cost- performance has never been greater, both in terms of performance per watt and per dollar. For these reasons, the new generations of multi- and many-core processors being mass produced for commercial IT and "graphical computing" (video games) are being scrutinized for their ability to exploit the abundant fine- grain parallelism in atmospheric models. We present results of our work to date identifying key computational kernels within the dynamics and physics of a large community NWP model, the Weather Research and Forecast (WRF) model. We benchmark and optimize these kernels on several different multi- and many-core processors. The goals are to (1) characterize and model performance of the kernels in terms of computational intensity, data parallelism, memory bandwidth pressure, memory footprint, etc. (2) enumerate and classify effective strategies for coding and optimizing for these new processors, (3) assess difficulties and opportunities for tool or higher-level language support, and (4) establish a continuing set of kernel benchmarks that can be used to measure and compare effectiveness of current and future designs of multi- and many-core processors for weather and climate applications.
High-Strength Composite Fabric Tested at Structural Benchmark Test Facility
NASA Technical Reports Server (NTRS)
Krause, David L.
2002-01-01
Large sheets of ultrahigh strength fabric were put to the test at NASA Glenn Research Center's Structural Benchmark Test Facility. The material was stretched like a snare drum head until the last ounce of strength was reached, when it burst with a cacophonous release of tension. Along the way, the 3-ft square samples were also pulled, warped, tweaked, pinched, and yanked to predict the material's physical reactions to the many loads that it will experience during its proposed use. The material tested was a unique multi-ply composite fabric, reinforced with fibers that had a tensile strength eight times that of common carbon steel. The fiber plies were oriented at 0 and 90 to provide great membrane stiffness, as well as oriented at 45 to provide an unusually high resistance to shear distortion. The fabric's heritage is in astronaut space suits and other NASA programs.
Overkamp, Wout; Beilharz, Katrin; Detert Oude Weme, Ruud; Solopova, Ana; Karsens, Harma; Kovács, Ákos T.; Kok, Jan
2013-01-01
Green fluorescent protein (GFP) offers efficient ways of visualizing promoter activity and protein localization in vivo, and many different variants are currently available to study bacterial cell biology. Which of these variants is best suited for a certain bacterial strain, goal, or experimental condition is not clear. Here, we have designed and constructed two “superfolder” GFPs with codon adaptation specifically for Bacillus subtilis and Streptococcus pneumoniae and have benchmarked them against five other previously available variants of GFP in B. subtilis, S. pneumoniae, and Lactococcus lactis, using promoter-gfp fusions. Surprisingly, the best-performing GFP under our experimental conditions in B. subtilis was the one codon optimized for S. pneumoniae and vice versa. The data and tools described in this study will be useful for cell biology studies in low-GC-rich Gram-positive bacteria. PMID:23956387
Parallelization of Unsteady Adaptive Mesh Refinement for Unstructured Navier-Stokes Solvers
NASA Technical Reports Server (NTRS)
Schwing, Alan M.; Nompelis, Ioannis; Candler, Graham V.
2014-01-01
This paper explores the implementation of the MPI parallelization in a Navier-Stokes solver using adaptive mesh re nement. Viscous and inviscid test problems are considered for the purpose of benchmarking, as are implicit and explicit time advancement methods. The main test problem for comparison includes e ects from boundary layers and other viscous features and requires a large number of grid points for accurate computation. Ex- perimental validation against double cone experiments in hypersonic ow are shown. The adaptive mesh re nement shows promise for a staple test problem in the hypersonic com- munity. Extension to more advanced techniques for more complicated ows is described.
NASA Astrophysics Data System (ADS)
Stupakov, Gennady; Zhou, Demin
2016-04-01
We develop a general model of coherent synchrotron radiation (CSR) impedance with shielding provided by two parallel conducting plates. This model allows us to easily reproduce all previously known analytical CSR wakes and to expand the analysis to situations not explored before. It reduces calculations of the impedance to taking integrals along the trajectory of the beam. New analytical results are derived for the radiation impedance with shielding for the following orbits: a kink, a bending magnet, a wiggler of finite length, and an infinitely long wiggler. All our formulas are benchmarked against numerical simulations with the CSRZ computer code.
Using domain decomposition in the multigrid NAS parallel benchmark on the Fujitsu VPP500
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wang, J.C.H.; Lung, H.; Katsumata, Y.
1995-12-01
In this paper, we demonstrate how domain decomposition can be applied to the multigrid algorithm to convert the code for MPP architectures. We also discuss the performance and scalability of this implementation on the new product line of Fujitsu`s vector parallel computer, VPP500. This computer has Fujitsu`s well-known vector processor as the PE each rated at 1.6 C FLOPS. The high speed crossbar network rated at 800 MB/s provides the inter-PE communication. The results show that the physical domain decomposition is the best way to solve MG problems on VPP500.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Stupakov, Gennady; Zhou, Demin
2016-04-21
We develop a general model of coherent synchrotron radiation (CSR) impedance with shielding provided by two parallel conducting plates. This model allows us to easily reproduce all previously known analytical CSR wakes and to expand the analysis to situations not explored before. It reduces calculations of the impedance to taking integrals along the trajectory of the beam. New analytical results are derived for the radiation impedance with shielding for the following orbits: a kink, a bending magnet, a wiggler of finite length, and an infinitely long wiggler. All our formulas are benchmarked against numerical simulations with the CSRZ computer code.
3D multiphysics modeling of superconducting cavities with a massively parallel simulation suite
NASA Astrophysics Data System (ADS)
Kononenko, Oleksiy; Adolphsen, Chris; Li, Zenghai; Ng, Cho-Kuen; Rivetta, Claudio
2017-10-01
Radiofrequency cavities based on superconducting technology are widely used in particle accelerators for various applications. The cavities usually have high quality factors and hence narrow bandwidths, so the field stability is sensitive to detuning from the Lorentz force and external loads, including vibrations and helium pressure variations. If not properly controlled, the detuning can result in a serious performance degradation of a superconducting accelerator, so an understanding of the underlying detuning mechanisms can be very helpful. Recent advances in the simulation suite ace3p have enabled realistic multiphysics characterization of such complex accelerator systems on supercomputers. In this paper, we present the new capabilities in ace3p for large-scale 3D multiphysics modeling of superconducting cavities, in particular, a parallel eigensolver for determining mechanical resonances, a parallel harmonic response solver to calculate the response of a cavity to external vibrations, and a numerical procedure to decompose mechanical loads, such as from the Lorentz force or piezoactuators, into the corresponding mechanical modes. These capabilities have been used to do an extensive rf-mechanical analysis of dressed TESLA-type superconducting cavities. The simulation results and their implications for the operational stability of the Linac Coherent Light Source-II are discussed.
Compute Server Performance Results
NASA Technical Reports Server (NTRS)
Stockdale, I. E.; Barton, John; Woodrow, Thomas (Technical Monitor)
1994-01-01
Parallel-vector supercomputers have been the workhorses of high performance computing. As expectations of future computing needs have risen faster than projected vector supercomputer performance, much work has been done investigating the feasibility of using Massively Parallel Processor systems as supercomputers. An even more recent development is the availability of high performance workstations which have the potential, when clustered together, to replace parallel-vector systems. We present a systematic comparison of floating point performance and price-performance for various compute server systems. A suite of highly vectorized programs was run on systems including traditional vector systems such as the Cray C90, and RISC workstations such as the IBM RS/6000 590 and the SGI R8000. The C90 system delivers 460 million floating point operations per second (FLOPS), the highest single processor rate of any vendor. However, if the price-performance ration (PPR) is considered to be most important, then the IBM and SGI processors are superior to the C90 processors. Even without code tuning, the IBM and SGI PPR's of 260 and 220 FLOPS per dollar exceed the C90 PPR of 160 FLOPS per dollar when running our highly vectorized suite,
Smith, M.P.; Donoghue, P.C.J.; Repetski, J.E.
2005-01-01
A clear distinction may be drawn between the perpendicular architecture of the feeding apparatus of ozarkodinid, prioniodontid and prioniodinid conodonts, in which the P elements are situated at a high angle to the M and S elements, and the parallel architecture of panderodontid and other coniform apparatuses, where two suites of coniform elements lie parallel to each other and oppose across the midline. The quest for homologies between the two architectures has been fraught with difficulty, at least in part because of the paucity of natural assemblages of coniform taxa. A diagenetically fused apparatus of Cordylodns lindstroini elements is here described which is made up of one rounded and two compressed element morphotypes. One of the compressed elements is bowed and asymmetrical and the other is unbowed and more symmetrical. These compressed elements are considered to be homologous with those of panderodontid apparatuses and would have lain at the caudal end of the parallel arrays, with the more symmetrical morphotypes located rostrally to the asymmetrical ones. The bowed and unbowed compressed elements of Cordylodns thus correspond, respectively, to the pt and pf positions of panderodontid apparatuses. In addition, the presence of symmetry transition within the rounded elements of Cordylodns, but not the compressed morphotypes, enables correlation of these with the S and M element locations of ozarkodinid apparatuses. By extension, the compressed elements must be homologues of the P elements. Specifically, the asymmetrical pt morphotype is homologous with the P1 of ozarkodinids and the more symmetrical and rostral pf morphotype is homologous with the P2 position. However, because of uncertainties over the nature of topological transformation of the rostral element array (the "rounded" or "costate" suites), it is not possible to recognize specific homologies between these elements and the M and S elements of ozarkodinids. Morphologic differentiation of P from M and S element suites thus preceded the topological transformation from parallel to perpendicular apparatus architectures.
ERIC Educational Resources Information Center
Janc, Helen; Appelbaum, Deborah
2004-01-01
This newsletter discusses how in many ways the development of the school reform concept parallels with the evolution of thinking about principal leadership over the past decade. Under Comprehensive School Reform (CSR), schools are increasingly being asked to use data to plan for improvement and to fortify instruction and professional development…
2012-02-09
1nclud1ng suggestions for reduc1ng the burden. to the Department of Defense. ExecutiVe Serv1ce D>rectorate (0704-0188) Respondents should be aware...benchmark problem we contacted Bertrand LeCun who in their poject CHOC from 2005-2008 had applied their parallel B&B framework BOB++ to the RLT1
Advances in molecular quantum chemistry contained in the Q-Chem 4 program package
NASA Astrophysics Data System (ADS)
Shao, Yihan; Gan, Zhengting; Epifanovsky, Evgeny; Gilbert, Andrew T. B.; Wormit, Michael; Kussmann, Joerg; Lange, Adrian W.; Behn, Andrew; Deng, Jia; Feng, Xintian; Ghosh, Debashree; Goldey, Matthew; Horn, Paul R.; Jacobson, Leif D.; Kaliman, Ilya; Khaliullin, Rustam Z.; Kuś, Tomasz; Landau, Arie; Liu, Jie; Proynov, Emil I.; Rhee, Young Min; Richard, Ryan M.; Rohrdanz, Mary A.; Steele, Ryan P.; Sundstrom, Eric J.; Woodcock, H. Lee, III; Zimmerman, Paul M.; Zuev, Dmitry; Albrecht, Ben; Alguire, Ethan; Austin, Brian; Beran, Gregory J. O.; Bernard, Yves A.; Berquist, Eric; Brandhorst, Kai; Bravaya, Ksenia B.; Brown, Shawn T.; Casanova, David; Chang, Chun-Min; Chen, Yunqing; Chien, Siu Hung; Closser, Kristina D.; Crittenden, Deborah L.; Diedenhofen, Michael; DiStasio, Robert A., Jr.; Do, Hainam; Dutoi, Anthony D.; Edgar, Richard G.; Fatehi, Shervin; Fusti-Molnar, Laszlo; Ghysels, An; Golubeva-Zadorozhnaya, Anna; Gomes, Joseph; Hanson-Heine, Magnus W. D.; Harbach, Philipp H. P.; Hauser, Andreas W.; Hohenstein, Edward G.; Holden, Zachary C.; Jagau, Thomas-C.; Ji, Hyunjun; Kaduk, Benjamin; Khistyaev, Kirill; Kim, Jaehoon; Kim, Jihan; King, Rollin A.; Klunzinger, Phil; Kosenkov, Dmytro; Kowalczyk, Tim; Krauter, Caroline M.; Lao, Ka Un; Laurent, Adèle D.; Lawler, Keith V.; Levchenko, Sergey V.; Lin, Ching Yeh; Liu, Fenglai; Livshits, Ester; Lochan, Rohini C.; Luenser, Arne; Manohar, Prashant; Manzer, Samuel F.; Mao, Shan-Ping; Mardirossian, Narbe; Marenich, Aleksandr V.; Maurer, Simon A.; Mayhall, Nicholas J.; Neuscamman, Eric; Oana, C. Melania; Olivares-Amaya, Roberto; O'Neill, Darragh P.; Parkhill, John A.; Perrine, Trilisa M.; Peverati, Roberto; Prociuk, Alexander; Rehn, Dirk R.; Rosta, Edina; Russ, Nicholas J.; Sharada, Shaama M.; Sharma, Sandeep; Small, David W.; Sodt, Alexander; Stein, Tamar; Stück, David; Su, Yu-Chuan; Thom, Alex J. W.; Tsuchimochi, Takashi; Vanovschi, Vitalii; Vogt, Leslie; Vydrov, Oleg; Wang, Tao; Watson, Mark A.; Wenzel, Jan; White, Alec; Williams, Christopher F.; Yang, Jun; Yeganeh, Sina; Yost, Shane R.; You, Zhi-Qiang; Zhang, Igor Ying; Zhang, Xing; Zhao, Yan; Brooks, Bernard R.; Chan, Garnet K. L.; Chipman, Daniel M.; Cramer, Christopher J.; Goddard, William A., III; Gordon, Mark S.; Hehre, Warren J.; Klamt, Andreas; Schaefer, Henry F., III; Schmidt, Michael W.; Sherrill, C. David; Truhlar, Donald G.; Warshel, Arieh; Xu, Xin; Aspuru-Guzik, Alán; Baer, Roi; Bell, Alexis T.; Besley, Nicholas A.; Chai, Jeng-Da; Dreuw, Andreas; Dunietz, Barry D.; Furlani, Thomas R.; Gwaltney, Steven R.; Hsu, Chao-Ping; Jung, Yousung; Kong, Jing; Lambrecht, Daniel S.; Liang, WanZhen; Ochsenfeld, Christian; Rassolov, Vitaly A.; Slipchenko, Lyudmila V.; Subotnik, Joseph E.; Van Voorhis, Troy; Herbert, John M.; Krylov, Anna I.; Gill, Peter M. W.; Head-Gordon, Martin
2015-01-01
A summary of the technical advances that are incorporated in the fourth major release of the Q-Chem quantum chemistry program is provided, covering approximately the last seven years. These include developments in density functional theory methods and algorithms, nuclear magnetic resonance (NMR) property evaluation, coupled cluster and perturbation theories, methods for electronically excited and open-shell species, tools for treating extended environments, algorithms for walking on potential surfaces, analysis tools, energy and electron transfer modelling, parallel computing capabilities, and graphical user interfaces. In addition, a selection of example case studies that illustrate these capabilities is given. These include extensive benchmarks of the comparative accuracy of modern density functionals for bonded and non-bonded interactions, tests of attenuated second order Møller-Plesset (MP2) methods for intermolecular interactions, a variety of parallel performance benchmarks, and tests of the accuracy of implicit solvation models. Some specific chemical examples include calculations on the strongly correlated Cr2 dimer, exploring zeolite-catalysed ethane dehydrogenation, energy decomposition analysis of a charged ter-molecular complex arising from glycerol photoionisation, and natural transition orbitals for a Frenkel exciton state in a nine-unit model of a self-assembling nanotube.
Parallel processing considerations for image recognition tasks
NASA Astrophysics Data System (ADS)
Simske, Steven J.
2011-01-01
Many image recognition tasks are well-suited to parallel processing. The most obvious example is that many imaging tasks require the analysis of multiple images. From this standpoint, then, parallel processing need be no more complicated than assigning individual images to individual processors. However, there are three less trivial categories of parallel processing that will be considered in this paper: parallel processing (1) by task; (2) by image region; and (3) by meta-algorithm. Parallel processing by task allows the assignment of multiple workflows-as diverse as optical character recognition [OCR], document classification and barcode reading-to parallel pipelines. This can substantially decrease time to completion for the document tasks. For this approach, each parallel pipeline is generally performing a different task. Parallel processing by image region allows a larger imaging task to be sub-divided into a set of parallel pipelines, each performing the same task but on a different data set. This type of image analysis is readily addressed by a map-reduce approach. Examples include document skew detection and multiple face detection and tracking. Finally, parallel processing by meta-algorithm allows different algorithms to be deployed on the same image simultaneously. This approach may result in improved accuracy.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gerhard Strydom
2014-04-01
The INL PHISICS code system consists of three modules providing improved core simulation capability: INSTANT (performing 3D nodal transport core calculations), MRTAU (depletion and decay heat generation) and a perturbation/mixer module. Coupling of the PHISICS code suite to the thermal hydraulics system code RELAP5-3D has recently been finalized, and as part of the code verification and validation program the exercises defined for Phase I of the OECD/NEA MHTGR 350 MW Benchmark were completed. This paper provides an overview of the MHTGR Benchmark, and presents selected results of the three steady state exercises 1-3 defined for Phase I. For Exercise 1,more » a stand-alone steady-state neutronics solution for an End of Equilibrium Cycle Modular High Temperature Reactor (MHTGR) was calculated with INSTANT, using the provided geometry, material descriptions, and detailed cross-section libraries. Exercise 2 required the modeling of a stand-alone thermal fluids solution. The RELAP5-3D results of four sub-cases are discussed, consisting of various combinations of coolant bypass flows and material thermophysical properties. Exercise 3 combined the first two exercises in a coupled neutronics and thermal fluids solution, and the coupled code suite PHISICS/RELAP5-3D was used to calculate the results of two sub-cases. The main focus of the paper is a comparison of the traditional RELAP5-3D “ring” model approach vs. a much more detailed model that include kinetics feedback on individual block level and thermal feedbacks on a triangular sub-mesh. The higher fidelity of the block model is illustrated with comparison results on the temperature, power density and flux distributions, and the typical under-predictions produced by the ring model approach are highlighted.« less
Summary of Documentation for DYNA3D-ParaDyn's Software Quality Assurance Regression Test Problems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zywicz, Edward
The Software Quality Assurance (SQA) regression test suite for DYNA3D (Zywicz and Lin, 2015) and ParaDyn (DeGroot, et al., 2015) currently contains approximately 600 problems divided into 21 suites, and is a required component of ParaDyn’s SQA plan (Ferencz and Oliver, 2013). The regression suite allows developers to ensure that software modifications do not unintentionally alter the code response. The entire regression suite is run prior to permanently incorporating any software modification or addition. When code modifications alter test problem results, the specific cause must be determined and fully understood before the software changes and revised test answers can bemore » incorporated. The regression suite is executed on LLNL platforms using a Python script and an associated data file. The user specifies the DYNA3D or ParaDyn executable, number of processors to use, test problems to run, and other options to the script. The data file details how each problem and its answer extraction scripts are executed. For each problem in the regression suite there exists an input deck, an eight-processor partition file, an answer file, and various extraction scripts. These scripts assemble a temporary answer file in a specific format from the simulation results. The temporary and stored answer files are compared to a specific level of numerical precision, and when differences are detected the test problem is flagged as failed. Presently, numerical results are stored and compared to 16 digits. At this accuracy level different processor types, compilers, number of partitions, etc. impact the results to various degrees. Thus, for consistency purposes the regression suite is run with ParaDyn using 8 processors on machines with a specific processor type (currently the Intel Xeon E5530 processor). For non-parallel regression problems, i.e., the two XFEM problems, DYNA3D is used instead. When environments or platforms change, executables using the current source code and the new resource are created and the regression suite is run. If differences in answers arise, the new answers are retained provided that the differences are inconsequential. This bootstrap approach allows the test suite answers to evolve in a controlled manner with a high level of confidence. Developers also run the entire regression suite with (serial) DYNA3D. While these results normally differ from the stored (parallel) answers, abnormal termination or wildly different values are strong indicators of potential issues.« less
Reconstruction of coded aperture images
NASA Technical Reports Server (NTRS)
Bielefeld, Michael J.; Yin, Lo I.
1987-01-01
Balanced correlation method and the Maximum Entropy Method (MEM) were implemented to reconstruct a laboratory X-ray source as imaged by a Uniformly Redundant Array (URA) system. Although the MEM method has advantages over the balanced correlation method, it is computationally time consuming because of the iterative nature of its solution. Massively Parallel Processing, with its parallel array structure is ideally suited for such computations. These preliminary results indicate that it is possible to use the MEM method in future coded-aperture experiments with the help of the MPP.
Block-Parallel Data Analysis with DIY2
DOE Office of Scientific and Technical Information (OSTI.GOV)
Morozov, Dmitriy; Peterka, Tom
DIY2 is a programming model and runtime for block-parallel analytics on distributed-memory machines. Its main abstraction is block-structured data parallelism: data are decomposed into blocks; blocks are assigned to processing elements (processes or threads); computation is described as iterations over these blocks, and communication between blocks is defined by reusable patterns. By expressing computation in this general form, the DIY2 runtime is free to optimize the movement of blocks between slow and fast memories (disk and flash vs. DRAM) and to concurrently execute blocks residing in memory with multiple threads. This enables the same program to execute in-core, out-of-core, serial,more » parallel, single-threaded, multithreaded, or combinations thereof. This paper describes the implementation of the main features of the DIY2 programming model and optimizations to improve performance. DIY2 is evaluated on benchmark test cases to establish baseline performance for several common patterns and on larger complete analysis codes running on large-scale HPC machines.« less
Processing large remote sensing image data sets on Beowulf clusters
Steinwand, Daniel R.; Maddox, Brian; Beckmann, Tim; Schmidt, Gail
2003-01-01
High-performance computing is often concerned with the speed at which floating- point calculations can be performed. The architectures of many parallel computers and/or their network topologies are based on these investigations. Often, benchmarks resulting from these investigations are compiled with little regard to how a large dataset would move about in these systems. This part of the Beowulf study addresses that concern by looking at specific applications software and system-level modifications. Applications include an implementation of a smoothing filter for time-series data, a parallel implementation of the decision tree algorithm used in the Landcover Characterization project, a parallel Kriging algorithm used to fit point data collected in the field on invasive species to a regular grid, and modifications to the Beowulf project's resampling algorithm to handle larger, higher resolution datasets at a national scale. Systems-level investigations include a feasibility study on Flat Neighborhood Networks and modifications of that concept with Parallel File Systems.
A Systems Approach to Scalable Transportation Network Modeling
DOE Office of Scientific and Technical Information (OSTI.GOV)
Perumalla, Kalyan S
2006-01-01
Emerging needs in transportation network modeling and simulation are raising new challenges with respect to scal-ability of network size and vehicular traffic intensity, speed of simulation for simulation-based optimization, and fidel-ity of vehicular behavior for accurate capture of event phe-nomena. Parallel execution is warranted to sustain the re-quired detail, size and speed. However, few parallel simulators exist for such applications, partly due to the challenges underlying their development. Moreover, many simulators are based on time-stepped models, which can be computationally inefficient for the purposes of modeling evacuation traffic. Here an approach is presented to de-signing a simulator with memory andmore » speed efficiency as the goals from the outset, and, specifically, scalability via parallel execution. The design makes use of discrete event modeling techniques as well as parallel simulation meth-ods. Our simulator, called SCATTER, is being developed, incorporating such design considerations. Preliminary per-formance results are presented on benchmark road net-works, showing scalability to one million vehicles simu-lated on one processor.« less
Sequential Feedback Scheme Outperforms the Parallel Scheme for Hamiltonian Parameter Estimation.
Yuan, Haidong
2016-10-14
Measurement and estimation of parameters are essential for science and engineering, where the main quest is to find the highest achievable precision with the given resources and design schemes to attain it. Two schemes, the sequential feedback scheme and the parallel scheme, are usually studied in the quantum parameter estimation. While the sequential feedback scheme represents the most general scheme, it remains unknown whether it can outperform the parallel scheme for any quantum estimation tasks. In this Letter, we show that the sequential feedback scheme has a threefold improvement over the parallel scheme for Hamiltonian parameter estimations on two-dimensional systems, and an order of O(d+1) improvement for Hamiltonian parameter estimation on d-dimensional systems. We also show that, contrary to the conventional belief, it is possible to simultaneously achieve the highest precision for estimating all three components of a magnetic field, which sets a benchmark on the local precision limit for the estimation of a magnetic field.
The mass storage testing laboratory at GSFC
NASA Technical Reports Server (NTRS)
Venkataraman, Ravi; Williams, Joel; Michaud, David; Gu, Heng; Kalluri, Atri; Hariharan, P. C.; Kobler, Ben; Behnke, Jeanne; Peavey, Bernard
1998-01-01
Industry-wide benchmarks exist for measuring the performance of processors (SPECmarks), and of database systems (Transaction Processing Council). Despite storage having become the dominant item in computing and IT (Information Technology) budgets, no such common benchmark is available in the mass storage field. Vendors and consultants provide services and tools for capacity planning and sizing, but these do not account for the complete set of metrics needed in today's archives. The availability of automated tape libraries, high-capacity RAID systems, and high- bandwidth interconnectivity between processor and peripherals has led to demands for services which traditional file systems cannot provide. File Storage and Management Systems (FSMS), which began to be marketed in the late 80's, have helped to some extent with large tape libraries, but their use has introduced additional parameters affecting performance. The aim of the Mass Storage Test Laboratory (MSTL) at Goddard Space Flight Center is to develop a test suite that includes not only a comprehensive check list to document a mass storage environment but also benchmark code. Benchmark code is being tested which will provide measurements for both baseline systems, i.e. applications interacting with peripherals through the operating system services, and for combinations involving an FSMS. The benchmarks are written in C, and are easily portable. They are initially being aimed at the UNIX Open Systems world. Measurements are being made using a Sun Ultra 170 Sparc with 256MB memory running Solaris 2.5.1 with the following configuration: 4mm tape stacker on SCSI 2 Fast/Wide; 4GB disk device on SCSI 2 Fast/Wide; and Sony Petaserve on Fast/Wide differential SCSI 2.
Benchmarking worker nodes using LHCb productions and comparing with HEPSpec06
NASA Astrophysics Data System (ADS)
Charpentier, P.
2017-10-01
In order to estimate the capabilities of a computing slot with limited processing time, it is necessary to know with a rather good precision its “power”. This allows for example pilot jobs to match a task for which the required CPU-work is known, or to define the number of events to be processed knowing the CPU-work per event. Otherwise one always has the risk that the task is aborted because it exceeds the CPU capabilities of the resource. It also allows a better accounting of the consumed resources. The traditional way the CPU power is estimated in WLCG since 2007 is using the HEP-Spec06 benchmark (HS06) suite that was verified at the time to scale properly with a set of typical HEP applications. However, the hardware architecture of processors has evolved, all WLCG experiments moved to using 64-bit applications and use different compilation flags from those advertised for running HS06. It is therefore interesting to check the scaling of HS06 with the HEP applications. For this purpose, we have been using CPU intensive massive simulation productions from the LHCb experiment and compared their event throughput to the HS06 rating of the worker nodes. We also compared it with a much faster benchmark script that is used by the DIRAC framework used by LHCb for evaluating at run time the performance of the worker nodes. This contribution reports on the finding of these comparisons: the main observation is that the scaling with HS06 is no longer fulfilled, while the fast benchmarks have a better scaling but are less precise. One can also clearly see that some hardware or software features when enabled on the worker nodes may enhance their performance beyond expectation from either benchmark, depending on external factors.
Tankam, Patrice; Santhanam, Anand P.; Lee, Kye-Sung; Won, Jungeun; Canavesi, Cristina; Rolland, Jannick P.
2014-01-01
Abstract. Gabor-domain optical coherence microscopy (GD-OCM) is a volumetric high-resolution technique capable of acquiring three-dimensional (3-D) skin images with histological resolution. Real-time image processing is needed to enable GD-OCM imaging in a clinical setting. We present a parallelized and scalable multi-graphics processing unit (GPU) computing framework for real-time GD-OCM image processing. A parallelized control mechanism was developed to individually assign computation tasks to each of the GPUs. For each GPU, the optimal number of amplitude-scans (A-scans) to be processed in parallel was selected to maximize GPU memory usage and core throughput. We investigated five computing architectures for computational speed-up in processing 1000×1000 A-scans. The proposed parallelized multi-GPU computing framework enables processing at a computational speed faster than the GD-OCM image acquisition, thereby facilitating high-speed GD-OCM imaging in a clinical setting. Using two parallelized GPUs, the image processing of a 1×1×0.6 mm3 skin sample was performed in about 13 s, and the performance was benchmarked at 6.5 s with four GPUs. This work thus demonstrates that 3-D GD-OCM data may be displayed in real-time to the examiner using parallelized GPU processing. PMID:24695868
Tankam, Patrice; Santhanam, Anand P; Lee, Kye-Sung; Won, Jungeun; Canavesi, Cristina; Rolland, Jannick P
2014-07-01
Gabor-domain optical coherence microscopy (GD-OCM) is a volumetric high-resolution technique capable of acquiring three-dimensional (3-D) skin images with histological resolution. Real-time image processing is needed to enable GD-OCM imaging in a clinical setting. We present a parallelized and scalable multi-graphics processing unit (GPU) computing framework for real-time GD-OCM image processing. A parallelized control mechanism was developed to individually assign computation tasks to each of the GPUs. For each GPU, the optimal number of amplitude-scans (A-scans) to be processed in parallel was selected to maximize GPU memory usage and core throughput. We investigated five computing architectures for computational speed-up in processing 1000×1000 A-scans. The proposed parallelized multi-GPU computing framework enables processing at a computational speed faster than the GD-OCM image acquisition, thereby facilitating high-speed GD-OCM imaging in a clinical setting. Using two parallelized GPUs, the image processing of a 1×1×0.6 mm3 skin sample was performed in about 13 s, and the performance was benchmarked at 6.5 s with four GPUs. This work thus demonstrates that 3-D GD-OCM data may be displayed in real-time to the examiner using parallelized GPU processing.
ERIC Educational Resources Information Center
Lin, Zeng; Gardner, Dianne
2006-01-01
The purpose of this study is to demonstrate the usefulness of the Schools and Staffing Survey (SASS), for the comparative analysis of alumni teachers. This article shows how SASS can be used as an evaluative tool by any institution that wants to appraise its alumni in comparison to those of its parallel institutions for the purposes of…
ERIC Educational Resources Information Center
Rice, Mabel L.; Redmond, Sean M.; Hoffman, Lesa
2006-01-01
Purpose: Although mean length of utterance (MLU) is a useful benchmark in studies of children with specific language impairment (SLI), some empirical and interpretive issues are unresolved. The authors report on 2 studies examining, respectively, the concurrent validity and temporal stability of MLU equivalency between children with SLI and…
Hagen, Espen; Ness, Torbjørn V; Khosrowshahi, Amir; Sørensen, Christina; Fyhn, Marianne; Hafting, Torkel; Franke, Felix; Einevoll, Gaute T
2015-04-30
New, silicon-based multielectrodes comprising hundreds or more electrode contacts offer the possibility to record spike trains from thousands of neurons simultaneously. This potential cannot be realized unless accurate, reliable automated methods for spike sorting are developed, in turn requiring benchmarking data sets with known ground-truth spike times. We here present a general simulation tool for computing benchmarking data for evaluation of spike-sorting algorithms entitled ViSAPy (Virtual Spiking Activity in Python). The tool is based on a well-established biophysical forward-modeling scheme and is implemented as a Python package built on top of the neuronal simulator NEURON and the Python tool LFPy. ViSAPy allows for arbitrary combinations of multicompartmental neuron models and geometries of recording multielectrodes. Three example benchmarking data sets are generated, i.e., tetrode and polytrode data mimicking in vivo cortical recordings and microelectrode array (MEA) recordings of in vitro activity in salamander retinas. The synthesized example benchmarking data mimics salient features of typical experimental recordings, for example, spike waveforms depending on interspike interval. ViSAPy goes beyond existing methods as it includes biologically realistic model noise, synaptic activation by recurrent spiking networks, finite-sized electrode contacts, and allows for inhomogeneous electrical conductivities. ViSAPy is optimized to allow for generation of long time series of benchmarking data, spanning minutes of biological time, by parallel execution on multi-core computers. ViSAPy is an open-ended tool as it can be generalized to produce benchmarking data or arbitrary recording-electrode geometries and with various levels of complexity. Copyright © 2015 The Authors. Published by Elsevier B.V. All rights reserved.
Vascular system modeling in parallel environment - distributed and shared memory approaches
Jurczuk, Krzysztof; Kretowski, Marek; Bezy-Wendling, Johanne
2011-01-01
The paper presents two approaches in parallel modeling of vascular system development in internal organs. In the first approach, new parts of tissue are distributed among processors and each processor is responsible for perfusing its assigned parts of tissue to all vascular trees. Communication between processors is accomplished by passing messages and therefore this algorithm is perfectly suited for distributed memory architectures. The second approach is designed for shared memory machines. It parallelizes the perfusion process during which individual processing units perform calculations concerning different vascular trees. The experimental results, performed on a computing cluster and multi-core machines, show that both algorithms provide a significant speedup. PMID:21550891
Benchmarking hardware architecture candidates for the NFIRAOS real-time controller
NASA Astrophysics Data System (ADS)
Smith, Malcolm; Kerley, Dan; Herriot, Glen; Véran, Jean-Pierre
2014-07-01
As a part of the trade study for the Narrow Field Infrared Adaptive Optics System, the adaptive optics system for the Thirty Meter Telescope, we investigated the feasibility of performing real-time control computation using a Linux operating system and Intel Xeon E5 CPUs. We also investigated a Xeon Phi based architecture which allows higher levels of parallelism. This paper summarizes both the CPU based real-time controller architecture and the Xeon Phi based RTC. The Intel Xeon E5 CPU solution meets the requirements and performs the computation for one AO cycle in an average of 767 microseconds. The Xeon Phi solution did not meet the 1200 microsecond time requirement and also suffered from unpredictable execution times. More detailed benchmark results are reported for both architectures.
An Application-Based Performance Characterization of the Columbia Supercluster
NASA Technical Reports Server (NTRS)
Biswas, Rupak; Djomehri, Jahed M.; Hood, Robert; Jin, Hoaqiang; Kiris, Cetin; Saini, Subhash
2005-01-01
Columbia is a 10,240-processor supercluster consisting of 20 Altix nodes with 512 processors each, and currently ranked as the second-fastest computer in the world. In this paper, we present the performance characteristics of Columbia obtained on up to four computing nodes interconnected via the InfiniBand and/or NUMAlink4 communication fabrics. We evaluate floating-point performance, memory bandwidth, message passing communication speeds, and compilers using a subset of the HPC Challenge benchmarks, and some of the NAS Parallel Benchmarks including the multi-zone versions. We present detailed performance results for three scientific applications of interest to NASA, one from molecular dynamics, and two from computational fluid dynamics. Our results show that both the NUMAlink4 and the InfiniBand hold promise for application scaling to a large number of processors.
Development and Testing of Neutron Cross Section Covariance Data for SCALE 6.2
DOE Office of Scientific and Technical Information (OSTI.GOV)
Marshall, William BJ J; Williams, Mark L; Wiarda, Dorothea
2015-01-01
Neutron cross-section covariance data are essential for many sensitivity/uncertainty and uncertainty quantification assessments performed both within the TSUNAMI suite and more broadly throughout the SCALE code system. The release of ENDF/B-VII.1 included a more complete set of neutron cross-section covariance data: these data form the basis for a new cross-section covariance library to be released in SCALE 6.2. A range of testing is conducted to investigate the properties of these covariance data and ensure that the data are reasonable. These tests include examination of the uncertainty in critical experiment benchmark model k eff values due to nuclear data uncertainties, asmore » well as similarity assessments of irradiated pressurized water reactor (PWR) and boiling water reactor (BWR) fuel with suites of critical experiments. The contents of the new covariance library, the testing performed, and the behavior of the new covariance data are described in this paper. The neutron cross-section covariances can be combined with a sensitivity data file generated using the TSUNAMI suite of codes within SCALE to determine the uncertainty in system k eff caused by nuclear data uncertainties. The Verified, Archived Library of Inputs and Data (VALID) maintained at Oak Ridge National Laboratory (ORNL) contains over 400 critical experiment benchmark models, and sensitivity data are generated for each of these models. The nuclear data uncertainty in k eff is generated for each experiment, and the resulting uncertainties are tabulated and compared to the differences in measured and calculated results. The magnitude of the uncertainty for categories of nuclides (such as actinides, fission products, and structural materials) is calculated for irradiated PWR and BWR fuel to quantify the effect of covariance library changes between the SCALE 6.1 and 6.2 libraries. One of the primary applications of sensitivity/uncertainty methods within SCALE is the assessment of similarities between benchmark experiments and safety applications. This is described by a c k value for each experiment with each application. Several studies have analyzed typical c k values for a range of critical experiments compared with hypothetical irradiated fuel applications. The c k value is sensitive to the cross-section covariance data because the contribution of each nuclide is influenced by its uncertainty; large uncertainties indicate more likely bias sources and are thus given more weight. Changes in c k values resulting from different covariance data can be used to examine and assess underlying data changes. These comparisons are performed for PWR and BWR fuel in storage and transportation systems.« less
Synergia: an accelerator modeling tool with 3-D space charge
DOE Office of Scientific and Technical Information (OSTI.GOV)
Amundson, James F.; Spentzouris, P.; /Fermilab
2004-07-01
High precision modeling of space-charge effects, together with accurate treatment of single-particle dynamics, is essential for designing future accelerators as well as optimizing the performance of existing machines. We describe Synergia, a high-fidelity parallel beam dynamics simulation package with fully three dimensional space-charge capabilities and a higher order optics implementation. We describe the computational techniques, the advanced human interface, and the parallel performance obtained using large numbers of macroparticles. We also perform code benchmarks comparing to semi-analytic results and other codes. Finally, we present initial results on particle tune spread, beam halo creation, and emittance growth in the Fermilab boostermore » accelerator.« less
The development of a revised version of multi-center molecular Ornstein-Zernike equation
NASA Astrophysics Data System (ADS)
Kido, Kentaro; Yokogawa, Daisuke; Sato, Hirofumi
2012-04-01
Ornstein-Zernike (OZ)-type theory is a powerful tool to obtain 3-dimensional solvent distribution around solute molecule. Recently, we proposed multi-center molecular OZ method, which is suitable for parallel computing of 3D solvation structure. The distribution function in this method consists of two components, namely reference and residue parts. Several types of the function were examined as the reference part to investigate the numerical robustness of the method. As the benchmark, the method is applied to water, benzene in aqueous solution and single-walled carbon nanotube in chloroform solution. The results indicate that fully-parallelization is achieved by utilizing the newly proposed reference functions.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Stupakov, Gennady; Zhou, Demin
2016-04-21
We develop a general model of coherent synchrotron radiation (CSR) impedance with shielding provided by two parallel conducting plates. This model allows us to easily reproduce all previously known analytical CSR wakes and to expand the analysis to situations not explored before. It reduces calculations of the impedance to taking integrals along the trajectory of the beam. New analytical results are derived for the radiation impedance with shielding for the following orbits: a kink, a bending magnet, a wiggler of finite length, and an infinitely long wiggler. Furthermore, all our formulas are benchmarked against numerical simulations with the CSRZ computermore » code.« less
NASA Astrophysics Data System (ADS)
Rebuffi, Luca; Sanchez del Rio, Manuel
2017-08-01
In the next years most of the major synchrotron radiation facilities around the world will upgrade to 4th-generation Diffraction Limited Storage Rings using multi-bend-achromat technology. Moreover, several Free Electron Lasers are ready-to-go or in phase of completion. These events represent a huge challenge for the optics physicists responsible of designing and calculating optical systems capable to exploit the revolutionary characteristics of the new photon beams. Reliable and robust beamline design is nowadays based on sophisticated computer simulations only possible by lumping together different simulation tools. The OASYS (OrAnge SYnchrotron Suite) suite drives several simulation tools providing new mechanisms of interoperability and communication within the same software environment. OASYS has been successfully used during the conceptual design of many beamline and optical designs for the ESRF and Elettra- Sincrotrone Trieste upgrades. Some examples are presented showing comparisons and benchmarking of simulations against calculated and experimental data.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lopes, M. L.
2014-07-01
SolCalc is a software suite that computes and displays magnetic fields generated by a three dimensional (3D) solenoid system. Examples of such systems are the Mu2e magnet system and Helical Solenoids for muon cooling systems. SolCalc was originally coded in Matlab, and later upgraded to a compiled version (called MEX) to improve solving speed. Matlab was chosen because its graphical capabilities represent an attractive feature over other computer languages. Solenoid geometries can be created using any text editor or spread sheets and can be displayed dynamically in 3D. Fields are computed from any given list of coordinates. The field distributionmore » on the surfaces of the coils can be displayed as well. SolCalc was benchmarked against a well-known commercial software for speed and accuracy and the results compared favorably.« less
Vener, David F; Gaies, Michael; Jacobs, Jeffrey P; Pasquali, Sara K
2017-01-01
The growth in large-scale data management capabilities and the successful care of patients with congenital heart defects have coincidentally paralleled each other for the last three decades, and participation in multicenter congenital heart disease databases and registries is now a fundamental component of cardiac care. This manuscript attempts for the first time to consolidate in one location all of the relevant databases worldwide, including target populations, specialties, Web sites, and participation information. Since at least 1,992 cardiac surgeons and cardiologists began leveraging this burgeoning technology to create multi-institutional data collections addressing a variety of specialties within this field. Pediatric heart diseases are particularly well suited to this methodology because each individual care location has access to only a relatively limited number of diagnoses and procedures in any given calendar year. Combining multiple institutions data therefore allows for a far more accurate contemporaneous assessment of treatment modalities and adverse outcomes. Additionally, the data can be used to develop outcome benchmarks by which individual institutions can measure their progress against the field as a whole and focus quality improvement efforts in a more directed fashion, and there is increasing utilization combining clinical research efforts within existing data structures. Efforts are ongoing to support better collaboration and integration across data sets, to improve efficiency, further the utility of the data collection infrastructure and information collected, and to enhance return on investment for participating institutions.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dokhane, A.; Canepa, S.; Ferroukhi, H.
For stability analyses of the Swiss operating Boiling-Water-Reactors (BWRs), the methodology employed and validated so far at the Paul Scherrer Inst. (PSI) was based on the RAMONA-3 code with a hybrid upstream static lattice/core analysis approach using CASMO-4 and PRESTO-2. More recently, steps were undertaken towards a new methodology based on the SIMULATE-3K (S3K) code for the dynamical analyses combined with the CMSYS system relying on the CASMO/SIMULATE-3 suite of codes and which was established at PSI to serve as framework for the development and validation of reference core models of all the Swiss reactors and operated cycles. This papermore » presents a first validation of the new methodology on the basis of a benchmark recently organised by a Swiss utility and including the participation of several international organisations with various codes/methods. Now in parallel, a transition from CASMO-4E (C4E) to CASMO-5M (C5M) as basis for the CMSYS core models was also recently initiated at PSI. Consequently, it was considered adequate to address the impact of this transition both for the steady-state core analyses as well as for the stability calculations and to achieve thereby, an integral approach for the validation of the new S3K methodology. Therefore, a comparative assessment of C4 versus C5M is also presented in this paper with particular emphasis on the void coefficients and their impact on the downstream stability analysis results. (authors)« less
A GaAs vector processor based on parallel RISC microprocessors
NASA Astrophysics Data System (ADS)
Misko, Tim A.; Rasset, Terry L.
A vector processor architecture based on the development of a 32-bit microprocessor using gallium arsenide (GaAs) technology has been developed. The McDonnell Douglas vector processor (MVP) will be fabricated completely from GaAs digital integrated circuits. The MVP architecture includes a vector memory of 1 megabyte, a parallel bus architecture with eight processing elements connected in parallel, and a control processor. The processing elements consist of a reduced instruction set CPU (RISC) with four floating-point coprocessor units and necessary memory interface functions. This architecture has been simulated for several benchmark programs including complex fast Fourier transform (FFT), complex inner product, trigonometric functions, and sort-merge routine. The results of this study indicate that the MVP can process a 1024-point complex FFT at a speed of 112 microsec (389 megaflops) while consuming approximately 618 W of power in a volume of approximately 0.1 ft-cubed.
Automation of Data Traffic Control on DSM Architecture
NASA Technical Reports Server (NTRS)
Frumkin, Michael; Jin, Hao-Qiang; Yan, Jerry
2001-01-01
The design of distributed shared memory (DSM) computers liberates users from the duty to distribute data across processors and allows for the incremental development of parallel programs using, for example, OpenMP or Java threads. DSM architecture greatly simplifies the development of parallel programs having good performance on a few processors. However, to achieve a good program scalability on DSM computers requires that the user understand data flow in the application and use various techniques to avoid data traffic congestions. In this paper we discuss a number of such techniques, including data blocking, data placement, data transposition and page size control and evaluate their efficiency on the NAS (NASA Advanced Supercomputing) Parallel Benchmarks. We also present a tool which automates the detection of constructs causing data congestions in Fortran array oriented codes and advises the user on code transformations for improving data traffic in the application.
Computational Performance of a Parallelized Three-Dimensional High-Order Spectral Element Toolbox
NASA Astrophysics Data System (ADS)
Bosshard, Christoph; Bouffanais, Roland; Clémençon, Christian; Deville, Michel O.; Fiétier, Nicolas; Gruber, Ralf; Kehtari, Sohrab; Keller, Vincent; Latt, Jonas
In this paper, a comprehensive performance review of an MPI-based high-order three-dimensional spectral element method C++ toolbox is presented. The focus is put on the performance evaluation of several aspects with a particular emphasis on the parallel efficiency. The performance evaluation is analyzed with help of a time prediction model based on a parameterization of the application and the hardware resources. A tailor-made CFD computation benchmark case is introduced and used to carry out this review, stressing the particular interest for clusters with up to 8192 cores. Some problems in the parallel implementation have been detected and corrected. The theoretical complexities with respect to the number of elements, to the polynomial degree, and to communication needs are correctly reproduced. It is concluded that this type of code has a nearly perfect speed up on machines with thousands of cores, and is ready to make the step to next-generation petaflop machines.
Experimental Mapping and Benchmarking of Magnetic Field Codes on the LHD Ion Accelerator
NASA Astrophysics Data System (ADS)
Chitarin, G.; Agostinetti, P.; Gallo, A.; Marconato, N.; Nakano, H.; Serianni, G.; Takeiri, Y.; Tsumori, K.
2011-09-01
For the validation of the numerical models used for the design of the Neutral Beam Test Facility for ITER in Padua [1], an experimental benchmark against a full-size device has been sought. The LHD BL2 injector [2] has been chosen as a first benchmark, because the BL2 Negative Ion Source and Beam Accelerator are geometrically similar to SPIDER, even though BL2 does not include current bars and ferromagnetic materials. A comprehensive 3D magnetic field model of the LHD BL2 device has been developed based on the same assumptions used for SPIDER. In parallel, a detailed experimental magnetic map of the BL2 device has been obtained using a suitably designed 3D adjustable structure for the fine positioning of the magnetic sensors inside 27 of the 770 beamlet apertures. The calculated values have been compared to the experimental data. The work has confirmed the quality of the numerical model, and has also provided useful information on the magnetic non-uniformities due to the edge effects and to the tolerance on permanent magnet remanence.
Experimental Mapping and Benchmarking of Magnetic Field Codes on the LHD Ion Accelerator
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chitarin, G.; University of Padova, Dept. of Management and Engineering, strad. S. Nicola, 36100 Vicenza; Agostinetti, P.
2011-09-26
For the validation of the numerical models used for the design of the Neutral Beam Test Facility for ITER in Padua [1], an experimental benchmark against a full-size device has been sought. The LHD BL2 injector [2] has been chosen as a first benchmark, because the BL2 Negative Ion Source and Beam Accelerator are geometrically similar to SPIDER, even though BL2 does not include current bars and ferromagnetic materials. A comprehensive 3D magnetic field model of the LHD BL2 device has been developed based on the same assumptions used for SPIDER. In parallel, a detailed experimental magnetic map of themore » BL2 device has been obtained using a suitably designed 3D adjustable structure for the fine positioning of the magnetic sensors inside 27 of the 770 beamlet apertures. The calculated values have been compared to the experimental data. The work has confirmed the quality of the numerical model, and has also provided useful information on the magnetic non-uniformities due to the edge effects and to the tolerance on permanent magnet remanence.« less
OWL: A scalable Monte Carlo simulation suite for finite-temperature study of materials
NASA Astrophysics Data System (ADS)
Li, Ying Wai; Yuk, Simuck F.; Cooper, Valentino R.; Eisenbach, Markus; Odbadrakh, Khorgolkhuu
The OWL suite is a simulation package for performing large-scale Monte Carlo simulations. Its object-oriented, modular design enables it to interface with various external packages for energy evaluations. It is therefore applicable to study the finite-temperature properties for a wide range of systems: from simple classical spin models to materials where the energy is evaluated by ab initio methods. This scheme not only allows for the study of thermodynamic properties based on first-principles statistical mechanics, it also provides a means for massive, multi-level parallelism to fully exploit the capacity of modern heterogeneous computer architectures. We will demonstrate how improved strong and weak scaling is achieved by employing novel, parallel and scalable Monte Carlo algorithms, as well as the applications of OWL to a few selected frontier materials research problems. This research was supported by the Office of Science of the Department of Energy under contract DE-AC05-00OR22725.
3D multiphysics modeling of superconducting cavities with a massively parallel simulation suite
Kononenko, Oleksiy; Adolphsen, Chris; Li, Zenghai; ...
2017-10-10
Radiofrequency cavities based on superconducting technology are widely used in particle accelerators for various applications. The cavities usually have high quality factors and hence narrow bandwidths, so the field stability is sensitive to detuning from the Lorentz force and external loads, including vibrations and helium pressure variations. If not properly controlled, the detuning can result in a serious performance degradation of a superconducting accelerator, so an understanding of the underlying detuning mechanisms can be very helpful. Recent advances in the simulation suite ace3p have enabled realistic multiphysics characterization of such complex accelerator systems on supercomputers. In this paper, we presentmore » the new capabilities in ace3p for large-scale 3D multiphysics modeling of superconducting cavities, in particular, a parallel eigensolver for determining mechanical resonances, a parallel harmonic response solver to calculate the response of a cavity to external vibrations, and a numerical procedure to decompose mechanical loads, such as from the Lorentz force or piezoactuators, into the corresponding mechanical modes. These capabilities have been used to do an extensive rf-mechanical analysis of dressed TESLA-type superconducting cavities. Furthermore, the simulation results and their implications for the operational stability of the Linac Coherent Light Source-II are discussed.« less
3D multiphysics modeling of superconducting cavities with a massively parallel simulation suite
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kononenko, Oleksiy; Adolphsen, Chris; Li, Zenghai
Radiofrequency cavities based on superconducting technology are widely used in particle accelerators for various applications. The cavities usually have high quality factors and hence narrow bandwidths, so the field stability is sensitive to detuning from the Lorentz force and external loads, including vibrations and helium pressure variations. If not properly controlled, the detuning can result in a serious performance degradation of a superconducting accelerator, so an understanding of the underlying detuning mechanisms can be very helpful. Recent advances in the simulation suite ace3p have enabled realistic multiphysics characterization of such complex accelerator systems on supercomputers. In this paper, we presentmore » the new capabilities in ace3p for large-scale 3D multiphysics modeling of superconducting cavities, in particular, a parallel eigensolver for determining mechanical resonances, a parallel harmonic response solver to calculate the response of a cavity to external vibrations, and a numerical procedure to decompose mechanical loads, such as from the Lorentz force or piezoactuators, into the corresponding mechanical modes. These capabilities have been used to do an extensive rf-mechanical analysis of dressed TESLA-type superconducting cavities. Furthermore, the simulation results and their implications for the operational stability of the Linac Coherent Light Source-II are discussed.« less
The factorization of large composite numbers on the MPP
NASA Technical Reports Server (NTRS)
Mckurdy, Kathy J.; Wunderlich, Marvin C.
1987-01-01
The continued fraction method for factoring large integers (CFRAC) was an ideal algorithm to be implemented on a massively parallel computer such as the Massively Parallel Processor (MPP). After much effort, the first 60 digit number was factored on the MPP using about 6 1/2 hours of array time. Although this result added about 10 digits to the size number that could be factored using CFRAC on a serial machine, it was already badly beaten by the implementation of Davis and Holdridge on the CRAY-1 using the quadratic sieve, an algorithm which is clearly superior to CFRAC for large numbers. An algorithm is illustrated which is ideally suited to the single instruction multiple data (SIMD) massively parallel architecture and some of the modifications which were needed in order to make the parallel implementation effective and efficient are described.
NASA Astrophysics Data System (ADS)
Wang, Yue; Yu, Jingjun; Pei, Xu
2018-06-01
A new forward kinematics algorithm for the mechanism of 3-RPS (R: Revolute; P: Prismatic; S: Spherical) parallel manipulators is proposed in this study. This algorithm is primarily based on the special geometric conditions of the 3-RPS parallel mechanism, and it eliminates the errors produced by parasitic motions to improve and ensure accuracy. Specifically, the errors can be less than 10-6. In this method, only the group of solutions that is consistent with the actual situation of the platform is obtained rapidly. This algorithm substantially improves calculation efficiency because the selected initial values are reasonable, and all the formulas in the calculation are analytical. This novel forward kinematics algorithm is well suited for real-time and high-precision control of the 3-RPS parallel mechanism.
RELAP5-3D Results for Phase I (Exercise 2) of the OECD/NEA MHTGR-350 MW Benchmark
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gerhard Strydom
2012-06-01
The coupling of the PHISICS code suite to the thermal hydraulics system code RELAP5-3D has recently been initiated at the Idaho National Laboratory (INL) to provide a fully coupled prismatic Very High Temperature Reactor (VHTR) system modeling capability as part of the NGNP methods development program. The PHISICS code consists of three modules: INSTANT (performing 3D nodal transport core calculations), MRTAU (depletion and decay heat generation) and a perturbation/mixer module. As part of the verification and validation activities, steady state results have been obtained for Exercise 2 of Phase I of the newly-defined OECD/NEA MHTGR-350 MW Benchmark. This exercise requiresmore » participants to calculate a steady-state solution for an End of Equilibrium Cycle 350 MW Modular High Temperature Reactor (MHTGR), using the provided geometry, material, and coolant bypass flow description. The paper provides an overview of the MHTGR Benchmark and presents typical steady state results (e.g. solid and gas temperatures, thermal conductivities) for Phase I Exercise 2. Preliminary results are also provided for the early test phase of Exercise 3 using a two-group cross-section library and the Relap5-3D model developed for Exercise 2.« less
RELAP5-3D results for phase I (Exercise 2) of the OECD/NEA MHTGR-350 MW benchmark
DOE Office of Scientific and Technical Information (OSTI.GOV)
Strydom, G.; Epiney, A. S.
2012-07-01
The coupling of the PHISICS code suite to the thermal hydraulics system code RELAP5-3D has recently been initiated at the Idaho National Laboratory (INL) to provide a fully coupled prismatic Very High Temperature Reactor (VHTR) system modeling capability as part of the NGNP methods development program. The PHISICS code consists of three modules: INSTANT (performing 3D nodal transport core calculations), MRTAU (depletion and decay heat generation) and a perturbation/mixer module. As part of the verification and validation activities, steady state results have been obtained for Exercise 2 of Phase I of the newly-defined OECD/NEA MHTGR-350 MW Benchmark. This exercise requiresmore » participants to calculate a steady-state solution for an End of Equilibrium Cycle 350 MW Modular High Temperature Reactor (MHTGR), using the provided geometry, material, and coolant bypass flow description. The paper provides an overview of the MHTGR Benchmark and presents typical steady state results (e.g. solid and gas temperatures, thermal conductivities) for Phase I Exercise 2. Preliminary results are also provided for the early test phase of Exercise 3 using a two-group cross-section library and the Relap5-3D model developed for Exercise 2. (authors)« less
VMEbus based computer and real-time UNIX as infrastructure of DAQ
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yasu, Y.; Fujii, H.; Nomachi, M.
1994-12-31
This paper describes what the authors have constructed as the infrastructure of data acquisition system (DAQ). The paper reports recent developments concerned with HP VME board computer with LynxOS (HP742rt/HP-RT) and Alpha/OSF1 with VMEbus adapter. The paper also reports current status of developing a Benchmark Suite for Data Acquisition (DAQBENCH) for measuring not only the performance of VME/CAMAC access but also that of the context switching, the inter-process communications and so on, for various computers including Workstation-based systems and VME board computers.
DOE Office of Scientific and Technical Information (OSTI.GOV)
2017-05-17
PeleC is an adaptive-mesh compressible hydrodynamics code for reacting flows. It solves the compressible Navier-Stokes with multispecies transport in a block structured framework. The resulting algorithm is well suited for flows with localized resolution requirements and robust to discontinuities. User controllable refinement crieteria has the potential to result in extremely small numerical dissipation and dispersion, making this code appropriate for both research and applied usage. The code is built on the AMReX library which facilitates hierarchical parallelism and manages distributed memory parallism. PeleC algorithms are implemented to express shared memory parallelism.
NASA Technical Reports Server (NTRS)
Shapiro, Linda G.; Tanimoto, Steven L.; Ahrens, James P.
1996-01-01
The goal of this task was to create a design and prototype implementation of a database environment that is particular suited for handling the image, vision and scientific data associated with the NASA's EOC Amazon project. The focus was on a data model and query facilities that are designed to execute efficiently on parallel computers. A key feature of the environment is an interface which allows a scientist to specify high-level directives about how query execution should occur.
The International Conference on Vector and Parallel Computing (2nd)
1989-01-17
Computation of the SVD of Bidiagonal Matrices" ...................................... 11 " Lattice QCD -As a Large Scale Scientific Computation...vectorizcd for the IBM 3090 Vector Facility. In addition, elapsed times " Lattice QCD -As a Large Scale Scientific have been reduced by using 3090...benchmarked Lattice QCD on a large number ofcompu- come from the wavefront solver routine. This was exten- ters: CrayX-MP and Cray 2 (vector
High-Order Methods for Computational Physics
1999-03-01
computation is running in 278 Ronald D. Henderson parallel. Instead we use the concept of a voxel database (VDB) of geometric positions in the mesh [85...processor 0 Fig. 4.19. Connectivity and communications axe established by building a voxel database (VDB) of positions. A VDB maps each position to a...studies such as the highly accurate stability computations considered help expand the database for this benchmark problem. The two-dimensional linear
Hybrid parallel code acceleration methods in full-core reactor physics calculations
DOE Office of Scientific and Technical Information (OSTI.GOV)
Courau, T.; Plagne, L.; Ponicot, A.
2012-07-01
When dealing with nuclear reactor calculation schemes, the need for three dimensional (3D) transport-based reference solutions is essential for both validation and optimization purposes. Considering a benchmark problem, this work investigates the potential of discrete ordinates (Sn) transport methods applied to 3D pressurized water reactor (PWR) full-core calculations. First, the benchmark problem is described. It involves a pin-by-pin description of a 3D PWR first core, and uses a 8-group cross-section library prepared with the DRAGON cell code. Then, a convergence analysis is performed using the PENTRAN parallel Sn Cartesian code. It discusses the spatial refinement and the associated angular quadraturemore » required to properly describe the problem physics. It also shows that initializing the Sn solution with the EDF SPN solver COCAGNE reduces the number of iterations required to converge by nearly a factor of 6. Using a best estimate model, PENTRAN results are then compared to multigroup Monte Carlo results obtained with the MCNP5 code. Good consistency is observed between the two methods (Sn and Monte Carlo), with discrepancies that are less than 25 pcm for the k{sub eff}, and less than 2.1% and 1.6% for the flux at the pin-cell level and for the pin-power distribution, respectively. (authors)« less
Scalable Metropolis Monte Carlo for simulation of hard shapes
NASA Astrophysics Data System (ADS)
Anderson, Joshua A.; Eric Irrgang, M.; Glotzer, Sharon C.
2016-07-01
We design and implement a scalable hard particle Monte Carlo simulation toolkit (HPMC), and release it open source as part of HOOMD-blue. HPMC runs in parallel on many CPUs and many GPUs using domain decomposition. We employ BVH trees instead of cell lists on the CPU for fast performance, especially with large particle size disparity, and optimize inner loops with SIMD vector intrinsics on the CPU. Our GPU kernel proposes many trial moves in parallel on a checkerboard and uses a block-level queue to redistribute work among threads and avoid divergence. HPMC supports a wide variety of shape classes, including spheres/disks, unions of spheres, convex polygons, convex spheropolygons, concave polygons, ellipsoids/ellipses, convex polyhedra, convex spheropolyhedra, spheres cut by planes, and concave polyhedra. NVT and NPT ensembles can be run in 2D or 3D triclinic boxes. Additional integration schemes permit Frenkel-Ladd free energy computations and implicit depletant simulations. In a benchmark system of a fluid of 4096 pentagons, HPMC performs 10 million sweeps in 10 min on 96 CPU cores on XSEDE Comet. The same simulation would take 7.6 h in serial. HPMC also scales to large system sizes, and the same benchmark with 16.8 million particles runs in 1.4 h on 2048 GPUs on OLCF Titan.
Multiprocessing the Sieve of Eratosthenes
NASA Technical Reports Server (NTRS)
Bokhari, S.
1986-01-01
The Sieve of Eratosthenes for finding prime numbers in recent years has seen much use as a benchmark algorithm for serial computers while its intrinsically parallel nature has gone largely unnoticed. The implementation of a parallel version of this algorithm for a real parallel computer, the Flex/32, is described and its performance discussed. It is shown that the algorithm is sensitive to several fundamental performance parameters of parallel machines, such as spawning time, signaling time, memory access, and overhead of process switching. Because of the nature of the algorithm, it is impossible to get any speedup beyond 4 or 5 processors unless some form of dynamic load balancing is employed. We describe the performance of our algorithm with and without load balancing and compare it with theoretical lower bounds and simulated results. It is straightforward to understand this algorithm and to check the final results. However, its efficient implementation on a real parallel machine requires thoughtful design, especially if dynamic load balancing is desired. The fundamental operations required by the algorithm are very simple: this means that the slightest overhead appears prominently in performance data. The Sieve thus serves not only as a very severe test of the capabilities of a parallel processor but is also an interesting challenge for the programmer.
Komarov, Ivan; D'Souza, Roshan M
2012-01-01
The Gillespie Stochastic Simulation Algorithm (GSSA) and its variants are cornerstone techniques to simulate reaction kinetics in situations where the concentration of the reactant is too low to allow deterministic techniques such as differential equations. The inherent limitations of the GSSA include the time required for executing a single run and the need for multiple runs for parameter sweep exercises due to the stochastic nature of the simulation. Even very efficient variants of GSSA are prohibitively expensive to compute and perform parameter sweeps. Here we present a novel variant of the exact GSSA that is amenable to acceleration by using graphics processing units (GPUs). We parallelize the execution of a single realization across threads in a warp (fine-grained parallelism). A warp is a collection of threads that are executed synchronously on a single multi-processor. Warps executing in parallel on different multi-processors (coarse-grained parallelism) simultaneously generate multiple trajectories. Novel data-structures and algorithms reduce memory traffic, which is the bottleneck in computing the GSSA. Our benchmarks show an 8×-120× performance gain over various state-of-the-art serial algorithms when simulating different types of models.
Parallel computation with molecular-motor-propelled agents in nanofabricated networks.
Nicolau, Dan V; Lard, Mercy; Korten, Till; van Delft, Falco C M J M; Persson, Malin; Bengtsson, Elina; Månsson, Alf; Diez, Stefan; Linke, Heiner; Nicolau, Dan V
2016-03-08
The combinatorial nature of many important mathematical problems, including nondeterministic-polynomial-time (NP)-complete problems, places a severe limitation on the problem size that can be solved with conventional, sequentially operating electronic computers. There have been significant efforts in conceiving parallel-computation approaches in the past, for example: DNA computation, quantum computation, and microfluidics-based computation. However, these approaches have not proven, so far, to be scalable and practical from a fabrication and operational perspective. Here, we report the foundations of an alternative parallel-computation system in which a given combinatorial problem is encoded into a graphical, modular network that is embedded in a nanofabricated planar device. Exploring the network in a parallel fashion using a large number of independent, molecular-motor-propelled agents then solves the mathematical problem. This approach uses orders of magnitude less energy than conventional computers, thus addressing issues related to power consumption and heat dissipation. We provide a proof-of-concept demonstration of such a device by solving, in a parallel fashion, the small instance {2, 5, 9} of the subset sum problem, which is a benchmark NP-complete problem. Finally, we discuss the technical advances necessary to make our system scalable with presently available technology.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Barrett, Brian; Brightwell, Ronald B.; Grant, Ryan
This report presents a specification for the Portals 4 networ k programming interface. Portals 4 is intended to allow scalable, high-performance network communication betwee n nodes of a parallel computing system. Portals 4 is well suited to massively parallel processing and embedded syste ms. Portals 4 represents an adaption of the data movement layer developed for massively parallel processing platfor ms, such as the 4500-node Intel TeraFLOPS machine. Sandia's Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4 is tarmore » geted to the next generation of machines employing advanced network interface architectures that support enh anced offload capabilities.« less
The Portals 4.0 network programming interface.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Barrett, Brian W.; Brightwell, Ronald Brian; Pedretti, Kevin
2012-11-01
This report presents a specification for the Portals 4.0 network programming interface. Portals 4.0 is intended to allow scalable, high-performance network communication between nodes of a parallel computing system. Portals 4.0 is well suited to massively parallel processing and embedded systems. Portals 4.0 represents an adaption of the data movement layer developed for massively parallel processing platforms, such as the 4500-node Intel TeraFLOPS machine. Sandias Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4.0 is targeted to the next generationmore » of machines employing advanced network interface architectures that support enhanced offload capabilities.« less
NASA Astrophysics Data System (ADS)
Mills, R. T.; Rupp, K.; Smith, B. F.; Brown, J.; Knepley, M.; Zhang, H.; Adams, M.; Hammond, G. E.
2017-12-01
As the high-performance computing community pushes towards the exascale horizon, power and heat considerations have driven the increasing importance and prevalence of fine-grained parallelism in new computer architectures. High-performance computing centers have become increasingly reliant on GPGPU accelerators and "manycore" processors such as the Intel Xeon Phi line, and 512-bit SIMD registers have even been introduced in the latest generation of Intel's mainstream Xeon server processors. The high degree of fine-grained parallelism and more complicated memory hierarchy considerations of such "manycore" processors present several challenges to existing scientific software. Here, we consider how the massively parallel, open-source hydrologic flow and reactive transport code PFLOTRAN - and the underlying Portable, Extensible Toolkit for Scientific Computation (PETSc) library on which it is built - can best take advantage of such architectures. We will discuss some key features of these novel architectures and our code optimizations and algorithmic developments targeted at them, and present experiences drawn from working with a wide range of PFLOTRAN benchmark problems on these architectures.
Serial vs. parallel models of attention in visual search: accounting for benchmark RT-distributions.
Moran, Rani; Zehetleitner, Michael; Liesefeld, Heinrich René; Müller, Hermann J; Usher, Marius
2016-10-01
Visual search is central to the investigation of selective visual attention. Classical theories propose that items are identified by serially deploying focal attention to their locations. While this accounts for set-size effects over a continuum of task difficulties, it has been suggested that parallel models can account for such effects equally well. We compared the serial Competitive Guided Search model with a parallel model in their ability to account for RT distributions and error rates from a large visual search data-set featuring three classical search tasks: 1) a spatial configuration search (2 vs. 5); 2) a feature-conjunction search; and 3) a unique feature search (Wolfe, Palmer & Horowitz Vision Research, 50(14), 1304-1311, 2010). In the parallel model, each item is represented by a diffusion to two boundaries (target-present/absent); the search corresponds to a parallel race between these diffusors. The parallel model was highly flexible in that it allowed both for a parametric range of capacity-limitation and for set-size adjustments of identification boundaries. Furthermore, a quit unit allowed for a continuum of search-quitting policies when the target is not found, with "single-item inspection" and exhaustive searches comprising its extremes. The serial model was found to be superior to the parallel model, even before penalizing the parallel model for its increased complexity. We discuss the implications of the results and the need for future studies to resolve the debate.
NASA Astrophysics Data System (ADS)
Rizki, Permata Nur Miftahur; Lee, Heezin; Lee, Minsu; Oh, Sangyoon
2017-01-01
With the rapid advance of remote sensing technology, the amount of three-dimensional point-cloud data has increased extraordinarily, requiring faster processing in the construction of digital elevation models. There have been several attempts to accelerate the computation using parallel methods; however, little attention has been given to investigating different approaches for selecting the most suited parallel programming model for a given computing environment. We present our findings and insights identified by implementing three popular high-performance parallel approaches (message passing interface, MapReduce, and GPGPU) on time demanding but accurate kriging interpolation. The performances of the approaches are compared by varying the size of the grid and input data. In our empirical experiment, we demonstrate the significant acceleration by all three approaches compared to a C-implemented sequential-processing method. In addition, we also discuss the pros and cons of each method in terms of usability, complexity infrastructure, and platform limitation to give readers a better understanding of utilizing those parallel approaches for gridding purposes.
Parallel VLSI architecture emulation and the organization of APSA/MPP
NASA Technical Reports Server (NTRS)
Odonnell, John T.
1987-01-01
The Applicative Programming System Architecture (APSA) combines an applicative language interpreter with a novel parallel computer architecture that is well suited for Very Large Scale Integration (VLSI) implementation. The Massively Parallel Processor (MPP) can simulate VLSI circuits by allocating one processing element in its square array to an area on a square VLSI chip. As long as there are not too many long data paths, the MPP can simulate a VLSI clock cycle very rapidly. The APSA circuit contains a binary tree with a few long paths and many short ones. A skewed H-tree layout allows every processing element to simulate a leaf cell and up to four tree nodes, with no loss in parallelism. Emulation of a key APSA algorithm on the MPP resulted in performance 16,000 times faster than a Vax. This speed will make it possible for the APSA language interpreter to run fast enough to support research in parallel list processing algorithms.
Parallel architectures for iterative methods on adaptive, block structured grids
NASA Technical Reports Server (NTRS)
Gannon, D.; Vanrosendale, J.
1983-01-01
A parallel computer architecture well suited to the solution of partial differential equations in complicated geometries is proposed. Algorithms for partial differential equations contain a great deal of parallelism. But this parallelism can be difficult to exploit, particularly on complex problems. One approach to extraction of this parallelism is the use of special purpose architectures tuned to a given problem class. The architecture proposed here is tuned to boundary value problems on complex domains. An adaptive elliptic algorithm which maps effectively onto the proposed architecture is considered in detail. Two levels of parallelism are exploited by the proposed architecture. First, by making use of the freedom one has in grid generation, one can construct grids which are locally regular, permitting a one to one mapping of grids to systolic style processor arrays, at least over small regions. All local parallelism can be extracted by this approach. Second, though there may be a regular global structure to the grids constructed, there will be parallelism at this level. One approach to finding and exploiting this parallelism is to use an architecture having a number of processor clusters connected by a switching network. The use of such a network creates a highly flexible architecture which automatically configures to the problem being solved.
Progress in Unsteady Turbopump Flow Simulations Using Overset Grid Systems
NASA Technical Reports Server (NTRS)
Kiris, Cetin C.; Chan, William; Kwak, Dochan
2002-01-01
This viewgraph presentation provides information on unsteady flow simulations for the Second Generation RLV (Reusable Launch Vehicle) baseline turbopump. Three impeller rotations were simulated by using a 34.3 million grid points model. MPI/OpenMP hybrid parallelism and MLP shared memory parallelism has been implemented and benchmarked in INS3D, an incompressible Navier-Stokes solver. For RLV turbopump simulations a speed up of more than 30 times has been obtained. Moving boundary capability is obtained by using the DCF module. Scripting capability from CAD geometry to solution is developed. Unsteady flow simulations for advanced consortium impeller/diffuser by using a 39 million grid points model are currently underway. 1.2 impeller rotations are completed. The fluid/structure coupling is initiated.
Support of Multidimensional Parallelism in the OpenMP Programming Model
NASA Technical Reports Server (NTRS)
Jin, Hao-Qiang; Jost, Gabriele
2003-01-01
OpenMP is the current standard for shared-memory programming. While providing ease of parallel programming, the OpenMP programming model also has limitations which often effect the scalability of applications. Examples for these limitations are work distribution and point-to-point synchronization among threads. We propose extensions to the OpenMP programming model which allow the user to easily distribute the work in multiple dimensions and synchronize the workflow among the threads. The proposed extensions include four new constructs and the associated runtime library. They do not require changes to the source code and can be implemented based on the existing OpenMP standard. We illustrate the concept in a prototype translator and test with benchmark codes and a cloud modeling code.
Multi-Purpose, Application-Centric, Scalable I/O Proxy Application
DOE Office of Scientific and Technical Information (OSTI.GOV)
Miller, M. C.
2015-06-15
MACSio is a Multi-purpose, Application-Centric, Scalable I/O proxy application. It is designed to support a number of goals with respect to parallel I/O performance testing and benchmarking including the ability to test and compare various I/O libraries and I/O paradigms, to predict scalable performance of real applications and to help identify where improvements in I/O performance can be made within the HPC I/O software stack.
PeakRanger: A cloud-enabled peak caller for ChIP-seq data
2011-01-01
Background Chromatin immunoprecipitation (ChIP), coupled with massively parallel short-read sequencing (seq) is used to probe chromatin dynamics. Although there are many algorithms to call peaks from ChIP-seq datasets, most are tuned either to handle punctate sites, such as transcriptional factor binding sites, or broad regions, such as histone modification marks; few can do both. Other algorithms are limited in their configurability, performance on large data sets, and ability to distinguish closely-spaced peaks. Results In this paper, we introduce PeakRanger, a peak caller software package that works equally well on punctate and broad sites, can resolve closely-spaced peaks, has excellent performance, and is easily customized. In addition, PeakRanger can be run in a parallel cloud computing environment to obtain extremely high performance on very large data sets. We present a series of benchmarks to evaluate PeakRanger against 10 other peak callers, and demonstrate the performance of PeakRanger on both real and synthetic data sets. We also present real world usages of PeakRanger, including peak-calling in the modENCODE project. Conclusions Compared to other peak callers tested, PeakRanger offers improved resolution in distinguishing extremely closely-spaced peaks. PeakRanger has above-average spatial accuracy in terms of identifying the precise location of binding events. PeakRanger also has excellent sensitivity and specificity in all benchmarks evaluated. In addition, PeakRanger offers significant improvements in run time when running on a single processor system, and very marked improvements when allowed to take advantage of the MapReduce parallel environment offered by a cloud computing resource. PeakRanger can be downloaded at the official site of modENCODE project: http://www.modencode.org/software/ranger/ PMID:21554709
Strydom, G.; Epiney, A. S.; Alfonsi, Andrea; ...
2015-12-02
The PHISICS code system has been under development at INL since 2010. It consists of several modules providing improved coupled core simulation capability: INSTANT (3D nodal transport core calculations), MRTAU (depletion and decay heat generation) and modules performing criticality searches, fuel shuffling and generalized perturbation. Coupling of the PHISICS code suite to the thermal hydraulics system code RELAP5-3D was finalized in 2013, and as part of the verification and validation effort the first phase of the OECD/NEA MHTGR-350 Benchmark has now been completed. The theoretical basis and latest development status of the coupled PHISICS/RELAP5-3D tool are described in more detailmore » in a concurrent paper. This paper provides an overview of the OECD/NEA MHTGR-350 Benchmark and presents the results of Exercises 2 and 3 defined for Phase I. Exercise 2 required the modelling of a stand-alone thermal fluids solution at End of Equilibrium Cycle for the Modular High Temperature Reactor (MHTGR). The RELAP5-3D results of four sub-cases are discussed, consisting of various combinations of coolant bypass flows and material thermophysical properties. Exercise 3 required a coupled neutronics and thermal fluids solution, and the PHISICS/RELAP5-3D code suite was used to calculate the results of two sub-cases. The main focus of the paper is a comparison of results obtained with the traditional RELAP5-3D “ring” model approach against a much more detailed model that include kinetics feedback on individual block level and thermal feedbacks on a triangular sub-mesh. The higher fidelity that can be obtained by this “block” model is illustrated with comparison results on the temperature, power density and flux distributions. Furthermore, it is shown that the ring model leads to significantly lower fuel temperatures (up to 10%) when compared with the higher fidelity block model, and that the additional model development and run-time efforts are worth the gains obtained in the improved spatial temperature and flux distributions.« less
Benchmarking Evaluation Results for Prototype Extravehicular Activity Gloves
NASA Technical Reports Server (NTRS)
Aitchison, Lindsay; McFarland, Shane
2012-01-01
The Space Suit Assembly (SSA) Development Team at NASA Johnson Space Center has invested heavily in the advancement of rear-entry planetary exploration suit design but largely deferred development of extravehicular activity (EVA) glove designs, and accepted the risk of using the current flight gloves, Phase VI, for unique mission scenarios outside the Space Shuttle and International Space Station (ISS) Program realm of experience. However, as design reference missions mature, the risks of using heritage hardware have highlighted the need for developing robust new glove technologies. To address the technology gap, the NASA Game-Changing Technology group provided start-up funding for the High Performance EVA Glove (HPEG) Project in the spring of 2012. The overarching goal of the HPEG Project is to develop a robust glove design that increases human performance during EVA and creates pathway for future implementation of emergent technologies, with specific aims of increasing pressurized mobility to 60% of barehanded capability, increasing the durability by 100%, and decreasing the potential of gloves to cause injury during use. The HPEG Project focused initial efforts on identifying potential new technologies and benchmarking the performance of current state of the art gloves to identify trends in design and fit leading to establish standards and metrics against which emerging technologies can be assessed at both the component and assembly levels. The first of the benchmarking tests evaluated the quantitative mobility performance and subjective fit of four prototype gloves developed by Flagsuit LLC, Final Frontier Designs, LLC Dover, and David Clark Company as compared to the Phase VI. All of the companies were asked to design and fabricate gloves to the same set of NASA provided hand measurements (which corresponded to a single size of Phase Vi glove) and focus their efforts on improving mobility in the metacarpal phalangeal and carpometacarpal joints. Four test subjects representing the design ]to hand anthropometry completed range of motion, grip/pinch strength, dexterity, and fit evaluations for each glove design in both the unpressurized and pressurized conditions. This paper provides a comparison of the test results along with a detailed description of hardware and test methodologies used.
Structural tests using a MEMS acoustic emission sensor
NASA Astrophysics Data System (ADS)
Oppenheim, Irving J.; Greve, David W.; Ozevin, Didem; Hay, D. Robert; Hay, Thomas R.; Pessiki, Stephen P.; Tyson, Nathan L.
2006-03-01
In a collaborative project at Lehigh and Carnegie Mellon, a MEMS acoustic emission sensor was designed and fabricated as a suite of six resonant-type capacitive transducers in the frequency range between 100 and 500 kHz. Characterization studies showed good comparisons between predicted and experimental electro-mechanical behavior. Acoustic emission events, simulated experimentally in steel ball impact and in pencil lead break tests, were detected and source localization was demonstrated. In this paper we describe the application of the MEMS device in structural testing, both in laboratory and in field applications. We discuss our findings regarding housing and mounting (acoustic coupling) of the MEMS device with its supporting electronics, and we then report the results of structural testing. In all tests, the MEMS transducers were used in parallel with commercial acoustic emission sensors, which thereby serve as a benchmark and permit a direct observation of MEMS device functionality. All tests involved steel structures, with particular interest in propagation of existing cracks or flaws. A series of four laboratory tests were performed on beam specimens fabricated from two segments (Grade 50 steel) with a full penetration weld (E70T-4 electrode material) at midspan. That weld region was notched, an initial fatigue crack was induced, and the specimens were then instrumented with one commercial transducer and with one MEMS device; data was recorded from five individual transducers on the MEMS device. Under a four-point bending test, the beam displayed both inelastic behavior and crack propagation, including load drops associated with crack instability. The MEMS transducers detected all instability events as well as many or most of the acoustic emissions occurring during plasticity and stable crack growth. The MEMS transducers were less sensitive than the commercial transducer, and did not detect as many events, but the normalized cumulative burst count obtained from the MEMS transducers paralleled the count obtained from the commercial transducer. Waveform analysis of signals from the MEMS transducers provided additional information concerning arrivals of P-waves and S-waves. Similarly, the analysis provided additional confirmation that the acoustic emissions emanated from the damage zone near the crack tip, and were not spurious signals or artifacts. Subsequent tests were conducted in a field application where the MEMS transducers were redundant to a group of commercial transducers. The application example is a connection plate in truss bridge construction under passage of heavy traffic loads. The MEMS transducers were found to be functional, but were less sensitive in their present form than existing commercial transducers. We conclude that the transducers are usable in their current configuration and we outline applications for which they are presently suited, and then we discuss alternate MEMS structures that would provide greater sensitivity.
New features and improved uncertainty analysis in the NEA nuclear data sensitivity tool (NDaST)
NASA Astrophysics Data System (ADS)
Dyrda, J.; Soppera, N.; Hill, I.; Bossant, M.; Gulliford, J.
2017-09-01
Following the release and initial testing period of the NEA's Nuclear Data Sensitivity Tool [1], new features have been designed and implemented in order to expand its uncertainty analysis capabilities. The aim is to provide a free online tool for integral benchmark testing, that is both efficient and comprehensive, meeting the needs of the nuclear data and benchmark testing communities. New features include access to P1 sensitivities for neutron scattering angular distribution [2] and constrained Chi sensitivities for the prompt fission neutron energy sampling. Both of these are compatible with covariance data accessed via the JANIS nuclear data software, enabling propagation of the resultant uncertainties in keff to a large series of integral experiment benchmarks. These capabilities are available using a number of different covariance libraries e.g., ENDF/B, JEFF, JENDL and TENDL, allowing comparison of the broad range of results it is possible to obtain. The IRPhE database of reactor physics measurements is now also accessible within the tool in addition to the criticality benchmarks from ICSBEP. Other improvements include the ability to determine and visualise the energy dependence of a given calculated result in order to better identify specific regions of importance or high uncertainty contribution. Sorting and statistical analysis of the selected benchmark suite is now also provided. Examples of the plots generated by the software are included to illustrate such capabilities. Finally, a number of analytical expressions, for example Maxwellian and Watt fission spectra will be included. This will allow the analyst to determine the impact of varying such distributions within the data evaluation, either through adjustment of parameters within the expressions, or by comparison to a more general probability distribution fitted to measured data. The impact of such changes is verified through calculations which are compared to a `direct' measurement found by adjustment of the original ENDF format file.
apGA: An adaptive parallel genetic algorithm
DOE Office of Scientific and Technical Information (OSTI.GOV)
Liepins, G.E.; Baluja, S.
1991-01-01
We develop apGA, a parallel variant of the standard generational GA, that combines aggressive search with perpetual novelty, yet is able to preserve enough genetic structure to optimally solve variably scaled, non-uniform block deceptive and hierarchical deceptive problems. apGA combines elitism, adaptive mutation, adaptive exponential scaling, and temporal memory. We present empirical results for six classes of problems, including the DeJong test suite. Although we have not investigated hybrids, we note that apGA could be incorporated into other recent GA variants such as GENITOR, CHC, and the recombination stage of mGA. 12 refs., 2 figs., 2 tabs.
Molecular Symmetry in Ab Initio Calculations
NASA Astrophysics Data System (ADS)
Madhavan, P. V.; Written, J. L.
1987-05-01
A scheme is presented for the construction of the Fock matrix in LCAO-SCF calculations and for the transformation of basis integrals to LCAO-MO integrals that can utilize several symmetry unique lists of integrals corresponding to different symmetry groups. The algorithm is fully compatible with vector processing machines and is especially suited for parallel processing machines.
Magnetosphere simulations with a high-performance 3D AMR MHD Code
NASA Astrophysics Data System (ADS)
Gombosi, Tamas; Dezeeuw, Darren; Groth, Clinton; Powell, Kenneth; Song, Paul
1998-11-01
BATS-R-US is a high-performance 3D AMR MHD code for space physics applications running on massively parallel supercomputers. In BATS-R-US the electromagnetic and fluid equations are solved with a high-resolution upwind numerical scheme in a tightly coupled manner. The code is very robust and it is capable of spanning a wide range of plasma parameters (such as β, acoustic and Alfvénic Mach numbers). Our code is highly scalable: it achieved a sustained performance of 233 GFLOPS on a Cray T3E-1200 supercomputer with 1024 PEs. This talk reports results from the BATS-R-US code for the GGCM (Geospace General Circularculation Model) Phase 1 Standard Model Suite. This model suite contains 10 different steady-state configurations: 5 IMF clock angles (north, south, and three equally spaced angles in- between) with 2 IMF field strengths for each angle (5 nT and 10 nT). The other parameters are: solar wind speed =400 km/sec; solar wind number density = 5 protons/cc; Hall conductance = 0; Pedersen conductance = 5 S; parallel conductivity = ∞.
Renner, Franziska
2016-09-01
Monte Carlo simulations are regarded as the most accurate method of solving complex problems in the field of dosimetry and radiation transport. In (external) radiation therapy they are increasingly used for the calculation of dose distributions during treatment planning. In comparison to other algorithms for the calculation of dose distributions, Monte Carlo methods have the capability of improving the accuracy of dose calculations - especially under complex circumstances (e.g. consideration of inhomogeneities). However, there is a lack of knowledge of how accurate the results of Monte Carlo calculations are on an absolute basis. A practical verification of the calculations can be performed by direct comparison with the results of a benchmark experiment. This work presents such a benchmark experiment and compares its results (with detailed consideration of measurement uncertainty) with the results of Monte Carlo calculations using the well-established Monte Carlo code EGSnrc. The experiment was designed to have parallels to external beam radiation therapy with respect to the type and energy of the radiation, the materials used and the kind of dose measurement. Because the properties of the beam have to be well known in order to compare the results of the experiment and the simulation on an absolute basis, the benchmark experiment was performed using the research electron accelerator of the Physikalisch-Technische Bundesanstalt (PTB), whose beam was accurately characterized in advance. The benchmark experiment and the corresponding Monte Carlo simulations were carried out for two different types of ionization chambers and the results were compared. Considering the uncertainty, which is about 0.7 % for the experimental values and about 1.0 % for the Monte Carlo simulation, the results of the simulation and the experiment coincide. Copyright © 2015. Published by Elsevier GmbH.
The portals 4.0.1 network programming interface.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Barrett, Brian W.; Brightwell, Ronald Brian; Pedretti, Kevin
2013-04-01
This report presents a specification for the Portals 4.0 network programming interface. Portals 4.0 is intended to allow scalable, high-performance network communication between nodes of a parallel computing system. Portals 4.0 is well suited to massively parallel processing and embedded systems. Portals 4.0 represents an adaption of the data movement layer developed for massively parallel processing platforms, such as the 4500-node Intel TeraFLOPS machine. Sandias Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4.0 is targeted to the next generationmore » of machines employing advanced network interface architectures that support enhanced offload capabilities. 3« less
Control of parallel manipulators using force feedback
NASA Technical Reports Server (NTRS)
Nanua, Prabjot
1994-01-01
Two control schemes are compared for parallel robotic mechanisms actuated by hydraulic cylinders. One scheme, the 'rate based scheme', uses the position and rate information only for feedback. The second scheme, the 'force based scheme' feeds back the force information also. The force control scheme is shown to improve the response over the rate control one. It is a simple constant gain control scheme better suited to parallel mechanisms. The force control scheme can be easily modified for the dynamic forces on the end effector. This paper presents the results of a computer simulation of both the rate and force control schemes. The gains in the force based scheme can be individually adjusted in all three directions, whereas the adjustment in just one direction of the rate based scheme directly affects the other two directions.
NASA Technical Reports Server (NTRS)
Burleigh, Scott
2008-01-01
This slide presentation reviews the activity around the Asynchronous Message Service (AMS) prototype. An AMS reference implementation has been available since late 2005. It is aimed at supporting message exchange both in on-board environments and over space links. The implementation incoroporates all mandatory elements of the draft recommendation from July 2007: (1) MAMS, AMS, and RAMS protocols. (2) Failover, heartbeats, resync. (3) "Hooks" for security, but no cipher suites included in the distribution. The performance is reviewed, and a Benchmark latency test over VxWorks Message Queues is shown as histograms of a count vs microseconds per 1000-byte message
VHSIC Hardware Description Language (VHDL) Benchmark Suite
1990-10-01
T7 prit.iy label iTM Architecture label I flO inch Label I TIC ) P rae s Label I W Conkgurstion Spec. 1 21 Appendix B. Test Descriptions, Shell Code...Siensls R Accnss Operaist s Iov (soc 3 3 & 7 3 61 $ File I/0 S1 Reed S2 Write T Label Site TI Signal TIA Archi~ecture TIE Block TIC Port T2 VariableI...Access Operations I (sec 3 3 & 7.3 61 1 S FI I/0 1 Sl Read 52 Write T Label SreI TI Signal TIA Architeclt TIR Block TIC
A Fast MHD Code for Gravitationally Stratified Media using Graphical Processing Units: SMAUG
NASA Astrophysics Data System (ADS)
Griffiths, M. K.; Fedun, V.; Erdélyi, R.
2015-03-01
Parallelization techniques have been exploited most successfully by the gaming/graphics industry with the adoption of graphical processing units (GPUs), possessing hundreds of processor cores. The opportunity has been recognized by the computational sciences and engineering communities, who have recently harnessed successfully the numerical performance of GPUs. For example, parallel magnetohydrodynamic (MHD) algorithms are important for numerical modelling of highly inhomogeneous solar, astrophysical and geophysical plasmas. Here, we describe the implementation of SMAUG, the Sheffield Magnetohydrodynamics Algorithm Using GPUs. SMAUG is a 1-3D MHD code capable of modelling magnetized and gravitationally stratified plasma. The objective of this paper is to present the numerical methods and techniques used for porting the code to this novel and highly parallel compute architecture. The methods employed are justified by the performance benchmarks and validation results demonstrating that the code successfully simulates the physics for a range of test scenarios including a full 3D realistic model of wave propagation in the solar atmosphere.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Satake, Shin-ichi; Kanamori, Hiroyuki; Kunugi, Tomoaki
2007-02-01
We have developed a parallel algorithm for microdigital-holographic particle-tracking velocimetry. The algorithm is used in (1) numerical reconstruction of a particle image computer using a digital hologram, and (2) searching for particles. The numerical reconstruction from the digital hologram makes use of the Fresnel diffraction equation and the FFT (fast Fourier transform),whereas the particle search algorithm looks for local maximum graduation in a reconstruction field represented by a 3D matrix. To achieve high performance computing for both calculations (reconstruction and particle search), two memory partitions are allocated to the 3D matrix. In this matrix, the reconstruction part consists of horizontallymore » placed 2D memory partitions on the x-y plane for the FFT, whereas, the particle search part consists of vertically placed 2D memory partitions set along the z axes.Consequently, the scalability can be obtained for the proportion of processor elements,where the benchmarks are carried out for parallel computation by a SGI Altix machine.« less
NASA Astrophysics Data System (ADS)
Wang, Ting; Plecháč, Petr
2017-12-01
Stochastic reaction networks that exhibit bistable behavior are common in systems biology, materials science, and catalysis. Sampling of stationary distributions is crucial for understanding and characterizing the long-time dynamics of bistable stochastic dynamical systems. However, simulations are often hindered by the insufficient sampling of rare transitions between the two metastable regions. In this paper, we apply the parallel replica method for a continuous time Markov chain in order to improve sampling of the stationary distribution in bistable stochastic reaction networks. The proposed method uses parallel computing to accelerate the sampling of rare transitions. Furthermore, it can be combined with the path-space information bounds for parametric sensitivity analysis. With the proposed methodology, we study three bistable biological networks: the Schlögl model, the genetic switch network, and the enzymatic futile cycle network. We demonstrate the algorithmic speedup achieved in these numerical benchmarks. More significant acceleration is expected when multi-core or graphics processing unit computer architectures and programming tools such as CUDA are employed.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Arnold H. Kritz
PTRANSP, which is the predictive version of the TRANSP code, was developed in a collaborative effort involving the Princeton Plasma Physics Laboratory, General Atomics Corporation, Lawrence Livermore National Laboratory, and Lehigh University. The PTRANSP/TRANSP suite of codes is the premier integrated tokamak modeling software in the United States. A production service for PTRANSP/TRANSP simulations is maintained at the Princeton Plasma Physics Laboratory; the server has a simple command line client interface and is subscribed to by about 100 researchers from tokamak projects in the US, Europe, and Asia. This service produced nearly 13000 PTRANSP/TRANSP simulations in the four year periodmore » FY 2005 through FY 2008. Major archives of TRANSP results are maintained at PPPL, MIT, General Atomics, and JET. Recent utilization, counting experimental analysis simulations as well as predictive simulations, more than doubled from slightly over 2000 simulations per year in FY 2005 and FY 2006 to over 4300 simulations per year in FY 2007 and FY 2008. PTRANSP predictive simulations applied to ITER increased eight fold from 30 simulations per year in FY 2005 and FY 2006 to 240 simulations per year in FY 2007 and FY 2008, accounting for more than half of combined PTRANSP/TRANSP service CPU resource utilization in FY 2008. PTRANSP studies focused on ITER played a key role in journal articles. Examples of validation studies carried out for momentum transport in PTRANSP simulations were presented at the 2008 IAEA conference. The increase in number of PTRANSP simulations has continued (more than 7000 TRANSP/PTRANSP simulations in 2010) and results of PTRANSP simulations appear in conference proceedings, for example the 2010 IAEA conference, and in peer reviewed papers. PTRANSP provides a bridge to the Fusion Simulation Program (FSP) and to the future of integrated modeling. Through years of widespread usage, each of the many parts of the PTRANSP suite of codes has been thoroughly validated against experimental data and benchmarked against other codes. At the same time, architectural modernizations are improving the modularity of the PTRANSP code base. The NUBEAM neutral beam and fusion products fast ion model, the Plasma State data repository (developed originally in the SWIM SciDAC project and adapted for use in PTRANSP), and other components are already shared with the SWIM, FACETS, and CPES SciDAC FSP prototype projects. Thus, the PTRANSP code is already serving as a bridge between our present integrated modeling capability and future capability. As the Fusion Simulation Program builds toward the facility currently available in the PTRANSP suite of codes, early versions of the FSP core plasma model will need to be benchmarked against the PTRANSP simulations. This will be necessary to build user confidence in FSP, but this benchmarking can only be done if PTRANSP itself is maintained and developed.« less
Rand, Hugh; Shumway, Martin; Trees, Eija K.; Simmons, Mustafa; Agarwala, Richa; Davis, Steven; Tillman, Glenn E.; Defibaugh-Chavez, Stephanie; Carleton, Heather A.; Klimke, William A.; Katz, Lee S.
2017-01-01
Background As next generation sequence technology has advanced, there have been parallel advances in genome-scale analysis programs for determining evolutionary relationships as proxies for epidemiological relationship in public health. Most new programs skip traditional steps of ortholog determination and multi-gene alignment, instead identifying variants across a set of genomes, then summarizing results in a matrix of single-nucleotide polymorphisms or alleles for standard phylogenetic analysis. However, public health authorities need to document the performance of these methods with appropriate and comprehensive datasets so they can be validated for specific purposes, e.g., outbreak surveillance. Here we propose a set of benchmark datasets to be used for comparison and validation of phylogenomic pipelines. Methods We identified four well-documented foodborne pathogen events in which the epidemiology was concordant with routine phylogenomic analyses (reference-based SNP and wgMLST approaches). These are ideal benchmark datasets, as the trees, WGS data, and epidemiological data for each are all in agreement. We have placed these sequence data, sample metadata, and “known” phylogenetic trees in publicly-accessible databases and developed a standard descriptive spreadsheet format describing each dataset. To facilitate easy downloading of these benchmarks, we developed an automated script that uses the standard descriptive spreadsheet format. Results Our “outbreak” benchmark datasets represent the four major foodborne bacterial pathogens (Listeria monocytogenes, Salmonella enterica, Escherichia coli, and Campylobacter jejuni) and one simulated dataset where the “known tree” can be accurately called the “true tree”. The downloading script and associated table files are available on GitHub: https://github.com/WGS-standards-and-analysis/datasets. Discussion These five benchmark datasets will help standardize comparison of current and future phylogenomic pipelines, and facilitate important cross-institutional collaborations. Our work is part of a global effort to provide collaborative infrastructure for sequence data and analytic tools—we welcome additional benchmark datasets in our recommended format, and, if relevant, we will add these on our GitHub site. Together, these datasets, dataset format, and the underlying GitHub infrastructure present a recommended path for worldwide standardization of phylogenomic pipelines. PMID:29372115
Timme, Ruth E; Rand, Hugh; Shumway, Martin; Trees, Eija K; Simmons, Mustafa; Agarwala, Richa; Davis, Steven; Tillman, Glenn E; Defibaugh-Chavez, Stephanie; Carleton, Heather A; Klimke, William A; Katz, Lee S
2017-01-01
As next generation sequence technology has advanced, there have been parallel advances in genome-scale analysis programs for determining evolutionary relationships as proxies for epidemiological relationship in public health. Most new programs skip traditional steps of ortholog determination and multi-gene alignment, instead identifying variants across a set of genomes, then summarizing results in a matrix of single-nucleotide polymorphisms or alleles for standard phylogenetic analysis. However, public health authorities need to document the performance of these methods with appropriate and comprehensive datasets so they can be validated for specific purposes, e.g., outbreak surveillance. Here we propose a set of benchmark datasets to be used for comparison and validation of phylogenomic pipelines. We identified four well-documented foodborne pathogen events in which the epidemiology was concordant with routine phylogenomic analyses (reference-based SNP and wgMLST approaches). These are ideal benchmark datasets, as the trees, WGS data, and epidemiological data for each are all in agreement. We have placed these sequence data, sample metadata, and "known" phylogenetic trees in publicly-accessible databases and developed a standard descriptive spreadsheet format describing each dataset. To facilitate easy downloading of these benchmarks, we developed an automated script that uses the standard descriptive spreadsheet format. Our "outbreak" benchmark datasets represent the four major foodborne bacterial pathogens ( Listeria monocytogenes , Salmonella enterica , Escherichia coli , and Campylobacter jejuni ) and one simulated dataset where the "known tree" can be accurately called the "true tree". The downloading script and associated table files are available on GitHub: https://github.com/WGS-standards-and-analysis/datasets. These five benchmark datasets will help standardize comparison of current and future phylogenomic pipelines, and facilitate important cross-institutional collaborations. Our work is part of a global effort to provide collaborative infrastructure for sequence data and analytic tools-we welcome additional benchmark datasets in our recommended format, and, if relevant, we will add these on our GitHub site. Together, these datasets, dataset format, and the underlying GitHub infrastructure present a recommended path for worldwide standardization of phylogenomic pipelines.
NASA Astrophysics Data System (ADS)
Kim, Jae Wook
2013-05-01
This paper proposes a novel systematic approach for the parallelization of pentadiagonal compact finite-difference schemes and filters based on domain decomposition. The proposed approach allows a pentadiagonal banded matrix system to be split into quasi-disjoint subsystems by using a linear-algebraic transformation technique. As a result the inversion of pentadiagonal matrices can be implemented within each subdomain in an independent manner subject to a conventional halo-exchange process. The proposed matrix transformation leads to new subdomain boundary (SB) compact schemes and filters that require three halo terms to exchange with neighboring subdomains. The internode communication overhead in the present approach is equivalent to that of standard explicit schemes and filters based on seven-point discretization stencils. The new SB compact schemes and filters demand additional arithmetic operations compared to the original serial ones. However, it is shown that the additional cost becomes sufficiently low by choosing optimal sizes of their discretization stencils. Compared to earlier published results, the proposed SB compact schemes and filters successfully reduce parallelization artifacts arising from subdomain boundaries to a level sufficiently negligible for sophisticated aeroacoustic simulations without degrading parallel efficiency. The overall performance and parallel efficiency of the proposed approach are demonstrated by stringent benchmark tests.
Cavitation, Flow Structure and Turbulence in the Tip Region of a Rotor Blade
NASA Technical Reports Server (NTRS)
Wu, H.; Miorini, R.; Soranna, F.; Katz, J.; Michael, T.; Jessup, S.
2010-01-01
Objectives: Measure the flow structure and turbulence within a Naval, axial waterjet pump. Create a database for benchmarking and validation of parallel computational efforts. Address flow and turbulence modeling issues that are unique to this complex environment. Measure and model flow phenomena affecting cavitation within the pump and its effect on pump performance. This presentation focuses on cavitation phenomena and associated flow structure in the tip region of a rotor blade.
Time-Dependent Simulations of Incompressible Flow in a Turbopump Using Overset Grid Approach
NASA Technical Reports Server (NTRS)
Kiris, Cetin; Kwak, Dochan
2001-01-01
This viewgraph presentation provides information on mathematical modelling of the SSME (space shuttle main engine). The unsteady SSME-rig1 start-up procedure from the pump at rest has been initiated by using 34.3 million grid points. The computational model for the SSME-rig1 has been completed. Moving boundary capability is obtained by using DCF module in OVERFLOW-D. MPI (Message Passing Interface)/OpenMP hybrid parallel code has been benchmarked.
Hamdy, M; Hamdan, I
2015-07-01
In this paper, a robust H∞ fuzzy output feedback controller is designed for a class of affine nonlinear systems with disturbance via Takagi-Sugeno (T-S) fuzzy bilinear model. The parallel distributed compensation (PDC) technique is utilized to design a fuzzy controller. The stability conditions of the overall closed loop T-S fuzzy bilinear model are formulated in terms of Lyapunov function via linear matrix inequality (LMI). The control law is robustified by H∞ sense to attenuate external disturbance. Moreover, the desired controller gains can be obtained by solving a set of LMI. A continuous stirred tank reactor (CSTR), which is a benchmark problem in nonlinear process control, is discussed in detail to verify the effectiveness of the proposed approach with a comparative study. Copyright © 2014 ISA. Published by Elsevier Ltd. All rights reserved.
Running Neuroimaging Applications on Amazon Web Services: How, When, and at What Cost?
Madhyastha, Tara M; Koh, Natalie; Day, Trevor K M; Hernández-Fernández, Moises; Kelley, Austin; Peterson, Daniel J; Rajan, Sabreena; Woelfer, Karl A; Wolf, Jonathan; Grabowski, Thomas J
2017-01-01
The contribution of this paper is to identify and describe current best practices for using Amazon Web Services (AWS) to execute neuroimaging workflows "in the cloud." Neuroimaging offers a vast set of techniques by which to interrogate the structure and function of the living brain. However, many of the scientists for whom neuroimaging is an extremely important tool have limited training in parallel computation. At the same time, the field is experiencing a surge in computational demands, driven by a combination of data-sharing efforts, improvements in scanner technology that allow acquisition of images with higher image resolution, and by the desire to use statistical techniques that stress processing requirements. Most neuroimaging workflows can be executed as independent parallel jobs and are therefore excellent candidates for running on AWS, but the overhead of learning to do so and determining whether it is worth the cost can be prohibitive. In this paper we describe how to identify neuroimaging workloads that are appropriate for running on AWS, how to benchmark execution time, and how to estimate cost of running on AWS. By benchmarking common neuroimaging applications, we show that cloud computing can be a viable alternative to on-premises hardware. We present guidelines that neuroimaging labs can use to provide a cluster-on-demand type of service that should be familiar to users, and scripts to estimate cost and create such a cluster.
pyRMSD: a Python package for efficient pairwise RMSD matrix calculation and handling.
Gil, Víctor A; Guallar, Víctor
2013-09-15
We introduce pyRMSD, an open source standalone Python package that aims at offering an integrative and efficient way of performing Root Mean Square Deviation (RMSD)-related calculations of large sets of structures. It is specially tuned to do fast collective RMSD calculations, as pairwise RMSD matrices, implementing up to three well-known superposition algorithms. pyRMSD provides its own symmetric distance matrix class that, besides the fact that it can be used as a regular matrix, helps to save memory and increases memory access speed. This last feature can dramatically improve the overall performance of any Python algorithm using it. In addition, its extensibility, testing suites and documentation make it a good choice to those in need of a workbench for developing or testing new algorithms. The source code (under MIT license), installer, test suites and benchmarks can be found at https://pele.bsc.es/ under the tools section. victor.guallar@bsc.es Supplementary data are available at Bioinformatics online.
Analyzing large-scale spiking neural data with HRLAnalysis™
Thibeault, Corey M.; O'Brien, Michael J.; Srinivasa, Narayan
2014-01-01
The additional capabilities provided by high-performance neural simulation environments and modern computing hardware has allowed for the modeling of increasingly larger spiking neural networks. This is important for exploring more anatomically detailed networks but the corresponding accumulation in data can make analyzing the results of these simulations difficult. This is further compounded by the fact that many existing analysis packages were not developed with large spiking data sets in mind. Presented here is a software suite developed to not only process the increased amount of spike-train data in a reasonable amount of time, but also provide a user friendly Python interface. We describe the design considerations, implementation and features of the HRLAnalysis™ suite. In addition, performance benchmarks demonstrating the speedup of this design compared to a published Python implementation are also presented. The result is a high-performance analysis toolkit that is not only usable and readily extensible, but also straightforward to interface with existing Python modules. PMID:24634655
Goodbred, Steven L.; Smith, Stephen B.; Greene, Patricia S.; Rauschenberger, Richard H.; Bartish, Timothy M.
2007-01-01
A nationwide reconnaissance investigation was initiated in 1994 to develop and evaluate a suite of reproductive and endocrine biomarkers for their potential to assess reproductive health and status in teleost (bony) fish. Fish collections were made at 119 sites, representing many regions of the country and land- and water-use settings. Collectively, this report will provide a national and regional benchmark and a basis for evaluating biomarkers of endocrine and reproductive function. Approximately 2,200 common carp (Cyprinus carpio) and 650 largemouth bass (Micropterus salmoides) were collected from 1994 through 1997. The suite of biomarkers used for these studies included: the plasma sex-steroid hormones, 17?-estradiol (E2) and 11 ketotestosterone (11KT); the ratio of E2 to 11KT (E2:11KT); plasma vitellogenin (VTG); and stage of gonadal development. This data report provides fish size, stage and reproductive biomarker data for individual fish and for site and regional summaries of these variables.
Egea, Jose A; Henriques, David; Cokelaer, Thomas; Villaverde, Alejandro F; MacNamara, Aidan; Danciu, Diana-Patricia; Banga, Julio R; Saez-Rodriguez, Julio
2014-05-10
Optimization is the key to solving many problems in computational biology. Global optimization methods, which provide a robust methodology, and metaheuristics in particular have proven to be the most efficient methods for many applications. Despite their utility, there is a limited availability of metaheuristic tools. We present MEIGO, an R and Matlab optimization toolbox (also available in Python via a wrapper of the R version), that implements metaheuristics capable of solving diverse problems arising in systems biology and bioinformatics. The toolbox includes the enhanced scatter search method (eSS) for continuous nonlinear programming (cNLP) and mixed-integer programming (MINLP) problems, and variable neighborhood search (VNS) for Integer Programming (IP) problems. Additionally, the R version includes BayesFit for parameter estimation by Bayesian inference. The eSS and VNS methods can be run on a single-thread or in parallel using a cooperative strategy. The code is supplied under GPLv3 and is available at http://www.iim.csic.es/~gingproc/meigo.html. Documentation and examples are included. The R package has been submitted to BioConductor. We evaluate MEIGO against optimization benchmarks, and illustrate its applicability to a series of case studies in bioinformatics and systems biology where it outperforms other state-of-the-art methods. MEIGO provides a free, open-source platform for optimization that can be applied to multiple domains of systems biology and bioinformatics. It includes efficient state of the art metaheuristics, and its open and modular structure allows the addition of further methods.
2014-01-01
Background Optimization is the key to solving many problems in computational biology. Global optimization methods, which provide a robust methodology, and metaheuristics in particular have proven to be the most efficient methods for many applications. Despite their utility, there is a limited availability of metaheuristic tools. Results We present MEIGO, an R and Matlab optimization toolbox (also available in Python via a wrapper of the R version), that implements metaheuristics capable of solving diverse problems arising in systems biology and bioinformatics. The toolbox includes the enhanced scatter search method (eSS) for continuous nonlinear programming (cNLP) and mixed-integer programming (MINLP) problems, and variable neighborhood search (VNS) for Integer Programming (IP) problems. Additionally, the R version includes BayesFit for parameter estimation by Bayesian inference. The eSS and VNS methods can be run on a single-thread or in parallel using a cooperative strategy. The code is supplied under GPLv3 and is available at http://www.iim.csic.es/~gingproc/meigo.html. Documentation and examples are included. The R package has been submitted to BioConductor. We evaluate MEIGO against optimization benchmarks, and illustrate its applicability to a series of case studies in bioinformatics and systems biology where it outperforms other state-of-the-art methods. Conclusions MEIGO provides a free, open-source platform for optimization that can be applied to multiple domains of systems biology and bioinformatics. It includes efficient state of the art metaheuristics, and its open and modular structure allows the addition of further methods. PMID:24885957
Dockins, James; Abuzahrieh, Ramzi; Stack, Martin
2015-01-01
To translate and adapt an effective, validated, benchmarked, and widely used patient satisfaction measurement tool for use with an Arabic-speaking population. Translation of survey's items, survey administration process development, evaluation of reliability, and international benchmarking Three hundred-bed tertiary care hospital in Jeddah, Saudi Arabia. 645 patients discharged during 2011 from the hospital's inpatient care units. INTERVENTIONS; The Hospital Consumer Assessment of Healthcare Providers and Systems (HCAHPS) instrument was translated into Arabic, a randomized weekly sample of patients was selected, and the survey was administered via telephone during 2011 to patients or their relatives. Scores were compiled for each of the HCAHPS questions and then for each of the six HCAHPS clinical composites, two non-clinical items, and two global items. Clinical composite scores, as well as the two non-clinical and two global items were analyzed for the 645 respondents. Clinical composites were analyzed using Spearman's correlation coefficient and Cronbach's alpha to demonstrate acceptable internal consistency for these items and scales demonstrated acceptable internal consistency for the clinical composites. (Spearman's correlation coefficient = 0.327 - 0.750, P < 0.01; Cronbach's alpha = 0.516 - 0.851) All ten HCAHPS measures were compared quarterly to US national averages with results that closely paralleled the US benchmarks. . The Arabic translation and adaptation of the HCAHPS is a valid, reliable, and feasible tool for evaluation and benchmarking of inpatient satisfaction in Arabic speaking populations.
The Scalable Checkpoint/Restart Library
DOE Office of Scientific and Technical Information (OSTI.GOV)
Moody, A.
The Scalable Checkpoint/Restart (SCR) library provides an interface that codes may use to worite our and read in application-level checkpoints in a scalable fashion. In the current implementation, checkpoint files are cached in local storage (hard disk or RAM disk) on the compute nodes. This technique provides scalable aggregate bandwidth and uses storage resources that are fully dedicated to the job. This approach addresses the two common drawbacks of checkpointing a large-scale application to a shared parallel file system, namely, limited bandwidth and file system contention. In fact, on current platforms, SCR scales linearly with the number of compute nodes.more » It has been benchmarked as high as 720GB/s on 1094 nodes of Atlas, which is nearly two orders of magnitude faster thanthe parallel file system.« less
GLAD: a system for developing and deploying large-scale bioinformatics grid.
Teo, Yong-Meng; Wang, Xianbing; Ng, Yew-Kwong
2005-03-01
Grid computing is used to solve large-scale bioinformatics problems with gigabytes database by distributing the computation across multiple platforms. Until now in developing bioinformatics grid applications, it is extremely tedious to design and implement the component algorithms and parallelization techniques for different classes of problems, and to access remotely located sequence database files of varying formats across the grid. In this study, we propose a grid programming toolkit, GLAD (Grid Life sciences Applications Developer), which facilitates the development and deployment of bioinformatics applications on a grid. GLAD has been developed using ALiCE (Adaptive scaLable Internet-based Computing Engine), a Java-based grid middleware, which exploits the task-based parallelism. Two bioinformatics benchmark applications, such as distributed sequence comparison and distributed progressive multiple sequence alignment, have been developed using GLAD.
Data Acquisition and Linguistic Resources
NASA Astrophysics Data System (ADS)
Strassel, Stephanie; Christianson, Caitlin; McCary, John; Staderman, William; Olive, Joseph
All human language technology demands substantial quantities of data for system training and development, plus stable benchmark data to measure ongoing progress. While creation of high quality linguistic resources is both costly and time consuming, such data has the potential to profoundly impact not just a single evaluation program but language technology research in general. GALE's challenging performance targets demand linguistic data on a scale and complexity never before encountered. Resources cover multiple languages (Arabic, Chinese, and English) and multiple genres -- both structured (newswire and broadcast news) and unstructured (web text, including blogs and newsgroups, and broadcast conversation). These resources include significant volumes of monolingual text and speech, parallel text, and transcribed audio combined with multiple layers of linguistic annotation, ranging from word aligned parallel text and Treebanks to rich semantic annotation.
Parallel 3D-TLM algorithm for simulation of the Earth-ionosphere cavity
NASA Astrophysics Data System (ADS)
Toledo-Redondo, Sergio; Salinas, Alfonso; Morente-Molinera, Juan Antonio; Méndez, Antonio; Fornieles, Jesús; Portí, Jorge; Morente, Juan Antonio
2013-03-01
A parallel 3D algorithm for solving time-domain electromagnetic problems with arbitrary geometries is presented. The technique employed is the Transmission Line Modeling (TLM) method implemented in Shared Memory (SM) environments. The benchmarking performed reveals that the maximum speedup depends on the memory size of the problem as well as multiple hardware factors, like the disposition of CPUs, cache, or memory. A maximum speedup of 15 has been measured for the largest problem. In certain circumstances of low memory requirements, superlinear speedup is achieved using our algorithm. The model is employed to model the Earth-ionosphere cavity, thus enabling a study of the natural electromagnetic phenomena that occur in it. The algorithm allows complete 3D simulations of the cavity with a resolution of 10 km, within a reasonable timescale.
Long-range interactions and parallel scalability in molecular simulations
NASA Astrophysics Data System (ADS)
Patra, Michael; Hyvönen, Marja T.; Falck, Emma; Sabouri-Ghomi, Mohsen; Vattulainen, Ilpo; Karttunen, Mikko
2007-01-01
Typical biomolecular systems such as cellular membranes, DNA, and protein complexes are highly charged. Thus, efficient and accurate treatment of electrostatic interactions is of great importance in computational modeling of such systems. We have employed the GROMACS simulation package to perform extensive benchmarking of different commonly used electrostatic schemes on a range of computer architectures (Pentium-4, IBM Power 4, and Apple/IBM G5) for single processor and parallel performance up to 8 nodes—we have also tested the scalability on four different networks, namely Infiniband, GigaBit Ethernet, Fast Ethernet, and nearly uniform memory architecture, i.e. communication between CPUs is possible by directly reading from or writing to other CPUs' local memory. It turns out that the particle-mesh Ewald method (PME) performs surprisingly well and offers competitive performance unless parallel runs on PC hardware with older network infrastructure are needed. Lipid bilayers of sizes 128, 512 and 2048 lipid molecules were used as the test systems representing typical cases encountered in biomolecular simulations. Our results enable an accurate prediction of computational speed on most current computing systems, both for serial and parallel runs. These results should be helpful in, for example, choosing the most suitable configuration for a small departmental computer cluster.
High-performance computational fluid dynamics: a custom-code approach
NASA Astrophysics Data System (ADS)
Fannon, James; Loiseau, Jean-Christophe; Valluri, Prashant; Bethune, Iain; Náraigh, Lennon Ó.
2016-07-01
We introduce a modified and simplified version of the pre-existing fully parallelized three-dimensional Navier-Stokes flow solver known as TPLS. We demonstrate how the simplified version can be used as a pedagogical tool for the study of computational fluid dynamics (CFDs) and parallel computing. TPLS is at its heart a two-phase flow solver, and uses calls to a range of external libraries to accelerate its performance. However, in the present context we narrow the focus of the study to basic hydrodynamics and parallel computing techniques, and the code is therefore simplified and modified to simulate pressure-driven single-phase flow in a channel, using only relatively simple Fortran 90 code with MPI parallelization, but no calls to any other external libraries. The modified code is analysed in order to both validate its accuracy and investigate its scalability up to 1000 CPU cores. Simulations are performed for several benchmark cases in pressure-driven channel flow, including a turbulent simulation, wherein the turbulence is incorporated via the large-eddy simulation technique. The work may be of use to advanced undergraduate and graduate students as an introductory study in CFDs, while also providing insight for those interested in more general aspects of high-performance computing.
GRAMM-X public web server for protein–protein docking
Tovchigrechko, Andrey; Vakser, Ilya A.
2006-01-01
Protein docking software GRAMM-X and its web interface () extend the original GRAMM Fast Fourier Transformation methodology by employing smoothed potentials, refinement stage, and knowledge-based scoring. The web server frees users from complex installation of database-dependent parallel software and maintaining large hardware resources needed for protein docking simulations. Docking problems submitted to GRAMM-X server are processed by a 320 processor Linux cluster. The server was extensively tested by benchmarking, several months of public use, and participation in the CAPRI server track. PMID:16845016
NASA Technical Reports Server (NTRS)
Mattmann, Chris A.; Kelly, Sean; Crichton, Daniel J.; Hughes, J. Steven; Hardman, Sean; Ramirez, Paul; Joyner, Ron
2006-01-01
In this paper, we present a preliminary study of several different electronic data movement technologies. We detail our approach to classifying the technologies included in our study and present the preliminary results of some initial performance benchmarking. Our studies suggest that highly parallel TCP/IP streaming technologies, such as GridFTP and bbFTP, outperform commercial and open-source UDP-bursting technologies in several of the key data movement dimensions that we studied.
Personal supercomputing by using transputer and Intel 80860 in plasma engineering
NASA Astrophysics Data System (ADS)
Ido, S.; Aoki, K.; Ishine, M.; Kubota, M.
1992-09-01
Transputer (T800) and 64-bit RISC Intel 80860 (i860) added on a personal computer can be used as an accelerator. When 32-bit T800s in a parallel system or 64-bit i860s are used, scientific calculations are carried out several ten times as fast as in the case of commonly used 32-bit personal computers or UNIX workstations. Benchmark tests and examples of physical simulations using T800s and i860 are reported.
1996-01-01
Real-Time 19 5 Conclusion 23 List of References 25 ii LIST OF FIGURES FIGURE PAGE 3-1 Test Bench Pseudo Code 7 3-2 Fast Convolution...3-1 shows pseudo - code for a test bench with two application nodes. The outer test bench wrapper consists of three functions: pipeline_init, pipeline...exit_func); Figure 3-1. Test Bench Pseudo Code The application wrapper is contained in the pipeline routine and similarly consists of an
MPI implementation of PHOENICS: A general purpose computational fluid dynamics code
NASA Astrophysics Data System (ADS)
Simunovic, S.; Zacharia, T.; Baltas, N.; Spalding, D. B.
1995-03-01
PHOENICS is a suite of computational analysis programs that are used for simulation of fluid flow, heat transfer, and dynamical reaction processes. The parallel version of the solver EARTH for the Computational Fluid Dynamics (CFD) program PHOENICS has been implemented using Message Passing Interface (MPI) standard. Implementation of MPI version of PHOENICS makes this computational tool portable to a wide range of parallel machines and enables the use of high performance computing for large scale computational simulations. MPI libraries are available on several parallel architectures making the program usable across different architectures as well as on heterogeneous computer networks. The Intel Paragon NX and MPI versions of the program have been developed and tested on massively parallel supercomputers Intel Paragon XP/S 5, XP/S 35, and Kendall Square Research, and on the multiprocessor SGI Onyx computer at Oak Ridge National Laboratory. The preliminary testing results of the developed program have shown scalable performance for reasonably sized computational domains.
MPI implementation of PHOENICS: A general purpose computational fluid dynamics code
DOE Office of Scientific and Technical Information (OSTI.GOV)
Simunovic, S.; Zacharia, T.; Baltas, N.
1995-04-01
PHOENICS is a suite of computational analysis programs that are used for simulation of fluid flow, heat transfer, and dynamical reaction processes. The parallel version of the solver EARTH for the Computational Fluid Dynamics (CFD) program PHOENICS has been implemented using Message Passing Interface (MPI) standard. Implementation of MPI version of PHOENICS makes this computational tool portable to a wide range of parallel machines and enables the use of high performance computing for large scale computational simulations. MPI libraries are available on several parallel architectures making the program usable across different architectures as well as on heterogeneous computer networks. Themore » Intel Paragon NX and MPI versions of the program have been developed and tested on massively parallel supercomputers Intel Paragon XP/S 5, XP/S 35, and Kendall Square Research, and on the multiprocessor SGI Onyx computer at Oak Ridge National Laboratory. The preliminary testing results of the developed program have shown scalable performance for reasonably sized computational domains.« less
Parallelized implicit propagators for the finite-difference Schrödinger equation
NASA Astrophysics Data System (ADS)
Parker, Jonathan; Taylor, K. T.
1995-08-01
We describe the application of block Gauss-Seidel and block Jacobi iterative methods to the design of implicit propagators for finite-difference models of the time-dependent Schrödinger equation. The block-wise iterative methods discussed here are mixed direct-iterative methods for solving simultaneous equations, in the sense that direct methods (e.g. LU decomposition) are used to invert certain block sub-matrices, and iterative methods are used to complete the solution. We describe parallel variants of the basic algorithm that are well suited to the medium- to coarse-grained parallelism of work-station clusters, and MIMD supercomputers, and we show that under a wide range of conditions, fine-grained parallelism of the computation can be achieved. Numerical tests are conducted on a typical one-electron atom Hamiltonian. The methods converge robustly to machine precision (15 significant figures), in some cases in as few as 6 or 7 iterations. The rate of convergence is nearly independent of the finite-difference grid-point separations.
The cost of conservative synchronization in parallel discrete event simulations
NASA Technical Reports Server (NTRS)
Nicol, David M.
1990-01-01
The performance of a synchronous conservative parallel discrete-event simulation protocol is analyzed. The class of simulation models considered is oriented around a physical domain and possesses a limited ability to predict future behavior. A stochastic model is used to show that as the volume of simulation activity in the model increases relative to a fixed architecture, the complexity of the average per-event overhead due to synchronization, event list manipulation, lookahead calculations, and processor idle time approach the complexity of the average per-event overhead of a serial simulation. The method is therefore within a constant factor of optimal. The analysis demonstrates that on large problems--those for which parallel processing is ideally suited--there is often enough parallel workload so that processors are not usually idle. The viability of the method is also demonstrated empirically, showing how good performance is achieved on large problems using a thirty-two node Intel iPSC/2 distributed memory multiprocessor.
Short-Term Solar Forecasting Performance of Popular Machine Learning Algorithms: Preprint
DOE Office of Scientific and Technical Information (OSTI.GOV)
Florita, Anthony R; Elgindy, Tarek; Hodge, Brian S
A framework for assessing the performance of short-term solar forecasting is presented in conjunction with a range of numerical results using global horizontal irradiation (GHI) from the open-source Surface Radiation Budget (SURFRAD) data network. A suite of popular machine learning algorithms is compared according to a set of statistically distinct metrics and benchmarked against the persistence-of-cloudiness forecast and a cloud motion forecast. Results show significant improvement compared to the benchmarks with trade-offs among the machine learning algorithms depending on the desired error metric. Training inputs include time series observations of GHI for a history of years, historical weather and atmosphericmore » measurements, and corresponding date and time stamps such that training sensitivities might be inferred. Prediction outputs are GHI forecasts for 1, 2, 3, and 4 hours ahead of the issue time, and they are made for every month of the year for 7 locations. Photovoltaic power and energy outputs can then be made using the solar forecasts to better understand power system impacts.« less
Automated Instrumentation, Monitoring and Visualization of PVM Programs Using AIMS
NASA Technical Reports Server (NTRS)
Mehra, Pankaj; VanVoorst, Brian; Yan, Jerry; Tucker, Deanne (Technical Monitor)
1994-01-01
We present views and analysis of the execution of several PVM codes for Computational Fluid Dynamics on a network of Sparcstations, including (a) NAS Parallel benchmarks CG and MG (White, Alund and Sunderam 1993); (b) a multi-partitioning algorithm for NAS Parallel Benchmark SP (Wijngaart 1993); and (c) an overset grid flowsolver (Smith 1993). These views and analysis were obtained using our Automated Instrumentation and Monitoring System (AIMS) version 3.0, a toolkit for debugging the performance of PVM programs. We will describe the architecture, operation and application of AIMS. The AIMS toolkit contains (a) Xinstrument, which can automatically instrument various computational and communication constructs in message-passing parallel programs; (b) Monitor, a library of run-time trace-collection routines; (c) VK (Visual Kernel), an execution-animation tool with source-code clickback; and (d) Tally, a tool for statistical analysis of execution profiles. Currently, Xinstrument can handle C and Fortran77 programs using PVM 3.2.x; Monitor has been implemented and tested on Sun 4 systems running SunOS 4.1.2; and VK uses X11R5 and Motif 1.2. Data and views obtained using AIMS clearly illustrate several characteristic features of executing parallel programs on networked workstations: (a) the impact of long message latencies; (b) the impact of multiprogramming overheads and associated load imbalance; (c) cache and virtual-memory effects; and (4significant skews between workstation clocks. Interestingly, AIMS can compensate for constant skew (zero drift) by calibrating the skew between a parent and its spawned children. In addition, AIMS' skew-compensation algorithm can adjust timestamps in a way that eliminates physically impossible communications (e.g., messages going backwards in time). Our current efforts are directed toward creating new views to explain the observed performance of PVM programs. Some of the features planned for the near future include: (a) ConfigView, showing the physical topology of the virtual machine, inferred using specially formatted IP (Internet Protocol) packets; and (b) LoadView, synchronous animation of PVM-program execution and resource-utilization patterns.
Population annealing with weighted averages: A Monte Carlo method for rough free-energy landscapes
NASA Astrophysics Data System (ADS)
Machta, J.
2010-08-01
The population annealing algorithm introduced by Hukushima and Iba is described. Population annealing combines simulated annealing and Boltzmann weighted differential reproduction within a population of replicas to sample equilibrium states. Population annealing gives direct access to the free energy. It is shown that unbiased measurements of observables can be obtained by weighted averages over many runs with weight factors related to the free-energy estimate from the run. Population annealing is well suited to parallelization and may be a useful alternative to parallel tempering for systems with rough free-energy landscapes such as spin glasses. The method is demonstrated for spin glasses.
A method for data handling numerical results in parallel OpenFOAM simulations
DOE Office of Scientific and Technical Information (OSTI.GOV)
Anton, Alin; Muntean, Sebastian
Parallel computational fluid dynamics simulations produce vast amount of numerical result data. This paper introduces a method for reducing the size of the data by replaying the interprocessor traffic. The results are recovered only in certain regions of interest configured by the user. A known test case is used for several mesh partitioning scenarios using the OpenFOAM toolkit{sup ®}[1]. The space savings obtained with classic algorithms remain constant for more than 60 Gb of floating point data. Our method is most efficient on large simulation meshes and is much better suited for compressing large scale simulation results than the regular algorithms.
Parallel, stochastic measurement of molecular surface area.
Juba, Derek; Varshney, Amitabh
2008-08-01
Biochemists often wish to compute surface areas of proteins. A variety of algorithms have been developed for this task, but they are designed for traditional single-processor architectures. The current trend in computer hardware is towards increasingly parallel architectures for which these algorithms are not well suited. We describe a parallel, stochastic algorithm for molecular surface area computation that maps well to the emerging multi-core architectures. Our algorithm is also progressive, providing a rough estimate of surface area immediately and refining this estimate as time goes on. Furthermore, the algorithm generates points on the molecular surface which can be used for point-based rendering. We demonstrate a GPU implementation of our algorithm and show that it compares favorably with several existing molecular surface computation programs, giving fast estimates of the molecular surface area with good accuracy.
Optimizing the Four-Index Integral Transform Using Data Movement Lower Bounds Analysis
DOE Office of Scientific and Technical Information (OSTI.GOV)
Rajbhandari, Samyam; Rastello, Fabrice; Kowalski, Karol
The four-index integral transform is a fundamental and computationally demanding calculation used in many computational chemistry suites such as NWChem. It transforms a four-dimensional tensor from an atomic basis to a molecular basis. This transformation is most efficiently implemented as a sequence of four tensor contractions that each contract a four-dimensional tensor with a two-dimensional transformation matrix. Differing degrees of permutation symmetry in the intermediate and final tensors in the sequence of contractions cause intermediate tensors to be much larger than the final tensor and limit the number of electronic states in the modeled systems. Loop fusion, in conjunction withmore » tiling, can be very effective in reducing the total space requirement, as well as data movement. However, the large number of possible choices for loop fusion and tiling, and data/computation distribution across a parallel system, make it challenging to develop an optimized parallel implementation for the four-index integral transform. We develop a novel approach to address this problem, using lower bounds modeling of data movement complexity. We establish relationships between available aggregate physical memory in a parallel computer system and ineffective fusion configurations, enabling their pruning and consequent identification of effective choices and a characterization of optimality criteria. This work has resulted in the development of a significantly improved implementation of the four-index transform that enables higher performance and the ability to model larger electronic systems than the current implementation in the NWChem quantum chemistry software suite.« less
TOUGH3: A new efficient version of the TOUGH suite of multiphase flow and transport simulators
NASA Astrophysics Data System (ADS)
Jung, Yoojin; Pau, George Shu Heng; Finsterle, Stefan; Pollyea, Ryan M.
2017-11-01
The TOUGH suite of nonisothermal multiphase flow and transport simulators has been updated by various developers over many years to address a vast range of challenging subsurface problems. The increasing complexity of the simulated processes as well as the growing size of model domains that need to be handled call for an improvement in the simulator's computational robustness and efficiency. Moreover, modifications have been frequently introduced independently, resulting in multiple versions of TOUGH that (1) led to inconsistencies in feature implementation and usage, (2) made code maintenance and development inefficient, and (3) caused confusion to users and developers. TOUGH3-a new base version of TOUGH-addresses these issues. It consolidates both the serial (TOUGH2 V2.1) and parallel (TOUGH2-MP V2.0) implementations, enabling simulations to be performed on desktop computers and supercomputers using a single code. New PETSc parallel linear solvers are added to the existing serial solvers of TOUGH2 and the Aztec solver used in TOUGH2-MP. The PETSc solvers generally perform better than the Aztec solvers in parallel and the internal TOUGH3 linear solver in serial. TOUGH3 also incorporates many new features, addresses bugs, and improves the flexibility of data handling. Due to the improved capabilities and usability, TOUGH3 is more robust and efficient for solving tough and computationally demanding problems in diverse scientific and practical applications related to subsurface flow modeling.
Parallelization of sequential Gaussian, indicator and direct simulation algorithms
NASA Astrophysics Data System (ADS)
Nunes, Ruben; Almeida, José A.
2010-08-01
Improving the performance and robustness of algorithms on new high-performance parallel computing architectures is a key issue in efficiently performing 2D and 3D studies with large amount of data. In geostatistics, sequential simulation algorithms are good candidates for parallelization. When compared with other computational applications in geosciences (such as fluid flow simulators), sequential simulation software is not extremely computationally intensive, but parallelization can make it more efficient and creates alternatives for its integration in inverse modelling approaches. This paper describes the implementation and benchmarking of a parallel version of the three classic sequential simulation algorithms: direct sequential simulation (DSS), sequential indicator simulation (SIS) and sequential Gaussian simulation (SGS). For this purpose, the source used was GSLIB, but the entire code was extensively modified to take into account the parallelization approach and was also rewritten in the C programming language. The paper also explains in detail the parallelization strategy and the main modifications. Regarding the integration of secondary information, the DSS algorithm is able to perform simple kriging with local means, kriging with an external drift and collocated cokriging with both local and global correlations. SIS includes a local correction of probabilities. Finally, a brief comparison is presented of simulation results using one, two and four processors. All performance tests were carried out on 2D soil data samples. The source code is completely open source and easy to read. It should be noted that the code is only fully compatible with Microsoft Visual C and should be adapted for other systems/compilers.
A Next-Generation Parallel File System Environment for the OLCF
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dillow, David A; Fuller, Douglas; Gunasekaran, Raghul
2012-01-01
When deployed in 2008/2009 the Spider system at the Oak Ridge National Laboratory s Leadership Computing Facility (OLCF) was the world s largest scale Lustre parallel file system. Envisioned as a shared parallel file system capable of delivering both the bandwidth and capacity requirements of the OLCF s diverse computational environment, Spider has since become a blueprint for shared Lustre environments deployed worldwide. Designed to support the parallel I/O requirements of the Jaguar XT5 system and other smallerscale platforms at the OLCF, the upgrade to the Titan XK6 heterogeneous system will begin to push the limits of Spider s originalmore » design by mid 2013. With a doubling in total system memory and a 10x increase in FLOPS, Titan will require both higher bandwidth and larger total capacity. Our goal is to provide a 4x increase in total I/O bandwidth from over 240GB=sec today to 1TB=sec and a doubling in total capacity. While aggregate bandwidth and total capacity remain important capabilities, an equally important goal in our efforts is dramatically increasing metadata performance, currently the Achilles heel of parallel file systems at leadership. We present in this paper an analysis of our current I/O workloads, our operational experiences with the Spider parallel file systems, the high-level design of our Spider upgrade, and our efforts in developing benchmarks that synthesize our performance requirements based on our workload characterization studies.« less
NASA Technical Reports Server (NTRS)
Deardorff, Glenn; Djomehri, M. Jahed; Freeman, Ken; Gambrel, Dave; Green, Bryan; Henze, Chris; Hinke, Thomas; Hood, Robert; Kiris, Cetin; Moran, Patrick;
2001-01-01
A series of NASA presentations for the Supercomputing 2001 conference are summarized. The topics include: (1) Mars Surveyor Landing Sites "Collaboratory"; (2) Parallel and Distributed CFD for Unsteady Flows with Moving Overset Grids; (3) IP Multicast for Seamless Support of Remote Science; (4) Consolidated Supercomputing Management Office; (5) Growler: A Component-Based Framework for Distributed/Collaborative Scientific Visualization and Computational Steering; (6) Data Mining on the Information Power Grid (IPG); (7) Debugging on the IPG; (8) Debakey Heart Assist Device: (9) Unsteady Turbopump for Reusable Launch Vehicle; (10) Exploratory Computing Environments Component Framework; (11) OVERSET Computational Fluid Dynamics Tools; (12) Control and Observation in Distributed Environments; (13) Multi-Level Parallelism Scaling on NASA's Origin 1024 CPU System; (14) Computing, Information, & Communications Technology; (15) NAS Grid Benchmarks; (16) IPG: A Large-Scale Distributed Computing and Data Management System; and (17) ILab: Parameter Study Creation and Submission on the IPG.
OpenMP performance for benchmark 2D shallow water equations using LBM
NASA Astrophysics Data System (ADS)
Sabri, Khairul; Rabbani, Hasbi; Gunawan, Putu Harry
2018-03-01
Shallow water equations or commonly referred as Saint-Venant equations are used to model fluid phenomena. These equations can be solved numerically using several methods, like Lattice Boltzmann method (LBM), SIMPLE-like Method, Finite Difference Method, Godunov-type Method, and Finite Volume Method. In this paper, the shallow water equation will be approximated using LBM or known as LABSWE and will be simulated in performance of parallel programming using OpenMP. To evaluate the performance between 2 and 4 threads parallel algorithm, ten various number of grids Lx and Ly are elaborated. The results show that using OpenMP platform, the computational time for solving LABSWE can be decreased. For instance using grid sizes 1000 × 500, the speedup of 2 and 4 threads is observed 93.54 s and 333.243 s respectively.
Laszlo, Sarah; Plaut, David C
2012-03-01
The Parallel Distributed Processing (PDP) framework has significant potential for producing models of cognitive tasks that approximate how the brain performs the same tasks. To date, however, there has been relatively little contact between PDP modeling and data from cognitive neuroscience. In an attempt to advance the relationship between explicit, computational models and physiological data collected during the performance of cognitive tasks, we developed a PDP model of visual word recognition which simulates key results from the ERP reading literature, while simultaneously being able to successfully perform lexical decision-a benchmark task for reading models. Simulations reveal that the model's success depends on the implementation of several neurally plausible features in its architecture which are sufficiently domain-general to be relevant to cognitive modeling more generally. Copyright © 2011 Elsevier Inc. All rights reserved.
A CPU/MIC Collaborated Parallel Framework for GROMACS on Tianhe-2 Supercomputer.
Peng, Shaoliang; Yang, Shunyun; Su, Wenhe; Zhang, Xiaoyu; Zhang, Tenglilang; Liu, Weiguo; Zhao, Xingming
2017-06-16
Molecular Dynamics (MD) is the simulation of the dynamic behavior of atoms and molecules. As the most popular software for molecular dynamics, GROMACS cannot work on large-scale data because of limit computing resources. In this paper, we propose a CPU and Intel® Xeon Phi Many Integrated Core (MIC) collaborated parallel framework to accelerate GROMACS using the offload mode on a MIC coprocessor, with which the performance of GROMACS is improved significantly, especially with the utility of Tianhe-2 supercomputer. Furthermore, we optimize GROMACS so that it can run on both the CPU and MIC at the same time. In addition, we accelerate multi-node GROMACS so that it can be used in practice. Benchmarking on real data, our accelerated GROMACS performs very well and reduces computation time significantly. Source code: https://github.com/tianhe2/gromacs-mic.
Porting AMG2013 to Heterogeneous CPU+GPU Nodes
DOE Office of Scientific and Technical Information (OSTI.GOV)
Samfass, Philipp
LLNL's future advanced technology system SIERRA will feature heterogeneous compute nodes that consist of IBM PowerV9 CPUs and NVIDIA Volta GPUs. Conceptually, the motivation for such an architecture is quite straightforward: While GPUs are optimized for throughput on massively parallel workloads, CPUs strive to minimize latency for rather sequential operations. Yet, making optimal use of heterogeneous architectures raises new challenges for the development of scalable parallel software, e.g., with respect to work distribution. Porting LLNL's parallel numerical libraries to upcoming heterogeneous CPU+GPU architectures is therefore a critical factor for ensuring LLNL's future success in ful lling its national mission. Onemore » of these libraries, called HYPRE, provides parallel solvers and precondi- tioners for large, sparse linear systems of equations. In the context of this intern- ship project, I consider AMG2013 which is a proxy application for major parts of HYPRE that implements a benchmark for setting up and solving di erent systems of linear equations. In the following, I describe in detail how I ported multiple parts of AMG2013 to the GPU (Section 2) and present results for di erent experiments that demonstrate a successful parallel implementation on the heterogeneous ma- chines surface and ray (Section 3). In Section 4, I give guidelines on how my code should be used. Finally, I conclude and give an outlook for future work (Section 5).« less
Wilson, Justin; Dai, Manhong; Jakupovic, Elvis; Watson, Stanley; Meng, Fan
2007-01-01
Modern video cards and game consoles typically have much better performance to price ratios than that of general purpose CPUs. The parallel processing capabilities of game hardware are well-suited for high throughput biomedical data analysis. Our initial results suggest that game hardware is a cost-effective platform for some computationally demanding bioinformatics problems.
Seedling establishment in a masting desert shrub parallels the pattern for forest trees
Susan E. Meyer; Burton K. Pendleton
2015-01-01
The masting phenomenon along with its accompanying suite of seedling adaptive traits has been well studied in forest trees but has rarely been examined in desert shrubs. Blackbrush (Coleogyne ramosissima) is a regionally dominant North American desert shrub whose seeds are produced in mast events and scatter-hoarded by rodents. We followed the fate of seedlings in...
NASA Astrophysics Data System (ADS)
Wu, Zu-guang; Tian, Zhan-jun; Liu, Hui; Huang, Rui; Zhu, Guo-hua
2009-07-01
Being the only listed telecom operators of A share market, China Unicom has always been attracted many institutional investors under the concept of 3G recent years,which itself is a great technical progress expectation.Do the institutional investors or the concept of technical progress have signficant effect on the improving of firm's operating efficiency?Though reviewing the documentary about operating efficiency we find that schoolars study this problem useing the regress analyzing based on traditional production function and data envelopment analysis(DEA) and financial index anayzing and marginal function and capital labor ratio coefficient etc. All the methods mainly based on macrodata. This paper we use the micro-data of company to evaluate the operating efficiency.Using factor analyzing based on financial index and comparing the factor score of three years from 2005 to 2007, we find that China Unicom's operating efficiency is under the averge level of benchmark corporates and has't improved under the concept of 3G from 2005 to 2007.In other words,institutional investor or the conception of technical progress expectation have faint effect on the changes of China Unicom's operating efficiency. Selecting benchmark corporates as post to evaluate the operating efficiency is a characteristic of this method ,which is basicallly sipmly and direct.This method is suit for the operation efficiency evaluation of agriculture listed companies because agriculture listed also face technical progress and marketing concept such as tax-free etc.
Running Neuroimaging Applications on Amazon Web Services: How, When, and at What Cost?
Madhyastha, Tara M.; Koh, Natalie; Day, Trevor K. M.; Hernández-Fernández, Moises; Kelley, Austin; Peterson, Daniel J.; Rajan, Sabreena; Woelfer, Karl A.; Wolf, Jonathan; Grabowski, Thomas J.
2017-01-01
The contribution of this paper is to identify and describe current best practices for using Amazon Web Services (AWS) to execute neuroimaging workflows “in the cloud.” Neuroimaging offers a vast set of techniques by which to interrogate the structure and function of the living brain. However, many of the scientists for whom neuroimaging is an extremely important tool have limited training in parallel computation. At the same time, the field is experiencing a surge in computational demands, driven by a combination of data-sharing efforts, improvements in scanner technology that allow acquisition of images with higher image resolution, and by the desire to use statistical techniques that stress processing requirements. Most neuroimaging workflows can be executed as independent parallel jobs and are therefore excellent candidates for running on AWS, but the overhead of learning to do so and determining whether it is worth the cost can be prohibitive. In this paper we describe how to identify neuroimaging workloads that are appropriate for running on AWS, how to benchmark execution time, and how to estimate cost of running on AWS. By benchmarking common neuroimaging applications, we show that cloud computing can be a viable alternative to on-premises hardware. We present guidelines that neuroimaging labs can use to provide a cluster-on-demand type of service that should be familiar to users, and scripts to estimate cost and create such a cluster. PMID:29163119
NASA Astrophysics Data System (ADS)
Helm, Anton; Vieira, Jorge; Silva, Luis; Fonseca, Ricardo
2016-10-01
Laser-driven accelerators gained an increased attention over the past decades. Typical modeling techniques for laser wakefield acceleration (LWFA) are based on particle-in-cell (PIC) simulations. PIC simulations, however, are very computationally expensive due to the disparity of the relevant scales ranging from the laser wavelength, in the micrometer range, to the acceleration length, currently beyond the ten centimeter range. To minimize the gap between these despair scales the ponderomotive guiding center (PGC) algorithm is a promising approach. By describing the evolution of the laser pulse envelope separately, only the scales larger than the plasma wavelength are required to be resolved in the PGC algorithm, leading to speedups in several orders of magnitude. Previous work was limited to two dimensions. Here we present the implementation of the 3D version of a PGC solver into the massively parallel, fully relativistic PIC code OSIRIS. We extended the solver to include periodic boundary conditions and parallelization in all spatial dimensions. We present benchmarks for distributed and shared memory parallelization. We also discuss the stability of the PGC solver.
Vectorial finite elements for solving the radiative transfer equation
NASA Astrophysics Data System (ADS)
Badri, M. A.; Jolivet, P.; Rousseau, B.; Le Corre, S.; Digonnet, H.; Favennec, Y.
2018-06-01
The discrete ordinate method coupled with the finite element method is often used for the spatio-angular discretization of the radiative transfer equation. In this paper we attempt to improve upon such a discretization technique. Instead of using standard finite elements, we reformulate the radiative transfer equation using vectorial finite elements. In comparison to standard finite elements, this reformulation yields faster timings for the linear system assemblies, as well as for the solution phase when using scattering media. The proposed vectorial finite element discretization for solving the radiative transfer equation is cross-validated against a benchmark problem available in literature. In addition, we have used the method of manufactured solutions to verify the order of accuracy for our discretization technique within different absorbing, scattering, and emitting media. For solving large problems of radiation on parallel computers, the vectorial finite element method is parallelized using domain decomposition. The proposed domain decomposition method scales on large number of processes, and its performance is unaffected by the changes in optical thickness of the medium. Our parallel solver is used to solve a large scale radiative transfer problem of the Kelvin-cell radiation.
Katouda, Michio; Naruse, Akira; Hirano, Yukihiko; Nakajima, Takahito
2016-11-15
A new parallel algorithm and its implementation for the RI-MP2 energy calculation utilizing peta-flop-class many-core supercomputers are presented. Some improvements from the previous algorithm (J. Chem. Theory Comput. 2013, 9, 5373) have been performed: (1) a dual-level hierarchical parallelization scheme that enables the use of more than 10,000 Message Passing Interface (MPI) processes and (2) a new data communication scheme that reduces network communication overhead. A multi-node and multi-GPU implementation of the present algorithm is presented for calculations on a central processing unit (CPU)/graphics processing unit (GPU) hybrid supercomputer. Benchmark results of the new algorithm and its implementation using the K computer (CPU clustering system) and TSUBAME 2.5 (CPU/GPU hybrid system) demonstrate high efficiency. The peak performance of 3.1 PFLOPS is attained using 80,199 nodes of the K computer. The peak performance of the multi-node and multi-GPU implementation is 514 TFLOPS using 1349 nodes and 4047 GPUs of TSUBAME 2.5. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.
Empirical study of parallel LRU simulation algorithms
NASA Technical Reports Server (NTRS)
Carr, Eric; Nicol, David M.
1994-01-01
This paper reports on the performance of five parallel algorithms for simulating a fully associative cache operating under the LRU (Least-Recently-Used) replacement policy. Three of the algorithms are SIMD, and are implemented on the MasPar MP-2 architecture. Two other algorithms are parallelizations of an efficient serial algorithm on the Intel Paragon. One SIMD algorithm is quite simple, but its cost is linear in the cache size. The two other SIMD algorithm are more complex, but have costs that are independent on the cache size. Both the second and third SIMD algorithms compute all stack distances; the second SIMD algorithm is completely general, whereas the third SIMD algorithm presumes and takes advantage of bounds on the range of reference tags. Both MIMD algorithm implemented on the Paragon are general and compute all stack distances; they differ in one step that may affect their respective scalability. We assess the strengths and weaknesses of these algorithms as a function of problem size and characteristics, and compare their performance on traces derived from execution of three SPEC benchmark programs.
Integrating the Apache Big Data Stack with HPC for Big Data
NASA Astrophysics Data System (ADS)
Fox, G. C.; Qiu, J.; Jha, S.
2014-12-01
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development. However, the same is not so true for data intensive computing, even though commercially clouds devote much more resources to data analytics than supercomputers devote to simulations. We look at a sample of over 50 big data applications to identify characteristics of data intensive applications and to deduce needed runtime and architectures. We suggest a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks and use these to identify a few key classes of hardware/software architectures. Our analysis builds on combining HPC and ABDS the Apache big data software stack that is well used in modern cloud computing. Initial results on clouds and HPC systems are encouraging. We propose the development of SPIDAL - Scalable Parallel Interoperable Data Analytics Library -- built on system aand data abstractions suggested by the HPC-ABDS architecture. We discuss how it can be used in several application areas including Polar Science.
Towards implementation of cellular automata in Microbial Fuel Cells.
Tsompanas, Michail-Antisthenis I; Adamatzky, Andrew; Sirakoulis, Georgios Ch; Greenman, John; Ieropoulos, Ioannis
2017-01-01
The Microbial Fuel Cell (MFC) is a bio-electrochemical transducer converting waste products into electricity using microbial communities. Cellular Automaton (CA) is a uniform array of finite-state machines that update their states in discrete time depending on states of their closest neighbors by the same rule. Arrays of MFCs could, in principle, act as massive-parallel computing devices with local connectivity between elementary processors. We provide a theoretical design of such a parallel processor by implementing CA in MFCs. We have chosen Conway's Game of Life as the 'benchmark' CA because this is the most popular CA which also exhibits an enormously rich spectrum of patterns. Each cell of the Game of Life CA is realized using two MFCs. The MFCs are linked electrically and hydraulically. The model is verified via simulation of an electrical circuit demonstrating equivalent behaviours. The design is a first step towards future implementations of fully autonomous biological computing devices with massive parallelism. The energy independence of such devices counteracts their somewhat slow transitions-compared to silicon circuitry-between the different states during computation.
NASA Technical Reports Server (NTRS)
Rao, M. V. Subba
1988-01-01
Two prominent rock suites constitute the lithology of the Eastern Ghat mobile belt: (1) the khondalite suite - the metapelites, and (2) the charnockite suite. Later intrusives include ultramafic sequences, anorthosites and granitic gneisses. The chief structural element in the rocks of the Eastern Ghats is a planar fabric (gneissosity), defined by the alignment of platy minerals like flattened quartz, garnet, sillimanite, graphite, etc. The parallelism between the foliation and the lithological layering is related to isoclinal folding. The major structural trend (axial plane foliation trend) observed in the belt is NE-SW. Five major tectonic events have been delineated in the belt. A boundary fault along the western margin of the Eastern Ghats, bordering the low grade terrain has been substantiated by recent gravity and the deep seismic sounding studies. Field evidence shows that the pyroxene granulites (basic granulites) post-date the khondalite suite, but are older than the charnockites as well as the granitic gneisses. Polyphase metamorphism, probably correlatable with different periods of deformation is recorded. The field relations in the Eastern Ghats point to the intense deformation of the terrain, apparently both before, during and after metamorphism.
Using Coarrays to Parallelize Legacy Fortran Applications: Strategy and Case Study
Radhakrishnan, Hari; Rouson, Damian W. I.; Morris, Karla; ...
2015-01-01
This paper summarizes a strategy for parallelizing a legacy Fortran 77 program using the object-oriented (OO) and coarray features that entered Fortran in the 2003 and 2008 standards, respectively. OO programming (OOP) facilitates the construction of an extensible suite of model-verification and performance tests that drive the development. Coarray parallel programming facilitates a rapid evolution from a serial application to a parallel application capable of running on multicore processors and many-core accelerators in shared and distributed memory. We delineate 17 code modernization steps used to refactor and parallelize the program and study the resulting performance. Our initial studies were donemore » using the Intel Fortran compiler on a 32-core shared memory server. Scaling behavior was very poor, and profile analysis using TAU showed that the bottleneck in the performance was due to our implementation of a collective, sequential summation procedure. We were able to improve the scalability and achieve nearly linear speedup by replacing the sequential summation with a parallel, binary tree algorithm. We also tested the Cray compiler, which provides its own collective summation procedure. Intel provides no collective reductions. With Cray, the program shows linear speedup even in distributed-memory execution. We anticipate similar results with other compilers once they support the new collective procedures proposed for Fortran 2015.« less
Probabilistic structural mechanics research for parallel processing computers
NASA Technical Reports Server (NTRS)
Sues, Robert H.; Chen, Heh-Chyun; Twisdale, Lawrence A.; Martin, William R.
1991-01-01
Aerospace structures and spacecraft are a complex assemblage of structural components that are subjected to a variety of complex, cyclic, and transient loading conditions. Significant modeling uncertainties are present in these structures, in addition to the inherent randomness of material properties and loads. To properly account for these uncertainties in evaluating and assessing the reliability of these components and structures, probabilistic structural mechanics (PSM) procedures must be used. Much research has focused on basic theory development and the development of approximate analytic solution methods in random vibrations and structural reliability. Practical application of PSM methods was hampered by their computationally intense nature. Solution of PSM problems requires repeated analyses of structures that are often large, and exhibit nonlinear and/or dynamic response behavior. These methods are all inherently parallel and ideally suited to implementation on parallel processing computers. New hardware architectures and innovative control software and solution methodologies are needed to make solution of large scale PSM problems practical.
Introduction to a system for implementing neural net connections on SIMD architectures
NASA Technical Reports Server (NTRS)
Tomboulian, Sherryl
1988-01-01
Neural networks have attracted much interest recently, and using parallel architectures to simulate neural networks is a natural and necessary application. The SIMD model of parallel computation is chosen, because systems of this type can be built with large numbers of processing elements. However, such systems are not naturally suited to generalized communication. A method is proposed that allows an implementation of neural network connections on massively parallel SIMD architectures. The key to this system is an algorithm permitting the formation of arbitrary connections between the neurons. A feature is the ability to add new connections quickly. It also has error recovery ability and is robust over a variety of network topologies. Simulations of the general connection system, and its implementation on the Connection Machine, indicate that the time and space requirements are proportional to the product of the average number of connections per neuron and the diameter of the interconnection network.
Introduction to a system for implementing neural net connections on SIMD architectures
NASA Technical Reports Server (NTRS)
Tomboulian, Sherryl
1988-01-01
Neural networks have attracted much interest recently, and using parallel architectures to simulate neural networks is a natural and necessary application. The SIMD model of parallel computation is chosen, because systems of this type can be built with large numbers of processing elements. However, such systems are not naturally suited to generalized elements. A method is proposed that allows an implementation of neural network connections on massively parallel SIMD architectures. The key to this system is an algorithm permitting the formation of arbitrary connections between the neurons. A feature is the ability to add new connections quickly. It also has error recovery ability and is robust over a variety of network topologies. Simulations of the general connection system, and its implementation on the Connection Machine, indicate that the time and space requirements are proportional to the product of the average number of connections per neuron and the diameter of the interconnection network.
Flexibility and Performance of Parallel File Systems
NASA Technical Reports Server (NTRS)
Kotz, David; Nieuwejaar, Nils
1996-01-01
As we gain experience with parallel file systems, it becomes increasingly clear that a single solution does not suit all applications. For example, it appears to be impossible to find a single appropriate interface, caching policy, file structure, or disk-management strategy. Furthermore, the proliferation of file-system interfaces and abstractions make applications difficult to port. We propose that the traditional functionality of parallel file systems be separated into two components: a fixed core that is standard on all platforms, encapsulating only primitive abstractions and interfaces, and a set of high-level libraries to provide a variety of abstractions and application-programmer interfaces (API's). We present our current and next-generation file systems as examples of this structure. Their features, such as a three-dimensional file structure, strided read and write interfaces, and I/O-node programs, are specifically designed with the flexibility and performance necessary to support a wide range of applications.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Deslippe, Jack; da Jornada, Felipe H.; Vigil-Fowler, Derek
2016-10-06
We profile and optimize calculations performed with the BerkeleyGW code on the Xeon-Phi architecture. BerkeleyGW depends both on hand-tuned critical kernels as well as on BLAS and FFT libraries. We describe the optimization process and performance improvements achieved. We discuss a layered parallelization strategy to take advantage of vector, thread and node-level parallelism. We discuss locality changes (including the consequence of the lack of L3 cache) and effective use of the on-package high-bandwidth memory. We show preliminary results on Knights-Landing including a roofline study of code performance before and after a number of optimizations. We find that the GW methodmore » is particularly well-suited for many-core architectures due to the ability to exploit a large amount of parallelism over plane-wave components, band-pairs, and frequencies.« less
The EB Factory: Fundamental Stellar Astrophysics with Eclipsing Binary Stars Discovered by Kepler
NASA Astrophysics Data System (ADS)
Stassun, Keivan
Eclipsing binaries (EBs) are key laboratories for determining the fundamental properties of stars. EBs are therefore foundational objects for constraining stellar evolution models, which in turn are central to determinations of stellar mass functions, of exoplanet properties, and many other areas. The primary goal of this proposal is to mine the Kepler mission light curves for: (1) EBs that include a subgiant star, from which precise ages can be derived and which can thus serve as critically needed age benchmarks; and within these, (2) long-period EBs that include low-mass M stars or brown dwarfs, which are increa-singly becoming the focus of exoplanet searches, but for which there are the fewest available fundamental mass- radius-age benchmarks. A secondary goal of this proposal is to develop an end-to-end computational pipeline -- the Kepler EB Factory -- that allows automatic processing of Kepler light curves for EBs, from period finding, to object classification, to determination of EB physical properties for the most scientifically interesting EBs, and finally to accurate modeling of these EBs for detailed tests and benchmarking of theoretical stellar evolution models. We will integrate the most successful algorithms into a single, cohesive workflow environment, and apply this 'Kepler EB Factory' to the full public Kepler dataset to find and characterize new "benchmark grade" EBs, and will disseminate both the enhanced data products from this pipeline and the pipeline itself to the broader NASA science community. The proposed work responds directly to two of the defined Research Areas of the NASA Astrophysics Data Analysis Program (ADAP), specifically Research Area #2 (Stellar Astrophysics) and Research Area #9 (Astrophysical Databases). To be clear, our primary goal is the fundamental stellar astrophysics that will be enabled by the discovery and analysis of relatively rare, benchmark-grade EBs in the Kepler dataset. At the same time, to enable this goal will require bringing a suite of extant and new custom algorithms to bear on the Kepler data, and thus our development of the Kepler EB Factory represents a value-added product that will allow the widest scientific impact of the in-formation locked within the vast reservoir of the Kepler light curves.
NASA Astrophysics Data System (ADS)
Romano, Paul Kollath
Monte Carlo particle transport methods are being considered as a viable option for high-fidelity simulation of nuclear reactors. While Monte Carlo methods offer several potential advantages over deterministic methods, there are a number of algorithmic shortcomings that would prevent their immediate adoption for full-core analyses. In this thesis, algorithms are proposed both to ameliorate the degradation in parallel efficiency typically observed for large numbers of processors and to offer a means of decomposing large tally data that will be needed for reactor analysis. A nearest-neighbor fission bank algorithm was proposed and subsequently implemented in the OpenMC Monte Carlo code. A theoretical analysis of the communication pattern shows that the expected cost is O( N ) whereas traditional fission bank algorithms are O(N) at best. The algorithm was tested on two supercomputers, the Intrepid Blue Gene/P and the Titan Cray XK7, and demonstrated nearly linear parallel scaling up to 163,840 processor cores on a full-core benchmark problem. An algorithm for reducing network communication arising from tally reduction was analyzed and implemented in OpenMC. The proposed algorithm groups only particle histories on a single processor into batches for tally purposes---in doing so it prevents all network communication for tallies until the very end of the simulation. The algorithm was tested, again on a full-core benchmark, and shown to reduce network communication substantially. A model was developed to predict the impact of load imbalances on the performance of domain decomposed simulations. The analysis demonstrated that load imbalances in domain decomposed simulations arise from two distinct phenomena: non-uniform particle densities and non-uniform spatial leakage. The dominant performance penalty for domain decomposition was shown to come from these physical effects rather than insufficient network bandwidth or high latency. The model predictions were verified with measured data from simulations in OpenMC on a full-core benchmark problem. Finally, a novel algorithm for decomposing large tally data was proposed, analyzed, and implemented/tested in OpenMC. The algorithm relies on disjoint sets of compute processes and tally servers. The analysis showed that for a range of parameters relevant to LWR analysis, the tally server algorithm should perform with minimal overhead. Tests were performed on Intrepid and Titan and demonstrated that the algorithm did indeed perform well over a wide range of parameters. (Copies available exclusively from MIT Libraries, libraries.mit.edu/docs - docs mit.edu)
Approaches in highly parameterized inversion - GENIE, a general model-independent TCP/IP run manager
Muffels, Christopher T.; Schreuder, Willem A.; Doherty, John E.; Karanovic, Marinko; Tonkin, Matthew J.; Hunt, Randall J.; Welter, David E.
2012-01-01
GENIE is a model-independent suite of programs that can be used to generally distribute, manage, and execute multiple model runs via the TCP/IP infrastructure. The suite consists of a file distribution interface, a run manage, a run executer, and a routine that can be compiled as part of a program and used to exchange model runs with the run manager. Because communication is via a standard protocol (TCP/IP), any computer connected to the Internet can serve in any of the capacities offered by this suite. Model independence is consistent with the existing template and instruction file protocols of the widely used PEST parameter estimation program. This report describes (1) the problem addressed; (2) the approach used by GENIE to queue, distribute, and retrieve model runs; and (3) user instructions, classes, and functions developed. It also includes (4) an example to illustrate the linking of GENIE with Parallel PEST using the interface routine.
De Biase, Pablo M; Markosyan, Suren; Noskov, Sergei
2015-02-05
The transport of ions and solutes by biological pores is central for cellular processes and has a variety of applications in modern biotechnology. The time scale involved in the polymer transport across a nanopore is beyond the accessibility of conventional MD simulations. Moreover, experimental studies lack sufficient resolution to provide details on the molecular underpinning of the transport mechanisms. BROMOC, the code presented herein, performs Brownian dynamics simulations, both serial and parallel, up to several milliseconds long. BROMOC can be used to model large biological systems. IMC-MACRO software allows for the development of effective potentials for solute-ion interactions based on radial distribution function from all-atom MD. BROMOC Suite also provides a versatile set of tools to do a wide variety of preprocessing and postsimulation analysis. We illustrate a potential application with ion and ssDNA transport in MspA nanopore. © 2014 Wiley Periodicals, Inc.
The Tera Multithreaded Architecture and Unstructured Meshes
NASA Technical Reports Server (NTRS)
Bokhari, Shahid H.; Mavriplis, Dimitri J.
1998-01-01
The Tera Multithreaded Architecture (MTA) is a new parallel supercomputer currently being installed at San Diego Supercomputing Center (SDSC). This machine has an architecture quite different from contemporary parallel machines. The computational processor is a custom design and the machine uses hardware to support very fine grained multithreading. The main memory is shared, hardware randomized and flat. These features make the machine highly suited to the execution of unstructured mesh problems, which are difficult to parallelize on other architectures. We report the results of a study carried out during July-August 1998 to evaluate the execution of EUL3D, a code that solves the Euler equations on an unstructured mesh, on the 2 processor Tera MTA at SDSC. Our investigation shows that parallelization of an unstructured code is extremely easy on the Tera. We were able to get an existing parallel code (designed for a shared memory machine), running on the Tera by changing only the compiler directives. Furthermore, a serial version of this code was compiled to run in parallel on the Tera by judicious use of directives to invoke the "full/empty" tag bits of the machine to obtain synchronization. This version achieves 212 and 406 Mflop/s on one and two processors respectively, and requires no attention to partitioning or placement of data issues that would be of paramount importance in other parallel architectures.
Efficiency of parallel direct optimization
NASA Technical Reports Server (NTRS)
Janies, D. A.; Wheeler, W. C.
2001-01-01
Tremendous progress has been made at the level of sequential computation in phylogenetics. However, little attention has been paid to parallel computation. Parallel computing is particularly suited to phylogenetics because of the many ways large computational problems can be broken into parts that can be analyzed concurrently. In this paper, we investigate the scaling factors and efficiency of random addition and tree refinement strategies using the direct optimization software, POY, on a small (10 slave processors) and a large (256 slave processors) cluster of networked PCs running LINUX. These algorithms were tested on several data sets composed of DNA and morphology ranging from 40 to 500 taxa. Various algorithms in POY show fundamentally different properties within and between clusters. All algorithms are efficient on the small cluster for the 40-taxon data set. On the large cluster, multibuilding exhibits excellent parallel efficiency, whereas parallel building is inefficient. These results are independent of data set size. Branch swapping in parallel shows excellent speed-up for 16 slave processors on the large cluster. However, there is no appreciable speed-up for branch swapping with the further addition of slave processors (>16). This result is independent of data set size. Ratcheting in parallel is efficient with the addition of up to 32 processors in the large cluster. This result is independent of data set size. c2001 The Willi Hennig Society.
Multitasking for flows about multiple body configurations using the chimera grid scheme
NASA Technical Reports Server (NTRS)
Dougherty, F. C.; Morgan, R. L.
1987-01-01
The multitasking of a finite-difference scheme using multiple overset meshes is described. In this chimera, or multiple overset mesh approach, a multiple body configuration is mapped using a major grid about the main component of the configuration, with minor overset meshes used to map each additional component. This type of code is well suited to multitasking. Both steady and unsteady two dimensional computations are run on parallel processors on a CRAY-X/MP 48, usually with one mesh per processor. Flow field results are compared with single processor results to demonstrate the feasibility of running multiple mesh codes on parallel processors and to show the increase in efficiency.
An efficient three-dimensional Poisson solver for SIMD high-performance-computing architectures
NASA Technical Reports Server (NTRS)
Cohl, H.
1994-01-01
We present an algorithm that solves the three-dimensional Poisson equation on a cylindrical grid. The technique uses a finite-difference scheme with operator splitting. This splitting maps the banded structure of the operator matrix into a two-dimensional set of tridiagonal matrices, which are then solved in parallel. Our algorithm couples FFT techniques with the well-known ADI (Alternating Direction Implicit) method for solving Elliptic PDE's, and the implementation is extremely well suited for a massively parallel environment like the SIMD architecture of the MasPar MP-1. Due to the highly recursive nature of our problem, we believe that our method is highly efficient, as it avoids excessive interprocessor communication.
[QUIPS: quality improvement in postoperative pain management].
Meissner, Winfried
2011-01-01
Despite the availability of high-quality guidelines and advanced pain management techniques acute postoperative pain management is still far from being satisfactory. The QUIPS (Quality Improvement in Postoperative Pain Management) project aims to improve treatment quality by means of standardised data acquisition, analysis of quality and process indicators, and feedback and benchmarking. During a pilot phase funded by the German Ministry of Health (BMG), a total of 12,389 data sets were collected from six participating hospitals. Outcome improved in four of the six hospitals. Process indicators, such as routine pain documentation, were only poorly correlated with outcomes. To date, more than 130 German hospitals use QUIPS as a routine quality management tool. An EC-funded parallel project disseminates the concept internationally. QUIPS demonstrates that patient-reported outcomes in postoperative pain management can be benchmarked in routine clinical practice. Quality improvement initiatives should use outcome instead of structural and process parameters. The concept is transferable to other fields of medicine. Copyright © 2011. Published by Elsevier GmbH.
Hierarchical Artificial Bee Colony Algorithm for RFID Network Planning Optimization
Ma, Lianbo; Chen, Hanning; Hu, Kunyuan; Zhu, Yunlong
2014-01-01
This paper presents a novel optimization algorithm, namely, hierarchical artificial bee colony optimization, called HABC, to tackle the radio frequency identification network planning (RNP) problem. In the proposed multilevel model, the higher-level species can be aggregated by the subpopulations from lower level. In the bottom level, each subpopulation employing the canonical ABC method searches the part-dimensional optimum in parallel, which can be constructed into a complete solution for the upper level. At the same time, the comprehensive learning method with crossover and mutation operators is applied to enhance the global search ability between species. Experiments are conducted on a set of 10 benchmark optimization problems. The results demonstrate that the proposed HABC obtains remarkable performance on most chosen benchmark functions when compared to several successful swarm intelligence and evolutionary algorithms. Then HABC is used for solving the real-world RNP problem on two instances with different scales. Simulation results show that the proposed algorithm is superior for solving RNP, in terms of optimization accuracy and computation robustness. PMID:24592200
Hierarchical artificial bee colony algorithm for RFID network planning optimization.
Ma, Lianbo; Chen, Hanning; Hu, Kunyuan; Zhu, Yunlong
2014-01-01
This paper presents a novel optimization algorithm, namely, hierarchical artificial bee colony optimization, called HABC, to tackle the radio frequency identification network planning (RNP) problem. In the proposed multilevel model, the higher-level species can be aggregated by the subpopulations from lower level. In the bottom level, each subpopulation employing the canonical ABC method searches the part-dimensional optimum in parallel, which can be constructed into a complete solution for the upper level. At the same time, the comprehensive learning method with crossover and mutation operators is applied to enhance the global search ability between species. Experiments are conducted on a set of 10 benchmark optimization problems. The results demonstrate that the proposed HABC obtains remarkable performance on most chosen benchmark functions when compared to several successful swarm intelligence and evolutionary algorithms. Then HABC is used for solving the real-world RNP problem on two instances with different scales. Simulation results show that the proposed algorithm is superior for solving RNP, in terms of optimization accuracy and computation robustness.
Two-fluid dusty shocks: simple benchmarking problems and applications to protoplanetary discs
NASA Astrophysics Data System (ADS)
Lehmann, Andrew; Wardle, Mark
2018-05-01
The key role that dust plays in the interstellar medium has motivated the development of numerical codes designed to study the coupled evolution of dust and gas in systems such as turbulent molecular clouds and protoplanetary discs. Drift between dust and gas has proven to be important as well as numerically challenging. We provide simple benchmarking problems for dusty gas codes by numerically solving the two-fluid dust-gas equations for steady, plane-parallel shock waves. The two distinct shock solutions to these equations allow a numerical code to test different forms of drag between the two fluids, the strength of that drag and the dust to gas ratio. We also provide an astrophysical application of J-type dust-gas shocks to studying the structure of accretion shocks on to protoplanetary discs. We find that two-fluid effects are most important for grains larger than 1 μm, and that the peak dust temperature within an accretion shock provides a signature of the dust-to-gas ratio of the infalling material.
NASA Astrophysics Data System (ADS)
Rocha, João Vicente; Camerini, Cesar; Pereira, Gabriela
2016-02-01
The 2015 World Federation of NDE Centers (WFNDEC) eddy current benchmark problem involves the inspection of two EDM notches placed at the edge of a conducting plate with a pancake coil that runs parallel to the plate's edge line. Experimental data consists of impedance variation measured with a precision LCR bridge as a XY scanner moves the coil. The authors are pleased to present the numerical results obtained with commercial FEM packages (OPERA 3-D). Values of electrical resistance and inductive reactance variation between base material and the region around the notch are plotted as function of the coil displacement over the plate. The calculations were made for frequencies of 1 kHz and 10 kHz and agreement between experimental and numerical results are excellent for all inspection conditions. Explanations are made about how the impedance is calculated as well as pros and cons of the presented methods.
Verification of ARES transport code system with TAKEDA benchmarks
NASA Astrophysics Data System (ADS)
Zhang, Liang; Zhang, Bin; Zhang, Penghe; Chen, Mengteng; Zhao, Jingchang; Zhang, Shun; Chen, Yixue
2015-10-01
Neutron transport modeling and simulation are central to many areas of nuclear technology, including reactor core analysis, radiation shielding and radiation detection. In this paper the series of TAKEDA benchmarks are modeled to verify the critical calculation capability of ARES, a discrete ordinates neutral particle transport code system. SALOME platform is coupled with ARES to provide geometry modeling and mesh generation function. The Koch-Baker-Alcouffe parallel sweep algorithm is applied to accelerate the traditional transport calculation process. The results show that the eigenvalues calculated by ARES are in excellent agreement with the reference values presented in NEACRP-L-330, with a difference less than 30 pcm except for the first case of model 3. Additionally, ARES provides accurate fluxes distribution compared to reference values, with a deviation less than 2% for region-averaged fluxes in all cases. All of these confirms the feasibility of ARES-SALOME coupling and demonstrate that ARES has a good performance in critical calculation.
Scaling of Multimillion-Atom Biological Molecular Dynamics Simulation on a Petascale Supercomputer
DOE Office of Scientific and Technical Information (OSTI.GOV)
Schulz, Roland; Lindner, Benjamin; Petridis, Loukas
2009-01-01
A strategy is described for a fast all-atom molecular dynamics simulation of multimillion-atom biological systems on massively parallel supercomputers. The strategy is developed using benchmark systems of particular interest to bioenergy research, comprising models of cellulose and lignocellulosic biomass in an aqueous solution. The approach involves using the reaction field (RF) method for the computation of long-range electrostatic interactions, which permits efficient scaling on many thousands of cores. Although the range of applicability of the RF method for biomolecular systems remains to be demonstrated, for the benchmark systems the use of the RF produces molecular dipole moments, Kirkwood G factors,more » other structural properties, and mean-square fluctuations in excellent agreement with those obtained with the commonly used Particle Mesh Ewald method. With RF, three million- and five million atom biological systems scale well up to 30k cores, producing 30 ns/day. Atomistic simulations of very large systems for time scales approaching the microsecond would, therefore, appear now to be within reach.« less
Scaling of Multimillion-Atom Biological Molecular Dynamics Simulation on a Petascale Supercomputer.
Schulz, Roland; Lindner, Benjamin; Petridis, Loukas; Smith, Jeremy C
2009-10-13
A strategy is described for a fast all-atom molecular dynamics simulation of multimillion-atom biological systems on massively parallel supercomputers. The strategy is developed using benchmark systems of particular interest to bioenergy research, comprising models of cellulose and lignocellulosic biomass in an aqueous solution. The approach involves using the reaction field (RF) method for the computation of long-range electrostatic interactions, which permits efficient scaling on many thousands of cores. Although the range of applicability of the RF method for biomolecular systems remains to be demonstrated, for the benchmark systems the use of the RF produces molecular dipole moments, Kirkwood G factors, other structural properties, and mean-square fluctuations in excellent agreement with those obtained with the commonly used Particle Mesh Ewald method. With RF, three million- and five million-atom biological systems scale well up to ∼30k cores, producing ∼30 ns/day. Atomistic simulations of very large systems for time scales approaching the microsecond would, therefore, appear now to be within reach.
NASA Astrophysics Data System (ADS)
Shen, Yanfeng; Cesnik, Carlos E. S.
2016-04-01
This paper presents a parallelized modeling technique for the efficient simulation of nonlinear ultrasonics introduced by the wave interaction with fatigue cracks. The elastodynamic wave equations with contact effects are formulated using an explicit Local Interaction Simulation Approach (LISA). The LISA formulation is extended to capture the contact-impact phenomena during the wave damage interaction based on the penalty method. A Coulomb friction model is integrated into the computation procedure to capture the stick-slip contact shear motion. The LISA procedure is coded using the Compute Unified Device Architecture (CUDA), which enables the highly parallelized supercomputing on powerful graphic cards. Both the explicit contact formulation and the parallel feature facilitates LISA's superb computational efficiency over the conventional finite element method (FEM). The theoretical formulations based on the penalty method is introduced and a guideline for the proper choice of the contact stiffness is given. The convergence behavior of the solution under various contact stiffness values is examined. A numerical benchmark problem is used to investigate the new LISA formulation and results are compared with a conventional contact finite element solution. Various nonlinear ultrasonic phenomena are successfully captured using this contact LISA formulation, including the generation of nonlinear higher harmonic responses. Nonlinear mode conversion of guided waves at fatigue cracks is also studied.
StrAuto: automation and parallelization of STRUCTURE analysis.
Chhatre, Vikram E; Emerson, Kevin J
2017-03-24
Population structure inference using the software STRUCTURE has become an integral part of population genetic studies covering a broad spectrum of taxa including humans. The ever-expanding size of genetic data sets poses computational challenges for this analysis. Although at least one tool currently implements parallel computing to reduce computational overload of this analysis, it does not fully automate the use of replicate STRUCTURE analysis runs required for downstream inference of optimal K. There is pressing need for a tool that can deploy population structure analysis on high performance computing clusters. We present an updated version of the popular Python program StrAuto, to streamline population structure analysis using parallel computing. StrAuto implements a pipeline that combines STRUCTURE analysis with the Evanno Δ K analysis and visualization of results using STRUCTURE HARVESTER. Using benchmarking tests, we demonstrate that StrAuto significantly reduces the computational time needed to perform iterative STRUCTURE analysis by distributing runs over two or more processors. StrAuto is the first tool to integrate STRUCTURE analysis with post-processing using a pipeline approach in addition to implementing parallel computation - a set up ideal for deployment on computing clusters. StrAuto is distributed under the GNU GPL (General Public License) and available to download from http://strauto.popgen.org .
Parallel Agent-Based Simulations on Clusters of GPUs and Multi-Core Processors
DOE Office of Scientific and Technical Information (OSTI.GOV)
Aaby, Brandon G; Perumalla, Kalyan S; Seal, Sudip K
2010-01-01
An effective latency-hiding mechanism is presented in the parallelization of agent-based model simulations (ABMS) with millions of agents. The mechanism is designed to accommodate the hierarchical organization as well as heterogeneity of current state-of-the-art parallel computing platforms. We use it to explore the computation vs. communication trade-off continuum available with the deep computational and memory hierarchies of extant platforms and present a novel analytical model of the tradeoff. We describe our implementation and report preliminary performance results on two distinct parallel platforms suitable for ABMS: CUDA threads on multiple, networked graphical processing units (GPUs), and pthreads on multi-core processors. Messagemore » Passing Interface (MPI) is used for inter-GPU as well as inter-socket communication on a cluster of multiple GPUs and multi-core processors. Results indicate the benefits of our latency-hiding scheme, delivering as much as over 100-fold improvement in runtime for certain benchmark ABMS application scenarios with several million agents. This speed improvement is obtained on our system that is already two to three orders of magnitude faster on one GPU than an equivalent CPU-based execution in a popular simulator in Java. Thus, the overall execution of our current work is over four orders of magnitude faster when executed on multiple GPUs.« less
A Fast Visible-Infrared Imaging Radiometer Suite Simulator for Cloudy Atmopheres
NASA Technical Reports Server (NTRS)
Liu, Chao; Yang, Ping; Nasiri, Shaima L.; Platnick, Steven; Meyer, Kerry G.; Wang, Chen Xi; Ding, Shouguo
2015-01-01
A fast instrument simulator is developed to simulate the observations made in cloudy atmospheres by the Visible Infrared Imaging Radiometer Suite (VIIRS). The correlated k-distribution (CKD) technique is used to compute the transmissivity of absorbing atmospheric gases. The bulk scattering properties of ice clouds used in this study are based on the ice model used for the MODIS Collection 6 ice cloud products. Two fast radiative transfer models based on pre-computed ice cloud look-up-tables are used for the VIIRS solar and infrared channels. The accuracy and efficiency of the fast simulator are quantify in comparison with a combination of the rigorous line-by-line (LBLRTM) and discrete ordinate radiative transfer (DISORT) models. Relative errors are less than 2 for simulated TOA reflectances for the solar channels and the brightness temperature differences for the infrared channels are less than 0.2 K. The simulator is over three orders of magnitude faster than the benchmark LBLRTM+DISORT model. Furthermore, the cloudy atmosphere reflectances and brightness temperatures from the fast VIIRS simulator compare favorably with those from VIIRS observations.
Autonomous onboard optical processor for driving aid
NASA Astrophysics Data System (ADS)
Attia, Mondher; Servel, Alain; Guibert, Laurent
1995-01-01
We take advantage of recent technological advances in the field of ferroelectric liquid crystal silicon back plane optoelectronic devices. These are well suited to perform massively parallel processing tasks. That choice enables the design of low cost vision systems and allows the implementation of an on-board system. We focus on transport applications such as road sign recognition. Preliminary in-car experimental results are presented.
PETSc Users Manual Revision 3.7
DOE Office of Scientific and Technical Information (OSTI.GOV)
Balay, Satish; Abhyankar, S.; Adams, M.
This manual describes the use of PETSc for the numerical solution of partial differential equations and related problems on high-performance computers. The Portable, Extensible Toolkit for Scientific Computation (PETSc) is a suite of data structures and routines that provide the building blocks for the implementation of large-scale application codes on parallel (and serial) computers. PETSc uses the MPI standard for all message-passing communication.
PETSc Users Manual Revision 3.8
DOE Office of Scientific and Technical Information (OSTI.GOV)
Balay, S.; Abhyankar, S.; Adams, M.
This manual describes the use of PETSc for the numerical solution of partial differential equations and related problems on high-performance computers. The Portable, Extensible Toolkit for Scientific Computation (PETSc) is a suite of data structures and routines that provide the building blocks for the implementation of large-scale application codes on parallel (and serial) computers. PETSc uses the MPI standard for all message-passing communication.
Benchmarking GPU and CPU codes for Heisenberg spin glass over-relaxation
NASA Astrophysics Data System (ADS)
Bernaschi, M.; Parisi, G.; Parisi, L.
2011-06-01
We present a set of possible implementations for Graphics Processing Units (GPU) of the Over-relaxation technique applied to the 3D Heisenberg spin glass model. The results show that a carefully tuned code can achieve more than 100 GFlops/s of sustained performance and update a single spin in about 0.6 nanoseconds. A multi-hit technique that exploits the GPU shared memory further reduces this time. Such results are compared with those obtained by means of a highly-tuned vector-parallel code on latest generation multi-core CPUs.
Performance of a carbon nanotube field emission electron gun
NASA Astrophysics Data System (ADS)
Getty, Stephanie A.; King, Todd T.; Bis, Rachael A.; Jones, Hollis H.; Herrero, Federico; Lynch, Bernard A.; Roman, Patrick; Mahaffy, Paul
2007-04-01
A cold cathode field emission electron gun (e-gun) based on a patterned carbon nanotube (CNT) film has been fabricated for use in a miniaturized reflectron time-of-flight mass spectrometer (RTOF MS), with future applications in other charged particle spectrometers, and performance of the CNT e-gun has been evaluated. A thermionic electron gun has also been fabricated and evaluated in parallel and its performance is used as a benchmark in the evaluation of our CNT e-gun. Implications for future improvements and integration into the RTOF MS are discussed.
The Automated Instrumentation and Monitoring System (AIMS): Design and Architecture. 3.2
NASA Technical Reports Server (NTRS)
Yan, Jerry C.; Schmidt, Melisa; Schulbach, Cathy; Bailey, David (Technical Monitor)
1997-01-01
Whether a researcher is designing the 'next parallel programming paradigm', another 'scalable multiprocessor' or investigating resource allocation algorithms for multiprocessors, a facility that enables parallel program execution to be captured and displayed is invaluable. Careful analysis of such information can help computer and software architects to capture, and therefore, exploit behavioral variations among/within various parallel programs to take advantage of specific hardware characteristics. A software tool-set that facilitates performance evaluation of parallel applications on multiprocessors has been put together at NASA Ames Research Center under the sponsorship of NASA's High Performance Computing and Communications Program over the past five years. The Automated Instrumentation and Monitoring Systematic has three major software components: a source code instrumentor which automatically inserts active event recorders into program source code before compilation; a run-time performance monitoring library which collects performance data; and a visualization tool-set which reconstructs program execution based on the data collected. Besides being used as a prototype for developing new techniques for instrumenting, monitoring and presenting parallel program execution, AIMS is also being incorporated into the run-time environments of various hardware testbeds to evaluate their impact on user productivity. Currently, the execution of FORTRAN and C programs on the Intel Paragon and PALM workstations can be automatically instrumented and monitored. Performance data thus collected can be displayed graphically on various workstations. The process of performance tuning with AIMS will be illustrated using various NAB Parallel Benchmarks. This report includes a description of the internal architecture of AIMS and a listing of the source code.
NASA Astrophysics Data System (ADS)
Christ, John A.; Goltz, Mark N.
2004-01-01
Pump-and-treat systems that are installed to contain contaminated groundwater migration typically involve placement of extraction wells perpendicular to the regional groundwater flow direction at the down gradient edge of a contaminant plume. These wells capture contaminated water for above ground treatment and disposal, thereby preventing further migration of contaminated water down gradient. In this work, examining two-, three-, and four-well systems, we compare well configurations that are parallel and perpendicular to the regional groundwater flow direction. We show that orienting extraction wells co-linearly, parallel to regional flow, results in (1) a larger area of aquifer influenced by the wells at a given total well flow rate, (2) a center and ultimate capture zone width equal to the perpendicular configuration, and (3) more flexibility with regard to minimizing drawdown. Although not suited for some scenarios, we found orienting extraction wells parallel to regional flow along a plume centerline, when compared to a perpendicular configuration, reduces drawdown by up to 7% and minimizes the fraction of uncontaminated water captured.
Accelerating atomistic calculations of quantum energy eigenstates on graphic cards
NASA Astrophysics Data System (ADS)
Rodrigues, Walter; Pecchia, A.; Lopez, M.; Auf der Maur, M.; Di Carlo, A.
2014-10-01
Electronic properties of nanoscale materials require the calculation of eigenvalues and eigenvectors of large matrices. This bottleneck can be overcome by parallel computing techniques or the introduction of faster algorithms. In this paper we report a custom implementation of the Lanczos algorithm with simple restart, optimized for graphical processing units (GPUs). The whole algorithm has been developed using CUDA and runs entirely on the GPU, with a specialized implementation that spares memory and reduces at most machine-to-device data transfers. Furthermore parallel distribution over several GPUs has been attained using the standard message passing interface (MPI). Benchmark calculations performed on a GaN/AlGaN wurtzite quantum dot with up to 600,000 atoms are presented. The empirical tight-binding (ETB) model with an sp3d5s∗+spin-orbit parametrization has been used to build the system Hamiltonian (H).
Injector Design Tool Improvements: User's manual for FDNS V.4.5
NASA Technical Reports Server (NTRS)
Chen, Yen-Sen; Shang, Huan-Min; Wei, Hong; Liu, Jiwen
1998-01-01
The major emphasis of the current effort is in the development and validation of an efficient parallel machine computational model, based on the FDNS code, to analyze the fluid dynamics of a wide variety of liquid jet configurations for general liquid rocket engine injection system applications. This model includes physical models for droplet atomization, breakup/coalescence, evaporation, turbulence mixing and gas-phase combustion. Benchmark validation cases for liquid rocket engine chamber combustion conditions will be performed for model validation purpose. Test cases may include shear coaxial, swirl coaxial and impinging injection systems with combinations LOXIH2 or LOXISP-1 propellant injector elements used in rocket engine designs. As a final goal of this project, a well tested parallel CFD performance methodology together with a user's operation description in a final technical report will be reported at the end of the proposed research effort.
Amplitude analysis of four-body decays using a massively-parallel fitting framework
NASA Astrophysics Data System (ADS)
Hasse, C.; Albrecht, J.; Alves, A. A., Jr.; d'Argent, P.; Evans, T. D.; Rademacker, J.; Sokoloff, M. D.
2017-10-01
The GooFit Framework is designed to perform maximum-likelihood fits for arbitrary functions on various parallel back ends, for example a GPU. We present an extension to GooFit which adds the functionality to perform time-dependent amplitude analyses of pseudoscalar mesons decaying into four pseudoscalar final states. Benchmarks of this functionality show a significant performance increase when utilizing a GPU compared to a CPU. Furthermore, this extension is employed to study the sensitivity on the {{{D}}}0-{\\bar{{{D}}}}0 mixing parameters x and y in a time-dependent amplitude analysis of the decay D0 → K+π-π+π-. Studying a sample of 50 000 events and setting the central values to the world average of x = (0.49 ± 0.15)% and y = (0.61 ± 0.08)%, the statistical sensitivities of x and y are determined to be σ(x) = 0.019 % and σ(y) = 0.019 %.
A Parallel Multigrid Solver for Viscous Flows on Anisotropic Structured Grids
NASA Technical Reports Server (NTRS)
Prieto, Manuel; Montero, Ruben S.; Llorente, Ignacio M.; Bushnell, Dennis M. (Technical Monitor)
2001-01-01
This paper presents an efficient parallel multigrid solver for speeding up the computation of a 3-D model that treats the flow of a viscous fluid over a flat plate. The main interest of this simulation lies in exhibiting some basic difficulties that prevent optimal multigrid efficiencies from being achieved. As the computing platform, we have used Coral, a Beowulf-class system based on Intel Pentium processors and equipped with GigaNet cLAN and switched Fast Ethernet networks. Our study not only examines the scalability of the solver but also includes a performance evaluation of Coral where the investigated solver has been used to compare several of its design choices, namely, the interconnection network (GigaNet versus switched Fast-Ethernet) and the node configuration (dual nodes versus single nodes). As a reference, the performance results have been compared with those obtained with the NAS-MG benchmark.
Extending substructure based iterative solvers to multiple load and repeated analyses
NASA Technical Reports Server (NTRS)
Farhat, Charbel
1993-01-01
Direct solvers currently dominate commercial finite element structural software, but do not scale well in the fine granularity regime targeted by emerging parallel processors. Substructure based iterative solvers--often called also domain decomposition algorithms--lend themselves better to parallel processing, but must overcome several obstacles before earning their place in general purpose structural analysis programs. One such obstacle is the solution of systems with many or repeated right hand sides. Such systems arise, for example, in multiple load static analyses and in implicit linear dynamics computations. Direct solvers are well-suited for these problems because after the system matrix has been factored, the multiple or repeated solutions can be obtained through relatively inexpensive forward and backward substitutions. On the other hand, iterative solvers in general are ill-suited for these problems because they often must restart from scratch for every different right hand side. In this paper, we present a methodology for extending the range of applications of domain decomposition methods to problems with multiple or repeated right hand sides. Basically, we formulate the overall problem as a series of minimization problems over K-orthogonal and supplementary subspaces, and tailor the preconditioned conjugate gradient algorithm to solve them efficiently. The resulting solution method is scalable, whereas direct factorization schemes and forward and backward substitution algorithms are not. We illustrate the proposed methodology with the solution of static and dynamic structural problems, and highlight its potential to outperform forward and backward substitutions on parallel computers. As an example, we show that for a linear structural dynamics problem with 11640 degrees of freedom, every time-step beyond time-step 15 is solved in a single iteration and consumes 1.0 second on a 32 processor iPSC-860 system; for the same problem and the same parallel processor, a pair of forward/backward substitutions at each step consumes 15.0 seconds.
NASA Astrophysics Data System (ADS)
Cowdery, E.; Dietze, M.
2017-12-01
As atmospheric levels of carbon dioxide levels continue to increase, it is critical that terrestrial ecosystem models can accurately predict ecological responses to the changing environment. Current predictions of net primary productivity (NPP) in response to elevated atmospheric CO2 concentration are highly variable and contain a considerable amount of uncertainty. Benchmarking model predictions against data are necessary to assess their ability to replicate observed patterns, but also to identify and evaluate the assumptions causing inter-model differences. We have implemented a novel benchmarking workflow as part of the Predictive Ecosystem Analyzer (PEcAn) that is automated, repeatable, and generalized to incorporate different sites and ecological models. Building on the recent Free-Air CO2 Enrichment Model Data Synthesis (FACE-MDS) project, we used observational data from the FACE experiments to test this flexible, extensible benchmarking approach aimed at providing repeatable tests of model process representation that can be performed quickly and frequently. Model performance assessments are often limited to traditional residual error analysis; however, this can result in a loss of critical information. Models that fail tests of relative measures of fit may still perform well under measures of absolute fit and mathematical similarity. This implies that models that are discounted as poor predictors of ecological productivity may still be capturing important patterns. Conversely, models that have been found to be good predictors of productivity may be hiding error in their sub-process that result in the right answers for the wrong reasons. Our suite of tests have not only highlighted process based sources of uncertainty in model productivity calculations, they have also quantified the patterns and scale of this error. Combining these findings with PEcAn's model sensitivity analysis and variance decomposition strengthen our ability to identify which processes need further study and additional data constraints. This can be used to inform future experimental design and in turn can provide an informative starting point for data assimilation.
NASA Astrophysics Data System (ADS)
Hadade, Ioan; di Mare, Luca
2016-08-01
Modern multicore and manycore processors exhibit multiple levels of parallelism through a wide range of architectural features such as SIMD for data parallel execution or threads for core parallelism. The exploitation of multi-level parallelism is therefore crucial for achieving superior performance on current and future processors. This paper presents the performance tuning of a multiblock CFD solver on Intel SandyBridge and Haswell multicore CPUs and the Intel Xeon Phi Knights Corner coprocessor. Code optimisations have been applied on two computational kernels exhibiting different computational patterns: the update of flow variables and the evaluation of the Roe numerical fluxes. We discuss at great length the code transformations required for achieving efficient SIMD computations for both kernels across the selected devices including SIMD shuffles and transpositions for flux stencil computations and global memory transformations. Core parallelism is expressed through threading based on a number of domain decomposition techniques together with optimisations pertaining to alleviating NUMA effects found in multi-socket compute nodes. Results are correlated with the Roofline performance model in order to assert their efficiency for each distinct architecture. We report significant speedups for single thread execution across both kernels: 2-5X on the multicore CPUs and 14-23X on the Xeon Phi coprocessor. Computations at full node and chip concurrency deliver a factor of three speedup on the multicore processors and up to 24X on the Xeon Phi manycore coprocessor.
Multiprocessor smalltalk: Implementation, performance, and analysis
DOE Office of Scientific and Technical Information (OSTI.GOV)
Pallas, J.I.
1990-01-01
Multiprocessor Smalltalk demonstrates the value of object-oriented programming on a multiprocessor. Its implementation and analysis shed light on three areas: concurrent programming in an object oriented language without special extensions, implementation techniques for adapting to multiprocessors, and performance factors in the resulting system. Adding parallelism to Smalltalk code is easy, because programs already use control abstractions like iterators. Smalltalk's basic control and concurrency primitives (lambda expressions, processes and semaphores) can be used to build parallel control abstractions, including parallel iterators, parallel objects, atomic objects, and futures. Language extensions for concurrency are not required. This implementation demonstrates that it is possiblemore » to build an efficient parallel object-oriented programming system and illustrates techniques for doing so. Three modification tools-serialization, replication, and reorganization-adapted the Berkeley Smalltalk interpreter to the Firefly multiprocessor. Multiprocessor Smalltalk's performance shows that the combination of multiprocessing and object-oriented programming can be effective: speedups (relative to the original serial version) exceed 2.0 for five processors on all the benchmarks; the median efficiency is 48%. Analysis shows both where performance is lost and how to improve and generalize the experimental results. Changes in the interpreter to support concurrency add at most 12% overhead; better access to per-process variables could eliminate much of that. Changes in the user code to express concurrency add as much as 70% overhead; this overhead could be reduced to 54% if blocks (lambda expressions) were reentrant. Performance is also lost when the program cannot keep all five processors busy.« less
A Parallel Processing Algorithm for Remote Sensing Classification
NASA Technical Reports Server (NTRS)
Gualtieri, J. Anthony
2005-01-01
A current thread in parallel computation is the use of cluster computers created by networking a few to thousands of commodity general-purpose workstation-level commuters using the Linux operating system. For example on the Medusa cluster at NASA/GSFC, this provides for super computing performance, 130 G(sub flops) (Linpack Benchmark) at moderate cost, $370K. However, to be useful for scientific computing in the area of Earth science, issues of ease of programming, access to existing scientific libraries, and portability of existing code need to be considered. In this paper, I address these issues in the context of tools for rendering earth science remote sensing data into useful products. In particular, I focus on a problem that can be decomposed into a set of independent tasks, which on a serial computer would be performed sequentially, but with a cluster computer can be performed in parallel, giving an obvious speedup. To make the ideas concrete, I consider the problem of classifying hyperspectral imagery where some ground truth is available to train the classifier. In particular I will use the Support Vector Machine (SVM) approach as applied to hyperspectral imagery. The approach will be to introduce notions about parallel computation and then to restrict the development to the SVM problem. Pseudocode (an outline of the computation) will be described and then details specific to the implementation will be given. Then timing results will be reported to show what speedups are possible using parallel computation. The paper will close with a discussion of the results.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Alan Black; Arnis Judzis
2003-10-01
This document details the progress to date on the OPTIMIZATION OF DEEP DRILLING PERFORMANCE--DEVELOPMENT AND BENCHMARK TESTING OF ADVANCED DIAMOND PRODUCT DRILL BITS AND HP/HT FLUIDS TO SIGNIFICANTLY IMPROVE RATES OF PENETRATION contract for the year starting October 2002 through September 2002. The industry cost shared program aims to benchmark drilling rates of penetration in selected simulated deep formations and to significantly improve ROP through a team development of aggressive diamond product drill bit--fluid system technologies. Overall the objectives are as follows: Phase 1--Benchmark ''best in class'' diamond and other product drilling bits and fluids and develop concepts for amore » next level of deep drilling performance; Phase 2--Develop advanced smart bit--fluid prototypes and test at large scale; and Phase 3--Field trial smart bit--fluid concepts, modify as necessary and commercialize products. Accomplishments to date include the following: 4Q 2002--Project started; Industry Team was assembled; Kick-off meeting was held at DOE Morgantown; 1Q 2003--Engineering meeting was held at Hughes Christensen, The Woodlands Texas to prepare preliminary plans for development and testing and review equipment needs; Operators started sending information regarding their needs for deep drilling challenges and priorities for large-scale testing experimental matrix; Aramco joined the Industry Team as DEA 148 objectives paralleled the DOE project; 2Q 2003--Engineering and planning for high pressure drilling at TerraTek commenced; 3Q 2003--Continuation of engineering and design work for high pressure drilling at TerraTek; Baker Hughes INTEQ drilling Fluids and Hughes Christensen commence planning for Phase 1 testing--recommendations for bits and fluids.« less
Brownian motion properties of optoelectronic random bit generators based on laser chaos.
Li, Pu; Yi, Xiaogang; Liu, Xianglian; Wang, Yuncai; Wang, Yongge
2016-07-11
The nondeterministic property of the optoelectronic random bit generator (RBG) based on laser chaos are experimentally analyzed from two aspects of the central limit theorem and law of iterated logarithm. The random bits are extracted from an optical feedback chaotic laser diode using a multi-bit extraction technique in the electrical domain. Our experimental results demonstrate that the generated random bits have no statistical distance from the Brownian motion, besides that they can pass the state-of-the-art industry-benchmark statistical test suite (NIST SP800-22). All of them give a mathematically provable evidence that the ultrafast random bit generator based on laser chaos can be used as a nondeterministic random bit source.
Wilson Dslash Kernel From Lattice QCD Optimization
DOE Office of Scientific and Technical Information (OSTI.GOV)
Joo, Balint; Smelyanskiy, Mikhail; Kalamkar, Dhiraj D.
2015-07-01
Lattice Quantum Chromodynamics (LQCD) is a numerical technique used for calculations in Theoretical Nuclear and High Energy Physics. LQCD is traditionally one of the first applications ported to many new high performance computing architectures and indeed LQCD practitioners have been known to design and build custom LQCD computers. Lattice QCD kernels are frequently used as benchmarks (e.g. 168.wupwise in the SPEC suite) and are generally well understood, and as such are ideal to illustrate several optimization techniques. In this chapter we will detail our work in optimizing the Wilson-Dslash kernels for Intel Xeon Phi, however, as we will show themore » technique gives excellent performance on regular Xeon Architecture as well.« less
Dynamic Curvature Steering Control for Autonomous Vehicle: Performance Analysis
NASA Astrophysics Data System (ADS)
Aizzat Zakaria, Muhammad; Zamzuri, Hairi; Amri Mazlan, Saiful
2016-02-01
This paper discusses the design of dynamic curvature steering control for autonomous vehicle. The lateral control and longitudinal control are discussed in this paper. The controller is designed based on the dynamic curvature calculation to estimate the path condition and modify the vehicle speed and steering wheel angle accordingly. In this paper, the simulation results are presented to show the capability of the controller to track the reference path. The controller is able to predict the path and modify the vehicle speed to suit the path condition. The effectiveness of the controller is shown in this paper whereby identical performance is achieved with the benchmark but with extra curvature adaptation capabilites.
PETSc Users Manual Revision 3.3
DOE Office of Scientific and Technical Information (OSTI.GOV)
Balay, S.; Brown, J.; Buschelman, K.
This manual describes the use of PETSc for the numerical solution of partial differential equations and related problems on high-performance computers. The Portable, Extensible Toolkit for Scientific Computation (PETSc) is a suite of data structures and routines that provide the building blocks for the implementation of large-scale application codes on parallel (and serial) computers. PETSc uses the MPI standard for all message-passing communication. PETSc includes an expanding suite of parallel linear, nonlinear equation solvers and time integrators that may be used in application codes written in Fortran, C, C++, Python, and MATLAB (sequential). PETSc provides many of the mechanisms neededmore » within parallel application codes, such as parallel matrix and vector assembly routines. The library is organized hierarchically, enabling users to employ the level of abstraction that is most appropriate for a particular problem. By using techniques of object-oriented programming, PETSc provides enormous flexibility for users. PETSc is a sophisticated set of software tools; as such, for some users it initially has a much steeper learning curve than a simple subroutine library. In particular, for individuals without some computer science background, experience programming in C, C++ or Fortran and experience using a debugger such as gdb or dbx, it may require a significant amount of time to take full advantage of the features that enable efficient software use. However, the power of the PETSc design and the algorithms it incorporates may make the efficient implementation of many application codes simpler than “rolling them” yourself; For many tasks a package such as MATLAB is often the best tool; PETSc is not intended for the classes of problems for which effective MATLAB code can be written. PETSc also has a MATLAB interface, so portions of your code can be written in MATLAB to “try out” the PETSc solvers. The resulting code will not be scalable however because currently MATLAB is inherently not scalable; and PETSc should not be used to attempt to provide a “parallel linear solver” in an otherwise sequential code. Certainly all parts of a previously sequential code need not be parallelized but the matrix generation portion must be parallelized to expect any kind of reasonable performance. Do not expect to generate your matrix sequentially and then “use PETSc” to solve the linear system in parallel. Since PETSc is under continued development, small changes in usage and calling sequences of routines will occur. PETSc is supported; see the web site http://www.mcs.anl.gov/petsc for information on contacting support. A http://www.mcs.anl.gov/petsc/publications may be found a list of publications and web sites that feature work involving PETSc. We welcome any reports of corrections for this document.« less
PETSc Users Manual Revision 3.4
DOE Office of Scientific and Technical Information (OSTI.GOV)
Balay, S.; Brown, J.; Buschelman, K.
This manual describes the use of PETSc for the numerical solution of partial differential equations and related problems on high-performance computers. The Portable, Extensible Toolkit for Scientific Computation (PETSc) is a suite of data structures and routines that provide the building blocks for the implementation of large-scale application codes on parallel (and serial) computers. PETSc uses the MPI standard for all message-passing communication. PETSc includes an expanding suite of parallel linear, nonlinear equation solvers and time integrators that may be used in application codes written in Fortran, C, C++, Python, and MATLAB (sequential). PETSc provides many of the mechanisms neededmore » within parallel application codes, such as parallel matrix and vector assembly routines. The library is organized hierarchically, enabling users to employ the level of abstraction that is most appropriate for a particular problem. By using techniques of object-oriented programming, PETSc provides enormous flexibility for users. PETSc is a sophisticated set of software tools; as such, for some users it initially has a much steeper learning curve than a simple subroutine library. In particular, for individuals without some computer science background, experience programming in C, C++ or Fortran and experience using a debugger such as gdb or dbx, it may require a significant amount of time to take full advantage of the features that enable efficient software use. However, the power of the PETSc design and the algorithms it incorporates may make the efficient implementation of many application codes simpler than “rolling them” yourself; For many tasks a package such as MATLAB is often the best tool; PETSc is not intended for the classes of problems for which effective MATLAB code can be written. PETSc also has a MATLAB interface, so portions of your code can be written in MATLAB to “try out” the PETSc solvers. The resulting code will not be scalable however because currently MATLAB is inherently not scalable; and PETSc should not be used to attempt to provide a “parallel linear solver” in an otherwise sequential code. Certainly all parts of a previously sequential code need not be parallelized but the matrix generation portion must be parallelized to expect any kind of reasonable performance. Do not expect to generate your matrix sequentially and then “use PETSc” to solve the linear system in parallel. Since PETSc is under continued development, small changes in usage and calling sequences of routines will occur. PETSc is supported; see the web site http://www.mcs.anl.gov/petsc for information on contacting support. A http://www.mcs.anl.gov/petsc/publications may be found a list of publications and web sites that feature work involving PETSc. We welcome any reports of corrections for this document.« less
PETSc Users Manual Revision 3.5
DOE Office of Scientific and Technical Information (OSTI.GOV)
Balay, S.; Abhyankar, S.; Adams, M.
This manual describes the use of PETSc for the numerical solution of partial differential equations and related problems on high-performance computers. The Portable, Extensible Toolkit for Scientific Computation (PETSc) is a suite of data structures and routines that provide the building blocks for the implementation of large-scale application codes on parallel (and serial) computers. PETSc uses the MPI standard for all message-passing communication. PETSc includes an expanding suite of parallel linear, nonlinear equation solvers and time integrators that may be used in application codes written in Fortran, C, C++, Python, and MATLAB (sequential). PETSc provides many of the mechanisms neededmore » within parallel application codes, such as parallel matrix and vector assembly routines. The library is organized hierarchically, enabling users to employ the level of abstraction that is most appropriate for a particular problem. By using techniques of object-oriented programming, PETSc provides enormous flexibility for users. PETSc is a sophisticated set of software tools; as such, for some users it initially has a much steeper learning curve than a simple subroutine library. In particular, for individuals without some computer science background, experience programming in C, C++ or Fortran and experience using a debugger such as gdb or dbx, it may require a significant amount of time to take full advantage of the features that enable efficient software use. However, the power of the PETSc design and the algorithms it incorporates may make the efficient implementation of many application codes simpler than “rolling them” yourself. ;For many tasks a package such as MATLAB is often the best tool; PETSc is not intended for the classes of problems for which effective MATLAB code can be written. PETSc also has a MATLAB interface, so portions of your code can be written in MATLAB to “try out” the PETSc solvers. The resulting code will not be scalable however because currently MATLAB is inherently not scalable; and PETSc should not be used to attempt to provide a “parallel linear solver” in an otherwise sequential code. Certainly all parts of a previously sequential code need not be parallelized but the matrix generation portion must be parallelized to expect any kind of reasonable performance. Do not expect to generate your matrix sequentially and then “use PETSc” to solve the linear system in parallel. Since PETSc is under continued development, small changes in usage and calling sequences of routines will occur. PETSc is supported; see the web site http://www.mcs.anl.gov/petsc for information on contacting support. A http://www.mcs.anl.gov/petsc/publications may be found a list of publications and web sites that feature work involving PETSc. We welcome any reports of corrections for this document.« less
Benchmark of the local drift-kinetic models for neoclassical transport simulation in helical plasmas
NASA Astrophysics Data System (ADS)
Huang, B.; Satake, S.; Kanno, R.; Sugama, H.; Matsuoka, S.
2017-02-01
The benchmarks of the neoclassical transport codes based on the several local drift-kinetic models are reported here. Here, the drift-kinetic models are zero orbit width (ZOW), zero magnetic drift, DKES-like, and global, as classified in Matsuoka et al. [Phys. Plasmas 22, 072511 (2015)]. The magnetic geometries of Helically Symmetric Experiment, Large Helical Device (LHD), and Wendelstein 7-X are employed in the benchmarks. It is found that the assumption of E ×B incompressibility causes discrepancy of neoclassical radial flux and parallel flow among the models when E ×B is sufficiently large compared to the magnetic drift velocities. For example, Mp≤0.4 where Mp is the poloidal Mach number. On the other hand, when E ×B and the magnetic drift velocities are comparable, the tangential magnetic drift, which is included in both the global and ZOW models, fills the role of suppressing unphysical peaking of neoclassical radial-fluxes found in the other local models at Er≃0 . In low collisionality plasmas, in particular, the tangential drift effect works well to suppress such unphysical behavior of the radial transport caused in the simulations. It is demonstrated that the ZOW model has the advantage of mitigating the unphysical behavior in the several magnetic geometries, and that it also implements the evaluation of bootstrap current in LHD with the low computation cost compared to the global model.
Field Test of a Hybrid Finite-Difference and Analytic Element Regional Model.
Abrams, D B; Haitjema, H M; Feinstein, D T; Hunt, R J
2016-01-01
Regional finite-difference models often have cell sizes that are too large to sufficiently model well-stream interactions. Here, a steady-state hybrid model is applied whereby the upper layer or layers of a coarse MODFLOW model are replaced by the analytic element model GFLOW, which represents surface waters and wells as line and point sinks. The two models are coupled by transferring cell-by-cell leakage obtained from the original MODFLOW model to the bottom of the GFLOW model. A real-world test of the hybrid model approach is applied on a subdomain of an existing model of the Lake Michigan Basin. The original (coarse) MODFLOW model consists of six layers, the top four of which are aggregated into GFLOW as a single layer, while the bottom two layers remain part of MODFLOW in the hybrid model. The hybrid model and a refined "benchmark" MODFLOW model simulate similar baseflows. The hybrid and benchmark models also simulate similar baseflow reductions due to nearby pumping when the well is located within the layers represented by GFLOW. However, the benchmark model requires refinement of the model grid in the local area of interest, while the hybrid approach uses a gridless top layer and is thus unaffected by grid discretization errors. The hybrid approach is well suited to facilitate cost-effective retrofitting of existing coarse grid MODFLOW models commonly used for regional studies because it leverages the strengths of both finite-difference and analytic element methods for predictions in mildly heterogeneous systems that can be simulated with steady-state conditions. © 2015, National Ground Water Association.
Summary of Research 1997, Department of Computer Science.
1999-01-01
Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 22202-4302, and to the Office of Management and Budget, Paperwork Reduction Project (0704...contains summaries of research projects in the Department of Computer Science . A list of recent publications is also included which consists of conference...parallel programming. Recently, in a joint research project between NPS and the Russian Academy of Sciences Systems Programming Insti- tute in Moscow
Micromechanical apparatus for measurement of forces
Tanner, Danelle Mary; Allen, James Joe
2004-05-25
A new class of micromechanical dynamometers has been disclosed which are particularly suited to fabrication in parallel with other microelectromechanical apparatus. Forces in the microNewton regime and below can be measured with such dynamometers which are based on a high-compliance deflection element (e.g. a ring or annulus) suspended above a substrate for deflection by an applied force, and one or more distance scales for optically measuring the deflection.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Muller, U.A.; Baumle, B.; Kohler, P.
1992-10-01
Music, a DSP-based system with a parallel distributed-memory architecture, provides enormous computing power yet retains the flexibility of a general-purpose computer. Reaching a peak performance of 2.7 Gflops at a significantly lower cost, power consumption, and space requirement than conventional supercomputers, Music is well suited to computationally intensive applications such as neural network simulation. 12 refs., 9 figs., 2 tabs.
P43-S Computational Biology Applications Suite for High-Performance Computing (BioHPC.net)
Pillardy, J.
2007-01-01
One of the challenges of high-performance computing (HPC) is user accessibility. At the Cornell University Computational Biology Service Unit, which is also a Microsoft HPC institute, we have developed a computational biology application suite that allows researchers from biological laboratories to submit their jobs to the parallel cluster through an easy-to-use Web interface. Through this system, we are providing users with popular bioinformatics tools including BLAST, HMMER, InterproScan, and MrBayes. The system is flexible and can be easily customized to include other software. It is also scalable; the installation on our servers currently processes approximately 8500 job submissions per year, many of them requiring massively parallel computations. It also has a built-in user management system, which can limit software and/or database access to specified users. TAIR, the major database of the plant model organism Arabidopsis, and SGN, the international tomato genome database, are both using our system for storage and data analysis. The system consists of a Web server running the interface (ASP.NET C#), Microsoft SQL server (ADO.NET), compute cluster running Microsoft Windows, ftp server, and file server. Users can interact with their jobs and data via a Web browser, ftp, or e-mail. The interface is accessible at http://cbsuapps.tc.cornell.edu/.
GPU-based ultra-fast dose calculation using a finite size pencil beam model.
Gu, Xuejun; Choi, Dongju; Men, Chunhua; Pan, Hubert; Majumdar, Amitava; Jiang, Steve B
2009-10-21
Online adaptive radiation therapy (ART) is an attractive concept that promises the ability to deliver an optimal treatment in response to the inter-fraction variability in patient anatomy. However, it has yet to be realized due to technical limitations. Fast dose deposit coefficient calculation is a critical component of the online planning process that is required for plan optimization of intensity-modulated radiation therapy (IMRT). Computer graphics processing units (GPUs) are well suited to provide the requisite fast performance for the data-parallel nature of dose calculation. In this work, we develop a dose calculation engine based on a finite-size pencil beam (FSPB) algorithm and a GPU parallel computing framework. The developed framework can accommodate any FSPB model. We test our implementation in the case of a water phantom and the case of a prostate cancer patient with varying beamlet and voxel sizes. All testing scenarios achieved speedup ranging from 200 to 400 times when using a NVIDIA Tesla C1060 card in comparison with a 2.27 GHz Intel Xeon CPU. The computational time for calculating dose deposition coefficients for a nine-field prostate IMRT plan with this new framework is less than 1 s. This indicates that the GPU-based FSPB algorithm is well suited for online re-planning for adaptive radiotherapy.
Vigelius, Matthias; Meyer, Bernd
2012-01-01
For many biological applications, a macroscopic (deterministic) treatment of reaction-drift-diffusion systems is insufficient. Instead, one has to properly handle the stochastic nature of the problem and generate true sample paths of the underlying probability distribution. Unfortunately, stochastic algorithms are computationally expensive and, in most cases, the large number of participating particles renders the relevant parameter regimes inaccessible. In an attempt to address this problem we present a genuine stochastic, multi-dimensional algorithm that solves the inhomogeneous, non-linear, drift-diffusion problem on a mesoscopic level. Our method improves on existing implementations in being multi-dimensional and handling inhomogeneous drift and diffusion. The algorithm is well suited for an implementation on data-parallel hardware architectures such as general-purpose graphics processing units (GPUs). We integrate the method into an operator-splitting approach that decouples chemical reactions from the spatial evolution. We demonstrate the validity and applicability of our algorithm with a comprehensive suite of standard test problems that also serve to quantify the numerical accuracy of the method. We provide a freely available, fully functional GPU implementation. Integration into Inchman, a user-friendly web service, that allows researchers to perform parallel simulations of reaction-drift-diffusion systems on GPU clusters is underway. PMID:22506001
Performance Comparison of Big Data Analytics With NEXUS and Giovanni
NASA Astrophysics Data System (ADS)
Jacob, J. C.; Huang, T.; Lynnes, C.
2016-12-01
NEXUS is an emerging data-intensive analysis framework developed with a new approach for handling science data that enables large-scale data analysis. It is available through open source. We compare performance of NEXUS and Giovanni for 3 statistics algorithms applied to NASA datasets. Giovanni is a statistics web service at NASA Distributed Active Archive Centers (DAACs). NEXUS is a cloud-computing environment developed at JPL and built on Apache Solr, Cassandra, and Spark. We compute global time-averaged map, correlation map, and area-averaged time series. The first two algorithms average over time to produce a value for each pixel in a 2-D map. The third algorithm averages spatially to produce a single value for each time step. This talk is our report on benchmark comparison findings that indicate 15x speedup with NEXUS over Giovanni to compute area-averaged time series of daily precipitation rate for the Tropical Rainfall Measuring Mission (TRMM with 0.25 degree spatial resolution) for the Continental United States over 14 years (2000-2014) with 64-way parallelism and 545 tiles per granule. 16-way parallelism with 16 tiles per granule worked best with NEXUS for computing an 18-year (1998-2015) TRMM daily precipitation global time averaged map (2.5 times speedup) and 18-year global map of correlation between TRMM daily precipitation and TRMM real time daily precipitation (7x speedup). These and other benchmark results will be presented along with key lessons learned in applying the NEXUS tiling approach to big data analytics in the cloud.
Perera, Ajith; Gauss, Jürgen; Verma, Prakash; Morales, Jorge A
2017-04-28
We present a parallel implementation to compute electron spin resonance g-tensors at the coupled-cluster singles and doubles (CCSD) level which employs the ACES III domain-specific software tools for scalable parallel programming, i.e., the super instruction architecture language and processor (SIAL and SIP), respectively. A unique feature of the present implementation is the exact (not approximated) inclusion of the five one- and two-particle contributions to the g-tensor [i.e., the mass correction, one- and two-particle paramagnetic spin-orbit, and one- and two-particle diamagnetic spin-orbit terms]. Like in a previous implementation with effective one-electron operators [J. Gauss et al., J. Phys. Chem. A 113, 11541-11549 (2009)], our implementation utilizes analytic CC second derivatives and, therefore, classifies as a true CC linear-response treatment. Therefore, our implementation can unambiguously appraise the accuracy of less costly effective one-particle schemes and provide a rationale for their widespread use. We have considered a large selection of radicals used previously for benchmarking purposes including those studied in earlier work and conclude that at the CCSD level, the effective one-particle scheme satisfactorily captures the two-particle effects less costly than the rigorous two-particle scheme. With respect to the performance of density functional theory (DFT), we note that results obtained with the B3LYP functional exhibit the best agreement with our CCSD results. However, in general, the CCSD results agree better with the experimental data than the best DFT/B3LYP results, although in most cases within the rather large experimental error bars.
Holographic Associative Memory Employing Phase Conjugation
NASA Astrophysics Data System (ADS)
Soffer, B. H.; Marom, E.; Owechko, Y.; Dunning, G.
1986-12-01
The principle of information retrieval by association has been suggested as a basis for parallel computing and as the process by which human memory functions.1 Various associative processors have been proposed that use electronic or optical means. Optical schemes,2-7 in particular, those based on holographic principles,8'8' are well suited to associative processing because of their high parallelism and information throughput. Previous workers8 demonstrated that holographically stored images can be recalled by using relatively complicated reference images but did not utilize nonlinear feedback to reduce the large cross talk that results when multiple objects are stored and a partial or distorted input is used for retrieval. These earlier approaches were limited in their ability to reconstruct the output object faithfully from a partial input.
A Strassen-Newton algorithm for high-speed parallelizable matrix inversion
NASA Technical Reports Server (NTRS)
Bailey, David H.; Ferguson, Helaman R. P.
1988-01-01
Techniques are described for computing matrix inverses by algorithms that are highly suited to massively parallel computation. The techniques are based on an algorithm suggested by Strassen (1969). Variations of this scheme use matrix Newton iterations and other methods to improve the numerical stability while at the same time preserving a very high level of parallelism. One-processor Cray-2 implementations of these schemes range from one that is up to 55 percent faster than a conventional library routine to one that is slower than a library routine but achieves excellent numerical stability. The problem of computing the solution to a single set of linear equations is discussed, and it is shown that this problem can also be solved efficiently using these techniques.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sreepathi, Sarat; Sripathi, Vamsi; Mills, Richard T
2013-01-01
Inefficient parallel I/O is known to be a major bottleneck among scientific applications employed on supercomputers as the number of processor cores grows into the thousands. Our prior experience indicated that parallel I/O libraries such as HDF5 that rely on MPI-IO do not scale well beyond 10K processor cores, especially on parallel file systems (like Lustre) with single point of resource contention. Our previous optimization efforts for a massively parallel multi-phase and multi-component subsurface simulator (PFLOTRAN) led to a two-phase I/O approach at the application level where a set of designated processes participate in the I/O process by splitting themore » I/O operation into a communication phase and a disk I/O phase. The designated I/O processes are created by splitting the MPI global communicator into multiple sub-communicators. The root process in each sub-communicator is responsible for performing the I/O operations for the entire group and then distributing the data to rest of the group. This approach resulted in over 25X speedup in HDF I/O read performance and 3X speedup in write performance for PFLOTRAN at over 100K processor cores on the ORNL Jaguar supercomputer. This research describes the design and development of a general purpose parallel I/O library, SCORPIO (SCalable block-ORiented Parallel I/O) that incorporates our optimized two-phase I/O approach. The library provides a simplified higher level abstraction to the user, sitting atop existing parallel I/O libraries (such as HDF5) and implements optimized I/O access patterns that can scale on larger number of processors. Performance results with standard benchmark problems and PFLOTRAN indicate that our library is able to maintain the same speedups as before with the added flexibility of being applicable to a wider range of I/O intensive applications.« less
Scaling Semantic Graph Databases in Size and Performance
DOE Office of Scientific and Technical Information (OSTI.GOV)
Morari, Alessandro; Castellana, Vito G.; Villa, Oreste
In this paper we present SGEM, a full software system for accelerating large-scale semantic graph databases on commodity clusters. Unlike current approaches, SGEM addresses semantic graph databases by only employing graph methods at all the levels of the stack. On one hand, this allows exploiting the space efficiency of graph data structures and the inherent parallelism of graph algorithms. These features adapt well to the increasing system memory and core counts of modern commodity clusters. On the other hand, however, these systems are optimized for regular computation and batched data transfers, while graph methods usually are irregular and generate fine-grainedmore » data accesses with poor spatial and temporal locality. Our framework comprises a SPARQL to data parallel C compiler, a library of parallel graph methods and a custom, multithreaded runtime system. We introduce our stack, motivate its advantages with respect to other solutions and show how we solved the challenges posed by irregular behaviors. We present the result of our software stack on the Berlin SPARQL benchmarks with datasets up to 10 billion triples (a triple corresponds to a graph edge), demonstrating scaling in dataset size and in performance as more nodes are added to the cluster.« less
Parallel algorithms for quantum chemistry. I. Integral transformations on a hypercube multiprocessor
DOE Office of Scientific and Technical Information (OSTI.GOV)
Whiteside, R.A.; Binkley, J.S.; Colvin, M.E.
1987-02-15
For many years it has been recognized that fundamental physical constraints such as the speed of light will limit the ultimate speed of single processor computers to less than about three billion floating point operations per second (3 GFLOPS). This limitation is becoming increasingly restrictive as commercially available machines are now within an order of magnitude of this asymptotic limit. A natural way to avoid this limit is to harness together many processors to work on a single computational problem. In principle, these parallel processing computers have speeds limited only by the number of processors one chooses to acquire. Themore » usefulness of potentially unlimited processing speed to a computationally intensive field such as quantum chemistry is obvious. If these methods are to be applied to significantly larger chemical systems, parallel schemes will have to be employed. For this reason we have developed distributed-memory algorithms for a number of standard quantum chemical methods. We are currently implementing these on a 32 processor Intel hypercube. In this paper we present our algorithm and benchmark results for one of the bottleneck steps in quantum chemical calculations: the four index integral transformation.« less
NASA's Black Marble Nighttime Lights Product Suite
NASA Technical Reports Server (NTRS)
Wang, Zhuosen; Sun, Qingsong; Seto, Karen C.; Oda, Tomohiro; Wolfe, Robert E.; Sarkar, Sudipta; Stevens, Joshua; Ramos Gonzalez, Olga M.; Detres, Yasmin; Esch, Thomas;
2018-01-01
NASA's Black Marble nighttime lights product suite (VNP46) is available at 500 meters resolution since January 2012 with data from the Visible Infrared Imaging Radiometer Suite (VIIRS) Day/Night Band (DNB) onboard the Suomi National Polar-orbiting Platform (SNPP). The retrieval algorithm, developed and implemented for routine global processing at NASA's Land Science Investigator-led Processing System (SIPS), utilizes all high-quality, cloud-free, atmospheric-, terrain-, vegetation-, snow-, lunar-, and stray light-corrected radiances to estimate daily nighttime lights (NTL) and other intrinsic surface optical properties. Key algorithm enhancements include: (1) lunar irradiance modeling to resolve non-linear changes in phase and libration; (2) vector radiative transfer and lunar bidirectional surface anisotropic reflectance modeling to correct for atmospheric and BRDF (Bidirectional Reflectance Distribution Function) effects; (3) geometric-optical and canopy radiative transfer modeling to account for seasonal variations in NTL; and (4) temporal gap-filling to reduce persistent data gaps. Extensive benchmark tests at representative spatial and temporal scales were conducted on the VNP46 time series record to characterize the uncertainties stemming from upstream data sources. Initial validation results are presented together with example case studies illustrating the scientific utility of the products. This includes an evaluation of temporal patterns of NTL dynamics associated with urbanization, socioeconomic variability, cultural characteristics, and displaced populations affected by conflict. Current and planned activities under the Group on Earth Observations (GEO) Human Planet Initiative are aimed at evaluating the products at different geographic locations and time periods representing the full range of retrieval conditions.
NASA Astrophysics Data System (ADS)
Lauritzen, P. H.; Ullrich, P. A.; Jablonowski, C.; Bosler, P. A.; Calhoun, D.; Conley, A. J.; Enomoto, T.; Dong, L.; Dubey, S.; Guba, O.; Hansen, A. B.; Kaas, E.; Kent, J.; Lamarque, J.-F.; Prather, M. J.; Reinert, D.; Shashkin, V. V.; Skamarock, W. C.; Sørensen, B.; Taylor, M. A.; Tolstykh, M. A.
2013-09-01
Recently, a standard test case suite for 2-D linear transport on the sphere was proposed to assess important aspects of accuracy in geophysical fluid dynamics with a "minimal" set of idealized model configurations/runs/diagnostics. Here we present results from 19 state-of-the-art transport scheme formulations based on finite-difference/finite-volume methods as well as emerging (in the context of atmospheric/oceanographic sciences) Galerkin methods. Discretization grids range from traditional regular latitude-longitude grids to more isotropic domain discretizations such as icosahedral and cubed-sphere tessellations of the sphere. The schemes are evaluated using a wide range of diagnostics in idealized flow environments. Accuracy is assessed in single- and two-tracer configurations using conventional error norms as well as novel diagnostics designed for climate and climate-chemistry applications. In addition, algorithmic considerations that may be important for computational efficiency are reported on. The latter is inevitably computing platform dependent, The ensemble of results from a wide variety of schemes presented here helps shed light on the ability of the test case suite diagnostics and flow settings to discriminate between algorithms and provide insights into accuracy in the context of global atmospheric/ocean modeling. A library of benchmark results is provided to facilitate scheme intercomparison and model development. Simple software and data-sets are made available to facilitate the process of model evaluation and scheme intercomparison.
NASA Astrophysics Data System (ADS)
Lauritzen, P. H.; Ullrich, P. A.; Jablonowski, C.; Bosler, P. A.; Calhoun, D.; Conley, A. J.; Enomoto, T.; Dong, L.; Dubey, S.; Guba, O.; Hansen, A. B.; Kaas, E.; Kent, J.; Lamarque, J.-F.; Prather, M. J.; Reinert, D.; Shashkin, V. V.; Skamarock, W. C.; Sørensen, B.; Taylor, M. A.; Tolstykh, M. A.
2014-01-01
Recently, a standard test case suite for 2-D linear transport on the sphere was proposed to assess important aspects of accuracy in geophysical fluid dynamics with a "minimal" set of idealized model configurations/runs/diagnostics. Here we present results from 19 state-of-the-art transport scheme formulations based on finite-difference/finite-volume methods as well as emerging (in the context of atmospheric/oceanographic sciences) Galerkin methods. Discretization grids range from traditional regular latitude-longitude grids to more isotropic domain discretizations such as icosahedral and cubed-sphere tessellations of the sphere. The schemes are evaluated using a wide range of diagnostics in idealized flow environments. Accuracy is assessed in single- and two-tracer configurations using conventional error norms as well as novel diagnostics designed for climate and climate-chemistry applications. In addition, algorithmic considerations that may be important for computational efficiency are reported on. The latter is inevitably computing platform dependent. The ensemble of results from a wide variety of schemes presented here helps shed light on the ability of the test case suite diagnostics and flow settings to discriminate between algorithms and provide insights into accuracy in the context of global atmospheric/ocean modeling. A library of benchmark results is provided to facilitate scheme intercomparison and model development. Simple software and data sets are made available to facilitate the process of model evaluation and scheme intercomparison.
NASA Astrophysics Data System (ADS)
Nishiura, Daisuke; Furuichi, Mikito; Sakaguchi, Hide
2015-09-01
The computational performance of a smoothed particle hydrodynamics (SPH) simulation is investigated for three types of current shared-memory parallel computer devices: many integrated core (MIC) processors, graphics processing units (GPUs), and multi-core CPUs. We are especially interested in efficient shared-memory allocation methods for each chipset, because the efficient data access patterns differ between compute unified device architecture (CUDA) programming for GPUs and OpenMP programming for MIC processors and multi-core CPUs. We first introduce several parallel implementation techniques for the SPH code, and then examine these on our target computer architectures to determine the most effective algorithms for each processor unit. In addition, we evaluate the effective computing performance and power efficiency of the SPH simulation on each architecture, as these are critical metrics for overall performance in a multi-device environment. In our benchmark test, the GPU is found to produce the best arithmetic performance as a standalone device unit, and gives the most efficient power consumption. The multi-core CPU obtains the most effective computing performance. The computational speed of the MIC processor on Xeon Phi approached that of two Xeon CPUs. This indicates that using MICs is an attractive choice for existing SPH codes on multi-core CPUs parallelized by OpenMP, as it gains computational acceleration without the need for significant changes to the source code.
Parallelization of Lower-Upper Symmetric Gauss-Seidel Method for Chemically Reacting Flow
NASA Technical Reports Server (NTRS)
Yoon, Seokkwan; Jost, Gabriele; Chang, Sherry
2005-01-01
Development of technologies for exploration of the solar system has revived an interest in computational simulation of chemically reacting flows since planetary probe vehicles exhibit non-equilibrium phenomena during the atmospheric entry of a planet or a moon as well as the reentry to the Earth. Stability in combustion is essential for new propulsion systems. Numerical solution of real-gas flows often increases computational work by an order-of-magnitude compared to perfect gas flow partly because of the increased complexity of equations to solve. Recently, as part of Project Columbia, NASA has integrated a cluster of interconnected SGI Altix systems to provide a ten-fold increase in current supercomputing capacity that includes an SGI Origin system. Both the new and existing machines are based on cache coherent non-uniform memory access architecture. Lower-Upper Symmetric Gauss-Seidel (LU-SGS) relaxation method has been implemented into both perfect and real gas flow codes including Real-Gas Aerodynamic Simulator (RGAS). However, the vectorized RGAS code runs inefficiently on cache-based shared-memory machines such as SGI system. Parallelization of a Gauss-Seidel method is nontrivial due to its sequential nature. The LU-SGS method has been vectorized on an oblique plane in INS3D-LU code that has been one of the base codes for NAS Parallel benchmarks. The oblique plane has been called a hyperplane by computer scientists. It is straightforward to parallelize a Gauss-Seidel method by partitioning the hyperplanes once they are formed. Another way of parallelization is to schedule processors like a pipeline using software. Both hyperplane and pipeline methods have been implemented using openMP directives. The present paper reports the performance of the parallelized RGAS code on SGI Origin and Altix systems.
Echegaray, Sebastian; Bakr, Shaimaa; Rubin, Daniel L; Napel, Sandy
2017-10-06
The aim of this study was to develop an open-source, modular, locally run or server-based system for 3D radiomics feature computation that can be used on any computer system and included in existing workflows for understanding associations and building predictive models between image features and clinical data, such as survival. The QIFE exploits various levels of parallelization for use on multiprocessor systems. It consists of a managing framework and four stages: input, pre-processing, feature computation, and output. Each stage contains one or more swappable components, allowing run-time customization. We benchmarked the engine using various levels of parallelization on a cohort of CT scans presenting 108 lung tumors. Two versions of the QIFE have been released: (1) the open-source MATLAB code posted to Github, (2) a compiled version loaded in a Docker container, posted to DockerHub, which can be easily deployed on any computer. The QIFE processed 108 objects (tumors) in 2:12 (h/mm) using 1 core, and 1:04 (h/mm) hours using four cores with object-level parallelization. We developed the Quantitative Image Feature Engine (QIFE), an open-source feature-extraction framework that focuses on modularity, standards, parallelism, provenance, and integration. Researchers can easily integrate it with their existing segmentation and imaging workflows by creating input and output components that implement their existing interfaces. Computational efficiency can be improved by parallelizing execution at the cost of memory usage. Different parallelization levels provide different trade-offs, and the optimal setting will depend on the size and composition of the dataset to be processed.
Divide and Conquer (DC) BLAST: fast and easy BLAST execution within HPC environments
Yim, Won Cheol; Cushman, John C.
2017-07-22
Bioinformatics is currently faced with very large-scale data sets that lead to computational jobs, especially sequence similarity searches, that can take absurdly long times to run. For example, the National Center for Biotechnology Information (NCBI) Basic Local Alignment Search Tool (BLAST and BLAST+) suite, which is by far the most widely used tool for rapid similarity searching among nucleic acid or amino acid sequences, is highly central processing unit (CPU) intensive. While the BLAST suite of programs perform searches very rapidly, they have the potential to be accelerated. In recent years, distributed computing environments have become more widely accessible andmore » used due to the increasing availability of high-performance computing (HPC) systems. Therefore, simple solutions for data parallelization are needed to expedite BLAST and other sequence analysis tools. However, existing software for parallel sequence similarity searches often requires extensive computational experience and skill on the part of the user. In order to accelerate BLAST and other sequence analysis tools, Divide and Conquer BLAST (DCBLAST) was developed to perform NCBI BLAST searches within a cluster, grid, or HPC environment by using a query sequence distribution approach. Scaling from one (1) to 256 CPU cores resulted in significant improvements in processing speed. Thus, DCBLAST dramatically accelerates the execution of BLAST searches using a simple, accessible, robust, and parallel approach. DCBLAST works across multiple nodes automatically and it overcomes the speed limitation of single-node BLAST programs. DCBLAST can be used on any HPC system, can take advantage of hundreds of nodes, and has no output limitations. Thus, this freely available tool simplifies distributed computation pipelines to facilitate the rapid discovery of sequence similarities between very large data sets.« less
Divide and Conquer (DC) BLAST: fast and easy BLAST execution within HPC environments
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yim, Won Cheol; Cushman, John C.
Bioinformatics is currently faced with very large-scale data sets that lead to computational jobs, especially sequence similarity searches, that can take absurdly long times to run. For example, the National Center for Biotechnology Information (NCBI) Basic Local Alignment Search Tool (BLAST and BLAST+) suite, which is by far the most widely used tool for rapid similarity searching among nucleic acid or amino acid sequences, is highly central processing unit (CPU) intensive. While the BLAST suite of programs perform searches very rapidly, they have the potential to be accelerated. In recent years, distributed computing environments have become more widely accessible andmore » used due to the increasing availability of high-performance computing (HPC) systems. Therefore, simple solutions for data parallelization are needed to expedite BLAST and other sequence analysis tools. However, existing software for parallel sequence similarity searches often requires extensive computational experience and skill on the part of the user. In order to accelerate BLAST and other sequence analysis tools, Divide and Conquer BLAST (DCBLAST) was developed to perform NCBI BLAST searches within a cluster, grid, or HPC environment by using a query sequence distribution approach. Scaling from one (1) to 256 CPU cores resulted in significant improvements in processing speed. Thus, DCBLAST dramatically accelerates the execution of BLAST searches using a simple, accessible, robust, and parallel approach. DCBLAST works across multiple nodes automatically and it overcomes the speed limitation of single-node BLAST programs. DCBLAST can be used on any HPC system, can take advantage of hundreds of nodes, and has no output limitations. Thus, this freely available tool simplifies distributed computation pipelines to facilitate the rapid discovery of sequence similarities between very large data sets.« less
NASA Astrophysics Data System (ADS)
Abdelmalak, Mansour M.; Geoffroy, Laurent; Angelier, Jacques; Bonin, Bernard; Callot, Jean-Paul; Gelard, Jean-Pierre; Aubourg, Charles
2015-04-01
We characterize and map the stress fields acting during plate breakup along the West Greenland volcanic margin. Interpolated stress fields are based on an inversion of fault-slip data sets and magma-driven fractures, crosscutting mainly an exposed inner seaward-dipping basaltic wedge (i.e., SDRi) segmented along-strike, with differently oriented segments. We identify two distinct tectonic episodes P1 and P2 which are both syn-magmatic and purely extensional. P1 probably acted as early as the Late Palaeocene. This stress field was first homogeneous with the minimum principal stress σ3 trending ~N060E, defining a P1A stage. During development of the SDRi, σ3 locally reoriented to become orthogonal to each margin segment (P1B). P1 is coeval with lithosphere breakup and is associated with an extension orthogonal to the Labrador-Baffin axis, which is inherited from the Mesozoic. The P1 related dykes constitute an homogeneous HKTP (High-K-Ti-P) suite. This suit displays alkaline affinities and is rich in both LILE and HFSE. A regional and radical change of σ3 to a ~NS trend took place during P2. The P1-P2 transition occurred at ~56-54 Ma i.e. during magnetic Chron C24R. P2 is associated with only minor extension and σ3 runs parallel to the North American (NAM)/Greenland kinematic displacement vector. The dykes associated with P2 are quite different and constitute a less homogeneous LKTP (Low-K-Ti-P) suite. This suite is less rich in LILE, yielding poorly fractioned chondrite-normalized REE patterns and HFSE contents similar to E-MORB, with slight U-Th and P positive anomalies. We establish therefore that the minimum horizontal stress σ3 for P1 and P2 is parallel to the relative displacement of Greenland related to NAM but not to its absolute displacement during the Tertiary. Taking into account those results as well as variations in magma chemistry from P1 to P2, we suggest that tectonic stresses at a volcanic margin could arise from the local dynamics of the melting mantle.
Parallel digital modem using multirate digital filter banks
NASA Technical Reports Server (NTRS)
Sadr, Ramin; Vaidyanathan, P. P.; Raphaeli, Dan; Hinedi, Sami
1994-01-01
A new class of architectures for an all-digital modem is presented in this report. This architecture, referred to as the parallel receiver (PRX), is based on employing multirate digital filter banks (DFB's) to demodulate, track, and detect the received symbol stream. The resulting architecture is derived, and specifications are outlined for designing the DFB for the PRX. The key feature of this approach is a lower processing rate then either the Nyquist rate or the symbol rate, without any degradation in the symbol error rate. Due to the freedom in choosing the processing rate, the designer is able to arbitrarily select and use digital components, independent of the speed of the integrated circuit technology. PRX architecture is particularly suited for high data rate applications, and due to the modular structure of the parallel signal path, expansion to even higher data rates is accommodated with each. Applications of the PRX would include gigabit satellite channels, multiple spacecraft, optical links, interactive cable-TV, telemedicine, code division multiple access (CDMA) communications, and others.