Characterizing parallel file-access patterns on a large-scale multiprocessor
NASA Technical Reports Server (NTRS)
Purakayastha, A.; Ellis, Carla; Kotz, David; Nieuwejaar, Nils; Best, Michael L.
1995-01-01
High-performance parallel file systems are needed to satisfy tremendous I/O requirements of parallel scientific applications. The design of such high-performance parallel file systems depends on a comprehensive understanding of the expected workload, but so far there have been very few usage studies of multiprocessor file systems. This paper is part of the CHARISMA project, which intends to fill this void by measuring real file-system workloads on various production parallel machines. In particular, we present results from the CM-5 at the National Center for Supercomputing Applications. Our results are unique because we collect information about nearly every individual I/O request from the mix of jobs running on the machine. Analysis of the traces leads to various recommendations for parallel file-system design.
Bent, John M.; Faibish, Sorin; Grider, Gary
2016-04-19
Cloud object storage is enabled for checkpoints of high performance computing applications using a middleware process. A plurality of files, such as checkpoint files, generated by a plurality of processes in a parallel computing system are stored by obtaining said plurality of files from said parallel computing system; converting said plurality of files to objects using a log structured file system middleware process; and providing said objects for storage in a cloud object storage system. The plurality of processes may run, for example, on a plurality of compute nodes. The log structured file system middleware process may be embodied, for example, as a Parallel Log-Structured File System (PLFS). The log structured file system middleware process optionally executes on a burst buffer node.
Integration experiences and performance studies of A COTS parallel archive systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chen, Hsing-bung; Scott, Cody; Grider, Bary
2010-01-01
Current and future Archive Storage Systems have been asked to (a) scale to very high bandwidths, (b) scale in metadata performance, (c) support policy-based hierarchical storage management capability, (d) scale in supporting changing needs of very large data sets, (e) support standard interface, and (f) utilize commercial-off-the-shelf(COTS) hardware. Parallel file systems have been asked to do the same thing but at one or more orders of magnitude faster in performance. Archive systems continue to move closer to file systems in their design due to the need for speed and bandwidth, especially metadata searching speeds such as more caching and lessmore » robust semantics. Currently the number of extreme highly scalable parallel archive solutions is very small especially those that will move a single large striped parallel disk file onto many tapes in parallel. We believe that a hybrid storage approach of using COTS components and innovative software technology can bring new capabilities into a production environment for the HPC community much faster than the approach of creating and maintaining a complete end-to-end unique parallel archive software solution. In this paper, we relay our experience of integrating a global parallel file system and a standard backup/archive product with a very small amount of additional code to provide a scalable, parallel archive. Our solution has a high degree of overlap with current parallel archive products including (a) doing parallel movement to/from tape for a single large parallel file, (b) hierarchical storage management, (c) ILM features, (d) high volume (non-single parallel file) archives for backup/archive/content management, and (e) leveraging all free file movement tools in Linux such as copy, move, ls, tar, etc. We have successfully applied our working COTS Parallel Archive System to the current world's first petaflop/s computing system, LANL's Roadrunner, and demonstrated its capability to address requirements of future archival storage systems.« less
Integration experiments and performance studies of a COTS parallel archive system
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chen, Hsing-bung; Scott, Cody; Grider, Gary
2010-06-16
Current and future Archive Storage Systems have been asked to (a) scale to very high bandwidths, (b) scale in metadata performance, (c) support policy-based hierarchical storage management capability, (d) scale in supporting changing needs of very large data sets, (e) support standard interface, and (f) utilize commercial-off-the-shelf (COTS) hardware. Parallel file systems have been asked to do the same thing but at one or more orders of magnitude faster in performance. Archive systems continue to move closer to file systems in their design due to the need for speed and bandwidth, especially metadata searching speeds such as more caching andmore » less robust semantics. Currently the number of extreme highly scalable parallel archive solutions is very small especially those that will move a single large striped parallel disk file onto many tapes in parallel. We believe that a hybrid storage approach of using COTS components and innovative software technology can bring new capabilities into a production environment for the HPC community much faster than the approach of creating and maintaining a complete end-to-end unique parallel archive software solution. In this paper, we relay our experience of integrating a global parallel file system and a standard backup/archive product with a very small amount of additional code to provide a scalable, parallel archive. Our solution has a high degree of overlap with current parallel archive products including (a) doing parallel movement to/from tape for a single large parallel file, (b) hierarchical storage management, (c) ILM features, (d) high volume (non-single parallel file) archives for backup/archive/content management, and (e) leveraging all free file movement tools in Linux such as copy, move, Is, tar, etc. We have successfully applied our working COTS Parallel Archive System to the current world's first petafiop/s computing system, LANL's Roadrunner machine, and demonstrated its capability to address requirements of future archival storage systems.« less
Bent, John M.; Faibish, Sorin; Grider, Gary
2015-06-30
Cloud object storage is enabled for archived data, such as checkpoints and results, of high performance computing applications using a middleware process. A plurality of archived files, such as checkpoint files and results, generated by a plurality of processes in a parallel computing system are stored by obtaining the plurality of archived files from the parallel computing system; converting the plurality of archived files to objects using a log structured file system middleware process; and providing the objects for storage in a cloud object storage system. The plurality of processes may run, for example, on a plurality of compute nodes. The log structured file system middleware process may be embodied, for example, as a Parallel Log-Structured File System (PLFS). The log structured file system middleware process optionally executes on a burst buffer node.
Performance of the Galley Parallel File System
NASA Technical Reports Server (NTRS)
Nieuwejaar, Nils; Kotz, David
1996-01-01
As the input/output (I/O) needs of parallel scientific applications increase, file systems for multiprocessors are being designed to provide applications with parallel access to multiple disks. Many parallel file systems present applications with a conventional Unix-like interface that allows the application to access multiple disks transparently. This interface conceals the parallism within the file system, which increases the ease of programmability, but makes it difficult or impossible for sophisticated programmers and libraries to use knowledge about their I/O needs to exploit that parallelism. Furthermore, most current parallel file systems are optimized for a different workload than they are being asked to support. We introduce Galley, a new parallel file system that is intended to efficiently support realistic parallel workloads. Initial experiments, reported in this paper, indicate that Galley is capable of providing high-performance 1/O to applications the applications that rely on them. In Section 3 we describe that access data in patterns that have been observed to be common.
pcircle - A Suite of Scalable Parallel File System Tools
DOE Office of Scientific and Technical Information (OSTI.GOV)
WANG, FEIYI
2015-10-01
Most of the software related to file system are written for conventional local file system, they are serialized and can't take advantage of the benefit of a large scale parallel file system. "pcircle" software builds on top of ubiquitous MPI in cluster computing environment and "work-stealing" pattern to provide a scalable, high-performance suite of file system tools. In particular - it implemented parallel data copy and parallel data checksumming, with advanced features such as async progress report, checkpoint and restart, as well as integrity checking.
File-access characteristics of parallel scientific workloads
NASA Technical Reports Server (NTRS)
Nieuwejaar, Nils; Kotz, David; Purakayastha, Apratim; Best, Michael; Ellis, Carla Schlatter
1995-01-01
Phenomenal improvements in the computational performance of multiprocessors have not been matched by comparable gains in I/O system performance. This imbalance has resulted in I/O becoming a significant bottleneck for many scientific applications. One key to overcoming this bottleneck is improving the performance of parallel file systems. The design of a high-performance parallel file system requires a comprehensive understanding of the expected workload. Unfortunately, until recently, no general workload studies of parallel file systems have been conducted. The goal of the CHARISMA project was to remedy this problem by characterizing the behavior of several production workloads, on different machines, at the level of individual reads and writes. The first set of results from the CHARISMA project describe the workloads observed on an Intel iPSC/860 and a Thinking Machines CM-5. This paper is intended to compare and contrast these two workloads for an understanding of their essential similarities and differences, isolating common trends and platform-dependent variances. Using this comparison, we are able to gain more insight into the general principles that should guide parallel file-system design.
The Galley Parallel File System
NASA Technical Reports Server (NTRS)
Nieuwejaar, Nils; Kotz, David
1996-01-01
Most current multiprocessor file systems are designed to use multiple disks in parallel, using the high aggregate bandwidth to meet the growing I/0 requirements of parallel scientific applications. Many multiprocessor file systems provide applications with a conventional Unix-like interface, allowing the application to access multiple disks transparently. This interface conceals the parallelism within the file system, increasing the ease of programmability, but making it difficult or impossible for sophisticated programmers and libraries to use knowledge about their I/O needs to exploit that parallelism. In addition to providing an insufficient interface, most current multiprocessor file systems are optimized for a different workload than they are being asked to support. We introduce Galley, a new parallel file system that is intended to efficiently support realistic scientific multiprocessor workloads. We discuss Galley's file structure and application interface, as well as the performance advantages offered by that interface.
High-Performance, Multi-Node File Copies and Checksums for Clustered File Systems
NASA Technical Reports Server (NTRS)
Kolano, Paul Z.; Ciotti, Robert B.
2012-01-01
Modern parallel file systems achieve high performance using a variety of techniques, such as striping files across multiple disks to increase aggregate I/O bandwidth and spreading disks across multiple servers to increase aggregate interconnect bandwidth. To achieve peak performance from such systems, it is typically necessary to utilize multiple concurrent readers/writers from multiple systems to overcome various singlesystem limitations, such as number of processors and network bandwidth. The standard cp and md5sum tools of GNU coreutils found on every modern Unix/Linux system, however, utilize a single execution thread on a single CPU core of a single system, and hence cannot take full advantage of the increased performance of clustered file systems. Mcp and msum are drop-in replacements for the standard cp and md5sum programs that utilize multiple types of parallelism and other optimizations to achieve maximum copy and checksum performance on clustered file systems. Multi-threading is used to ensure that nodes are kept as busy as possible. Read/write parallelism allows individual operations of a single copy to be overlapped using asynchronous I/O. Multinode cooperation allows different nodes to take part in the same copy/checksum. Split-file processing allows multiple threads to operate concurrently on the same file. Finally, hash trees allow inherently serial checksums to be performed in parallel. Mcp and msum provide significant performance improvements over standard cp and md5sum using multiple types of parallelism and other optimizations. The total speed-ups from all improvements are significant. Mcp improves cp performance over 27x, msum improves md5sum performance almost 19x, and the combination of mcp and msum improves verified copies via cp and md5sum by almost 22x. These improvements come in the form of drop-in replacements for cp and md5sum, so are easily used and are available for download as open source software at http://mutil.sourceforge.net.
Characterizing parallel file-access patterns on a large-scale multiprocessor
NASA Technical Reports Server (NTRS)
Purakayastha, Apratim; Ellis, Carla Schlatter; Kotz, David; Nieuwejaar, Nils; Best, Michael
1994-01-01
Rapid increases in the computational speeds of multiprocessors have not been matched by corresponding performance enhancements in the I/O subsystem. To satisfy the large and growing I/O requirements of some parallel scientific applications, we need parallel file systems that can provide high-bandwidth and high-volume data transfer between the I/O subsystem and thousands of processors. Design of such high-performance parallel file systems depends on a thorough grasp of the expected workload. So far there have been no comprehensive usage studies of multiprocessor file systems. Our CHARISMA project intends to fill this void. The first results from our study involve an iPSC/860 at NASA Ames. This paper presents results from a different platform, the CM-5 at the National Center for Supercomputing Applications. The CHARISMA studies are unique because we collect information about every individual read and write request and about the entire mix of applications running on the machines. The results of our trace analysis lead to recommendations for parallel file system design. First the file system should support efficient concurrent access to many files, and I/O requests from many jobs under varying load conditions. Second, it must efficiently manage large files kept open for long periods. Third, it should expect to see small requests predominantly sequential access patterns, application-wide synchronous access, no concurrent file-sharing between jobs appreciable byte and block sharing between processes within jobs, and strong interprocess locality. Finally, the trace data suggest that node-level write caches and collective I/O request interfaces may be useful in certain environments.
Flexibility and Performance of Parallel File Systems
NASA Technical Reports Server (NTRS)
Kotz, David; Nieuwejaar, Nils
1996-01-01
As we gain experience with parallel file systems, it becomes increasingly clear that a single solution does not suit all applications. For example, it appears to be impossible to find a single appropriate interface, caching policy, file structure, or disk-management strategy. Furthermore, the proliferation of file-system interfaces and abstractions make applications difficult to port. We propose that the traditional functionality of parallel file systems be separated into two components: a fixed core that is standard on all platforms, encapsulating only primitive abstractions and interfaces, and a set of high-level libraries to provide a variety of abstractions and application-programmer interfaces (API's). We present our current and next-generation file systems as examples of this structure. Their features, such as a three-dimensional file structure, strided read and write interfaces, and I/O-node programs, are specifically designed with the flexibility and performance necessary to support a wide range of applications.
Adding Data Management Services to Parallel File Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Brandt, Scott
2015-03-04
The objective of this project, called DAMASC for “Data Management in Scientific Computing”, is to coalesce data management with parallel file system management to present a declarative interface to scientists for managing, querying, and analyzing extremely large data sets efficiently and predictably. Managing extremely large data sets is a key challenge of exascale computing. The overhead, energy, and cost of moving massive volumes of data demand designs where computation is close to storage. In current architectures, compute/analysis clusters access data in a physically separate parallel file system and largely leave it scientist to reduce data movement. Over the past decadesmore » the high-end computing community has adopted middleware with multiple layers of abstractions and specialized file formats such as NetCDF-4 and HDF5. These abstractions provide a limited set of high-level data processing functions, but have inherent functionality and performance limitations: middleware that provides access to the highly structured contents of scientific data files stored in the (unstructured) file systems can only optimize to the extent that file system interfaces permit; the highly structured formats of these files often impedes native file system performance optimizations. We are developing Damasc, an enhanced high-performance file system with native rich data management services. Damasc will enable efficient queries and updates over files stored in their native byte-stream format while retaining the inherent performance of file system data storage via declarative queries and updates over views of underlying files. Damasc has four key benefits for the development of data-intensive scientific code: (1) applications can use important data-management services, such as declarative queries, views, and provenance tracking, that are currently available only within database systems; (2) the use of these services becomes easier, as they are provided within a familiar file-based ecosystem; (3) common optimizations, e.g., indexing and caching, are readily supported across several file formats, avoiding effort duplication; and (4) performance improves significantly, as data processing is integrated more tightly with data storage. Our key contributions are: SciHadoop which explores changes to MapReduce assumption by taking advantage of semantics of structured data while preserving MapReduce’s failure and resource management; DataMods which extends common abstractions of parallel file systems so they become programmable such that they can be extended to natively support a variety of data models and can be hooked into emerging distributed runtimes such as Stanford’s Legion; and Miso which combines Hadoop and relational data warehousing to minimize time to insight, taking into account the overhead of ingesting data into data warehousing.« less
High performance data transfer
NASA Astrophysics Data System (ADS)
Cottrell, R.; Fang, C.; Hanushevsky, A.; Kreuger, W.; Yang, W.
2017-10-01
The exponentially increasing need for high speed data transfer is driven by big data, and cloud computing together with the needs of data intensive science, High Performance Computing (HPC), defense, the oil and gas industry etc. We report on the Zettar ZX software. This has been developed since 2013 to meet these growing needs by providing high performance data transfer and encryption in a scalable, balanced, easy to deploy and use way while minimizing power and space utilization. In collaboration with several commercial vendors, Proofs of Concept (PoC) consisting of clusters have been put together using off-the- shelf components to test the ZX scalability and ability to balance services using multiple cores, and links. The PoCs are based on SSD flash storage that is managed by a parallel file system. Each cluster occupies 4 rack units. Using the PoCs, between clusters we have achieved almost 200Gbps memory to memory over two 100Gbps links, and 70Gbps parallel file to parallel file with encryption over a 5000 mile 100Gbps link.
File concepts for parallel I/O
NASA Technical Reports Server (NTRS)
Crockett, Thomas W.
1989-01-01
The subject of input/output (I/O) was often neglected in the design of parallel computer systems, although for many problems I/O rates will limit the speedup attainable. The I/O problem is addressed by considering the role of files in parallel systems. The notion of parallel files is introduced. Parallel files provide for concurrent access by multiple processes, and utilize parallelism in the I/O system to improve performance. Parallel files can also be used conventionally by sequential programs. A set of standard parallel file organizations is proposed, organizations are suggested, using multiple storage devices. Problem areas are also identified and discussed.
Dynamic file-access characteristics of a production parallel scientific workload
NASA Technical Reports Server (NTRS)
Kotz, David; Nieuwejaar, Nils
1994-01-01
Multiprocessors have permitted astounding increases in computational performance, but many cannot meet the intense I/O requirements of some scientific applications. An important component of any solution to this I/O bottleneck is a parallel file system that can provide high-bandwidth access to tremendous amounts of data in parallel to hundreds or thousands of processors. Most successful systems are based on a solid understanding of the expected workload, but thus far there have been no comprehensive workload characterizations of multiprocessor file systems. This paper presents the results of a three week tracing study in which all file-related activity on a massively parallel computer was recorded. Our instrumentation differs from previous efforts in that it collects information about every I/O request and about the mix of jobs running in a production environment. We also present the results of a trace-driven caching simulation and recommendations for designers of multiprocessor file systems.
Tuning HDF5 subfiling performance on parallel file systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Byna, Suren; Chaarawi, Mohamad; Koziol, Quincey
Subfiling is a technique used on parallel file systems to reduce locking and contention issues when multiple compute nodes interact with the same storage target node. Subfiling provides a compromise between the single shared file approach that instigates the lock contention problems on parallel file systems and having one file per process, which results in generating a massive and unmanageable number of files. In this paper, we evaluate and tune the performance of recently implemented subfiling feature in HDF5. In specific, we explain the implementation strategy of subfiling feature in HDF5, provide examples of using the feature, and evaluate andmore » tune parallel I/O performance of this feature with parallel file systems of the Cray XC40 system at NERSC (Cori) that include a burst buffer storage and a Lustre disk-based storage. We also evaluate I/O performance on the Cray XC30 system, Edison, at NERSC. Our results show performance benefits of 1.2X to 6X performance advantage with subfiling compared to writing a single shared HDF5 file. We present our exploration of configurations, such as the number of subfiles and the number of Lustre storage targets to storing files, as optimization parameters to obtain superior I/O performance. Based on this exploration, we discuss recommendations for achieving good I/O performance as well as limitations with using the subfiling feature.« less
''Towards a High-Performance and Robust Implementation of MPI-IO on Top of GPFS''
DOE Office of Scientific and Technical Information (OSTI.GOV)
Prost, J.P.; Tremann, R.; Blackwore, R.
2000-01-11
MPI-IO/GPFS is a prototype implementation of the I/O chapter of the Message Passing Interface (MPI) 2 standard. It uses the IBM General Parallel File System (GPFS), with prototyped extensions, as the underlying file system. this paper describes the features of this prototype which support its high performance and robustness. The use of hints at the file system level and at the MPI-IO level allows tailoring the use of the file system to the application needs. Error handling in collective operations provides robust error reporting and deadlock prevention in case of returning errors.
A Next-Generation Parallel File System Environment for the OLCF
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dillow, David A; Fuller, Douglas; Gunasekaran, Raghul
2012-01-01
When deployed in 2008/2009 the Spider system at the Oak Ridge National Laboratory s Leadership Computing Facility (OLCF) was the world s largest scale Lustre parallel file system. Envisioned as a shared parallel file system capable of delivering both the bandwidth and capacity requirements of the OLCF s diverse computational environment, Spider has since become a blueprint for shared Lustre environments deployed worldwide. Designed to support the parallel I/O requirements of the Jaguar XT5 system and other smallerscale platforms at the OLCF, the upgrade to the Titan XK6 heterogeneous system will begin to push the limits of Spider s originalmore » design by mid 2013. With a doubling in total system memory and a 10x increase in FLOPS, Titan will require both higher bandwidth and larger total capacity. Our goal is to provide a 4x increase in total I/O bandwidth from over 240GB=sec today to 1TB=sec and a doubling in total capacity. While aggregate bandwidth and total capacity remain important capabilities, an equally important goal in our efforts is dramatically increasing metadata performance, currently the Achilles heel of parallel file systems at leadership. We present in this paper an analysis of our current I/O workloads, our operational experiences with the Spider parallel file systems, the high-level design of our Spider upgrade, and our efforts in developing benchmarks that synthesize our performance requirements based on our workload characterization studies.« less
KNBD: A Remote Kernel Block Server for Linux
NASA Technical Reports Server (NTRS)
Becker, Jeff
1999-01-01
I am developing a prototype of a Linux remote disk block server whose purpose is to serve as a lower level component of a parallel file system. Parallel file systems are an important component of high performance supercomputers and clusters. Although supercomputer vendors such as SGI and IBM have their own custom solutions, there has been a void and hence a demand for such a system on Beowulf-type PC Clusters. Recently, the Parallel Virtual File System (PVFS) project at Clemson University has begun to address this need (1). Although their system provides much of the functionality of (and indeed was inspired by) the equivalent file systems in the commercial supercomputer market, their system is all in user-space. Migrating their 10 services to the kernel could provide a performance boost, by obviating the need for expensive system calls. Thanks to Pavel Machek, the Linux kernel has provided the network block device (2) with kernels 2.1.101 and later. You can configure this block device to redirect reads and writes to a remote machine's disk. This can be used as a building block for constructing a striped file system across several nodes.
Cooperative storage of shared files in a parallel computing system with dynamic block size
Bent, John M.; Faibish, Sorin; Grider, Gary
2015-11-10
Improved techniques are provided for parallel writing of data to a shared object in a parallel computing system. A method is provided for storing data generated by a plurality of parallel processes to a shared object in a parallel computing system. The method is performed by at least one of the processes and comprises: dynamically determining a block size for storing the data; exchanging a determined amount of the data with at least one additional process to achieve a block of the data having the dynamically determined block size; and writing the block of the data having the dynamically determined block size to a file system. The determined block size comprises, e.g., a total amount of the data to be stored divided by the number of parallel processes. The file system comprises, for example, a log structured virtual parallel file system, such as a Parallel Log-Structured File System (PLFS).
A Lightweight, High-performance I/O Management Package for Data-intensive Computing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jun Wang
2007-07-17
File storage systems are playing an increasingly important role in high-performance computing as the performance gap between CPU and disk increases. It could take a long time to develop an entire system from scratch. Solutions will have to be built as extensions to existing systems. If new portable, customized software components are plugged into these systems, better sustained high I/O performance and higher scalability will be achieved, and the development cycle of next-generation of parallel file systems will be shortened. The overall research objective of this ECPI development plan aims to develop a lightweight, customized, high-performance I/O management package namedmore » LightI/O to extend and leverage current parallel file systems used by DOE. During this period, We have developed a novel component in LightI/O and prototype them into PVFS2, and evaluate the resultant prototype—extended PVFS2 system on data-intensive applications. The preliminary results indicate the extended PVFS2 delivers better performance and reliability to users. A strong collaborative effort between the PI at the University of Nebraska Lincoln and the DOE collaborators—Drs Rob Ross and Rajeev Thakur at Argonne National Laboratory who are leading the PVFS2 group makes the project more promising.« less
Experimental Analysis of File Transfer Rates over Wide-Area Dedicated Connections
DOE Office of Scientific and Technical Information (OSTI.GOV)
Rao, Nageswara S.; Liu, Qiang; Sen, Satyabrata
2016-12-01
File transfers over dedicated connections, supported by large parallel file systems, have become increasingly important in high-performance computing and big data workflows. It remains a challenge to achieve peak rates for such transfers due to the complexities of file I/O, host, and network transport subsystems, and equally importantly, their interactions. We present extensive measurements of disk-to-disk file transfers using Lustre and XFS file systems mounted on multi-core servers over a suite of 10 Gbps emulated connections with 0-366 ms round trip times. Our results indicate that large buffer sizes and many parallel flows do not always guarantee high transfer rates.more » Furthermore, large variations in the measured rates necessitate repeated measurements to ensure confidence in inferences based on them. We propose a new method to efficiently identify the optimal joint file I/O and network transport parameters using a small number of measurements. We show that for XFS and Lustre with direct I/O, this method identifies configurations achieving 97% of the peak transfer rate while probing only 12% of the parameter space.« less
RAMA: A file system for massively parallel computers
NASA Technical Reports Server (NTRS)
Miller, Ethan L.; Katz, Randy H.
1993-01-01
This paper describes a file system design for massively parallel computers which makes very efficient use of a few disks per processor. This overcomes the traditional I/O bottleneck of massively parallel machines by storing the data on disks within the high-speed interconnection network. In addition, the file system, called RAMA, requires little inter-node synchronization, removing another common bottleneck in parallel processor file systems. Support for a large tertiary storage system can easily be integrated in lo the file system; in fact, RAMA runs most efficiently when tertiary storage is used.
Tuning HDF5 for Lustre File Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Howison, Mark; Koziol, Quincey; Knaak, David
2010-09-24
HDF5 is a cross-platform parallel I/O library that is used by a wide variety of HPC applications for the flexibility of its hierarchical object-database representation of scientific data. We describe our recent work to optimize the performance of the HDF5 and MPI-IO libraries for the Lustre parallel file system. We selected three different HPC applications to represent the diverse range of I/O requirements, and measured their performance on three different systems to demonstrate the robustness of our optimizations across different file system configurations and to validate our optimization strategy. We demonstrate that the combined optimizations improve HDF5 parallel I/O performancemore » by up to 33 times in some cases running close to the achievable peak performance of the underlying file system and demonstrate scalable performance up to 40,960-way concurrency.« less
Lessons Learned in Deploying the World s Largest Scale Lustre File System
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dillow, David A; Fuller, Douglas; Wang, Feiyi
2010-01-01
The Spider system at the Oak Ridge National Laboratory's Leadership Computing Facility (OLCF) is the world's largest scale Lustre parallel file system. Envisioned as a shared parallel file system capable of delivering both the bandwidth and capacity requirements of the OLCF's diverse computational environment, the project had a number of ambitious goals. To support the workloads of the OLCF's diverse computational platforms, the aggregate performance and storage capacity of Spider exceed that of our previously deployed systems by a factor of 6x - 240 GB/sec, and 17x - 10 Petabytes, respectively. Furthermore, Spider supports over 26,000 clients concurrently accessing themore » file system, which exceeds our previously deployed systems by nearly 4x. In addition to these scalability challenges, moving to a center-wide shared file system required dramatically improved resiliency and fault-tolerance mechanisms. This paper details our efforts in designing, deploying, and operating Spider. Through a phased approach of research and development, prototyping, deployment, and transition to operations, this work has resulted in a number of insights into large-scale parallel file system architectures, from both the design and the operational perspectives. We present in this paper our solutions to issues such as network congestion, performance baselining and evaluation, file system journaling overheads, and high availability in a system with tens of thousands of components. We also discuss areas of continued challenges, such as stressed metadata performance and the need for file system quality of service alongside with our efforts to address them. Finally, operational aspects of managing a system of this scale are discussed along with real-world data and observations.« less
Parallel sort with a ranged, partitioned key-value store in a high perfomance computing environment
Bent, John M.; Faibish, Sorin; Grider, Gary; Torres, Aaron; Poole, Stephen W.
2016-01-26
Improved sorting techniques are provided that perform a parallel sort using a ranged, partitioned key-value store in a high performance computing (HPC) environment. A plurality of input data files comprising unsorted key-value data in a partitioned key-value store are sorted. The partitioned key-value store comprises a range server for each of a plurality of ranges. Each input data file has an associated reader thread. Each reader thread reads the unsorted key-value data in the corresponding input data file and performs a local sort of the unsorted key-value data to generate sorted key-value data. A plurality of sorted, ranged subsets of each of the sorted key-value data are generated based on the plurality of ranges. Each sorted, ranged subset corresponds to a given one of the ranges and is provided to one of the range servers corresponding to the range of the sorted, ranged subset. Each range server sorts the received sorted, ranged subsets and provides a sorted range. A plurality of the sorted ranges are concatenated to obtain a globally sorted result.
Sun, Xiaobo; Gao, Jingjing; Jin, Peng; Eng, Celeste; Burchard, Esteban G; Beaty, Terri H; Ruczinski, Ingo; Mathias, Rasika A; Barnes, Kathleen; Wang, Fusheng; Qin, Zhaohui S
2018-06-01
Sorted merging of genomic data is a common data operation necessary in many sequencing-based studies. It involves sorting and merging genomic data from different subjects by their genomic locations. In particular, merging a large number of variant call format (VCF) files is frequently required in large-scale whole-genome sequencing or whole-exome sequencing projects. Traditional single-machine based methods become increasingly inefficient when processing large numbers of files due to the excessive computation time and Input/Output bottleneck. Distributed systems and more recent cloud-based systems offer an attractive solution. However, carefully designed and optimized workflow patterns and execution plans (schemas) are required to take full advantage of the increased computing power while overcoming bottlenecks to achieve high performance. In this study, we custom-design optimized schemas for three Apache big data platforms, Hadoop (MapReduce), HBase, and Spark, to perform sorted merging of a large number of VCF files. These schemas all adopt the divide-and-conquer strategy to split the merging job into sequential phases/stages consisting of subtasks that are conquered in an ordered, parallel, and bottleneck-free way. In two illustrating examples, we test the performance of our schemas on merging multiple VCF files into either a single TPED or a single VCF file, which are benchmarked with the traditional single/parallel multiway-merge methods, message passing interface (MPI)-based high-performance computing (HPC) implementation, and the popular VCFTools. Our experiments suggest all three schemas either deliver a significant improvement in efficiency or render much better strong and weak scalabilities over traditional methods. Our findings provide generalized scalable schemas for performing sorted merging on genetics and genomics data using these Apache distributed systems.
Gao, Jingjing; Jin, Peng; Eng, Celeste; Burchard, Esteban G; Beaty, Terri H; Ruczinski, Ingo; Mathias, Rasika A; Barnes, Kathleen; Wang, Fusheng
2018-01-01
Abstract Background Sorted merging of genomic data is a common data operation necessary in many sequencing-based studies. It involves sorting and merging genomic data from different subjects by their genomic locations. In particular, merging a large number of variant call format (VCF) files is frequently required in large-scale whole-genome sequencing or whole-exome sequencing projects. Traditional single-machine based methods become increasingly inefficient when processing large numbers of files due to the excessive computation time and Input/Output bottleneck. Distributed systems and more recent cloud-based systems offer an attractive solution. However, carefully designed and optimized workflow patterns and execution plans (schemas) are required to take full advantage of the increased computing power while overcoming bottlenecks to achieve high performance. Findings In this study, we custom-design optimized schemas for three Apache big data platforms, Hadoop (MapReduce), HBase, and Spark, to perform sorted merging of a large number of VCF files. These schemas all adopt the divide-and-conquer strategy to split the merging job into sequential phases/stages consisting of subtasks that are conquered in an ordered, parallel, and bottleneck-free way. In two illustrating examples, we test the performance of our schemas on merging multiple VCF files into either a single TPED or a single VCF file, which are benchmarked with the traditional single/parallel multiway-merge methods, message passing interface (MPI)–based high-performance computing (HPC) implementation, and the popular VCFTools. Conclusions Our experiments suggest all three schemas either deliver a significant improvement in efficiency or render much better strong and weak scalabilities over traditional methods. Our findings provide generalized scalable schemas for performing sorted merging on genetics and genomics data using these Apache distributed systems. PMID:29762754
DOVIS 2.0: an efficient and easy to use parallel virtual screening tool based on AutoDock 4.0.
Jiang, Xiaohui; Kumar, Kamal; Hu, Xin; Wallqvist, Anders; Reifman, Jaques
2008-09-08
Small-molecule docking is an important tool in studying receptor-ligand interactions and in identifying potential drug candidates. Previously, we developed a software tool (DOVIS) to perform large-scale virtual screening of small molecules in parallel on Linux clusters, using AutoDock 3.05 as the docking engine. DOVIS enables the seamless screening of millions of compounds on high-performance computing platforms. In this paper, we report significant advances in the software implementation of DOVIS 2.0, including enhanced screening capability, improved file system efficiency, and extended usability. To keep DOVIS up-to-date, we upgraded the software's docking engine to the more accurate AutoDock 4.0 code. We developed a new parallelization scheme to improve runtime efficiency and modified the AutoDock code to reduce excessive file operations during large-scale virtual screening jobs. We also implemented an algorithm to output docked ligands in an industry standard format, sd-file format, which can be easily interfaced with other modeling programs. Finally, we constructed a wrapper-script interface to enable automatic rescoring of docked ligands by arbitrarily selected third-party scoring programs. The significance of the new DOVIS 2.0 software compared with the previous version lies in its improved performance and usability. The new version makes the computation highly efficient by automating load balancing, significantly reducing excessive file operations by more than 95%, providing outputs that conform to industry standard sd-file format, and providing a general wrapper-script interface for rescoring of docked ligands. The new DOVIS 2.0 package is freely available to the public under the GNU General Public License.
Network issues for large mass storage requirements
NASA Technical Reports Server (NTRS)
Perdue, James
1992-01-01
File Servers and Supercomputing environments need high performance networks to balance the I/O requirements seen in today's demanding computing scenarios. UltraNet is one solution which permits both high aggregate transfer rates and high task-to-task transfer rates as demonstrated in actual tests. UltraNet provides this capability as both a Server-to-Server and Server-to-Client access network giving the supercomputing center the following advantages highest performance Transport Level connections (to 40 MBytes/sec effective rates); matches the throughput of the emerging high performance disk technologies, such as RAID, parallel head transfer devices and software striping; supports standard network and file system applications using SOCKET's based application program interface such as FTP, rcp, rdump, etc.; supports access to the Network File System (NFS) and LARGE aggregate bandwidth for large NFS usage; provides access to a distributed, hierarchical data server capability using DISCOS UniTree product; supports file server solutions available from multiple vendors, including Cray, Convex, Alliant, FPS, IBM, and others.
NASA Astrophysics Data System (ADS)
Yamamoto, K.; Murata, K.; Kimura, E.; Honda, R.
2006-12-01
In the Solar-Terrestrial Physics (STP) field, the amount of satellite observation data has been increasing every year. It is necessary to solve the following three problems to achieve large-scale statistical analyses of plenty of data. (i) More CPU power and larger memory and disk size are required. However, total powers of personal computers are not enough to analyze such amount of data. Super-computers provide a high performance CPU and rich memory area, but they are usually separated from the Internet or connected only for the purpose of programming or data file transfer. (ii) Most of the observation data files are managed at distributed data sites over the Internet. Users have to know where the data files are located. (iii) Since no common data format in the STP field is available now, users have to prepare reading program for each data by themselves. To overcome the problems (i) and (ii), we constructed a parallel and distributed data analysis environment based on the Gfarm reference implementation of the Grid Datafarm architecture. The Gfarm shares both computational resources and perform parallel distributed processings. In addition, the Gfarm provides the Gfarm filesystem which can be as virtual directory tree among nodes. The Gfarm environment is composed of three parts; a metadata server to manage distributed files information, filesystem nodes to provide computational resources and a client to throw a job into metadata server and manages data processing schedulings. In the present study, both data files and data processes are parallelized on the Gfarm with 6 file system nodes: CPU clock frequency of each node is Pentium V 1GHz, 256MB memory and40GB disk. To evaluate performances of the present Gfarm system, we scanned plenty of data files, the size of which is about 300MB for each, in three processing methods: sequential processing in one node, sequential processing by each node and parallel processing by each node. As a result, in comparison between the number of files and the elapsed time, parallel and distributed processing shorten the elapsed time to 1/5 than sequential processing. On the other hand, sequential processing times were shortened in another experiment, whose file size is smaller than 100KB. In this case, the elapsed time to scan one file is within one second. It implies that disk swap took place in case of parallel processing by each node. We note that the operation became unstable when the number of the files exceeded 1000. To overcome the problem (iii), we developed an original data class. This class supports our reading of data files with various data formats since it converts them into an original data format since it defines schemata for every type of data and encapsulates the structure of data files. In addition, since this class provides a function of time re-sampling, users can easily convert multiple data (array) with different time resolution into the same time resolution array. Finally, using the Gfarm, we achieved a high performance environment for large-scale statistical data analyses. It should be noted that the present method is effective only when one data file size is large enough. At present, we are restructuring the new Gfarm environment with 8 nodes: CPU is Athlon 64 x2 Dual Core 2GHz, 2GB memory and 1.2TB disk (using RAID0) for each node. Our original class is to be implemented on the new Gfarm environment. In the present talk, we show the latest results with applying the present system for data analyses with huge number of satellite observation data files.
Rethinking key–value store for parallel I/O optimization
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kougkas, Anthony; Eslami, Hassan; Sun, Xian-He
2015-01-26
Key-value stores are being widely used as the storage system for large-scale internet services and cloud storage systems. However, they are rarely used in HPC systems, where parallel file systems are the dominant storage solution. In this study, we examine the architecture differences and performance characteristics of parallel file systems and key-value stores. We propose using key-value stores to optimize overall Input/Output (I/O) performance, especially for workloads that parallel file systems cannot handle well, such as the cases with intense data synchronization or heavy metadata operations. We conducted experiments with several synthetic benchmarks, an I/O benchmark, and a real application.more » We modeled the performance of these two systems using collected data from our experiments, and we provide a predictive method to identify which system offers better I/O performance given a specific workload. The results show that we can optimize the I/O performance in HPC systems by utilizing key-value stores.« less
Data Partitioning and Load Balancing in Parallel Disk Systems
NASA Technical Reports Server (NTRS)
Scheuermann, Peter; Weikum, Gerhard; Zabback, Peter
1997-01-01
Parallel disk systems provide opportunities for exploiting I/O parallelism in two possible waves, namely via inter-request and intra-request parallelism. In this paper we discuss the main issues in performance tuning of such systems, namely striping and load balancing, and show their relationship to response time and throughput. We outline the main components of an intelligent, self-reliant file system that aims to optimize striping by taking into account the requirements of the applications and performs load balancing by judicious file allocation and dynamic redistributions of the data when access patterns change. Our system uses simple but effective heuristics that incur only little overhead. We present performance experiments based on synthetic workloads and real-life traces.
Characterizing output bottlenecks in a supercomputer
DOE Office of Scientific and Technical Information (OSTI.GOV)
Xie, Bing; Chase, Jeffrey; Dillow, David A
2012-01-01
Supercomputer I/O loads are often dominated by writes. HPC (High Performance Computing) file systems are designed to absorb these bursty outputs at high bandwidth through massive parallelism. However, the delivered write bandwidth often falls well below the peak. This paper characterizes the data absorption behavior of a center-wide shared Lustre parallel file system on the Jaguar supercomputer. We use a statistical methodology to address the challenges of accurately measuring a shared machine under production load and to obtain the distribution of bandwidth across samples of compute nodes, storage targets, and time intervals. We observe and quantify limitations from competing traffic,more » contention on storage servers and I/O routers, concurrency limitations in the client compute node operating systems, and the impact of variance (stragglers) on coupled output such as striping. We then examine the implications of our results for application performance and the design of I/O middleware systems on shared supercomputers.« less
Log-less metadata management on metadata server for parallel file systems.
Liao, Jianwei; Xiao, Guoqiang; Peng, Xiaoning
2014-01-01
This paper presents a novel metadata management mechanism on the metadata server (MDS) for parallel and distributed file systems. In this technique, the client file system backs up the sent metadata requests, which have been handled by the metadata server, so that the MDS does not need to log metadata changes to nonvolatile storage for achieving highly available metadata service, as well as better performance improvement in metadata processing. As the client file system backs up certain sent metadata requests in its memory, the overhead for handling these backup requests is much smaller than that brought by the metadata server, while it adopts logging or journaling to yield highly available metadata service. The experimental results show that this newly proposed mechanism can significantly improve the speed of metadata processing and render a better I/O data throughput, in contrast to conventional metadata management schemes, that is, logging or journaling on MDS. Besides, a complete metadata recovery can be achieved by replaying the backup logs cached by all involved clients, when the metadata server has crashed or gone into nonoperational state exceptionally.
Log-Less Metadata Management on Metadata Server for Parallel File Systems
Xiao, Guoqiang; Peng, Xiaoning
2014-01-01
This paper presents a novel metadata management mechanism on the metadata server (MDS) for parallel and distributed file systems. In this technique, the client file system backs up the sent metadata requests, which have been handled by the metadata server, so that the MDS does not need to log metadata changes to nonvolatile storage for achieving highly available metadata service, as well as better performance improvement in metadata processing. As the client file system backs up certain sent metadata requests in its memory, the overhead for handling these backup requests is much smaller than that brought by the metadata server, while it adopts logging or journaling to yield highly available metadata service. The experimental results show that this newly proposed mechanism can significantly improve the speed of metadata processing and render a better I/O data throughput, in contrast to conventional metadata management schemes, that is, logging or journaling on MDS. Besides, a complete metadata recovery can be achieved by replaying the backup logs cached by all involved clients, when the metadata server has crashed or gone into nonoperational state exceptionally. PMID:24892093
Prefetching in file systems for MIMD multiprocessors
NASA Technical Reports Server (NTRS)
Kotz, David F.; Ellis, Carla Schlatter
1990-01-01
The question of whether prefetching blocks on the file into the block cache can effectively reduce overall execution time of a parallel computation, even under favorable assumptions, is considered. Experiments have been conducted with an interleaved file system testbed on the Butterfly Plus multiprocessor. Results of these experiments suggest that (1) the hit ratio, the accepted measure in traditional caching studies, may not be an adequate measure of performance when the workload consists of parallel computations and parallel file access patterns, (2) caching with prefetching can significantly improve the hit ratio and the average time to perform an I/O (input/output) operation, and (3) an improvement in overall execution time has been observed in most cases. In spite of these gains, prefetching sometimes results in increased execution times (a negative result, given the optimistic nature of the study). The authors explore why it is not trivial to translate savings on individual I/O requests into consistently better overall performance and identify the key problems that need to be addressed in order to improve the potential of prefetching techniques in the environment.
Purple L1 Milestone Review Panel GPFS Functionality and Performance
DOE Office of Scientific and Technical Information (OSTI.GOV)
Loewe, W E
2006-12-01
The GPFS deliverable for the Purple system requires the functionality and performance necessary for ASC I/O needs. The functionality includes POSIX and MPIIO compatibility, and multi-TB file capability across the entire machine. The bandwidth performance required is 122.15 GB/s, as necessary for productive and defensive I/O requirements, and the metadata performance requirement is 5,000 file stats per second. To determine success for this deliverable, several tools are employed. For functionality testing of POSIX, 10TB-files, and high-node-count capability, the parallel file system bandwidth performance test IOR is used. IOR is an MPI-coordinated application that can write and then read to amore » single shared file or to an individual file per process and check the data integrity of the file(s). The MPIIO functionality is tested with the MPIIO test suite from the MPICH library. Bandwidth performance is tested using IOR for the required 122.15 GB/s sustained write. All IOR tests are performanced with data checking enabled. Metadata performance is tested after ''aging'' the file system with 80% data block usage and 20% inode usage. The fdtree metadata test is expected to create/remove a large directory/file structure in under 20 minutes time, akin to interactive metadata usage. Multiple (10) instances of ''ls -lR'', each performing over 100K stats, are run concurrently in different large directories to demonstrate 5,000 stats/sec.« less
Experiments and Analyses of Data Transfers Over Wide-Area Dedicated Connections
DOE Office of Scientific and Technical Information (OSTI.GOV)
Rao, Nageswara S.; Liu, Qiang; Sen, Satyabrata
Dedicated wide-area network connections are increasingly employed in high-performance computing and big data scenarios. One might expect the performance and dynamics of data transfers over such connections to be easy to analyze due to the lack of competing traffic. However, non-linear transport dynamics and end-system complexities (e.g., multi-core hosts and distributed filesystems) can in fact make analysis surprisingly challenging. We present extensive measurements of memory-to-memory and disk-to-disk file transfers over 10 Gbps physical and emulated connections with 0–366 ms round trip times (RTTs). For memory-to-memory transfers, profiles of both TCP and UDT throughput as a function of RTT show concavemore » and convex regions; large buffer sizes and more parallel flows lead to wider concave regions, which are highly desirable. TCP and UDT both also display complex throughput dynamics, as indicated by their Poincare maps and Lyapunov exponents. For disk-to-disk transfers, we determine that high throughput can be achieved via a combination of parallel I/O threads, parallel network threads, and direct I/O mode. Our measurements also show that Lustre filesystems can be mounted over long-haul connections using LNet routers, although challenges remain in jointly optimizing file I/O and transport method parameters to achieve peak throughput.« less
Parallel File System I/O Performance Testing On LANL Clusters
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wiens, Isaac Christian; Green, Jennifer Kathleen
2016-08-18
These are slides from a presentation on parallel file system I/O performance testing on LANL clusters. I/O is a known bottleneck for HPC applications. Performance optimization of I/O is often required. This summer project entailed integrating IOR under Pavilion and automating the results analysis. The slides cover the following topics: scope of the work, tools utilized, IOR-Pavilion test workflow, build script, IOR parameters, how parameters are passed to IOR, *run_ior: functionality, Python IOR-Output Parser, Splunk data format, Splunk dashboard and features, and future work.
Optimizing Input/Output Using Adaptive File System Policies
NASA Technical Reports Server (NTRS)
Madhyastha, Tara M.; Elford, Christopher L.; Reed, Daniel A.
1996-01-01
Parallel input/output characterization studies and experiments with flexible resource management algorithms indicate that adaptivity is crucial to file system performance. In this paper we propose an automatic technique for selecting and refining file system policies based on application access patterns and execution environment. An automatic classification framework allows the file system to select appropriate caching and pre-fetching policies, while performance sensors provide feedback used to tune policy parameters for specific system environments. To illustrate the potential performance improvements possible using adaptive file system policies, we present results from experiments involving classification-based and performance-based steering.
An Approach Using Parallel Architecture to Storage DICOM Images in Distributed File System
NASA Astrophysics Data System (ADS)
Soares, Tiago S.; Prado, Thiago C.; Dantas, M. A. R.; de Macedo, Douglas D. J.; Bauer, Michael A.
2012-02-01
Telemedicine is a very important area in medical field that is expanding daily motivated by many researchers interested in improving medical applications. In Brazil was started in 2005, in the State of Santa Catarina has a developed server called the CyclopsDCMServer, which the purpose to embrace the HDF for the manipulation of medical images (DICOM) using a distributed file system. Since then, many researches were initiated in order to seek better performance. Our approach for this server represents an additional parallel implementation in I/O operations since HDF version 5 has an essential feature for our work which supports parallel I/O, based upon the MPI paradigm. Early experiments using four parallel nodes, provide good performance when compare to the serial HDF implemented in the CyclopsDCMServer.
On Data Transfers Over Wide-Area Dedicated Connections
DOE Office of Scientific and Technical Information (OSTI.GOV)
Rao, Nageswara S.; Liu, Qiang
Dedicated wide-area network connections are employed in big data and high-performance computing scenarios, since the absence of cross-traffic promises to make it easier to analyze and optimize data transfers over them. However, nonlinear transport dynamics and end-system complexity due to multi-core hosts and distributed file systems make these tasks surprisingly challenging. We present an overview of methods to analyze memory and disk file transfers using extensive measurements over 10 Gbps physical and emulated connections with 0–366 ms round trip times (RTTs). For memory transfers, we derive performance profiles of TCP and UDT throughput as a function of RTT, which showmore » concave regions in contrast to entirely convex regions predicted by previous models. These highly desirable concave regions can be expanded by utilizing large buffers and more parallel flows. We also present Poincar´e maps and Lyapunov exponents of TCP and UDT throughputtraces that indicate complex throughput dynamics. For disk file transfers, we show that throughput can be optimized using a combination of parallel I/O and network threads under direct I/O mode. Our initial throughput measurements of Lustre filesystems mounted over long-haul connections using LNet routers show convex profiles indicative of I/O limits.« less
The Galley Parallel File System
NASA Technical Reports Server (NTRS)
Nieuwejaar, Nils; Kotz, David
1996-01-01
As the I/O needs of parallel scientific applications increase, file systems for multiprocessors are being designed to provide applications with parallel access to multiple disks. Many parallel file systems present applications with a conventional Unix-like interface that allows the application to access multiple disks transparently. The interface conceals the parallelism within the file system, which increases the ease of programmability, but makes it difficult or impossible for sophisticated programmers and libraries to use knowledge about their I/O needs to exploit that parallelism. Furthermore, most current parallel file systems are optimized for a different workload than they are being asked to support. We introduce Galley, a new parallel file system that is intended to efficiently support realistic parallel workloads. We discuss Galley's file structure and application interface, as well as an application that has been implemented using that interface.
An approach to enhance pnetCDF performance in environmental modeling applications
Data intensive simulations are often limited by their I/O (input/output) performance, and "novel" techniques need to be developed in order to overcome this limitation. The software package pnetCDF (parallel network Common Data Form), which works with parallel file syste...
Understanding and Improving High-Performance I/O Subsystems
NASA Technical Reports Server (NTRS)
El-Ghazawi, Tarek A.; Frieder, Gideon; Clark, A. James
1996-01-01
This research program has been conducted in the framework of the NASA Earth and Space Science (ESS) evaluations led by Dr. Thomas Sterling. In addition to the many important research findings for NASA and the prestigious publications, the program has helped orienting the doctoral research program of two students towards parallel input/output in high-performance computing. Further, the experimental results in the case of the MasPar were very useful and helpful to MasPar with which the P.I. has had many interactions with the technical management. The contributions of this program are drawn from three experimental studies conducted on different high-performance computing testbeds/platforms, and therefore presented in 3 different segments as follows: 1. Evaluating the parallel input/output subsystem of a NASA high-performance computing testbeds, namely the MasPar MP- 1 and MP-2; 2. Characterizing the physical input/output request patterns for NASA ESS applications, which used the Beowulf platform; and 3. Dynamic scheduling techniques for hiding I/O latency in parallel applications such as sparse matrix computations. This study also has been conducted on the Intel Paragon and has also provided an experimental evaluation for the Parallel File System (PFS) and parallel input/output on the Paragon. This report is organized as follows. The summary of findings discusses the results of each of the aforementioned 3 studies. Three appendices, each containing a key scholarly research paper that details the work in one of the studies are included.
A model for optimizing file access patterns using spatio-temporal parallelism
DOE Office of Scientific and Technical Information (OSTI.GOV)
Boonthanome, Nouanesengsy; Patchett, John; Geveci, Berk
2013-01-01
For many years now, I/O read time has been recognized as the primary bottleneck for parallel visualization and analysis of large-scale data. In this paper, we introduce a model that can estimate the read time for a file stored in a parallel filesystem when given the file access pattern. Read times ultimately depend on how the file is stored and the access pattern used to read the file. The file access pattern will be dictated by the type of parallel decomposition used. We employ spatio-temporal parallelism, which combines both spatial and temporal parallelism, to provide greater flexibility to possible filemore » access patterns. Using our model, we were able to configure the spatio-temporal parallelism to design optimized read access patterns that resulted in a speedup factor of approximately 400 over traditional file access patterns.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
SmartImport.py is a Python source-code file that implements a replacement for the standard Python module importer. The code is derived from knee.py, a file in the standard Python diestribution , and adds functionality to improve the performance of Python module imports in massively parallel contexts.
An Optimizing Compiler for Petascale I/O on Leadership Class Architectures
DOE Office of Scientific and Technical Information (OSTI.GOV)
Choudhary, Alok; Kandemir, Mahmut
In high-performance computing systems, parallel I/O architectures usually have very complex hierarchies with multiple layers that collectively constitute an I/O stack, including high-level I/O libraries such as PnetCDF and HDF5, I/O middleware such as MPI-IO, and parallel file systems such as PVFS and Lustre. Our project explored automated instrumentation and compiler support for I/O intensive applications. Our project made significant progress towards understanding the complex I/O hierarchies of high-performance storage systems (including storage caches, HDDs, and SSDs), and designing and implementing state-of-the-art compiler/runtime system technology that targets I/O intensive HPC applications that target leadership class machine. This final report summarizesmore » the major achievements of the project and also points out promising future directions.« less
Samadian, Soroush; Bruce, Jeff P; Pugh, Trevor J
2018-03-01
Somatic copy number variations (CNVs) play a crucial role in development of many human cancers. The broad availability of next-generation sequencing data has enabled the development of algorithms to computationally infer CNV profiles from a variety of data types including exome and targeted sequence data; currently the most prevalent types of cancer genomics data. However, systemic evaluation and comparison of these tools remains challenging due to a lack of ground truth reference sets. To address this need, we have developed Bamgineer, a tool written in Python to introduce user-defined haplotype-phased allele-specific copy number events into an existing Binary Alignment Mapping (BAM) file, with a focus on targeted and exome sequencing experiments. As input, this tool requires a read alignment file (BAM format), lists of non-overlapping genome coordinates for introduction of gains and losses (bed file), and an optional file defining known haplotypes (vcf format). To improve runtime performance, Bamgineer introduces the desired CNVs in parallel using queuing and parallel processing on a local machine or on a high-performance computing cluster. As proof-of-principle, we applied Bamgineer to a single high-coverage (mean: 220X) exome sequence file from a blood sample to simulate copy number profiles of 3 exemplar tumors from each of 10 tumor types at 5 tumor cellularity levels (20-100%, 150 BAM files in total). To demonstrate feasibility beyond exome data, we introduced read alignments to a targeted 5-gene cell-free DNA sequencing library to simulate EGFR amplifications at frequencies consistent with circulating tumor DNA (10, 1, 0.1 and 0.01%) while retaining the multimodal insert size distribution of the original data. We expect Bamgineer to be of use for development and systematic benchmarking of CNV calling algorithms by users using locally-generated data for a variety of applications. The source code is freely available at http://github.com/pughlab/bamgineer.
Template based parallel checkpointing in a massively parallel computer system
Archer, Charles Jens [Rochester, MN; Inglett, Todd Alan [Rochester, MN
2009-01-13
A method and apparatus for a template based parallel checkpoint save for a massively parallel super computer system using a parallel variation of the rsync protocol, and network broadcast. In preferred embodiments, the checkpoint data for each node is compared to a template checkpoint file that resides in the storage and that was previously produced. Embodiments herein greatly decrease the amount of data that must be transmitted and stored for faster checkpointing and increased efficiency of the computer system. Embodiments are directed to a parallel computer system with nodes arranged in a cluster with a high speed interconnect that can perform broadcast communication. The checkpoint contains a set of actual small data blocks with their corresponding checksums from all nodes in the system. The data blocks may be compressed using conventional non-lossy data compression algorithms to further reduce the overall checkpoint size.
Small file aggregation in a parallel computing system
Faibish, Sorin; Bent, John M.; Tzelnic, Percy; Grider, Gary; Zhang, Jingwang
2014-09-02
Techniques are provided for small file aggregation in a parallel computing system. An exemplary method for storing a plurality of files generated by a plurality of processes in a parallel computing system comprises aggregating the plurality of files into a single aggregated file; and generating metadata for the single aggregated file. The metadata comprises an offset and a length of each of the plurality of files in the single aggregated file. The metadata can be used to unpack one or more of the files from the single aggregated file.
Faibish, Sorin; Bent, John M.; Tzelnic, Percy; Grider, Gary; Torres, Aaron
2015-10-20
Techniques are provided for storing files in a parallel computing system using different resolutions. A method is provided for storing at least one file generated by a distributed application in a parallel computing system. The file comprises one or more of a complete file and a sub-file. The method comprises the steps of obtaining semantic information related to the file; generating a plurality of replicas of the file with different resolutions based on the semantic information; and storing the file and the plurality of replicas of the file in one or more storage nodes of the parallel computing system. The different resolutions comprise, for example, a variable number of bits and/or a different sub-set of data elements from the file. A plurality of the sub-files can be merged to reproduce the file.
Measurements of file transfer rates over dedicated long-haul connections
DOE Office of Scientific and Technical Information (OSTI.GOV)
Rao, Nageswara S; Settlemyer, Bradley W; Imam, Neena
2016-01-01
Wide-area file transfers are an integral part of several High-Performance Computing (HPC) scenarios. Dedicated network connections with high capacity, low loss rate and low competing traffic, are increasingly being provisioned over current HPC infrastructures to support such transfers. To gain insights into these file transfers, we collected transfer rate measurements for Lustre and xfs file systems between dedicated multi-core servers over emulated 10 Gbps connections with round trip times (rtt) in 0-366 ms range. Memory transfer throughput over these connections is measured using iperf, and file IO throughput on host systems is measured using xddprof. We consider two file systemmore » configurations: Lustre over IB network and xfs over SSD connected to PCI bus. Files are transferred using xdd across these connections, and the transfer rates are measured, which indicate the need to jointly optimize the connection and host file IO parameters to achieve peak transfer rates. In particular, these measurements indicate that (i) peak file transfer rate is lower than peak connection and host IO throughput, in some cases by as much as 50% or lower, (ii) xdd request sizes that achieve peak throughput for host file IO do not necessarily lead to peak file transfer rates, and (iii) parallelism in host IO and TCP transport does not always improve the file transfer rates.« less
Storing files in a parallel computing system based on user-specified parser function
Faibish, Sorin; Bent, John M; Tzelnic, Percy; Grider, Gary; Manzanares, Adam; Torres, Aaron
2014-10-21
Techniques are provided for storing files in a parallel computing system based on a user-specified parser function. A plurality of files generated by a distributed application in a parallel computing system are stored by obtaining a parser from the distributed application for processing the plurality of files prior to storage; and storing one or more of the plurality of files in one or more storage nodes of the parallel computing system based on the processing by the parser. The plurality of files comprise one or more of a plurality of complete files and a plurality of sub-files. The parser can optionally store only those files that satisfy one or more semantic requirements of the parser. The parser can also extract metadata from one or more of the files and the extracted metadata can be stored with one or more of the plurality of files and used for searching for files.
Faibish, Sorin; Bent, John M; Tzelnic, Percy; Grider, Gary; Torres, Aaron
2015-02-03
Techniques are provided for storing files in a parallel computing system using sub-files with semantically meaningful boundaries. A method is provided for storing at least one file generated by a distributed application in a parallel computing system. The file comprises one or more of a complete file and a plurality of sub-files. The method comprises the steps of obtaining a user specification of semantic information related to the file; providing the semantic information as a data structure description to a data formatting library write function; and storing the semantic information related to the file with one or more of the sub-files in one or more storage nodes of the parallel computing system. The semantic information provides a description of data in the file. The sub-files can be replicated based on semantically meaningful boundaries.
cljam: a library for handling DNA sequence alignment/map (SAM) with parallel processing.
Takeuchi, Toshiki; Yamada, Atsuo; Aoki, Takashi; Nishimura, Kunihiro
2016-01-01
Next-generation sequencing can determine DNA bases and the results of sequence alignments are generally stored in files in the Sequence Alignment/Map (SAM) format and the compressed binary version (BAM) of it. SAMtools is a typical tool for dealing with files in the SAM/BAM format. SAMtools has various functions, including detection of variants, visualization of alignments, indexing, extraction of parts of the data and loci, and conversion of file formats. It is written in C and can execute fast. However, SAMtools requires an additional implementation to be used in parallel with, for example, OpenMP (Open Multi-Processing) libraries. For the accumulation of next-generation sequencing data, a simple parallelization program, which can support cloud and PC cluster environments, is required. We have developed cljam using the Clojure programming language, which simplifies parallel programming, to handle SAM/BAM data. Cljam can run in a Java runtime environment (e.g., Windows, Linux, Mac OS X) with Clojure. Cljam can process and analyze SAM/BAM files in parallel and at high speed. The execution time with cljam is almost the same as with SAMtools. The cljam code is written in Clojure and has fewer lines than other similar tools.
ATLAS software configuration and build tool optimisation
NASA Astrophysics Data System (ADS)
Rybkin, Grigory; Atlas Collaboration
2014-06-01
ATLAS software code base is over 6 million lines organised in about 2000 packages. It makes use of some 100 external software packages, is developed by more than 400 developers and used by more than 2500 physicists from over 200 universities and laboratories in 6 continents. To meet the challenge of configuration and building of this software, the Configuration Management Tool (CMT) is used. CMT expects each package to describe its build targets, build and environment setup parameters, dependencies on other packages in a text file called requirements, and each project (group of packages) to describe its policies and dependencies on other projects in a text project file. Based on the effective set of configuration parameters read from the requirements files of dependent packages and project files, CMT commands build the packages, generate the environment for their use, or query the packages. The main focus was on build time performance that was optimised within several approaches: reduction of the number of reads of requirements files that are now read once per package by a CMT build command that generates cached requirements files for subsequent CMT build commands; introduction of more fine-grained build parallelism at package task level, i.e., dependent applications and libraries are compiled in parallel; code optimisation of CMT commands used for build; introduction of package level build parallelism, i. e., parallelise the build of independent packages. By default, CMT launches NUMBER-OF-PROCESSORS build commands in parallel. The other focus was on CMT commands optimisation in general that made them approximately 2 times faster. CMT can generate a cached requirements file for the environment setup command, which is especially useful for deployment on distributed file systems like AFS or CERN VMFS. The use of parallelism, caching and code optimisation significantly-by several times-reduced software build time, environment setup time, increased the efficiency of multi-core computing resources utilisation, and considerably improved software developer and user experience.
An Optimizing Compiler for Petascale I/O on Leadership-Class Architectures
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kandemir, Mahmut Taylan; Choudary, Alok; Thakur, Rajeev
In high-performance computing (HPC), parallel I/O architectures usually have very complex hierarchies with multiple layers that collectively constitute an I/O stack, including high-level I/O libraries such as PnetCDF and HDF5, I/O middleware such as MPI-IO, and parallel file systems such as PVFS and Lustre. Our DOE project explored automated instrumentation and compiler support for I/O intensive applications. Our project made significant progress towards understanding the complex I/O hierarchies of high-performance storage systems (including storage caches, HDDs, and SSDs), and designing and implementing state-of-the-art compiler/runtime system technology that targets I/O intensive HPC applications that target leadership class machine. This final reportmore » summarizes the major achievements of the project and also points out promising future directions Two new sections in this report compared to the previous report are IOGenie and SSD/NVM-specific optimizations.« less
Toward Millions of File System IOPS on Low-Cost, Commodity Hardware
Zheng, Da; Burns, Randal; Szalay, Alexander S.
2013-01-01
We describe a storage system that removes I/O bottlenecks to achieve more than one million IOPS based on a user-space file abstraction for arrays of commodity SSDs. The file abstraction refactors I/O scheduling and placement for extreme parallelism and non-uniform memory and I/O. The system includes a set-associative, parallel page cache in the user space. We redesign page caching to eliminate CPU overhead and lock-contention in non-uniform memory architecture machines. We evaluate our design on a 32 core NUMA machine with four, eight-core processors. Experiments show that our design delivers 1.23 million 512-byte read IOPS. The page cache realizes the scalable IOPS of Linux asynchronous I/O (AIO) and increases user-perceived I/O performance linearly with cache hit rates. The parallel, set-associative cache matches the cache hit rates of the global Linux page cache under real workloads. PMID:24402052
Toward Millions of File System IOPS on Low-Cost, Commodity Hardware.
Zheng, Da; Burns, Randal; Szalay, Alexander S
2013-01-01
We describe a storage system that removes I/O bottlenecks to achieve more than one million IOPS based on a user-space file abstraction for arrays of commodity SSDs. The file abstraction refactors I/O scheduling and placement for extreme parallelism and non-uniform memory and I/O. The system includes a set-associative, parallel page cache in the user space. We redesign page caching to eliminate CPU overhead and lock-contention in non-uniform memory architecture machines. We evaluate our design on a 32 core NUMA machine with four, eight-core processors. Experiments show that our design delivers 1.23 million 512-byte read IOPS. The page cache realizes the scalable IOPS of Linux asynchronous I/O (AIO) and increases user-perceived I/O performance linearly with cache hit rates. The parallel, set-associative cache matches the cache hit rates of the global Linux page cache under real workloads.
Processing large remote sensing image data sets on Beowulf clusters
Steinwand, Daniel R.; Maddox, Brian; Beckmann, Tim; Schmidt, Gail
2003-01-01
High-performance computing is often concerned with the speed at which floating- point calculations can be performed. The architectures of many parallel computers and/or their network topologies are based on these investigations. Often, benchmarks resulting from these investigations are compiled with little regard to how a large dataset would move about in these systems. This part of the Beowulf study addresses that concern by looking at specific applications software and system-level modifications. Applications include an implementation of a smoothing filter for time-series data, a parallel implementation of the decision tree algorithm used in the Landcover Characterization project, a parallel Kriging algorithm used to fit point data collected in the field on invasive species to a regular grid, and modifications to the Beowulf project's resampling algorithm to handle larger, higher resolution datasets at a national scale. Systems-level investigations include a feasibility study on Flat Neighborhood Networks and modifications of that concept with Parallel File Systems.
Livermore Big Artificial Neural Network Toolkit
DOE Office of Scientific and Technical Information (OSTI.GOV)
Essen, Brian Van; Jacobs, Sam; Kim, Hyojin
2016-07-01
LBANN is a toolkit that is designed to train artificial neural networks efficiently on high performance computing architectures. It is optimized to take advantages of key High Performance Computing features to accelerate neural network training. Specifically it is optimized for low-latency, high bandwidth interconnects, node-local NVRAM, node-local GPU accelerators, and high bandwidth parallel file systems. It is built on top of the open source Elemental distributed-memory dense and spars-direct linear algebra and optimization library that is released under the BSD license. The algorithms contained within LBANN are drawn from the academic literature and implemented to work within a distributed-memory framework.
A Massively Parallel Computational Method of Reading Index Files for SOAPsnv.
Zhu, Xiaoqian; Peng, Shaoliang; Liu, Shaojie; Cui, Yingbo; Gu, Xiang; Gao, Ming; Fang, Lin; Fang, Xiaodong
2015-12-01
SOAPsnv is the software used for identifying the single nucleotide variation in cancer genes. However, its performance is yet to match the massive amount of data to be processed. Experiments reveal that the main performance bottleneck of SOAPsnv software is the pileup algorithm. The original pileup algorithm's I/O process is time-consuming and inefficient to read input files. Moreover, the scalability of the pileup algorithm is also poor. Therefore, we designed a new algorithm, named BamPileup, aiming to improve the performance of sequential read, and the new pileup algorithm implemented a parallel read mode based on index. Using this method, each thread can directly read the data start from a specific position. The results of experiments on the Tianhe-2 supercomputer show that, when reading data in a multi-threaded parallel I/O way, the processing time of algorithm is reduced to 3.9 s and the application program can achieve a speedup up to 100×. Moreover, the scalability of the new algorithm is also satisfying.
SciSpark: Highly Interactive and Scalable Model Evaluation and Climate Metrics
NASA Astrophysics Data System (ADS)
Wilson, B. D.; Palamuttam, R. S.; Mogrovejo, R. M.; Whitehall, K. D.; Mattmann, C. A.; Verma, R.; Waliser, D. E.; Lee, H.
2015-12-01
Remote sensing data and climate model output are multi-dimensional arrays of massive sizes locked away in heterogeneous file formats (HDF5/4, NetCDF 3/4) and metadata models (HDF-EOS, CF) making it difficult to perform multi-stage, iterative science processing since each stage requires writing and reading data to and from disk. We are developing a lightning fast Big Data technology called SciSpark based on ApacheTM Spark under a NASA AIST grant (PI Mattmann). Spark implements the map-reduce paradigm for parallel computing on a cluster, but emphasizes in-memory computation, "spilling" to disk only as needed, and so outperforms the disk-based ApacheTM Hadoop by 100x in memory and by 10x on disk. SciSpark will enable scalable model evaluation by executing large-scale comparisons of A-Train satellite observations to model grids on a cluster of 10 to 1000 compute nodes. This 2nd generation capability for NASA's Regional Climate Model Evaluation System (RCMES) will compute simple climate metrics at interactive speeds, and extend to quite sophisticated iterative algorithms such as machine-learning based clustering of temperature PDFs, and even graph-based algorithms for searching for Mesocale Convective Complexes. We have implemented a parallel data ingest capability in which the user specifies desired variables (arrays) as several time-sorted lists of URL's (i.e. using OPeNDAP model.nc?varname, or local files). The specified variables are partitioned by time/space and then each Spark node pulls its bundle of arrays into memory to begin a computation pipeline. We also investigated the performance of several N-dim. array libraries (scala breeze, java jblas & netlib-java, and ND4J). We are currently developing science codes using ND4J and studying memory behavior on the JVM. On the pyspark side, many of our science codes already use the numpy and SciPy ecosystems. The talk will cover: the architecture of SciSpark, the design of the scientific RDD (sRDD) data structure, our efforts to integrate climate science algorithms in Python and Scala, parallel ingest and partitioning of A-Train satellite observations from HDF files and model grids from netCDF files, first parallel runs to compute comparison statistics and PDF's, and first metrics quantifying parallel speedups and memory & disk usage.
Storing files in a parallel computing system using list-based index to identify replica files
DOE Office of Scientific and Technical Information (OSTI.GOV)
Faibish, Sorin; Bent, John M.; Tzelnic, Percy
Improved techniques are provided for storing files in a parallel computing system using a list-based index to identify file replicas. A file and at least one replica of the file are stored in one or more storage nodes of the parallel computing system. An index for the file comprises at least one list comprising a pointer to a storage location of the file and a storage location of the at least one replica of the file. The file comprises one or more of a complete file and one or more sub-files. The index may also comprise a checksum value formore » one or more of the file and the replica(s) of the file. The checksum value can be evaluated to validate the file and/or the file replica(s). A query can be processed using the list.« less
High Performance Input/Output for Parallel Computer Systems
NASA Technical Reports Server (NTRS)
Ligon, W. B.
1996-01-01
The goal of our project is to study the I/O characteristics of parallel applications used in Earth Science data processing systems such as Regional Data Centers (RDCs) or EOSDIS. Our approach is to study the runtime behavior of typical programs and the effect of key parameters of the I/O subsystem both under simulation and with direct experimentation on parallel systems. Our three year activity has focused on two items: developing a test bed that facilitates experimentation with parallel I/O, and studying representative programs from the Earth science data processing application domain. The Parallel Virtual File System (PVFS) has been developed for use on a number of platforms including the Tiger Parallel Architecture Workbench (TPAW) simulator, The Intel Paragon, a cluster of DEC Alpha workstations, and the Beowulf system (at CESDIS). PVFS provides considerable flexibility in configuring I/O in a UNIX- like environment. Access to key performance parameters facilitates experimentation. We have studied several key applications fiom levels 1,2 and 3 of the typical RDC processing scenario including instrument calibration and navigation, image classification, and numerical modeling codes. We have also considered large-scale scientific database codes used to organize image data.
Distributed Storage Algorithm for Geospatial Image Data Based on Data Access Patterns.
Pan, Shaoming; Li, Yongkai; Xu, Zhengquan; Chong, Yanwen
2015-01-01
Declustering techniques are widely used in distributed environments to reduce query response time through parallel I/O by splitting large files into several small blocks and then distributing those blocks among multiple storage nodes. Unfortunately, however, many small geospatial image data files cannot be further split for distributed storage. In this paper, we propose a complete theoretical system for the distributed storage of small geospatial image data files based on mining the access patterns of geospatial image data using their historical access log information. First, an algorithm is developed to construct an access correlation matrix based on the analysis of the log information, which reveals the patterns of access to the geospatial image data. Then, a practical heuristic algorithm is developed to determine a reasonable solution based on the access correlation matrix. Finally, a number of comparative experiments are presented, demonstrating that our algorithm displays a higher total parallel access probability than those of other algorithms by approximately 10-15% and that the performance can be further improved by more than 20% by simultaneously applying a copy storage strategy. These experiments show that the algorithm can be applied in distributed environments to help realize parallel I/O and thereby improve system performance.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kodavasal, Janardhan; Harms, Kevin; Srivastava, Priyesh
A closed-cycle gasoline compression ignition engine simulation near top dead center (TDC) was used to profile the performance of a parallel commercial engine computational fluid dynamics code, as it was scaled on up to 4096 cores of an IBM Blue Gene/Q supercomputer. The test case has 9 million cells near TDC, with a fixed mesh size of 0.15 mm, and was run on configurations ranging from 128 to 4096 cores. Profiling was done for a small duration of 0.11 crank angle degrees near TDC during ignition. Optimization of input/output performance resulted in a significant speedup in reading restart files, andmore » in an over 100-times speedup in writing restart files and files for post-processing. Improvements to communication resulted in a 1400-times speedup in the mesh load balancing operation during initialization, on 4096 cores. An improved, “stiffness-based” algorithm for load balancing chemical kinetics calculations was developed, which results in an over 3-times faster run-time near ignition on 4096 cores relative to the original load balancing scheme. With this improvement to load balancing, the code achieves over 78% scaling efficiency on 2048 cores, and over 65% scaling efficiency on 4096 cores, relative to 256 cores.« less
Active Storage with Analytics Capabilities and I/O Runtime System for Petascale Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Choudhary, Alok
Computational scientists must understand results from experimental, observational and computational simulation generated data to gain insights and perform knowledge discovery. As systems approach the petascale range, problems that were unimaginable a few years ago are within reach. With the increasing volume and complexity of data produced by ultra-scale simulations and high-throughput experiments, understanding the science is largely hampered by the lack of comprehensive I/O, storage, acceleration of data manipulation, analysis, and mining tools. Scientists require techniques, tools and infrastructure to facilitate better understanding of their data, in particular the ability to effectively perform complex data analysis, statistical analysis and knowledgemore » discovery. The goal of this work is to enable more effective analysis of scientific datasets through the integration of enhancements in the I/O stack, from active storage support at the file system layer to MPI-IO and high-level I/O library layers. We propose to provide software components to accelerate data analytics, mining, I/O, and knowledge discovery for large-scale scientific applications, thereby increasing productivity of both scientists and the systems. Our approaches include 1) design the interfaces in high-level I/O libraries, such as parallel netCDF, for applications to activate data mining operations at the lower I/O layers; 2) Enhance MPI-IO runtime systems to incorporate the functionality developed as a part of the runtime system design; 3) Develop parallel data mining programs as part of runtime library for server-side file system in PVFS file system; and 4) Prototype an active storage cluster, which will utilize multicore CPUs, GPUs, and FPGAs to carry out the data mining workload.« less
The Scalable Checkpoint/Restart Library
DOE Office of Scientific and Technical Information (OSTI.GOV)
Moody, A.
The Scalable Checkpoint/Restart (SCR) library provides an interface that codes may use to worite our and read in application-level checkpoints in a scalable fashion. In the current implementation, checkpoint files are cached in local storage (hard disk or RAM disk) on the compute nodes. This technique provides scalable aggregate bandwidth and uses storage resources that are fully dedicated to the job. This approach addresses the two common drawbacks of checkpointing a large-scale application to a shared parallel file system, namely, limited bandwidth and file system contention. In fact, on current platforms, SCR scales linearly with the number of compute nodes.more » It has been benchmarked as high as 720GB/s on 1094 nodes of Atlas, which is nearly two orders of magnitude faster thanthe parallel file system.« less
CloudMC: a cloud computing application for Monte Carlo simulation.
Miras, H; Jiménez, R; Miras, C; Gomà, C
2013-04-21
This work presents CloudMC, a cloud computing application-developed in Windows Azure®, the platform of the Microsoft® cloud-for the parallelization of Monte Carlo simulations in a dynamic virtual cluster. CloudMC is a web application designed to be independent of the Monte Carlo code in which the simulations are based-the simulations just need to be of the form: input files → executable → output files. To study the performance of CloudMC in Windows Azure®, Monte Carlo simulations with penelope were performed on different instance (virtual machine) sizes, and for different number of instances. The instance size was found to have no effect on the simulation runtime. It was also found that the decrease in time with the number of instances followed Amdahl's law, with a slight deviation due to the increase in the fraction of non-parallelizable time with increasing number of instances. A simulation that would have required 30 h of CPU on a single instance was completed in 48.6 min when executed on 64 instances in parallel (speedup of 37 ×). Furthermore, the use of cloud computing for parallel computing offers some advantages over conventional clusters: high accessibility, scalability and pay per usage. Therefore, it is strongly believed that cloud computing will play an important role in making Monte Carlo dose calculation a reality in future clinical practice.
Siretskiy, Alexey; Sundqvist, Tore; Voznesenskiy, Mikhail; Spjuth, Ola
2015-01-01
New high-throughput technologies, such as massively parallel sequencing, have transformed the life sciences into a data-intensive field. The most common e-infrastructure for analyzing this data consists of batch systems that are based on high-performance computing resources; however, the bioinformatics software that is built on this platform does not scale well in the general case. Recently, the Hadoop platform has emerged as an interesting option to address the challenges of increasingly large datasets with distributed storage, distributed processing, built-in data locality, fault tolerance, and an appealing programming methodology. In this work we introduce metrics and report on a quantitative comparison between Hadoop and a single node of conventional high-performance computing resources for the tasks of short read mapping and variant calling. We calculate efficiency as a function of data size and observe that the Hadoop platform is more efficient for biologically relevant data sizes in terms of computing hours for both split and un-split data files. We also quantify the advantages of the data locality provided by Hadoop for NGS problems, and show that a classical architecture with network-attached storage will not scale when computing resources increase in numbers. Measurements were performed using ten datasets of different sizes, up to 100 gigabases, using the pipeline implemented in Crossbow. To make a fair comparison, we implemented an improved preprocessor for Hadoop with better performance for splittable data files. For improved usability, we implemented a graphical user interface for Crossbow in a private cloud environment using the CloudGene platform. All of the code and data in this study are freely available as open source in public repositories. From our experiments we can conclude that the improved Hadoop pipeline scales better than the same pipeline on high-performance computing resources, we also conclude that Hadoop is an economically viable option for the common data sizes that are currently used in massively parallel sequencing. Given that datasets are expected to increase over time, Hadoop is a framework that we envision will have an increasingly important role in future biological data analysis.
PFLOTRAN: Reactive Flow & Transport Code for Use on Laptops to Leadership-Class Supercomputers
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hammond, Glenn E.; Lichtner, Peter C.; Lu, Chuan
PFLOTRAN, a next-generation reactive flow and transport code for modeling subsurface processes, has been designed from the ground up to run efficiently on machines ranging from leadership-class supercomputers to laptops. Based on an object-oriented design, the code is easily extensible to incorporate additional processes. It can interface seamlessly with Fortran 9X, C and C++ codes. Domain decomposition parallelism is employed, with the PETSc parallel framework used to manage parallel solvers, data structures and communication. Features of the code include a modular input file, implementation of high-performance I/O using parallel HDF5, ability to perform multiple realization simulations with multiple processors permore » realization in a seamless manner, and multiple modes for multiphase flow and multicomponent geochemical transport. Chemical reactions currently implemented in the code include homogeneous aqueous complexing reactions and heterogeneous mineral precipitation/dissolution, ion exchange, surface complexation and a multirate kinetic sorption model. PFLOTRAN has demonstrated petascale performance using 2{sup 17} processor cores with over 2 billion degrees of freedom. Accomplishments achieved to date include applications to the Hanford 300 Area and modeling CO{sub 2} sequestration in deep geologic formations.« less
Checkpoint-Restart in User Space
DOE Office of Scientific and Technical Information (OSTI.GOV)
CRUISE implements a user-space file system that stores data in main memory and transparently spills over to other storage, like local flash memory or the parallel file system, as needed. CRUISE also exposes file contents fo remote direct memory access, allowing external tools to copy files to the parallel file system in the background with reduced CPU interruption.
An Approach to Integrate a Space-Time GIS Data Model with High Performance Computers
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wang, Dali; Zhao, Ziliang; Shaw, Shih-Lung
2011-01-01
In this paper, we describe an approach to integrate a Space-Time GIS data model on a high performance computing platform. The Space-Time GIS data model has been developed on a desktop computing environment. We use the Space-Time GIS data model to generate GIS module, which organizes a series of remote sensing data. We are in the process of porting the GIS module into an HPC environment, in which the GIS modules handle large dataset directly via parallel file system. Although it is an ongoing project, authors hope this effort can inspire further discussions on the integration of GIS on highmore » performance computing platforms.« less
ArrayBridge: Interweaving declarative array processing with high-performance computing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Xing, Haoyuan; Floratos, Sofoklis; Blanas, Spyros
Scientists are increasingly turning to datacenter-scale computers to produce and analyze massive arrays. Despite decades of database research that extols the virtues of declarative query processing, scientists still write, debug and parallelize imperative HPC kernels even for the most mundane queries. This impedance mismatch has been partly attributed to the cumbersome data loading process; in response, the database community has proposed in situ mechanisms to access data in scientific file formats. Scientists, however, desire more than a passive access method that reads arrays from files. This paper describes ArrayBridge, a bi-directional array view mechanism for scientific file formats, that aimsmore » to make declarative array manipulations interoperable with imperative file-centric analyses. Our prototype implementation of ArrayBridge uses HDF5 as the underlying array storage library and seamlessly integrates into the SciDB open-source array database system. In addition to fast querying over external array objects, ArrayBridge produces arrays in the HDF5 file format just as easily as it can read from it. ArrayBridge also supports time travel queries from imperative kernels through the unmodified HDF5 API, and automatically deduplicates between array versions for space efficiency. Our extensive performance evaluation in NERSC, a large-scale scientific computing facility, shows that ArrayBridge exhibits statistically indistinguishable performance and I/O scalability to the native SciDB storage engine.« less
Storing files in a parallel computing system based on user or application specification
DOE Office of Scientific and Technical Information (OSTI.GOV)
Faibish, Sorin; Bent, John M.; Nick, Jeffrey M.
2016-03-29
Techniques are provided for storing files in a parallel computing system based on a user-specification. A plurality of files generated by a distributed application in a parallel computing system are stored by obtaining a specification from the distributed application indicating how the plurality of files should be stored; and storing one or more of the plurality of files in one or more storage nodes of a multi-tier storage system based on the specification. The plurality of files comprise a plurality of complete files and/or a plurality of sub-files. The specification can optionally be processed by a daemon executing on onemore » or more nodes in a multi-tier storage system. The specification indicates how the plurality of files should be stored, for example, identifying one or more storage nodes where the plurality of files should be stored.« less
The Automated Instrumentation and Monitoring System (AIMS) reference manual
NASA Technical Reports Server (NTRS)
Yan, Jerry; Hontalas, Philip; Listgarten, Sherry
1993-01-01
Whether a researcher is designing the 'next parallel programming paradigm,' another 'scalable multiprocessor' or investigating resource allocation algorithms for multiprocessors, a facility that enables parallel program execution to be captured and displayed is invaluable. Careful analysis of execution traces can help computer designers and software architects to uncover system behavior and to take advantage of specific application characteristics and hardware features. A software tool kit that facilitates performance evaluation of parallel applications on multiprocessors is described. The Automated Instrumentation and Monitoring System (AIMS) has four major software components: a source code instrumentor which automatically inserts active event recorders into the program's source code before compilation; a run time performance-monitoring library, which collects performance data; a trace file animation and analysis tool kit which reconstructs program execution from the trace file; and a trace post-processor which compensate for data collection overhead. Besides being used as prototype for developing new techniques for instrumenting, monitoring, and visualizing parallel program execution, AIMS is also being incorporated into the run-time environments of various hardware test beds to evaluate their impact on user productivity. Currently, AIMS instrumentors accept FORTRAN and C parallel programs written for Intel's NX operating system on the iPSC family of multi computers. A run-time performance-monitoring library for the iPSC/860 is included in this release. We plan to release monitors for other platforms (such as PVM and TMC's CM-5) in the near future. Performance data collected can be graphically displayed on workstations (e.g. Sun Sparc and SGI) supporting X-Windows (in particular, Xl IR5, Motif 1.1.3).
MarFS, a Near-POSIX Interface to Cloud Objects
DOE Office of Scientific and Technical Information (OSTI.GOV)
Inman, Jeffrey Thornton; Vining, William Flynn; Ransom, Garrett Wilson
The engineering forces driving development of “cloud” storage have produced resilient, cost-effective storage systems that can scale to 100s of petabytes, with good parallel access and bandwidth. These features would make a good match for the vast storage needs of High-Performance Computing datacenters, but cloud storage gains some of its capability from its use of HTTP-style Representational State Transfer (REST) semantics, whereas most large datacenters have legacy applications that rely on POSIX file-system semantics. MarFS is an open-source project at Los Alamos National Laboratory that allows us to present cloud-style object-storage as a scalable near-POSIX file system. We have alsomore » developed a new storage architecture to improve bandwidth and scalability beyond what’s available in commodity object stores, while retaining their resilience and economy. Additionally, we present a scheme for scaling the POSIX interface to allow billions of files in a single directory and trillions of files in total.« less
MarFS, a Near-POSIX Interface to Cloud Objects
Inman, Jeffrey Thornton; Vining, William Flynn; Ransom, Garrett Wilson; ...
2017-01-01
The engineering forces driving development of “cloud” storage have produced resilient, cost-effective storage systems that can scale to 100s of petabytes, with good parallel access and bandwidth. These features would make a good match for the vast storage needs of High-Performance Computing datacenters, but cloud storage gains some of its capability from its use of HTTP-style Representational State Transfer (REST) semantics, whereas most large datacenters have legacy applications that rely on POSIX file-system semantics. MarFS is an open-source project at Los Alamos National Laboratory that allows us to present cloud-style object-storage as a scalable near-POSIX file system. We have alsomore » developed a new storage architecture to improve bandwidth and scalability beyond what’s available in commodity object stores, while retaining their resilience and economy. Additionally, we present a scheme for scaling the POSIX interface to allow billions of files in a single directory and trillions of files in total.« less
Collignon, Barbara; Schulz, Roland; Smith, Jeremy C; Baudry, Jerome
2011-04-30
A message passing interface (MPI)-based implementation (Autodock4.lga.MPI) of the grid-based docking program Autodock4 has been developed to allow simultaneous and independent docking of multiple compounds on up to thousands of central processing units (CPUs) using the Lamarkian genetic algorithm. The MPI version reads a single binary file containing precalculated grids that represent the protein-ligand interactions, i.e., van der Waals, electrostatic, and desolvation potentials, and needs only two input parameter files for the entire docking run. In comparison, the serial version of Autodock4 reads ASCII grid files and requires one parameter file per compound. The modifications performed result in significantly reduced input/output activity compared with the serial version. Autodock4.lga.MPI scales up to 8192 CPUs with a maximal overhead of 16.3%, of which two thirds is due to input/output operations and one third originates from MPI operations. The optimal docking strategy, which minimizes docking CPU time without lowering the quality of the database enrichments, comprises the docking of ligands preordered from the most to the least flexible and the assignment of the number of energy evaluations as a function of the number of rotatable bounds. In 24 h, on 8192 high-performance computing CPUs, the present MPI version would allow docking to a rigid protein of about 300K small flexible compounds or 11 million rigid compounds.
Parallel file system with metadata distributed across partitioned key-value store c
Bent, John M.; Faibish, Sorin; Grider, Gary; Torres, Aaron
2017-09-19
Improved techniques are provided for storing metadata associated with a plurality of sub-files associated with a single shared file in a parallel file system. The shared file is generated by a plurality of applications executing on a plurality of compute nodes. A compute node implements a Parallel Log Structured File System (PLFS) library to store at least one portion of the shared file generated by an application executing on the compute node and metadata for the at least one portion of the shared file on one or more object storage servers. The compute node is also configured to implement a partitioned data store for storing a partition of the metadata for the shared file, wherein the partitioned data store communicates with partitioned data stores on other compute nodes using a message passing interface. The partitioned data store can be implemented, for example, using Multidimensional Data Hashing Indexing Middleware (MDHIM).
High-performance mass storage system for workstations
NASA Technical Reports Server (NTRS)
Chiang, T.; Tang, Y.; Gupta, L.; Cooperman, S.
1993-01-01
Reduced Instruction Set Computer (RISC) workstations and Personnel Computers (PC) are very popular tools for office automation, command and control, scientific analysis, database management, and many other applications. However, when using Input/Output (I/O) intensive applications, the RISC workstations and PC's are often overburdened with the tasks of collecting, staging, storing, and distributing data. Also, by using standard high-performance peripherals and storage devices, the I/O function can still be a common bottleneck process. Therefore, the high-performance mass storage system, developed by Loral AeroSys' Independent Research and Development (IR&D) engineers, can offload a RISC workstation of I/O related functions and provide high-performance I/O functions and external interfaces. The high-performance mass storage system has the capabilities to ingest high-speed real-time data, perform signal or image processing, and stage, archive, and distribute the data. This mass storage system uses a hierarchical storage structure, thus reducing the total data storage cost, while maintaining high-I/O performance. The high-performance mass storage system is a network of low-cost parallel processors and storage devices. The nodes in the network have special I/O functions such as: SCSI controller, Ethernet controller, gateway controller, RS232 controller, IEEE488 controller, and digital/analog converter. The nodes are interconnected through high-speed direct memory access links to form a network. The topology of the network is easily reconfigurable to maximize system throughput for various applications. This high-performance mass storage system takes advantage of a 'busless' architecture for maximum expandability. The mass storage system consists of magnetic disks, a WORM optical disk jukebox, and an 8mm helical scan tape to form a hierarchical storage structure. Commonly used files are kept in the magnetic disk for fast retrieval. The optical disks are used as archive media, and the tapes are used as backup media. The storage system is managed by the IEEE mass storage reference model-based UniTree software package. UniTree software will keep track of all files in the system, will automatically migrate the lesser used files to archive media, and will stage the files when needed by the system. The user can access the files without knowledge of their physical location. The high-performance mass storage system developed by Loral AeroSys will significantly boost the system I/O performance and reduce the overall data storage cost. This storage system provides a highly flexible and cost-effective architecture for a variety of applications (e.g., realtime data acquisition with a signal and image processing requirement, long-term data archiving and distribution, and image analysis and enhancement).
A parallel calibration utility for WRF-Hydro on high performance computers
NASA Astrophysics Data System (ADS)
Wang, J.; Wang, C.; Kotamarthi, V. R.
2017-12-01
A successful modeling of complex hydrological processes comprises establishing an integrated hydrological model which simulates the hydrological processes in each water regime, calibrates and validates the model performance based on observation data, and estimates the uncertainties from different sources especially those associated with parameters. Such a model system requires large computing resources and often have to be run on High Performance Computers (HPC). The recently developed WRF-Hydro modeling system provides a significant advancement in the capability to simulate regional water cycles more completely. The WRF-Hydro model has a large range of parameters such as those in the input table files — GENPARM.TBL, SOILPARM.TBL and CHANPARM.TBL — and several distributed scaling factors such as OVROUGHRTFAC. These parameters affect the behavior and outputs of the model and thus may need to be calibrated against the observations in order to obtain a good modeling performance. Having a parameter calibration tool specifically for automate calibration and uncertainty estimates of WRF-Hydro model can provide significant convenience for the modeling community. In this study, we developed a customized tool using the parallel version of the model-independent parameter estimation and uncertainty analysis tool, PEST, to enabled it to run on HPC with PBS and SLURM workload manager and job scheduler. We also developed a series of PEST input file templates that are specifically for WRF-Hydro model calibration and uncertainty analysis. Here we will present a flood case study occurred in April 2013 over Midwest. The sensitivity and uncertainties are analyzed using the customized PEST tool we developed.
ParCAT: A Parallel Climate Analysis Toolkit
NASA Astrophysics Data System (ADS)
Haugen, B.; Smith, B.; Steed, C.; Ricciuto, D. M.; Thornton, P. E.; Shipman, G.
2012-12-01
Climate science has employed increasingly complex models and simulations to analyze the past and predict the future of our climate. The size and dimensionality of climate simulation data has been growing with the complexity of the models. This growth in data is creating a widening gap between the data being produced and the tools necessary to analyze large, high dimensional data sets. With single run data sets increasing into 10's, 100's and even 1000's of gigabytes, parallel computing tools are becoming a necessity in order to analyze and compare climate simulation data. The Parallel Climate Analysis Toolkit (ParCAT) provides basic tools that efficiently use parallel computing techniques to narrow the gap between data set size and analysis tools. ParCAT was created as a collaborative effort between climate scientists and computer scientists in order to provide efficient parallel implementations of the computing tools that are of use to climate scientists. Some of the basic functionalities included in the toolkit are the ability to compute spatio-temporal means and variances, differences between two runs and histograms of the values in a data set. ParCAT is designed to facilitate the "heavy lifting" that is required for large, multidimensional data sets. The toolkit does not focus on performing the final visualizations and presentation of results but rather, reducing large data sets to smaller, more manageable summaries. The output from ParCAT is provided in commonly used file formats (NetCDF, CSV, ASCII) to allow for simple integration with other tools. The toolkit is currently implemented as a command line utility, but will likely also provide a C library for developers interested in tighter software integration. Elements of the toolkit are already being incorporated into projects such as UV-CDAT and CMDX. There is also an effort underway to implement portions of the CCSM Land Model Diagnostics package using ParCAT in conjunction with Python and gnuplot. ParCAT is implemented in C to provide efficient file IO. The file IO operations in the toolkit use the parallel-netcdf library; this enables the code to use the parallel IO capabilities of modern HPC systems. Analysis that currently requires an estimated 12+ hours with the traditional CCSM Land Model Diagnostics Package can now be performed in as little as 30 minutes on a single desktop workstation and a few minutes for relatively small jobs completed on modern HPC systems such as ORNL's Jaguar.
Zhang, Yan; Xie, Mengying; Roscow, James; Bao, Yinxiang; Zhou, Kechao
2017-01-01
This paper demonstrates the significant benefits of exploiting highly aligned porosity in piezoelectric and pyroelectric materials for improved energy harvesting performance. Porous lead zirconate (PZT) ceramics with aligned pore channels and varying fractions of porosity were manufactured in a water-based suspension using freeze-casting. The aligned porous PZT ceramics were characterized in detail for both piezoelectric and pyroelectric properties and their energy harvesting performance figures of merit were assessed parallel and perpendicular to the freezing direction. As a result of the introduction of porosity into the ceramic microstructure, high piezoelectric and pyroelectric harvesting figures of merits were achieved for porous freeze-cast PZT compared to dense PZT due to the reduced permittivity and volume specific heat capacity. Experimental results were compared to parallel and series analytical models with good agreement and the PZT with porosity aligned parallel to the freezing direction exhibited the highest piezoelectric and pyroelectric harvesting response; this was a result of the enhanced interconnectivity of the ferroelectric material along the poling direction and reduced fraction of unpoled material that leads to a higher polarization. A complete thermal energy harvesting system, composed of a parallel-aligned PZT harvester element and an AC/DC converter, was successfully demonstrated by charging a storage capacitor. The maximum energy density generated by the 60 vol% porous parallel-connected PZT when subjected to thermal oscillations was 1653 μJ cm–3, which was 374% higher than that of the dense PZT with an energy density of 446 μJ cm–3. The results are beneficial for the design and manufacture of high performance porous pyroelectric and piezoelectric materials in devices for energy harvesting and sensor applications. PMID:28580142
Cloud parallel processing of tandem mass spectrometry based proteomics data.
Mohammed, Yassene; Mostovenko, Ekaterina; Henneman, Alex A; Marissen, Rob J; Deelder, André M; Palmblad, Magnus
2012-10-05
Data analysis in mass spectrometry based proteomics struggles to keep pace with the advances in instrumentation and the increasing rate of data acquisition. Analyzing this data involves multiple steps requiring diverse software, using different algorithms and data formats. Speed and performance of the mass spectral search engines are continuously improving, although not necessarily as needed to face the challenges of acquired big data. Improving and parallelizing the search algorithms is one possibility; data decomposition presents another, simpler strategy for introducing parallelism. We describe a general method for parallelizing identification of tandem mass spectra using data decomposition that keeps the search engine intact and wraps the parallelization around it. We introduce two algorithms for decomposing mzXML files and recomposing resulting pepXML files. This makes the approach applicable to different search engines, including those relying on sequence databases and those searching spectral libraries. We use cloud computing to deliver the computational power and scientific workflow engines to interface and automate the different processing steps. We show how to leverage these technologies to achieve faster data analysis in proteomics and present three scientific workflows for parallel database as well as spectral library search using our data decomposition programs, X!Tandem and SpectraST.
NASA Astrophysics Data System (ADS)
Jang, W.; Engda, T. A.; Neff, J. C.; Herrick, J.
2017-12-01
Many crop models are increasingly used to evaluate crop yields at regional and global scales. However, implementation of these models across large areas using fine-scale grids is limited by computational time requirements. In order to facilitate global gridded crop modeling with various scenarios (i.e., different crop, management schedule, fertilizer, and irrigation) using the Environmental Policy Integrated Climate (EPIC) model, we developed a distributed parallel computing framework in Python. Our local desktop with 14 cores (28 threads) was used to test the distributed parallel computing framework in Iringa, Tanzania which has 406,839 grid cells. High-resolution soil data, SoilGrids (250 x 250 m), and climate data, AgMERRA (0.25 x 0.25 deg) were also used as input data for the gridded EPIC model. The framework includes a master file for parallel computing, input database, input data formatters, EPIC model execution, and output analyzers. Through the master file for parallel computing, the user-defined number of threads of CPU divides the EPIC simulation into jobs. Then, Using EPIC input data formatters, the raw database is formatted for EPIC input data and the formatted data moves into EPIC simulation jobs. Then, 28 EPIC jobs run simultaneously and only interesting results files are parsed and moved into output analyzers. We applied various scenarios with seven different slopes and twenty-four fertilizer ranges. Parallelized input generators create different scenarios as a list for distributed parallel computing. After all simulations are completed, parallelized output analyzers are used to analyze all outputs according to the different scenarios. This saves significant computing time and resources, making it possible to conduct gridded modeling at regional to global scales with high-resolution data. For example, serial processing for the Iringa test case would require 113 hours, while using the framework developed in this study requires only approximately 6 hours, a nearly 95% reduction in computing time.
NASA Astrophysics Data System (ADS)
Little, Duncan A.; Tennyson, Jonathan; Plummer, Martin; Noble, Clifford J.; Sunderland, Andrew G.
2017-06-01
TIMEDELN implements the time-delay method of determining resonance parameters from the characteristic Lorentzian form displayed by the largest eigenvalues of the time-delay matrix. TIMEDELN constructs the time-delay matrix from input K-matrices and analyses its eigenvalues. This new version implements multi-resonance fitting and may be run serially or as a high performance parallel code with three levels of parallelism. TIMEDELN takes K-matrices from a scattering calculation, either read from a file or calculated on a dynamically adjusted grid, and calculates the time-delay matrix. This is then diagonalized, with the largest eigenvalue representing the longest time-delay experienced by the scattering particle. A resonance shows up as a characteristic Lorentzian form in the time-delay: the programme searches the time-delay eigenvalues for maxima and traces resonances when they pass through different eigenvalues, separating overlapping resonances. It also performs the fitting of the calculated data to the Lorentzian form and outputs resonance positions and widths. Any remaining overlapping resonances can be fitted jointly. The branching ratios of decay into the open channels can also be found. The programme may be run serially or in parallel with three levels of parallelism. The parallel code modules are abstracted from the main physics code and can be used independently.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gibson, Garth
Petascale computing infrastructures for scientific discovery make petascale demands on information storage capacity, performance, concurrency, reliability, availability, and manageability. The Petascale Data Storage Institute focuses on the data storage problems found in petascale scientific computing environments, with special attention to community issues such as interoperability, community buy-in, and shared tools. The Petascale Data Storage Institute is a collaboration between researchers at Carnegie Mellon University, National Energy Research Scientific Computing Center, Pacific Northwest National Laboratory, Oak Ridge National Laboratory, Sandia National Laboratory, Los Alamos National Laboratory, University of Michigan, and the University of California at Santa Cruz. Because the Institute focusesmore » on low level files systems and storage systems, its role in improving SciDAC systems was one of supporting application middleware such as data management and system-level performance tuning. In retrospect, the Petascale Data Storage Institute’s most innovative and impactful contribution is the Parallel Log-structured File System (PLFS). Published in SC09, PLFS is middleware that operates in MPI-IO or embedded in FUSE for non-MPI applications. Its function is to decouple concurrently written files into a per-process log file, whose impact (the contents of the single file that the parallel application was concurrently writing) is determined on later reading, rather than during its writing. PLFS is transparent to the parallel application, offering a POSIX or MPI-IO interface, and it shows an order of magnitude speedup to the Chombo benchmark and two orders of magnitude to the FLASH benchmark. Moreover, LANL production applications see speedups of 5X to 28X, so PLFS has been put into production at LANL. Originally conceived and prototyped in a PDSI collaboration between LANL and CMU, it has grown to engage many other PDSI institutes, international partners like AWE, and has a large team at EMC supporting and enhancing it. PLFS is open sourced with a BSD license on sourceforge. Post PDSI funding comes from NNSA and industry sources. Moreover, PLFS has spin out half a dozen or more papers, partnered on research with multiple schools and vendors, and has projects to transparently 1) dis- tribute metadata over independent metadata servers, 2) exploit drastically non-POSIX Hadoop storage for HPC POSIX applications, 3) compress checkpoints on the fly, 4) batch delayed writes for write speed, 5) compress read-back indexes and parallelize their redistribution, 6) double-buffer writes in NAND Flash storage to decouple host blocking during checkpoint from disk write time in the storage system, 7) pack small files into a smaller number of bigger containers. There are two large scale open source Linux software projects that PDSI significantly incubated, though neither were initated in PDSI. These are 1) Ceph, a UCSC parallel object storage research project that has continued to be a vehicle for research, and has become a released part of Linux, and 2) Parallel NFS (pNFS) a portion of the IETF’s NFSv4.1 that brings the core data parallelism found in Lustre, PanFS, PVFS, and Ceph to the industry standard NFS, with released code in Linux 3.0, and its vendor offerings, with products from NetApp, EMC, BlueArc and RedHat. Both are fundamentally supported and advanced by vendor companies now, but were critcally transferred from research demonstration to viable product with funding from PDSI, in part. At this point Lustre remains the primary path to scalable IO in Exascale systems, but both Ceph and pNFS are viable alternatives with different fundamental advantages. Finally, research community building was a big success for PDSI. Through the HECFSIO workshops and HECURA project with NSF PDSI stimulated and helped to steer leveraged funding of over $25M. Through the Petascale (now Parallel) Data Storage Workshop series, www.pdsw.org, colocated with SCxy each year, PDSI created and incubated five offerings of this high-attendance workshop. The workshop has gone on without PDSI support with two more highly successfully workshops, rewriting its organizational structure to be community managed. More than 70 peer reviewed papers have been presented at PDSW workshops.« less
High-Performance I/O: HDF5 for Lattice QCD
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kurth, Thorsten; Pochinsky, Andrew; Sarje, Abhinav
2015-01-01
Practitioners of lattice QCD/QFT have been some of the primary pioneer users of the state-of-the-art high-performance-computing systems, and contribute towards the stress tests of such new machines as soon as they become available. As with all aspects of high-performance-computing, I/O is becoming an increasingly specialized component of these systems. In order to take advantage of the latest available high-performance I/O infrastructure, to ensure reliability and backwards compatibility of data files, and to help unify the data structures used in lattice codes, we have incorporated parallel HDF5 I/O into the SciDAC supported USQCD software stack. Here we present the design andmore » implementation of this I/O framework. Our HDF5 implementation outperforms optimized QIO at the 10-20% level and leaves room for further improvement by utilizing appropriate dataset chunking.« less
High-Performance I/O: HDF5 for Lattice QCD
Kurth, Thorsten; Pochinsky, Andrew; Sarje, Abhinav; ...
2017-05-09
Practitioners of lattice QCD/QFT have been some of the primary pioneer users of the state-of-the-art high-performance-computing systems, and contribute towards the stress tests of such new machines as soon as they become available. As with all aspects of high-performance-computing, I/O is becoming an increasingly specialized component of these systems. In order to take advantage of the latest available high-performance I/O infrastructure, to ensure reliability and backwards compatibility of data files, and to help unify the data structures used in lattice codes, we have incorporated parallel HDF5 I/O into the SciDAC supported USQCD software stack. Here we present the design andmore » implementation of this I/O framework. Our HDF5 implementation outperforms optimized QIO at the 10-20% level and leaves room for further improvement by utilizing appropriate dataset chunking.« less
SciDB versus Spark: A Preliminary Comparison Based on an Earth Science Use Case
NASA Astrophysics Data System (ADS)
Clune, T.; Kuo, K. S.; Doan, K.; Oloso, A.
2015-12-01
We compare two Big Data technologies, SciDB and Spark, for performance, usability, and extensibility, when applied to a representative Earth science use case. SciDB is a new-generation parallel distributed database management system (DBMS) based on the array data model that is capable of handling multidimensional arrays efficiently but requires lengthy data ingest prior to analysis, whereas Spark is a fast and general engine for large scale data processing that can immediately process raw data files and thereby avoid the ingest process. Once data have been ingested, SciDB is very efficient in database operations such as subsetting. Spark, on the other hand, provides greater flexibility by supporting a wide variety of high-level tools including DBMS's. For the performance aspect of this preliminary comparison, we configure Spark to operate directly on text or binary data files and thereby limit the need for additional tools. Arguably, a more appropriate comparison would involve exploring other configurations of Spark which exploit supported high-level tools, but that is beyond our current resources. To make the comparison as "fair" as possible, we export the arrays produced by SciDB into text files (or converting them to binary files) for the intake by Spark and thereby avoid any additional file processing penalties. The Earth science use case selected for this comparison is the identification and tracking of snowstorms in the NASA Modern Era Retrospective-analysis for Research and Applications (MERRA) reanalysis data. The identification portion of the use case is to flag all grid cells of the MERRA high-resolution hourly data that satisfies our criteria for snowstorm, whereas the tracking portion connects flagged cells adjacent in time and space to form a snowstorm episode. We will report the results of our comparisons at this presentation.
File-System Workload on a Scientific Multiprocessor
NASA Technical Reports Server (NTRS)
Kotz, David; Nieuwejaar, Nils
1995-01-01
Many scientific applications have intense computational and I/O requirements. Although multiprocessors have permitted astounding increases in computational performance, the formidable I/O needs of these applications cannot be met by current multiprocessors a their I/O subsystems. To prevent I/O subsystems from forever bottlenecking multiprocessors and limiting the range of feasible applications, new I/O subsystems must be designed. The successful design of computer systems (both hardware and software) depends on a thorough understanding of their intended use. A system designer optimizes the policies and mechanisms for the cases expected to most common in the user's workload. In the case of multiprocessor file systems, however, designers have been forced to build file systems based only on speculation about how they would be used, extrapolating from file-system characterizations of general-purpose workloads on uniprocessor and distributed systems or scientific workloads on vector supercomputers (see sidebar on related work). To help these system designers, in June 1993 we began the Charisma Project, so named because the project sought to characterize 1/0 in scientific multiprocessor applications from a variety of production parallel computing platforms and sites. The Charisma project is unique in recording individual read and write requests-in live, multiprogramming, parallel workloads (rather than from selected or nonparallel applications). In this article, we present the first results from the project: a characterization of the file-system workload an iPSC/860 multiprocessor running production, parallel scientific applications at NASA's Ames Research Center.
Nemesis I: Parallel Enhancements to ExodusII
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hennigan, Gary L.; John, Matthew S.; Shadid, John N.
2006-03-28
NEMESIS I is an enhancement to the EXODUS II finite element database model used to store and retrieve data for unstructured parallel finite element analyses. NEMESIS I adds data structures which facilitate the partitioning of a scalar (standard serial) EXODUS II file onto parallel disk systems found on many parallel computers. Since the NEMESIS I application programming interface (APl)can be used to append information to an existing EXODUS II files can be used on files which contain NEMESIS I information. The NEMESIS I information is written and read via C or C++ callable functions which compromise the NEMESIS I API.
Scalable Molecular Dynamics with NAMD
Phillips, James C.; Braun, Rosemary; Wang, Wei; Gumbart, James; Tajkhorshid, Emad; Villa, Elizabeth; Chipot, Christophe; Skeel, Robert D.; Kalé, Laxmikant; Schulten, Klaus
2008-01-01
NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems. NAMD scales to hundreds of processors on high-end parallel platforms, as well as tens of processors on low-cost commodity clusters, and also runs on individual desktop and laptop computers. NAMD works with AMBER and CHARMM potential functions, parameters, and file formats. This paper, directed to novices as well as experts, first introduces concepts and methods used in the NAMD program, describing the classical molecular dynamics force field, equations of motion, and integration methods along with the efficient electrostatics evaluation algorithms employed and temperature and pressure controls used. Features for steering the simulation across barriers and for calculating both alchemical and conformational free energy differences are presented. The motivations for and a roadmap to the internal design of NAMD, implemented in C++ and based on Charm++ parallel objects, are outlined. The factors affecting the serial and parallel performance of a simulation are discussed. Next, typical NAMD use is illustrated with representative applications to a small, a medium, and a large biomolecular system, highlighting particular features of NAMD, e.g., the Tcl scripting language. Finally, the paper provides a list of the key features of NAMD and discusses the benefits of combining NAMD with the molecular graphics/sequence analysis software VMD and the grid computing/collaboratory software BioCoRE. NAMD is distributed free of charge with source code at www.ks.uiuc.edu. PMID:16222654
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ilsche, Thomas; Schuchart, Joseph; Cope, Joseph
Event tracing is an important tool for understanding the performance of parallel applications. As concurrency increases in leadership-class computing systems, the quantity of performance log data can overload the parallel file system, perturbing the application being observed. In this work we present a solution for event tracing at leadership scales. We enhance the I/O forwarding system software to aggregate and reorganize log data prior to writing to the storage system, significantly reducing the burden on the underlying file system for this type of traffic. Furthermore, we augment the I/O forwarding system with a write buffering capability to limit the impactmore » of artificial perturbations from log data accesses on traced applications. To validate the approach, we modify the Vampir tracing tool to take advantage of this new capability and show that the approach increases the maximum traced application size by a factor of 5x to more than 200,000 processors.« less
NASA Astrophysics Data System (ADS)
Zhu, F.; Yu, H.; Rilee, M. L.; Kuo, K. S.; Yu, L.; Pan, Y.; Jiang, H.
2017-12-01
Since the establishment of data archive centers and the standardization of file formats, scientists are required to search metadata catalogs for data needed and download the data files to their local machines to carry out data analysis. This approach has facilitated data discovery and access for decades, but it inevitably leads to data transfer from data archive centers to scientists' computers through low-bandwidth Internet connections. Data transfer becomes a major performance bottleneck in such an approach. Combined with generally constrained local compute/storage resources, they limit the extent of scientists' studies and deprive them of timely outcomes. Thus, this conventional approach is not scalable with respect to both the volume and variety of geoscience data. A much more viable solution is to couple analysis and storage systems to minimize data transfer. In our study, we compare loosely coupled approaches (exemplified by Spark and Hadoop) and tightly coupled approaches (exemplified by parallel distributed database management systems, e.g., SciDB). In particular, we investigate the optimization of data placement and movement to effectively tackle the variety challenge, and boost the popularization of parallelization to address the volume challenge. Our goal is to enable high-performance interactive analysis for a good portion of geoscience data analysis exercise. We show that tightly coupled approaches can concentrate data traffic between local storage systems and compute units, and thereby optimizing bandwidth utilization to achieve a better throughput. Based on our observations, we develop a geoscience data analysis system that tightly couples analysis engines with storages, which has direct access to the detailed map of data partition locations. Through an innovation data partitioning and distribution scheme, our system has demonstrated scalable and interactive performance in real-world geoscience data analysis applications.
Automated quality control in a file-based broadcasting workflow
NASA Astrophysics Data System (ADS)
Zhang, Lina
2014-04-01
Benefit from the development of information and internet technologies, television broadcasting is transforming from inefficient tape-based production and distribution to integrated file-based workflows. However, no matter how many changes have took place, successful broadcasting still depends on the ability to deliver a consistent high quality signal to the audiences. After the transition from tape to file, traditional methods of manual quality control (QC) become inadequate, subjective, and inefficient. Based on China Central Television's full file-based workflow in the new site, this paper introduces an automated quality control test system for accurate detection of hidden troubles in media contents. It discusses the system framework and workflow control when the automated QC is added. It puts forward a QC criterion and brings forth a QC software followed this criterion. It also does some experiments on QC speed by adopting parallel processing and distributed computing. The performance of the test system shows that the adoption of automated QC can make the production effective and efficient, and help the station to achieve a competitive advantage in the media market.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kurth, Thorsten; Pochinsky, Andrew; Sarje, Abhinav
Practitioners of lattice QCD/QFT have been some of the primary pioneer users of the state-of-the-art high-performance-computing systems, and contribute towards the stress tests of such new machines as soon as they become available. As with all aspects of high-performance-computing, I/O is becoming an increasingly specialized component of these systems. In order to take advantage of the latest available high-performance I/O infrastructure, to ensure reliability and backwards compatibility of data files, and to help unify the data structures used in lattice codes, we have incorporated parallel HDF5 I/O into the SciDAC supported USQCD software stack. Here we present the design andmore » implementation of this I/O framework. Our HDF5 implementation outperforms optimized QIO at the 10-20% level and leaves room for further improvement by utilizing appropriate dataset chunking.« less
SDS: A Framework for Scientific Data Services
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dong, Bin; Byna, Surendra; Wu, Kesheng
2013-10-31
Large-scale scientific applications typically write their data to parallel file systems with organizations designed to achieve fast write speeds. Analysis tasks frequently read the data in a pattern that is different from the write pattern, and therefore experience poor I/O performance. In this paper, we introduce a prototype framework for bridging the performance gap between write and read stages of data access from parallel file systems. We call this framework Scientific Data Services, or SDS for short. This initial implementation of SDS focuses on reorganizing previously written files into data layouts that benefit read patterns, and transparently directs read callsmore » to the reorganized data. SDS follows a client-server architecture. The SDS Server manages partial or full replicas of reorganized datasets and serves SDS Clients' requests for data. The current version of the SDS client library supports HDF5 programming interface for reading data. The client library intercepts HDF5 calls and transparently redirects them to the reorganized data. The SDS client library also provides a querying interface for reading part of the data based on user-specified selective criteria. We describe the design and implementation of the SDS client-server architecture, and evaluate the response time of the SDS Server and the performance benefits of SDS.« less
An Efficient Format for Nearly Constant-Time Access to Arbitrary Time Intervals in Large Trace Files
Chan, Anthony; Gropp, William; Lusk, Ewing
2008-01-01
A powerful method to aid in understanding the performance of parallel applications uses log or trace files containing time-stamped events and states (pairs of events). These trace files can be very large, often hundreds or even thousands of megabytes. Because of the cost of accessing and displaying such files, other methods are often used that reduce the size of the tracefiles at the cost of sacrificing detail or other information. This paper describes a hierarchical trace file format that provides for display of an arbitrary time window in a time independent of the total size of the file and roughlymore » proportional to the number of events within the time window. This format eliminates the need to sacrifice data to achieve a smaller trace file size (since storage is inexpensive, it is necessary only to make efficient use of bandwidth to that storage). The format can be used to organize a trace file or to create a separate file of annotations that may be used with conventional trace files. We present an analysis of the time to access all of the events relevant to an interval of time and we describe experiments demonstrating the performance of this file format.« less
A General-purpose Framework for Parallel Processing of Large-scale LiDAR Data
NASA Astrophysics Data System (ADS)
Li, Z.; Hodgson, M.; Li, W.
2016-12-01
Light detection and ranging (LiDAR) technologies have proven efficiency to quickly obtain very detailed Earth surface data for a large spatial extent. Such data is important for scientific discoveries such as Earth and ecological sciences and natural disasters and environmental applications. However, handling LiDAR data poses grand geoprocessing challenges due to data intensity and computational intensity. Previous studies received notable success on parallel processing of LiDAR data to these challenges. However, these studies either relied on high performance computers and specialized hardware (GPUs) or focused mostly on finding customized solutions for some specific algorithms. We developed a general-purpose scalable framework coupled with sophisticated data decomposition and parallelization strategy to efficiently handle big LiDAR data. Specifically, 1) a tile-based spatial index is proposed to manage big LiDAR data in the scalable and fault-tolerable Hadoop distributed file system, 2) two spatial decomposition techniques are developed to enable efficient parallelization of different types of LiDAR processing tasks, and 3) by coupling existing LiDAR processing tools with Hadoop, this framework is able to conduct a variety of LiDAR data processing tasks in parallel in a highly scalable distributed computing environment. The performance and scalability of the framework is evaluated with a series of experiments conducted on a real LiDAR dataset using a proof-of-concept prototype system. The results show that the proposed framework 1) is able to handle massive LiDAR data more efficiently than standalone tools; and 2) provides almost linear scalability in terms of either increased workload (data volume) or increased computing nodes with both spatial decomposition strategies. We believe that the proposed framework provides valuable references on developing a collaborative cyberinfrastructure for processing big earth science data in a highly scalable environment.
Database Reorganization in Parallel Disk Arrays with I/O Service Stealing
NASA Technical Reports Server (NTRS)
Zabback, Peter; Onyuksel, Ibrahim; Scheuermann, Peter; Weikum, Gerhard
1996-01-01
We present a model for data reorganization in parallel disk systems that is geared towards load balancing in an environment with periodic access patterns. Data reorganization is performed by disk cooling, i.e. migrating files or extents from the hottest disks to the coldest ones. We develop an approximate queueing model for determining the effective arrival rates of cooling requests and discuss its use in assessing the costs versus benefits of cooling.
cellGPU: Massively parallel simulations of dynamic vertex models
NASA Astrophysics Data System (ADS)
Sussman, Daniel M.
2017-10-01
Vertex models represent confluent tissue by polygonal or polyhedral tilings of space, with the individual cells interacting via force laws that depend on both the geometry of the cells and the topology of the tessellation. This dependence on the connectivity of the cellular network introduces several complications to performing molecular-dynamics-like simulations of vertex models, and in particular makes parallelizing the simulations difficult. cellGPU addresses this difficulty and lays the foundation for massively parallelized, GPU-based simulations of these models. This article discusses its implementation for a pair of two-dimensional models, and compares the typical performance that can be expected between running cellGPU entirely on the CPU versus its performance when running on a range of commercial and server-grade graphics cards. By implementing the calculation of topological changes and forces on cells in a highly parallelizable fashion, cellGPU enables researchers to simulate time- and length-scales previously inaccessible via existing single-threaded CPU implementations. Program Files doi:http://dx.doi.org/10.17632/6j2cj29t3r.1 Licensing provisions: MIT Programming language: CUDA/C++ Nature of problem: Simulations of off-lattice "vertex models" of cells, in which the interaction forces depend on both the geometry and the topology of the cellular aggregate. Solution method: Highly parallelized GPU-accelerated dynamical simulations in which the force calculations and the topological features can be handled on either the CPU or GPU. Additional comments: The code is hosted at https://gitlab.com/dmsussman/cellGPU, with documentation additionally maintained at http://dmsussman.gitlab.io/cellGPUdocumentation
Data General Corporation Advanced Operating System/Virtual Storage (AOS/ VS). Revision 7.60
1989-02-22
control list for each directory and data file. An access control list includes the users who can and cannot access files as well as the access...and any required data, it can -5- February 22, 1989 Final Evaluation Report Data General AOS/VS SYSTEM OVERVIEW operate asynchronously and in parallel...memory. The IOC can perform the data transfer without further interventiin from the CPU. The I/O channels interface with the processor or system
Jung, Jaewoon; Mori, Takaharu; Kobayashi, Chigusa; Matsunaga, Yasuhiro; Yoda, Takao; Feig, Michael; Sugita, Yuji
2015-07-01
GENESIS (Generalized-Ensemble Simulation System) is a new software package for molecular dynamics (MD) simulations of macromolecules. It has two MD simulators, called ATDYN and SPDYN. ATDYN is parallelized based on an atomic decomposition algorithm for the simulations of all-atom force-field models as well as coarse-grained Go-like models. SPDYN is highly parallelized based on a domain decomposition scheme, allowing large-scale MD simulations on supercomputers. Hybrid schemes combining OpenMP and MPI are used in both simulators to target modern multicore computer architectures. Key advantages of GENESIS are (1) the highly parallel performance of SPDYN for very large biological systems consisting of more than one million atoms and (2) the availability of various REMD algorithms (T-REMD, REUS, multi-dimensional REMD for both all-atom and Go-like models under the NVT, NPT, NPAT, and NPγT ensembles). The former is achieved by a combination of the midpoint cell method and the efficient three-dimensional Fast Fourier Transform algorithm, where the domain decomposition space is shared in real-space and reciprocal-space calculations. Other features in SPDYN, such as avoiding concurrent memory access, reducing communication times, and usage of parallel input/output files, also contribute to the performance. We show the REMD simulation results of a mixed (POPC/DMPC) lipid bilayer as a real application using GENESIS. GENESIS is released as free software under the GPLv2 licence and can be easily modified for the development of new algorithms and molecular models. WIREs Comput Mol Sci 2015, 5:310-323. doi: 10.1002/wcms.1220.
Toward an automated parallel computing environment for geosciences
NASA Astrophysics Data System (ADS)
Zhang, Huai; Liu, Mian; Shi, Yaolin; Yuen, David A.; Yan, Zhenzhen; Liang, Guoping
2007-08-01
Software for geodynamic modeling has not kept up with the fast growing computing hardware and network resources. In the past decade supercomputing power has become available to most researchers in the form of affordable Beowulf clusters and other parallel computer platforms. However, to take full advantage of such computing power requires developing parallel algorithms and associated software, a task that is often too daunting for geoscience modelers whose main expertise is in geosciences. We introduce here an automated parallel computing environment built on open-source algorithms and libraries. Users interact with this computing environment by specifying the partial differential equations, solvers, and model-specific properties using an English-like modeling language in the input files. The system then automatically generates the finite element codes that can be run on distributed or shared memory parallel machines. This system is dynamic and flexible, allowing users to address different problems in geosciences. It is capable of providing web-based services, enabling users to generate source codes online. This unique feature will facilitate high-performance computing to be integrated with distributed data grids in the emerging cyber-infrastructures for geosciences. In this paper we discuss the principles of this automated modeling environment and provide examples to demonstrate its versatility.
Framework for Parallel Preprocessing of Microarray Data Using Hadoop
2018-01-01
Nowadays, microarray technology has become one of the popular ways to study gene expression and diagnosis of disease. National Center for Biology Information (NCBI) hosts public databases containing large volumes of biological data required to be preprocessed, since they carry high levels of noise and bias. Robust Multiarray Average (RMA) is one of the standard and popular methods that is utilized to preprocess the data and remove the noises. Most of the preprocessing algorithms are time-consuming and not able to handle a large number of datasets with thousands of experiments. Parallel processing can be used to address the above-mentioned issues. Hadoop is a well-known and ideal distributed file system framework that provides a parallel environment to run the experiment. In this research, for the first time, the capability of Hadoop and statistical power of R have been leveraged to parallelize the available preprocessing algorithm called RMA to efficiently process microarray data. The experiment has been run on cluster containing 5 nodes, while each node has 16 cores and 16 GB memory. It compares efficiency and the performance of parallelized RMA using Hadoop with parallelized RMA using affyPara package as well as sequential RMA. The result shows the speed-up rate of the proposed approach outperforms the sequential approach and affyPara approach. PMID:29796018
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mitchell, Christopher J; Ahrens, James P; Wang, Jun
2010-10-15
Petascale simulations compute at resolutions ranging into billions of cells and write terabytes of data for visualization and analysis. Interactive visuaUzation of this time series is a desired step before starting a new run. The I/O subsystem and associated network often are a significant impediment to interactive visualization of time-varying data; as they are not configured or provisioned to provide necessary I/O read rates. In this paper, we propose a new I/O library for visualization applications: VisIO. Visualization applications commonly use N-to-N reads within their parallel enabled readers which provides an incentive for a shared-nothing approach to I/O, similar tomore » other data-intensive approaches such as Hadoop. However, unlike other data-intensive applications, visualization requires: (1) interactive performance for large data volumes, (2) compatibility with MPI and POSIX file system semantics for compatibility with existing infrastructure, and (3) use of existing file formats and their stipulated data partitioning rules. VisIO, provides a mechanism for using a non-POSIX distributed file system to provide linear scaling of 110 bandwidth. In addition, we introduce a novel scheduling algorithm that helps to co-locate visualization processes on nodes with the requested data. Testing using VisIO integrated into Para View was conducted using the Hadoop Distributed File System (HDFS) on TACC's Longhorn cluster. A representative dataset, VPIC, across 128 nodes showed a 64.4% read performance improvement compared to the provided Lustre installation. Also tested, was a dataset representing a global ocean salinity simulation that showed a 51.4% improvement in read performance over Lustre when using our VisIO system. VisIO, provides powerful high-performance I/O services to visualization applications, allowing for interactive performance with ultra-scale, time-series data.« less
Recent developments in user-job management with Ganga
NASA Astrophysics Data System (ADS)
Currie, R.; Elmsheuser, J.; Fay, R.; Owen, P. H.; Richards, A.; Slater, M.; Sutcliffe, W.; Williams, M.
2015-12-01
The Ganga project was originally developed for use by LHC experiments and has been used extensively throughout Run1 in both LHCb and ATLAS. This document describes some the most recent developments within the Ganga project. There have been improvements in the handling of large scale computational tasks in the form of a new GangaTasks infrastructure. Improvements in file handling through using a new IGangaFile interface makes handling files largely transparent to the end user. In addition to this the performance and usability of Ganga have both been addressed through the development of a new queues system allows for parallel processing of job related tasks.
Multiple Independent File Parallel I/O with HDF5
DOE Office of Scientific and Technical Information (OSTI.GOV)
Miller, M. C.
2016-07-13
The HDF5 library has supported the I/O requirements of HPC codes at Lawrence Livermore National Labs (LLNL) since the late 90’s. In particular, HDF5 used in the Multiple Independent File (MIF) parallel I/O paradigm has supported LLNL code’s scalable I/O requirements and has recently been gainfully used at scales as large as O(10 6) parallel tasks.
Storage of sparse files using parallel log-structured file system
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bent, John M.; Faibish, Sorin; Grider, Gary
A sparse file is stored without holes by storing a data portion of the sparse file using a parallel log-structured file system; and generating an index entry for the data portion, the index entry comprising a logical offset, physical offset and length of the data portion. The holes can be restored to the sparse file upon a reading of the sparse file. The data portion can be stored at a logical end of the sparse file. Additional storage efficiency can optionally be achieved by (i) detecting a write pattern for a plurality of the data portions and generating a singlemore » patterned index entry for the plurality of the patterned data portions; and/or (ii) storing the patterned index entries for a plurality of the sparse files in a single directory, wherein each entry in the single directory comprises an identifier of a corresponding sparse file.« less
NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations
NASA Astrophysics Data System (ADS)
Valiev, M.; Bylaska, E. J.; Govind, N.; Kowalski, K.; Straatsma, T. P.; Van Dam, H. J. J.; Wang, D.; Nieplocha, J.; Apra, E.; Windus, T. L.; de Jong, W. A.
2010-09-01
The latest release of NWChem delivers an open-source computational chemistry package with extensive capabilities for large scale simulations of chemical and biological systems. Utilizing a common computational framework, diverse theoretical descriptions can be used to provide the best solution for a given scientific problem. Scalable parallel implementations and modular software design enable efficient utilization of current computational architectures. This paper provides an overview of NWChem focusing primarily on the core theoretical modules provided by the code and their parallel performance. Program summaryProgram title: NWChem Catalogue identifier: AEGI_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEGI_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Open Source Educational Community License No. of lines in distributed program, including test data, etc.: 11 709 543 No. of bytes in distributed program, including test data, etc.: 680 696 106 Distribution format: tar.gz Programming language: Fortran 77, C Computer: all Linux based workstations and parallel supercomputers, Windows and Apple machines Operating system: Linux, OS X, Windows Has the code been vectorised or parallelized?: Code is parallelized Classification: 2.1, 2.2, 3, 7.3, 7.7, 16.1, 16.2, 16.3, 16.10, 16.13 Nature of problem: Large-scale atomistic simulations of chemical and biological systems require efficient and reliable methods for ground and excited solutions of many-electron Hamiltonian, analysis of the potential energy surface, and dynamics. Solution method: Ground and excited solutions of many-electron Hamiltonian are obtained utilizing density-functional theory, many-body perturbation approach, and coupled cluster expansion. These solutions or a combination thereof with classical descriptions are then used to analyze potential energy surface and perform dynamical simulations. Additional comments: Full documentation is provided in the distribution file. This includes an INSTALL file giving details of how to build the package. A set of test runs is provided in the examples directory. The distribution file for this program is over 90 Mbytes and therefore is not delivered directly when download or Email is requested. Instead a html file giving details of how the program can be obtained is sent. Running time: Running time depends on the size of the chemical system, complexity of the method, number of cpu's and the computational task. It ranges from several seconds for serial DFT energy calculations on a few atoms to several hours for parallel coupled cluster energy calculations on tens of atoms or ab-initio molecular dynamics simulation on hundreds of atoms.
Distributing an executable job load file to compute nodes in a parallel computer
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gooding, Thomas M.
Distributing an executable job load file to compute nodes in a parallel computer, the parallel computer comprising a plurality of compute nodes, including: determining, by a compute node in the parallel computer, whether the compute node is participating in a job; determining, by the compute node in the parallel computer, whether a descendant compute node is participating in the job; responsive to determining that the compute node is participating in the job or that the descendant compute node is participating in the job, communicating, by the compute node to a parent compute node, an identification of a data communications linkmore » over which the compute node receives data from the parent compute node; constructing a class route for the job, wherein the class route identifies all compute nodes participating in the job; and broadcasting the executable load file for the job along the class route for the job.« less
The ATLAS EventIndex: architecture, design choices, deployment and first operation experience
NASA Astrophysics Data System (ADS)
Barberis, D.; Cárdenas Zárate, S. E.; Cranshaw, J.; Favareto, A.; Fernández Casaní, Á.; Gallas, E. J.; Glasman, C.; González de la Hoz, S.; Hřivnáč, J.; Malon, D.; Prokoshin, F.; Salt Cairols, J.; Sánchez, J.; Többicke, R.; Yuan, R.
2015-12-01
The EventIndex is the complete catalogue of all ATLAS events, keeping the references to all files that contain a given event in any processing stage. It replaces the TAG database, which had been in use during LHC Run 1. For each event it contains its identifiers, the trigger pattern and the GUIDs of the files containing it. Major use cases are event picking, feeding the Event Service used on some production sites, and technical checks of the completion and consistency of processing campaigns. The system design is highly modular so that its components (data collection system, storage system based on Hadoop, query web service and interfaces to other ATLAS systems) could be developed separately and in parallel during LSI. The EventIndex is in operation for the start of LHC Run 2. This paper describes the high-level system architecture, the technical design choices and the deployment process and issues. The performance of the data collection and storage systems, as well as the query services, are also reported.
LVFS: A Scalable Petabye/Exabyte Data Storage System
NASA Astrophysics Data System (ADS)
Golpayegani, N.; Halem, M.; Masuoka, E. J.; Ye, G.; Devine, N. K.
2013-12-01
Managing petabytes of data with hundreds of millions of files is the first step necessary towards an effective big data computing and collaboration environment in a distributed system. We describe here the MODAPS LAADS Virtual File System (LVFS), a new storage architecture which replaces the previous MODAPS operational Level 1 Land Atmosphere Archive Distribution System (LAADS) NFS based approach to storing and distributing datasets from several instruments, such as MODIS, MERIS, and VIIRS. LAADS is responsible for the distribution of over 4 petabytes of data and over 300 million files across more than 500 disks. We present here the first LVFS big data comparative performance results and new capabilities not previously possible with the LAADS system. We consider two aspects in addressing inefficiencies of massive scales of data. First, is dealing in a reliable and resilient manner with the volume and quantity of files in such a dataset, and, second, minimizing the discovery and lookup times for accessing files in such large datasets. There are several popular file systems that successfully deal with the first aspect of the problem. Their solution, in general, is through distribution, replication, and parallelism of the storage architecture. The Hadoop Distributed File System (HDFS), Parallel Virtual File System (PVFS), and Lustre are examples of such file systems that deal with petabyte data volumes. The second aspect deals with data discovery among billions of files, the largest bottleneck in reducing access time. The metadata of a file, generally represented in a directory layout, is stored in ways that are not readily scalable. This is true for HDFS, PVFS, and Lustre as well. Recent experimental file systems, such as Spyglass or Pantheon, have attempted to address this problem through redesign of the metadata directory architecture. LVFS takes a radically different architectural approach by eliminating the need for a separate directory within the file system. The LVFS system replaces the NFS disk mounting approach of LAADS and utilizes the already existing highly optimized metadata database server, which is applicable to most scientific big data intensive compute systems. Thus, LVFS ties the existing storage system with the existing metadata infrastructure system which we believe leads to a scalable exabyte virtual file system. The uniqueness of the implemented design is not limited to LAADS but can be employed with most scientific data processing systems. By utilizing the Filesystem In Userspace (FUSE), a kernel module available in many operating systems, LVFS was able to replace the NFS system while staying POSIX compliant. As a result, the LVFS system becomes scalable to exabyte sizes owing to the use of highly scalable database servers optimized for metadata storage. The flexibility of the LVFS design allows it to organize data on the fly in different ways, such as by region, date, instrument or product without the need for duplication, symbolic links, or any other replication methods. We proposed here a strategic reference architecture that addresses the inefficiencies of scientific petabyte/exabyte file system access through the dynamic integration of the observing system's large metadata file.
Jung, Jaewoon; Mori, Takaharu; Kobayashi, Chigusa; Matsunaga, Yasuhiro; Yoda, Takao; Feig, Michael; Sugita, Yuji
2015-01-01
GENESIS (Generalized-Ensemble Simulation System) is a new software package for molecular dynamics (MD) simulations of macromolecules. It has two MD simulators, called ATDYN and SPDYN. ATDYN is parallelized based on an atomic decomposition algorithm for the simulations of all-atom force-field models as well as coarse-grained Go-like models. SPDYN is highly parallelized based on a domain decomposition scheme, allowing large-scale MD simulations on supercomputers. Hybrid schemes combining OpenMP and MPI are used in both simulators to target modern multicore computer architectures. Key advantages of GENESIS are (1) the highly parallel performance of SPDYN for very large biological systems consisting of more than one million atoms and (2) the availability of various REMD algorithms (T-REMD, REUS, multi-dimensional REMD for both all-atom and Go-like models under the NVT, NPT, NPAT, and NPγT ensembles). The former is achieved by a combination of the midpoint cell method and the efficient three-dimensional Fast Fourier Transform algorithm, where the domain decomposition space is shared in real-space and reciprocal-space calculations. Other features in SPDYN, such as avoiding concurrent memory access, reducing communication times, and usage of parallel input/output files, also contribute to the performance. We show the REMD simulation results of a mixed (POPC/DMPC) lipid bilayer as a real application using GENESIS. GENESIS is released as free software under the GPLv2 licence and can be easily modified for the development of new algorithms and molecular models. WIREs Comput Mol Sci 2015, 5:310–323. doi: 10.1002/wcms.1220 PMID:26753008
Dynamic Non-Hierarchical File Systems for Exascale Storage
DOE Office of Scientific and Technical Information (OSTI.GOV)
Long, Darrell E.; Miller, Ethan L
This constitutes the final report for “Dynamic Non-Hierarchical File Systems for Exascale Storage”. The ultimate goal of this project was to improve data management in scientific computing and high-end computing (HEC) applications, and to achieve this goal we proposed: to develop the first, HEC-targeted, file system featuring rich metadata and provenance collection, extreme scalability, and future storage hardware integration as core design goals, and to evaluate and develop a flexible non-hierarchical file system interface suitable for providing more powerful and intuitive data management interfaces to HEC and scientific computing users. Data management is swiftly becoming a serious problem in themore » scientific community – while copious amounts of data are good for obtaining results, finding the right data is often daunting and sometimes impossible. Scientists participating in a Department of Energy workshop noted that most of their time was spent “...finding, processing, organizing, and moving data and it’s going to get much worse”. Scientists should not be forced to become data mining experts in order to retrieve the data they want, nor should they be expected to remember the naming convention they used several years ago for a set of experiments they now wish to revisit. Ideally, locating the data you need would be as easy as browsing the web. Unfortunately, existing data management approaches are usually based on hierarchical naming, a 40 year-old technology designed to manage thousands of files, not exabytes of data. Today’s systems do not take advantage of the rich array of metadata that current high-end computing (HEC) file systems can gather, including content-based metadata and provenance1 information. As a result, current metadata search approaches are typically ad hoc and often work by providing a parallel management system to the “main” file system, as is done in Linux (the locate utility), personal computers, and enterprise search appliances. These search applications are often optimized for a single file system, making it difficult to move files and their metadata between file systems. Users have tried to solve this problem in several ways, including the use of separate databases to index file properties, the encoding of file properties into file names, and separately gathering and managing provenance data, but none of these approaches has worked well, either due to limited usefulness or scalability, or both. Our research addressed several key issues: High-performance, real-time metadata harvesting: extracting important attributes from files dynamically and immediately updating indexes used to improve search; Transparent, automatic, and secure provenance capture: recording the data inputs and processing steps used in the production of each file in the system; Scalable indexing: indexes that are optimized for integration with the file system; Dynamic file system structure: our approach provides dynamic directories similar to those in semantic file systems, but these are the native organization rather than a feature grafted onto a conventional system. In addition to these goals, our research effort will include evaluating the impact of new storage technologies on the file system design and performance. In particular, the indexing and metadata harvesting functions can potentially benefit from the performance improvements promised by new storage class memories.« less
Field of genes: using Apache Kafka as a bioinformatic data repository.
Lawlor, Brendan; Lynch, Richard; Mac Aogáin, Micheál; Walsh, Paul
2018-04-01
Bioinformatic research is increasingly dependent on large-scale datasets, accessed either from private or public repositories. An example of a public repository is National Center for Biotechnology Information's (NCBI's) Reference Sequence (RefSeq). These repositories must decide in what form to make their data available. Unstructured data can be put to almost any use but are limited in how access to them can be scaled. Highly structured data offer improved performance for specific algorithms but limit the wider usefulness of the data. We present an alternative: lightly structured data stored in Apache Kafka in a way that is amenable to parallel access and streamed processing, including subsequent transformations into more highly structured representations. We contend that this approach could provide a flexible and powerful nexus of bioinformatic data, bridging the gap between low structure on one hand, and high performance and scale on the other. To demonstrate this, we present a proof-of-concept version of NCBI's RefSeq database using this technology. We measure the performance and scalability characteristics of this alternative with respect to flat files. The proof of concept scales almost linearly as more compute nodes are added, outperforming the standard approach using files. Apache Kafka merits consideration as a fast and more scalable but general-purpose way to store and retrieve bioinformatic data, for public, centralized reference datasets such as RefSeq and for private clinical and experimental data.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Haghighat, A.; Sjoden, G.E.; Wagner, J.C.
In the past 10 yr, the Penn State Transport Theory Group (PSTTG) has concentrated its efforts on developing accurate and efficient particle transport codes to address increasing needs for efficient and accurate simulation of nuclear systems. The PSTTG's efforts have primarily focused on shielding applications that are generally treated using multigroup, multidimensional, discrete ordinates (S{sub n}) deterministic and/or statistical Monte Carlo methods. The difficulty with the existing public codes is that they require significant (impractical) computation time for simulation of complex three-dimensional (3-D) problems. For the S{sub n} codes, the large memory requirements are handled through the use of scratchmore » files (i.e., read-from and write-to-disk) that significantly increases the necessary execution time. Further, the lack of flexible features and/or utilities for preparing input and processing output makes these codes difficult to use. The Monte Carlo method becomes impractical because variance reduction (VR) methods have to be used, and normally determination of the necessary parameters for the VR methods is very difficult and time consuming for a complex 3-D problem. For the deterministic method, the authors have developed the 3-D parallel PENTRAN (Parallel Environment Neutral-particle TRANsport) code system that, in addition to a parallel 3-D S{sub n} solver, includes pre- and postprocessing utilities. PENTRAN provides for full phase-space decomposition, memory partitioning, and parallel input/output to provide the capability of solving large problems in a relatively short time. Besides having a modular parallel structure, PENTRAN has several unique new formulations and features that are necessary for achieving high parallel performance. For the Monte Carlo method, the major difficulty currently facing most users is the selection of an effective VR method and its associated parameters. For complex problems, generally, this process is very time consuming and may be complicated due to the possibility of biasing the results. In an attempt to eliminate this problem, the authors have developed the A{sup 3}MCNP (automated adjoint accelerated MCNP) code that automatically prepares parameters for source and transport biasing within a weight-window VR approach based on the S{sub n} adjoint function. A{sup 3}MCNP prepares the necessary input files for performing multigroup, 3-D adjoint S{sub n} calculations using TORT.« less
SciSpark: Highly Interactive and Scalable Model Evaluation and Climate Metrics
NASA Astrophysics Data System (ADS)
Wilson, B. D.; Mattmann, C. A.; Waliser, D. E.; Kim, J.; Loikith, P.; Lee, H.; McGibbney, L. J.; Whitehall, K. D.
2014-12-01
Remote sensing data and climate model output are multi-dimensional arrays of massive sizes locked away in heterogeneous file formats (HDF5/4, NetCDF 3/4) and metadata models (HDF-EOS, CF) making it difficult to perform multi-stage, iterative science processing since each stage requires writing and reading data to and from disk. We are developing a lightning fast Big Data technology called SciSpark based on ApacheTM Spark. Spark implements the map-reduce paradigm for parallel computing on a cluster, but emphasizes in-memory computation, "spilling" to disk only as needed, and so outperforms the disk-based ApacheTM Hadoop by 100x in memory and by 10x on disk, and makes iterative algorithms feasible. SciSpark will enable scalable model evaluation by executing large-scale comparisons of A-Train satellite observations to model grids on a cluster of 100 to 1000 compute nodes. This 2nd generation capability for NASA's Regional Climate Model Evaluation System (RCMES) will compute simple climate metrics at interactive speeds, and extend to quite sophisticated iterative algorithms such as machine-learning (ML) based clustering of temperature PDFs, and even graph-based algorithms for searching for Mesocale Convective Complexes. The goals of SciSpark are to: (1) Decrease the time to compute comparison statistics and plots from minutes to seconds; (2) Allow for interactive exploration of time-series properties over seasons and years; (3) Decrease the time for satellite data ingestion into RCMES to hours; (4) Allow for Level-2 comparisons with higher-order statistics or PDF's in minutes to hours; and (5) Move RCMES into a near real time decision-making platform. We will report on: the architecture and design of SciSpark, our efforts to integrate climate science algorithms in Python and Scala, parallel ingest and partitioning (sharding) of A-Train satellite observations from HDF files and model grids from netCDF files, first parallel runs to compute comparison statistics and PDF's, and first metrics quantifying parallel speedups and memory & disk usage.
IOPA: I/O-aware parallelism adaption for parallel programs
Liu, Tao; Liu, Yi; Qian, Chen; Qian, Depei
2017-01-01
With the development of multi-/many-core processors, applications need to be written as parallel programs to improve execution efficiency. For data-intensive applications that use multiple threads to read/write files simultaneously, an I/O sub-system can easily become a bottleneck when too many of these types of threads exist; on the contrary, too few threads will cause insufficient resource utilization and hurt performance. Therefore, programmers must pay much attention to parallelism control to find the appropriate number of I/O threads for an application. This paper proposes a parallelism control mechanism named IOPA that can adjust the parallelism of applications to adapt to the I/O capability of a system and balance computing resources and I/O bandwidth. The programming interface of IOPA is also provided to programmers to simplify parallel programming. IOPA is evaluated using multiple applications with both solid state and hard disk drives. The results show that the parallel applications using IOPA can achieve higher efficiency than those with a fixed number of threads. PMID:28278236
IOPA: I/O-aware parallelism adaption for parallel programs.
Liu, Tao; Liu, Yi; Qian, Chen; Qian, Depei
2017-01-01
With the development of multi-/many-core processors, applications need to be written as parallel programs to improve execution efficiency. For data-intensive applications that use multiple threads to read/write files simultaneously, an I/O sub-system can easily become a bottleneck when too many of these types of threads exist; on the contrary, too few threads will cause insufficient resource utilization and hurt performance. Therefore, programmers must pay much attention to parallelism control to find the appropriate number of I/O threads for an application. This paper proposes a parallelism control mechanism named IOPA that can adjust the parallelism of applications to adapt to the I/O capability of a system and balance computing resources and I/O bandwidth. The programming interface of IOPA is also provided to programmers to simplify parallel programming. IOPA is evaluated using multiple applications with both solid state and hard disk drives. The results show that the parallel applications using IOPA can achieve higher efficiency than those with a fixed number of threads.
Reconstructing evolutionary trees in parallel for massive sequences.
Zou, Quan; Wan, Shixiang; Zeng, Xiangxiang; Ma, Zhanshan Sam
2017-12-14
Building the evolutionary trees for massive unaligned DNA sequences is challenging and crucial. However, reconstructing evolutionary tree for ultra-large sequences is hard. Massive multiple sequence alignment is also challenging and time/space consuming. Hadoop and Spark are developed recently, which bring spring light for the classical computational biology problems. In this paper, we tried to solve the multiple sequence alignment and evolutionary reconstruction in parallel. HPTree, which is developed in this paper, can deal with big DNA sequence files quickly. It works well on the >1GB files, and gets better performance than other evolutionary reconstruction tools. Users could use HPTree for reonstructing evolutioanry trees on the computer clusters or cloud platform (eg. Amazon Cloud). HPTree could help on population evolution research and metagenomics analysis. In this paper, we employ the Hadoop and Spark platform and design an evolutionary tree reconstruction software tool for unaligned massive DNA sequences. Clustering and multiple sequence alignment are done in parallel. Neighbour-joining model was employed for the evolutionary tree building. We opened our software together with source codes via http://lab.malab.cn/soft/HPtree/ .
Exploring the use of I/O nodes for computation in a MIMD multiprocessor
NASA Technical Reports Server (NTRS)
Kotz, David; Cai, Ting
1995-01-01
As parallel systems move into the production scientific-computing world, the emphasis will be on cost-effective solutions that provide high throughput for a mix of applications. Cost effective solutions demand that a system make effective use of all of its resources. Many MIMD multiprocessors today, however, distinguish between 'compute' and 'I/O' nodes, the latter having attached disks and being dedicated to running the file-system server. This static division of responsibilities simplifies system management but does not necessarily lead to the best performance in workloads that need a different balance of computation and I/O. Of course, computational processes sharing a node with a file-system service may receive less CPU time, network bandwidth, and memory bandwidth than they would on a computation-only node. In this paper we begin to examine this issue experimentally. We found that high performance I/O does not necessarily require substantial CPU time, leaving plenty of time for application computation. There were some complex file-system requests, however, which left little CPU time available to the application. (The impact on network and memory bandwidth still needs to be determined.) For applications (or users) that cannot tolerate an occasional interruption, we recommend that they continue to use only compute nodes. For tolerant applications needing more cycles than those provided by the compute nodes, we recommend that they take full advantage of both compute and I/O nodes for computation, and that operating systems should make this possible.
Developing a Hadoop-based Middleware for Handling Multi-dimensional NetCDF
NASA Astrophysics Data System (ADS)
Li, Z.; Yang, C. P.; Schnase, J. L.; Duffy, D.; Lee, T. J.
2014-12-01
Climate observations and model simulations are collecting and generating vast amounts of climate data, and these data are ever-increasing and being accumulated in a rapid speed. Effectively managing and analyzing these data are essential for climate change studies. Hadoop, a distributed storage and processing framework for large data sets, has attracted increasing attentions in dealing with the Big Data challenge. The maturity of Infrastructure as a Service (IaaS) of cloud computing further accelerates the adoption of Hadoop in solving Big Data problems. However, Hadoop is designed to process unstructured data such as texts, documents and web pages, and cannot effectively handle the scientific data format such as array-based NetCDF files and other binary data format. In this paper, we propose to build a Hadoop-based middleware for transparently handling big NetCDF data by 1) designing a distributed climate data storage mechanism based on POSIX-enabled parallel file system to enable parallel big data processing with MapReduce, as well as support data access by other systems; 2) modifying the Hadoop framework to transparently processing NetCDF data in parallel without sequencing or converting the data into other file formats, or loading them to HDFS; and 3) seamlessly integrating Hadoop, cloud computing and climate data in a highly scalable and fault-tolerance framework.
Parallel checksumming of data chunks of a shared data object using a log-structured file system
Bent, John M.; Faibish, Sorin; Grider, Gary
2016-09-06
Checksum values are generated and used to verify the data integrity. A client executing in a parallel computing system stores a data chunk to a shared data object on a storage node in the parallel computing system. The client determines a checksum value for the data chunk; and provides the checksum value with the data chunk to the storage node that stores the shared object. The data chunk can be stored on the storage node with the corresponding checksum value as part of the shared object. The storage node may be part of a Parallel Log-Structured File System (PLFS), and the client may comprise, for example, a Log-Structured File System client on a compute node or burst buffer. The checksum value can be evaluated when the data chunk is read from the storage node to verify the integrity of the data that is read.
Compiler-Directed File Layout Optimization for Hierarchical Storage Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ding, Wei; Zhang, Yuanrui; Kandemir, Mahmut
File layout of array data is a critical factor that effects the behavior of storage caches, and has so far taken not much attention in the context of hierarchical storage systems. The main contribution of this paper is a compiler-driven file layout optimization scheme for hierarchical storage caches. This approach, fully automated within an optimizing compiler, analyzes a multi-threaded application code and determines a file layout for each disk-resident array referenced by the code, such that the performance of the target storage cache hierarchy is maximized. We tested our approach using 16 I/O intensive application programs and compared its performancemore » against two previously proposed approaches under different cache space management schemes. Our experimental results show that the proposed approach improves the execution time of these parallel applications by 23.7% on average.« less
Compiler-Directed File Layout Optimization for Hierarchical Storage Systems
Ding, Wei; Zhang, Yuanrui; Kandemir, Mahmut; ...
2013-01-01
File layout of array data is a critical factor that effects the behavior of storage caches, and has so far taken not much attention in the context of hierarchical storage systems. The main contribution of this paper is a compiler-driven file layout optimization scheme for hierarchical storage caches. This approach, fully automated within an optimizing compiler, analyzes a multi-threaded application code and determines a file layout for each disk-resident array referenced by the code, such that the performance of the target storage cache hierarchy is maximized. We tested our approach using 16 I/O intensive application programs and compared its performancemore » against two previously proposed approaches under different cache space management schemes. Our experimental results show that the proposed approach improves the execution time of these parallel applications by 23.7% on average.« less
Striped tertiary storage arrays
NASA Technical Reports Server (NTRS)
Drapeau, Ann L.
1993-01-01
Data stripping is a technique for increasing the throughput and reducing the response time of large access to a storage system. In striped magnetic or optical disk arrays, a single file is striped or interleaved across several disks; in a striped tape system, files are interleaved across tape cartridges. Because a striped file can be accessed by several disk drives or tape recorders in parallel, the sustained bandwidth to the file is greater than in non-striped systems, where access to the file are restricted to a single device. It is argued that applying striping to tertiary storage systems will provide needed performance and reliability benefits. The performance benefits of striping for applications using large tertiary storage systems is discussed. It will introduce commonly available tape drives and libraries, and discuss their performance limitations, especially focusing on the long latency of tape accesses. This section will also describe an event-driven tertiary storage array simulator that is being used to understand the best ways of configuring these storage arrays. The reliability problems of magnetic tape devices are discussed, and plans for modeling the overall reliability of striped tertiary storage arrays to identify the amount of error correction required are described. Finally, work being done by other members of the Sequoia group to address latency of accesses, optimizing tertiary storage arrays that perform mostly writes, and compression is discussed.
Life and dynamic capacity modeling for aircraft transmissions
NASA Technical Reports Server (NTRS)
Savage, Michael
1991-01-01
A computer program to simulate the dynamic capacity and life of parallel shaft aircraft transmissions is presented. Five basic configurations can be analyzed: single mesh, compound, parallel, reverted, and single plane reductions. In execution, the program prompts the user for the data file prefix name, takes input from a ASCII file, and writes its output to a second ASCII file with the same prefix name. The input data file includes the transmission configuration, the input shaft torque and speed, and descriptions of the transmission geometry and the component gears and bearings. The program output file describes the transmission, its components, their capabilities, locations, and loads. It also lists the dynamic capability, ninety percent reliability, and mean life of each component and the transmission as a system. Here, the program, its input and output files, and the theory behind the operation of the program are described.
Field of genes: using Apache Kafka as a bioinformatic data repository
Lynch, Richard; Walsh, Paul
2018-01-01
Abstract Background Bioinformatic research is increasingly dependent on large-scale datasets, accessed either from private or public repositories. An example of a public repository is National Center for Biotechnology Information's (NCBI’s) Reference Sequence (RefSeq). These repositories must decide in what form to make their data available. Unstructured data can be put to almost any use but are limited in how access to them can be scaled. Highly structured data offer improved performance for specific algorithms but limit the wider usefulness of the data. We present an alternative: lightly structured data stored in Apache Kafka in a way that is amenable to parallel access and streamed processing, including subsequent transformations into more highly structured representations. We contend that this approach could provide a flexible and powerful nexus of bioinformatic data, bridging the gap between low structure on one hand, and high performance and scale on the other. To demonstrate this, we present a proof-of-concept version of NCBI’s RefSeq database using this technology. We measure the performance and scalability characteristics of this alternative with respect to flat files. Results The proof of concept scales almost linearly as more compute nodes are added, outperforming the standard approach using files. Conclusions Apache Kafka merits consideration as a fast and more scalable but general-purpose way to store and retrieve bioinformatic data, for public, centralized reference datasets such as RefSeq and for private clinical and experimental data. PMID:29635394
Striped Data Server for Scalable Parallel Data Analysis
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chang, Jin; Gutsche, Oliver; Mandrichenko, Igor
A columnar data representation is known to be an efficient way for data storage, specifically in cases when the analysis is often done based only on a small fragment of the available data structures. A data representation like Apache Parquet is a step forward from a columnar representation, which splits data horizontally to allow for easy parallelization of data analysis. Based on the general idea of columnar data storage, working on the [LDRD Project], we have developed a striped data representation, which, we believe, is better suited to the needs of High Energy Physics data analysis. A traditional columnar approachmore » allows for efficient data analysis of complex structures. While keeping all the benefits of columnar data representations, the striped mechanism goes further by enabling easy parallelization of computations without requiring special hardware. We will present an implementation and some performance characteristics of such a data representation mechanism using a distributed no-SQL database or a local file system, unified under the same API and data representation model. The representation is efficient and at the same time simple so that it allows for a common data model and APIs for wide range of underlying storage mechanisms such as distributed no-SQL databases and local file systems. Striped storage adopts Numpy arrays as its basic data representation format, which makes it easy and efficient to use in Python applications. The Striped Data Server is a web service, which allows to hide the server implementation details from the end user, easily exposes data to WAN users, and allows to utilize well known and developed data caching solutions to further increase data access efficiency. We are considering the Striped Data Server as the core of an enterprise scale data analysis platform for High Energy Physics and similar areas of data processing. We have been testing this architecture with a 2TB dataset from a CMS dark matter search and plan to expand it to multiple 100 TB or even PB scale. We will present the striped format, Striped Data Server architecture and performance test results.« less
Data Storage and Transfer | High-Performance Computing | NREL
High-Performance Computing (HPC) systems. Photo of computer server wiring and lights, blurred to show data. WinSCP for Windows File Transfers Use to transfer files from a local computer to a remote computer. Robinhood for File Management Use this tool to manage your data files on Peregrine. Best
[PVFS 2000: An operational parallel file system for Beowulf
NASA Technical Reports Server (NTRS)
Ligon, Walt
2004-01-01
The approach has been to develop Parallel Virtual File System version 2 (PVFS2) , retaining the basic philosophy of the original file system but completely rewriting the code. It shows the architecture of the server and client components. BMI - BMI is the network abstraction layer. It is designed with a common driver and modules for each protocol supported. The interface is non-blocking, and provides mechanisms for optimizations including pinning user buffers. Currently TCP/IP and GM(Myrinet) modules have been implemented. Trove -Trove is the storage abstraction layer. It provides for storing both data spaces and name/value pairs. Trove can also be implemented using different underlying storage mechanisms including native files, raw disk partitions, SQL and other databases. The current implementation uses native files for data spaces and Berkeley db for name/value pairs.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wang, Jun
Our group has been working with ANL collaborators on the topic bridging the gap between parallel file system and local file system during the course of this project period. We visited Argonne National Lab -- Dr. Robert Ross's group for one week in the past summer 2007. We looked over our current project progress and planned the activities for the incoming years 2008-09. The PI met Dr. Robert Ross several times such as HEC FSIO workshop 08, SC08 and SC10. We explored the opportunities to develop a production system by leveraging our current prototype to (SOGP+PVFS) a new PVFS version.more » We delivered SOGP+PVFS codes to ANL PVFS2 group in 2008.We also talked about exploring a potential project on developing new parallel programming models and runtime systems for data-intensive scalable computing (DISC). The methodology is to evolve MPI towards DISC by incorporating some functions of Google MapReduce parallel programming model. More recently, we are together exploring how to leverage existing works to perform (1) coordination/aggregation of local I/O operations prior to movement over the WAN, (2) efficient bulk data movement over the WAN, (3) latency hiding techniques for latency-intensive operations. Since 2009, we start applying Hadoop/MapReduce to some HEC applications with LANL scientists John Bent and Salman Habib. Another on-going work is to improve checkpoint performance at I/O forwarding Layer for the Road Runner super computer with James Nuetz and Gary Gridder at LANL. Two senior undergraduates from our research group did summer internships about high-performance file and storage system projects in LANL since 2008 for consecutive three years. Both of them are now pursuing Ph.D. degree in our group and will be 4th year in the PhD program in Fall 2011 and go to LANL to advance two above-mentioned works during this winter break. Since 2009, we have been collaborating with several computer scientists (Gary Grider, John bent, Parks Fields, James Nunez, Hsing-Bung Chen, etc) from HPC5 and James Ahrens from Advanced Computing Laboratory in Los Alamos National Laboratory. We hold a weekly conference and/or video meeting on advancing works at two fronts: the hardware/software infrastructure of building large-scale data intensive cluster and research publications. Our group members assist in constructing several onsite LANL data intensive clusters. Two parties have been developing software codes and research papers together using both sides resources.« less
Brennan; Biddison; Frauendorf; Schwarcz; Keen; Ecker; Davis; Tinder; Swayze
1998-01-01
An automated, 96-well parallel array synthesizer for solid-phase organic synthesis has been designed and constructed. The instrument employs a unique reagent array delivery format, in which each reagent utilized has a dedicated plumbing system. An inert atmosphere is maintained during all phases of a synthesis, and temperature can be controlled via a thermal transfer plate which holds the injection molded reaction block. The reaction plate assembly slides in the X-axis direction, while eight nozzle blocks holding the reagent lines slide in the Y-axis direction, allowing for the extremely rapid delivery of any of 64 reagents to 96 wells. In addition, there are six banks of fixed nozzle blocks, which deliver the same reagent or solvent to eight wells at once, for a total of 72 possible reagents. The instrument is controlled by software which allows the straightforward programming of the synthesis of a larger number of compounds. This is accomplished by supplying a general synthetic procedure in the form of a command file, which calls upon certain reagents to be added to specific wells via lookup in a sequence file. The bottle position, flow rate, and concentration of each reagent is stored in a separate reagent table file. To demonstrate the utility of the parallel array synthesizer, a small combinatorial library of hydroxamic acids was prepared in high throughput mode for biological screening. Approximately 1300 compounds were prepared on a 10 μmole scale (3-5 mg) in a few weeks. The resulting crude compounds were generally >80% pure, and were utilized directly for high throughput screening in antibacterial assays. Several active wells were found, and the activity was verified by solution-phase synthesis of analytically pure material, indicating that the system described herein is an efficient means for the parallel synthesis of compounds for lead discovery. Copyright 1998 John Wiley & Sons, Inc.
P43-S Computational Biology Applications Suite for High-Performance Computing (BioHPC.net)
Pillardy, J.
2007-01-01
One of the challenges of high-performance computing (HPC) is user accessibility. At the Cornell University Computational Biology Service Unit, which is also a Microsoft HPC institute, we have developed a computational biology application suite that allows researchers from biological laboratories to submit their jobs to the parallel cluster through an easy-to-use Web interface. Through this system, we are providing users with popular bioinformatics tools including BLAST, HMMER, InterproScan, and MrBayes. The system is flexible and can be easily customized to include other software. It is also scalable; the installation on our servers currently processes approximately 8500 job submissions per year, many of them requiring massively parallel computations. It also has a built-in user management system, which can limit software and/or database access to specified users. TAIR, the major database of the plant model organism Arabidopsis, and SGN, the international tomato genome database, are both using our system for storage and data analysis. The system consists of a Web server running the interface (ASP.NET C#), Microsoft SQL server (ADO.NET), compute cluster running Microsoft Windows, ftp server, and file server. Users can interact with their jobs and data via a Web browser, ftp, or e-mail. The interface is accessible at http://cbsuapps.tc.cornell.edu/.
Fast parallel algorithm for slicing STL based on pipeline
NASA Astrophysics Data System (ADS)
Ma, Xulong; Lin, Feng; Yao, Bo
2016-05-01
In Additive Manufacturing field, the current researches of data processing mainly focus on a slicing process of large STL files or complicated CAD models. To improve the efficiency and reduce the slicing time, a parallel algorithm has great advantages. However, traditional algorithms can't make full use of multi-core CPU hardware resources. In the paper, a fast parallel algorithm is presented to speed up data processing. A pipeline mode is adopted to design the parallel algorithm. And the complexity of the pipeline algorithm is analyzed theoretically. To evaluate the performance of the new algorithm, effects of threads number and layers number are investigated by a serial of experiments. The experimental results show that the threads number and layers number are two remarkable factors to the speedup ratio. The tendency of speedup versus threads number reveals a positive relationship which greatly agrees with the Amdahl's law, and the tendency of speedup versus layers number also keeps a positive relationship agreeing with Gustafson's law. The new algorithm uses topological information to compute contours with a parallel method of speedup. Another parallel algorithm based on data parallel is used in experiments to show that pipeline parallel mode is more efficient. A case study at last shows a suspending performance of the new parallel algorithm. Compared with the serial slicing algorithm, the new pipeline parallel algorithm can make full use of the multi-core CPU hardware, accelerate the slicing process, and compared with the data parallel slicing algorithm, the new slicing algorithm in this paper adopts a pipeline parallel model, and a much higher speedup ratio and efficiency is achieved.
A performance analysis of advanced I/O architectures for PC-based network file servers
NASA Astrophysics Data System (ADS)
Huynh, K. D.; Khoshgoftaar, T. M.
1994-12-01
In the personal computing and workstation environments, more and more I/O adapters are becoming complete functional subsystems that are intelligent enough to handle I/O operations on their own without much intervention from the host processor. The IBM Subsystem Control Block (SCB) architecture has been defined to enhance the potential of these intelligent adapters by defining services and conventions that deliver command information and data to and from the adapters. In recent years, a new storage architecture, the Redundant Array of Independent Disks (RAID), has been quickly gaining acceptance in the world of computing. In this paper, we would like to discuss critical system design issues that are important to the performance of a network file server. We then present a performance analysis of the SCB architecture and disk array technology in typical network file server environments based on personal computers (PCs). One of the key issues investigated in this paper is whether a disk array can outperform a group of disks (of same type, same data capacity, and same cost) operating independently, not in parallel as in a disk array.
High-performance computing — an overview
NASA Astrophysics Data System (ADS)
Marksteiner, Peter
1996-08-01
An overview of high-performance computing (HPC) is given. Different types of computer architectures used in HPC are discussed: vector supercomputers, high-performance RISC processors, various parallel computers like symmetric multiprocessors, workstation clusters, massively parallel processors. Software tools and programming techniques used in HPC are reviewed: vectorizing compilers, optimization and vector tuning, optimization for RISC processors; parallel programming techniques like shared-memory parallelism, message passing and data parallelism; and numerical libraries.
SeqMule: automated pipeline for analysis of human exome/genome sequencing data.
Guo, Yunfei; Ding, Xiaolei; Shen, Yufeng; Lyon, Gholson J; Wang, Kai
2015-09-18
Next-generation sequencing (NGS) technology has greatly helped us identify disease-contributory variants for Mendelian diseases. However, users are often faced with issues such as software compatibility, complicated configuration, and no access to high-performance computing facility. Discrepancies exist among aligners and variant callers. We developed a computational pipeline, SeqMule, to perform automated variant calling from NGS data on human genomes and exomes. SeqMule integrates computational-cluster-free parallelization capability built on top of the variant callers, and facilitates normalization/intersection of variant calls to generate consensus set with high confidence. SeqMule integrates 5 alignment tools, 5 variant calling algorithms and accepts various combinations all by one-line command, therefore allowing highly flexible yet fully automated variant calling. In a modern machine (2 Intel Xeon X5650 CPUs, 48 GB memory), when fast turn-around is needed, SeqMule generates annotated VCF files in a day from a 30X whole-genome sequencing data set; when more accurate calling is needed, SeqMule generates consensus call set that improves over single callers, as measured by both Mendelian error rate and consistency. SeqMule supports Sun Grid Engine for parallel processing, offers turn-key solution for deployment on Amazon Web Services, allows quality check, Mendelian error check, consistency evaluation, HTML-based reports. SeqMule is available at http://seqmule.openbioinformatics.org.
Performance Metrics for Monitoring Parallel Program Executions
NASA Technical Reports Server (NTRS)
Sarukkai, Sekkar R.; Gotwais, Jacob K.; Yan, Jerry; Lum, Henry, Jr. (Technical Monitor)
1994-01-01
Existing tools for debugging performance of parallel programs either provide graphical representations of program execution or profiles of program executions. However, for performance debugging tools to be useful, such information has to be augmented with information that highlights the cause of poor program performance. Identifying the cause of poor performance necessitates the need for not only determining the significance of various performance problems on the execution time of the program, but also needs to consider the effect of interprocessor communications of individual source level data structures. In this paper, we present a suite of normalized indices which provide a convenient mechanism for focusing on a region of code with poor performance and highlights the cause of the problem in terms of processors, procedures and data structure interactions. All the indices are generated from trace files augmented with data structure information.. Further, we show with the help of examples from the NAS benchmark suite that the indices help in detecting potential cause of poor performance, based on augmented execution traces obtained by monitoring the program.
An approach to enhance pnetCDF performance in ...
Data intensive simulations are often limited by their I/O (input/output) performance, and "novel" techniques need to be developed in order to overcome this limitation. The software package pnetCDF (parallel network Common Data Form), which works with parallel file systems, was developed to address this issue by providing parallel I/O capability. This study examines the performance of an application-level data aggregation approach which performs data aggregation along either row or column dimension of MPI (Message Passing Interface) processes on a spatially decomposed domain, and then applies the pnetCDF parallel I/O paradigm. The test was done with three different domain sizes which represent small, moderately large, and large data domains, using a small-scale Community Multiscale Air Quality model (CMAQ) mock-up code. The examination includes comparing I/O performance with traditional serial I/O technique, straight application of pnetCDF, and the data aggregation along row and column dimension before applying pnetCDF. After the comparison, "optimal" I/O configurations of this application-level data aggregation approach were quantified. Data aggregation along the row dimension (pnetCDFcr) works better than along the column dimension (pnetCDFcc) although it may perform slightly worse than the straight pnetCDF method with a small number of processors. When the number of processors becomes larger, pnetCDFcr outperforms pnetCDF significantly. If the number of proces
High performance x-ray anti-scatter grid
Logan, Clinton M.
1995-01-01
An x-ray anti-scatter grid for x-ray imaging, particularly for screening mammography, and method for fabricating same, x-rays incident along a direct path pass through a grid composed of a plurality of parallel or crossed openings, microchannels, grooves, or slots etched in a substrate, such as silicon, having the walls of the microchannels or slots coated with a high opacity material, such as gold, while x-rays incident at angels with respect to the slots of the grid, arising from scatter, are blocked. The thickness of the substrate is dependent on the specific application of the grid, whereby a substrate of the grid for mammography would be thinner than one for chest radiology. Instead of coating the walls of the slots, such could be filed with an appropriate liquid, such as mercury.
Benchmark Comparison of Cloud Analytics Methods Applied to Earth Observations
NASA Technical Reports Server (NTRS)
Lynnes, Chris; Little, Mike; Huang, Thomas; Jacob, Joseph; Yang, Phil; Kuo, Kwo-Sen
2016-01-01
Cloud computing has the potential to bring high performance computing capabilities to the average science researcher. However, in order to take full advantage of cloud capabilities, the science data used in the analysis must often be reorganized. This typically involves sharding the data across multiple nodes to enable relatively fine-grained parallelism. This can be either via cloud-based file systems or cloud-enabled databases such as Cassandra, Rasdaman or SciDB. Since storing an extra copy of data leads to increased cost and data management complexity, NASA is interested in determining the benefits and costs of various cloud analytics methods for real Earth Observation cases. Accordingly, NASA's Earth Science Technology Office and Earth Science Data and Information Systems project have teamed with cloud analytics practitioners to run a benchmark comparison on cloud analytics methods using the same input data and analysis algorithms. We have particularly looked at analysis algorithms that work over long time series, because these are particularly intractable for many Earth Observation datasets which typically store data with one or just a few time steps per file. This post will present side-by-side cost and performance results for several common Earth observation analysis operations.
Processing EOS MLS Level-2 Data
NASA Technical Reports Server (NTRS)
Snyder, W. Van; Wu, Dong; Read, William; Jiang, Jonathan; Wagner, Paul; Livesey, Nathaniel; Schwartz, Michael; Filipiak, Mark; Pumphrey, Hugh; Shippony, Zvi
2006-01-01
A computer program performs level-2 processing of thermal-microwave-radiance data from observations of the limb of the Earth by the Earth Observing System (EOS) Microwave Limb Sounder (MLS). The purpose of the processing is to estimate the composition and temperature of the atmosphere versus altitude from .8 to .90 km. "Level-2" as used here is a specialists f term signifying both vertical profiles of geophysical parameters along the measurement track of the instrument and processing performed by this or other software to generate such profiles. Designed to be flexible, the program is controlled via a configuration file that defines all aspects of processing, including contents of state and measurement vectors, configurations of forward models, measurement and calibration data to be read, and the manner of inverting the models to obtain the desired estimates. The program can operate in a parallel form in which one instance of the program acts a master, coordinating the work of multiple slave instances on a cluster of computers, each slave operating on a portion of the data. Optionally, the configuration file can be made to instruct the software to produce files of simulated radiances based on state vectors formed from sets of geophysical data-product files taken as input.
Parallel community climate model: Description and user`s guide
DOE Office of Scientific and Technical Information (OSTI.GOV)
Drake, J.B.; Flanery, R.E.; Semeraro, B.D.
This report gives an overview of a parallel version of the NCAR Community Climate Model, CCM2, implemented for MIMD massively parallel computers using a message-passing programming paradigm. The parallel implementation was developed on an Intel iPSC/860 with 128 processors and on the Intel Delta with 512 processors, and the initial target platform for the production version of the code is the Intel Paragon with 2048 processors. Because the implementation uses a standard, portable message-passing libraries, the code has been easily ported to other multiprocessors supporting a message-passing programming paradigm. The parallelization strategy used is to decompose the problem domain intomore » geographical patches and assign each processor the computation associated with a distinct subset of the patches. With this decomposition, the physics calculations involve only grid points and data local to a processor and are performed in parallel. Using parallel algorithms developed for the semi-Lagrangian transport, the fast Fourier transform and the Legendre transform, both physics and dynamics are computed in parallel with minimal data movement and modest change to the original CCM2 source code. Sequential or parallel history tapes are written and input files (in history tape format) are read sequentially by the parallel code to promote compatibility with production use of the model on other computer systems. A validation exercise has been performed with the parallel code and is detailed along with some performance numbers on the Intel Paragon and the IBM SP2. A discussion of reproducibility of results is included. A user`s guide for the PCCM2 version 2.1 on the various parallel machines completes the report. Procedures for compilation, setup and execution are given. A discussion of code internals is included for those who may wish to modify and use the program in their own research.« less
GWM-VI: groundwater management with parallel processing for multiple MODFLOW versions
Banta, Edward R.; Ahlfeld, David P.
2013-01-01
Groundwater Management–Version Independent (GWM–VI) is a new version of the Groundwater Management Process of MODFLOW. The Groundwater Management Process couples groundwater-flow simulation with a capability to optimize stresses on the simulated aquifer based on an objective function and constraints imposed on stresses and aquifer state. GWM–VI extends prior versions of Groundwater Management in two significant ways—(1) it can be used with any version of MODFLOW that meets certain requirements on input and output, and (2) it is structured to allow parallel processing of the repeated runs of the MODFLOW model that are required to solve the optimization problem. GWM–VI uses the same input structure for files that describe the management problem as that used by prior versions of Groundwater Management. GWM–VI requires only minor changes to the input files used by the MODFLOW model. GWM–VI uses the Joint Universal Parameter IdenTification and Evaluation of Reliability Application Programming Interface (JUPITER-API) to implement both version independence and parallel processing. GWM–VI communicates with the MODFLOW model by manipulating certain input files and interpreting results from the MODFLOW listing file and binary output files. Nearly all capabilities of prior versions of Groundwater Management are available in GWM–VI. GWM–VI has been tested with MODFLOW-2005, MODFLOW-NWT (a Newton formulation for MODFLOW-2005), MF2005-FMP2 (the Farm Process for MODFLOW-2005), SEAWAT, and CFP (Conduit Flow Process for MODFLOW-2005). This report provides sample problems that demonstrate a range of applications of GWM–VI and the directory structure and input information required to use the parallel-processing capability.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yu, Weikuan; Vetter, Jeffrey S
Parallel NFS (pNFS) is touted as an emergent standard protocol for parallel I/O access in various storage environments. Several pNFS prototypes have been implemented for initial validation and protocol examination. Previous efforts have focused on realizing the pNFS protocol to expose the best bandwidth potential from underlying file and storage systems. In this presentation, we provide an initial characterization of two pNFS prototype implementations, lpNFS (a Lustre-based parallel NFS implementation) and spNFS (another reference implementation from Network Appliance, Inc.). We show that both lpNFS and spNFS can faithfully achieve the primary goal of pNFS, i.e., aggregating I/O bandwidth from manymore » storage servers. However, they both face the challenge of scalable metadata management. Particularly, the throughput of sp-NFS metadata operations degrades significanlty with an increasing number of data servers. Even for the better-performing lpNFS, we discuss its architecture and propose a direct I/O request flow protocol to improve its performance.« less
17 CFR 12.24 - Parallel proceedings.
Code of Federal Regulations, 2010 CFR
2010-04-01
...) Definition. For purposes of this section, a parallel proceeding shall include: (1) An arbitration proceeding... the receivership includes the resolution of claims made by customers; or (3) A petition filed under... any of the foregoing with knowledge of a parallel proceeding shall promptly notify the Commission, by...
MPI-IO: A Parallel File I/O Interface for MPI Version 0.3
NASA Technical Reports Server (NTRS)
Corbett, Peter; Feitelson, Dror; Hsu, Yarsun; Prost, Jean-Pierre; Snir, Marc; Fineberg, Sam; Nitzberg, Bill; Traversat, Bernard; Wong, Parkson
1995-01-01
Thanks to MPI [9], writing portable message passing parallel programs is almost a reality. One of the remaining problems is file I/0. Although parallel file systems support similar interfaces, the lack of a standard makes developing a truly portable program impossible. Further, the closest thing to a standard, the UNIX file interface, is ill-suited to parallel computing. Working together, IBM Research and NASA Ames have drafted MPI-I0, a proposal to address the portable parallel I/0 problem. In a nutshell, this proposal is based on the idea that I/0 can be modeled as message passing: writing to a file is like sending a message, and reading from a file is like receiving a message. MPI-IO intends to leverage the relatively wide acceptance of the MPI interface in order to create a similar I/0 interface. The above approach can be materialized in different ways. The current proposal represents the result of extensive discussions (and arguments), but is by no means finished. Many changes can be expected as additional participants join the effort to define an interface for portable I/0. This document is organized as follows. The remainder of this section includes a discussion of some issues that have shaped the style of the interface. Section 2 presents an overview of MPI-IO as it is currently defined. It specifies what the interface currently supports and states what would need to be added to the current proposal to make the interface more complete and robust. The next seven sections contain the interface definition itself. Section 3 presents definitions and conventions. Section 4 contains functions for file control, most notably open. Section 5 includes functions for independent I/O, both blocking and nonblocking. Section 6 includes functions for collective I/O, both blocking and nonblocking. Section 7 presents functions to support system-maintained file pointers, and shared file pointers. Section 8 presents constructors that can be used to define useful filetypes (the role of filetypes is explained in Section 2 below). Section 9 presents how the error handling mechanism of MPI is supported by the MPI-IO interface. All this is followed by a set of appendices, which contain information about issues that have not been totally resolved yet, and about design considerations. The reader can find there the motivation behind some of our design choices. More information on this would definitely be welcome and will be included in a further release of this document. The first appendix contains a description of MPI-I0's 'hints' structure which is used when opening a file. Appendix B is a discussion of various issues in the support for file pointers. Appendix C explains what we mean in talking about atomic access. Appendix D provides detailed examples of filetype constructors, and Appendix E contains a collection of arguments for and against various design decisions.
NASA Technical Reports Server (NTRS)
Fijany, Amir (Inventor); Bejczy, Antal K. (Inventor)
1994-01-01
In a computer having a large number of single-instruction multiple data (SIMD) processors, each of the SIMD processors has two sets of three individual processor elements controlled by a master control unit and interconnected among a plurality of register file units where data is stored. The register files input and output data in synchronism with a minor cycle clock under control of two slave control units controlling the register file units connected to respective ones of the two sets of processor elements. Depending upon which ones of the register file units are enabled to store or transmit data during a particular minor clock cycle, the processor elements within an SIMD processor are connected in rings or in pipeline arrays, and may exchange data with the internal bus or with neighboring SIMD processors through interface units controlled by respective ones of the two slave control units.
. PDF file High pT muons in Cosmic-Ray Air Showers with IceCube. PDF file IceCube Performance with Artificial Light Sources: the road to a Cascade Analyses + Energy scale calibration for EHE. PDF file , 2006. PDF file Thorsten Stetzelberger "IceCube DAQ Design & Performance" Nov 2005 PPT
OPAL: An Open-Source MPI-IO Library over Cray XT
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yu, Weikuan; Vetter, Jeffrey S; Canon, Richard Shane
Parallel IO over Cray XT is supported by a vendor-supplied MPI-IO package. This package contains a proprietary ADIO implementation built on top of the sysio library. While it is reasonable to maintain a stable code base for application scientists' convenience, it is also very important to the system developers and researchers to analyze and assess the effectiveness of parallel IO software, and accordingly, tune and optimize the MPI-IO implementation. A proprietary parallel IO code base relinquishes such flexibilities. On the other hand, a generic UFS-based MPI-IO implementation is typically used on many Linux-based platforms. We have developed an open-source MPI-IOmore » package over Lustre, referred to as OPAL (OPportunistic and Adaptive MPI-IO Library over Lustre). OPAL provides a single source-code base for MPI-IO over Lustre on Cray XT and Linux platforms. Compared to Cray implementation, OPAL provides a number of good features, including arbitrary specification of striping patterns and Lustre-stripe aligned file domain partitioning. This paper presents the performance comparisons between OPAL and Cray's proprietary implementation. Our evaluation demonstrates that OPAL achieves the performance comparable to the Cray implementation. We also exemplify the benefits of an open source package in revealing the underpinning of the parallel IO performance.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
HENNIGAN, GARY; SHADID, JOHN; SJAARDEMA, GREGORY
2009-06-08
Nem_spread reads it's input command file (default name nem_spread.inp), takes the named ExodusII geometry definition and spreads out the geometry (and optionally results) contained in that file out to a parallel disk system. The decomposition is taken from a scalar Nemesis load balance file generated by the companion utility nem_slice.
High performance x-ray anti-scatter grid
Logan, C.M.
1995-05-23
Disclosed are an x-ray anti-scatter grid for x-ray imaging, particularly for screening mammography, and method for fabricating same, x-rays incident along a direct path pass through a grid composed of a plurality of parallel or crossed openings, microchannels, grooves, or slots etched in a substrate, such as silicon, having the walls of the microchannels or slots coated with a high opacity material, such as gold, while x-rays incident at angels with respect to the slots of the grid, arising from scatter, are blocked. The thickness of the substrate is dependent on the specific application of the grid, whereby a substrate of the grid for mammography would be thinner than one for chest radiology. Instead of coating the walls of the slots, such could be filed with an appropriate liquid, such as mercury. 4 Figs.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Curtis, Darren S.; Peterson, Elena S.; Oehmen, Chris S.
2008-05-04
This work presents the ScalaBLAST Web Application (SWA), a web based application implemented using the PHP script language, MySQL DBMS, and Apache web server under a GNU/Linux platform. SWA is an application built as part of the Data Intensive Computer for Complex Biological Systems (DICCBS) project at the Pacific Northwest National Laboratory (PNNL). SWA delivers accelerated throughput of bioinformatics analysis via high-performance computing through a convenient, easy-to-use web interface. This approach greatly enhances emerging fields of study in biology such as ontology-based homology, and multiple whole genome comparisons which, in the absence of a tool like SWA, require a heroicmore » effort to overcome the computational bottleneck associated with genome analysis. The current version of SWA includes a user account management system, a web based user interface, and a backend process that generates the files necessary for the Internet scientific community to submit a ScalaBLAST parallel processing job on a dedicated cluster.« less
Survey of MapReduce frame operation in bioinformatics.
Zou, Quan; Li, Xu-Bin; Jiang, Wen-Rui; Lin, Zi-Yu; Li, Gui-Lin; Chen, Ke
2014-07-01
Bioinformatics is challenged by the fact that traditional analysis tools have difficulty in processing large-scale data from high-throughput sequencing. The open source Apache Hadoop project, which adopts the MapReduce framework and a distributed file system, has recently given bioinformatics researchers an opportunity to achieve scalable, efficient and reliable computing performance on Linux clusters and on cloud computing services. In this article, we present MapReduce frame-based applications that can be employed in the next-generation sequencing and other biological domains. In addition, we discuss the challenges faced by this field as well as the future works on parallel computing in bioinformatics. © The Author 2013. Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com.
An Ephemeral Burst-Buffer File System for Scientific Applications
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wang, Teng; Moody, Adam; Yu, Weikuan
BurstFS is a distributed file system for node-local burst buffers on high performance computing systems. BurstFS presents a shared file system space across the burst buffers so that applications that use shared files can access the highly-scalable burst buffers without changing their applications.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sreepathi, Sarat; Sripathi, Vamsi; Mills, Richard T
2013-01-01
Inefficient parallel I/O is known to be a major bottleneck among scientific applications employed on supercomputers as the number of processor cores grows into the thousands. Our prior experience indicated that parallel I/O libraries such as HDF5 that rely on MPI-IO do not scale well beyond 10K processor cores, especially on parallel file systems (like Lustre) with single point of resource contention. Our previous optimization efforts for a massively parallel multi-phase and multi-component subsurface simulator (PFLOTRAN) led to a two-phase I/O approach at the application level where a set of designated processes participate in the I/O process by splitting themore » I/O operation into a communication phase and a disk I/O phase. The designated I/O processes are created by splitting the MPI global communicator into multiple sub-communicators. The root process in each sub-communicator is responsible for performing the I/O operations for the entire group and then distributing the data to rest of the group. This approach resulted in over 25X speedup in HDF I/O read performance and 3X speedup in write performance for PFLOTRAN at over 100K processor cores on the ORNL Jaguar supercomputer. This research describes the design and development of a general purpose parallel I/O library, SCORPIO (SCalable block-ORiented Parallel I/O) that incorporates our optimized two-phase I/O approach. The library provides a simplified higher level abstraction to the user, sitting atop existing parallel I/O libraries (such as HDF5) and implements optimized I/O access patterns that can scale on larger number of processors. Performance results with standard benchmark problems and PFLOTRAN indicate that our library is able to maintain the same speedups as before with the added flexibility of being applicable to a wider range of I/O intensive applications.« less
High Performance Computing at NASA
NASA Technical Reports Server (NTRS)
Bailey, David H.; Cooper, D. M. (Technical Monitor)
1994-01-01
The speaker will give an overview of high performance computing in the U.S. in general and within NASA in particular, including a description of the recently signed NASA-IBM cooperative agreement. The latest performance figures of various parallel systems on the NAS Parallel Benchmarks will be presented. The speaker was one of the authors of the NAS (National Aerospace Standards) Parallel Benchmarks, which are now widely cited in the industry as a measure of sustained performance on realistic high-end scientific applications. It will be shown that significant progress has been made by the highly parallel supercomputer industry during the past year or so, with several new systems, based on high-performance RISC processors, that now deliver superior performance per dollar compared to conventional supercomputers. Various pitfalls in reporting performance will be discussed. The speaker will then conclude by assessing the general state of the high performance computing field.
75 FR 40815 - PJM Interconnection, L.L.C.; Notice of Filing
Federal Register 2010, 2011, 2012, 2013, 2014
2010-07-14
... Interconnection, L.L.C.; Notice of Filing July 7, 2010. Take notice that on July 1, 2010, PJM Interconnection, L.L.C. (PJM) filed revised sheets to Schedule 1 of the Amended and Restated Operating Agreement of PJM Interconnection, L.L.C. (Operating Agreement) and the parallel provisions of Attachment K--Appendix of the PJM...
Sawmill: A Logging File System for a High-Performance RAID Disk Array
1995-01-01
from limiting disk performance, new controller architectures connect the disks directly to the network so that data movement bypasses the file server...These developments raise two questions for file systems: how to get the best performance from a RAID, and how to use such a controller architecture ...the RAID-II storage system; this architecture provides a fast data path that moves data rapidly among the disks, high-speed controller memory, and the
MPgrafic: A parallel MPI version of Grafic-1
NASA Astrophysics Data System (ADS)
Prunet, Simon; Pichon, Christophe
2013-04-01
MPgrafic is a parallel MPI version of Grafic-1 which can produce large cosmological initial conditions on a cluster without requiring shared memory. The real Fourier transforms are carried in place using fftw while minimizing the amount of used memory (at the expense of performance) in the spirit of Grafic-1. The writing of the output file is also carried in parallel. In addition to the technical parallelization, it provides three extensions over Grafic-1: it can produce power spectra with baryon wiggles (DJ Eisenstein and W. Hu, Ap. J. 496);it has the optional ability to load a lower resolution noise map corresponding to the low frequency component which will fix the larger scale modes of the simulation (extra flag 0/1 at the end of the input process) in the spirit of Grafic-2;it can be used in conjunction with constrfield, which generates initial conditions phases from a list of local constraints on density, tidal field density gradient and velocity.
Besnier, Francois; Glover, Kevin A.
2013-01-01
This software package provides an R-based framework to make use of multi-core computers when running analyses in the population genetics program STRUCTURE. It is especially addressed to those users of STRUCTURE dealing with numerous and repeated data analyses, and who could take advantage of an efficient script to automatically distribute STRUCTURE jobs among multiple processors. It also consists of additional functions to divide analyses among combinations of populations within a single data set without the need to manually produce multiple projects, as it is currently the case in STRUCTURE. The package consists of two main functions: MPI_structure() and parallel_structure() as well as an example data file. We compared the performance in computing time for this example data on two computer architectures and showed that the use of the present functions can result in several-fold improvements in terms of computation time. ParallelStructure is freely available at https://r-forge.r-project.org/projects/parallstructure/. PMID:23923012
Study of high-performance canonical molecular orbitals calculation for proteins
NASA Astrophysics Data System (ADS)
Hirano, Toshiyuki; Sato, Fumitoshi
2017-11-01
The canonical molecular orbital (CMO) calculation can help to understand chemical properties and reactions in proteins. However, it is difficult to perform the CMO calculation of proteins because of its self-consistent field (SCF) convergence problem and expensive computational cost. To certainly obtain the CMO of proteins, we work in research and development of high-performance CMO applications and perform experimental studies. We have proposed the third-generation density-functional calculation method of calculating the SCF, which is more advanced than the FILE and direct method. Our method is based on Cholesky decomposition for two-electron integrals calculation and the modified grid-free method for the pure-XC term evaluation. By using the third-generation density-functional calculation method, the Coulomb, the Fock-exchange, and the pure-XC terms can be given by simple linear algebraic procedure in the SCF loop. Therefore, we can expect to get a good parallel performance in solving the SCF problem by using a well-optimized linear algebra library such as BLAS on the distributed memory parallel computers. The third-generation density-functional calculation method is implemented to our program, ProteinDF. To achieve computing electronic structure of the large molecule, not only overcoming expensive computation cost and also good initial guess for safe SCF convergence are required. In order to prepare a precise initial guess for the macromolecular system, we have developed the quasi-canonical localized orbital (QCLO) method. The QCLO has the characteristics of both localized and canonical orbital in a certain region of the molecule. We have succeeded in the CMO calculations of proteins by using the QCLO method. For simplified and semi-automated calculation of the QCLO method, we have also developed a Python-based program, QCLObot.
Operating CFDP in the Interplanetary Internet
NASA Technical Reports Server (NTRS)
Burleigh, S.
2002-01-01
This paper examines the design elements of CCSDS File Delivery Protocol and Interplanetary Internet technologies that will simplify their integration and discusses the resulting new capabilities, such as efficient transmission of large files via multiple relay satellites operating in parallel.
V&V of MCNP 6.1.1 Beta Against Intermediate and High-Energy Experimental Data
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mashnik, Stepan G
This report presents a set of validation and verification (V&V) MCNP 6.1.1 beta results calculated in parallel, with MPI, obtained using its event generators at intermediate and high-energies compared against various experimental data. It also contains several examples of results using the models at energies below 150 MeV, down to 10 MeV, where data libraries are normally used. This report can be considered as the forth part of a set of MCNP6 Testing Primers, after its first, LA-UR-11-05129, and second, LA-UR-11-05627, and third, LA-UR-26944, publications, but is devoted to V&V with the latest, 1.1 beta version of MCNP6. The MCNP6more » test-problems discussed here are presented in the /VALIDATION_CEM/and/VALIDATION_LAQGSM/subdirectories in the MCNP6/Testing/directory. README files that contain short descriptions of every input file, the experiment, the quantity of interest that the experiment measures and its description in the MCNP6 output files, and the publication reference of that experiment are presented for every test problem. Templates for plotting the corresponding results with xmgrace as well as pdf files with figures representing the final results of our V&V efforts are presented. Several technical “bugs” in MCNP 6.1.1 beta were discovered during our current V&V of MCNP6 while running it in parallel with MPI using its event generators. These “bugs” are to be fixed in the following version of MCNP6. Our results show that MCNP 6.1.1 beta using its CEM03.03, LAQGSM03.03, Bertini, and INCL+ABLA, event generators describes, as a rule, reasonably well different intermediate- and high-energy measured data. This primer isn’t meant to be read from cover to cover. Readers may skip some sections and go directly to any test problem in which they are interested.« less
Wan, Shixiang; Zou, Quan
2017-01-01
Multiple sequence alignment (MSA) plays a key role in biological sequence analyses, especially in phylogenetic tree construction. Extreme increase in next-generation sequencing results in shortage of efficient ultra-large biological sequence alignment approaches for coping with different sequence types. Distributed and parallel computing represents a crucial technique for accelerating ultra-large (e.g. files more than 1 GB) sequence analyses. Based on HAlign and Spark distributed computing system, we implement a highly cost-efficient and time-efficient HAlign-II tool to address ultra-large multiple biological sequence alignment and phylogenetic tree construction. The experiments in the DNA and protein large scale data sets, which are more than 1GB files, showed that HAlign II could save time and space. It outperformed the current software tools. HAlign-II can efficiently carry out MSA and construct phylogenetic trees with ultra-large numbers of biological sequences. HAlign-II shows extremely high memory efficiency and scales well with increases in computing resource. THAlign-II provides a user-friendly web server based on our distributed computing infrastructure. HAlign-II with open-source codes and datasets was established at http://lab.malab.cn/soft/halign.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gooding, Thomas M.
Distributing an executable job load file to compute nodes in a parallel computer, the parallel computer comprising a plurality of compute nodes, including: determining, by a compute node in the parallel computer, whether the compute node is participating in a job; determining, by the compute node in the parallel computer, whether a descendant compute node is participating in the job; responsive to determining that the compute node is participating in the job or that the descendant compute node is participating in the job, communicating, by the compute node to a parent compute node, an identification of a data communications linkmore » over which the compute node receives data from the parent compute node; constructing a class route for the job, wherein the class route identifies all compute nodes participating in the job; and broadcasting the executable load file for the job along the class route for the job.« less
Parallel compression of data chunks of a shared data object using a log-structured file system
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bent, John M.; Faibish, Sorin; Grider, Gary
2016-10-25
Techniques are provided for parallel compression of data chunks being written to a shared object. A client executing on a compute node or a burst buffer node in a parallel computing system stores a data chunk generated by the parallel computing system to a shared data object on a storage node by compressing the data chunk; and providing the data compressed data chunk to the storage node that stores the shared object. The client and storage node may employ Log-Structured File techniques. The compressed data chunk can be de-compressed by the client when the data chunk is read. A storagemore » node stores a data chunk as part of a shared object by receiving a compressed version of the data chunk from a compute node; and storing the compressed version of the data chunk to the shared data object on the storage node.« less
Collectively loading programs in a multiple program multiple data environment
DOE Office of Scientific and Technical Information (OSTI.GOV)
Aho, Michael E.; Attinella, John E.; Gooding, Thomas M.
Techniques are disclosed for loading programs efficiently in a parallel computing system. In one embodiment, nodes of the parallel computing system receive a load description file which indicates, for each program of a multiple program multiple data (MPMD) job, nodes which are to load the program. The nodes determine, using collective operations, a total number of programs to load and a number of programs to load in parallel. The nodes further generate a class route for each program to be loaded in parallel, where the class route generated for a particular program includes only those nodes on which the programmore » needs to be loaded. For each class route, a node is selected using a collective operation to be a load leader which accesses a file system to load the program associated with a class route and broadcasts the program via the class route to other nodes which require the program.« less
Aerobraking Maneuver (ABM) Report Generator
NASA Technical Reports Server (NTRS)
Fisher, Forrest; Gladden, Roy; Khanampornpan, Teerapat
2008-01-01
abmREPORT Version 3.1 is a Perl script that extracts vital summarization information from the Mars Reconnaissance Orbiter (MRO) aerobraking ABM build process. This information facilitates sequence reviews, and provides a high-level summarization of the sequence for mission management. The script extracts information from the ENV, SSF, FRF, SCMFmax, and OPTG files and burn magnitude configuration files and presents them in a single, easy-to-check report that provides the majority of the parameters necessary for cross check and verification during the sequence review process. This means that needed information, formerly spread across a number of different files and each in a different format, is all available in this one application. This program is built on the capabilities developed in dragReport and then the scripts evolved as the two tools continued to be developed in parallel.
Global Magnetohydrodynamic Simulation Using High Performance FORTRAN on Parallel Computers
NASA Astrophysics Data System (ADS)
Ogino, T.
High Performance Fortran (HPF) is one of modern and common techniques to achieve high performance parallel computation. We have translated a 3-dimensional magnetohydrodynamic (MHD) simulation code of the Earth's magnetosphere from VPP Fortran to HPF/JA on the Fujitsu VPP5000/56 vector-parallel supercomputer and the MHD code was fully vectorized and fully parallelized in VPP Fortran. The entire performance and capability of the HPF MHD code could be shown to be almost comparable to that of VPP Fortran. A 3-dimensional global MHD simulation of the earth's magnetosphere was performed at a speed of over 400 Gflops with an efficiency of 76.5 VPP5000/56 in vector and parallel computation that permitted comparison with catalog values. We have concluded that fluid and MHD codes that are fully vectorized and fully parallelized in VPP Fortran can be translated with relative ease to HPF/JA, and a code in HPF/JA may be expected to perform comparably to the same code written in VPP Fortran.
Monitoring Data-Structure Evolution in Distributed Message-Passing Programs
NASA Technical Reports Server (NTRS)
Sarukkai, Sekhar R.; Beers, Andrew; Woodrow, Thomas S. (Technical Monitor)
1996-01-01
Monitoring the evolution of data structures in parallel and distributed programs, is critical for debugging its semantics and performance. However, the current state-of-art in tracking and presenting data-structure information on parallel and distributed environments is cumbersome and does not scale. In this paper we present a methodology that automatically tracks memory bindings (not the actual contents) of static and dynamic data-structures of message-passing C programs, using PVM. With the help of a number of examples we show that in addition to determining the impact of memory allocation overheads on program performance, graphical views can help in debugging the semantics of program execution. Scalable animations of virtual address bindings of source-level data-structures are used for debugging the semantics of parallel programs across all processors. In conjunction with light-weight core-files, this technique can be used to complement traditional debuggers on single processors. Detailed information (such as data-structure contents), on specific nodes, can be determined using traditional debuggers after the data structure evolution leading to the semantic error is observed graphically.
Zebra: A striped network file system
NASA Technical Reports Server (NTRS)
Hartman, John H.; Ousterhout, John K.
1992-01-01
The design of Zebra, a striped network file system, is presented. Zebra applies ideas from log-structured file system (LFS) and RAID research to network file systems, resulting in a network file system that has scalable performance, uses its servers efficiently even when its applications are using small files, and provides high availability. Zebra stripes file data across multiple servers, so that the file transfer rate is not limited by the performance of a single server. High availability is achieved by maintaining parity information for the file system. If a server fails its contents can be reconstructed using the contents of the remaining servers and the parity information. Zebra differs from existing striped file systems in the way it stripes file data: Zebra does not stripe on a per-file basis; instead it stripes the stream of bytes written by each client. Clients write to the servers in units called stripe fragments, which are analogous to segments in an LFS. Stripe fragments contain file blocks that were written recently, without regard to which file they belong. This method of striping has numerous advantages over per-file striping, including increased server efficiency, efficient parity computation, and elimination of parity update.
NASA Astrophysics Data System (ADS)
Wang, Hui; Chen, Huansheng; Wu, Qizhong; Lin, Junmin; Chen, Xueshun; Xie, Xinwei; Wang, Rongrong; Tang, Xiao; Wang, Zifa
2017-08-01
The Global Nested Air Quality Prediction Modeling System (GNAQPMS) is the global version of the Nested Air Quality Prediction Modeling System (NAQPMS), which is a multi-scale chemical transport model used for air quality forecast and atmospheric environmental research. In this study, we present the porting and optimisation of GNAQPMS on a second-generation Intel Xeon Phi processor, codenamed Knights Landing
(KNL). Compared with the first-generation Xeon Phi coprocessor (codenamed Knights Corner, KNC), KNL has many new hardware features such as a bootable processor, high-performance in-package memory and ISA compatibility with Intel Xeon processors. In particular, we describe the five optimisations we applied to the key modules of GNAQPMS, including the CBM-Z gas-phase chemistry, advection, convection and wet deposition modules. These optimisations work well on both the KNL 7250 processor and the Intel Xeon E5-2697 V4 processor. They include (1) updating the pure Message Passing Interface (MPI) parallel mode to the hybrid parallel mode with MPI and OpenMP in the emission, advection, convection and gas-phase chemistry modules; (2) fully employing the 512 bit wide vector processing units (VPUs) on the KNL platform; (3) reducing unnecessary memory access to improve cache efficiency; (4) reducing the thread local storage (TLS) in the CBM-Z gas-phase chemistry module to improve its OpenMP performance; and (5) changing the global communication from writing/reading interface files to MPI functions to improve the performance and the parallel scalability. These optimisations greatly improved the GNAQPMS performance. The same optimisations also work well for the Intel Xeon Broadwell processor, specifically E5-2697 v4. Compared with the baseline version of GNAQPMS, the optimised version was 3.51 × faster on KNL and 2.77 × faster on the CPU. Moreover, the optimised version ran at 26 % lower average power on KNL than on the CPU. With the combined performance and energy improvement, the KNL platform was 37.5 % more efficient on power consumption compared with the CPU platform. The optimisations also enabled much further parallel scalability on both the CPU cluster and the KNL cluster scaled to 40 CPU nodes and 30 KNL nodes, with a parallel efficiency of 70.4 and 42.2 %, respectively.
Please Move Inactive Files Off the /projects File System | High-Performance
Computing | NREL Please Move Inactive Files Off the /projects File System Please Move Inactive Files Off the /projects File System January 11, 2018 The /projects file system is a shared resource . This year this has created a space crunch - the file system is now about 90% full and we need your help
Climate Prediction Center - Reanalysis: Atmospheric Data
files; i.e., wgrib for GRIB-2 files wgrib2mv,wgrib2ms parallel processing with wgrib2 grb1to2.pl perl US government, DOC, NWS, NCEP or CPC. All spelling errors are property of the finder. comments
NASA Astrophysics Data System (ADS)
Fang, Z.; Ward, A. L.; Fang, Y.; Yabusaki, S.
2011-12-01
High-resolution geologic models have proven effective in improving the accuracy of subsurface flow and transport predictions. However, many of the parameters in subsurface flow and transport models cannot be determined directly at the scale of interest and must be estimated through inverse modeling. A major challenge, particularly in vadose zone flow and transport, is the inversion of the highly-nonlinear, high-dimensional problem as current methods are not readily scalable for large-scale, multi-process models. In this paper we describe the implementation of a fully automated approach for addressing complex parameter optimization and sensitivity issues on massively parallel multi- and many-core systems. The approach is based on the integration of PNNL's extreme scale Subsurface Transport Over Multiple Phases (eSTOMP) simulator, which uses the Global Array toolkit, with the Beowulf-Cluster inspired parallel nonlinear parameter estimation software, BeoPEST in the MPI mode. In the eSTOMP/BeoPEST implementation, a pre-processor generates all of the PEST input files based on the eSTOMP input file. Simulation results for comparison with observations are extracted automatically at each time step eliminating the need for post-process data extractions. The inversion framework was tested with three different experimental data sets: one-dimensional water flow at Hanford Grass Site; irrigation and infiltration experiment at the Andelfingen Site; and a three-dimensional injection experiment at Hanford's Sisson and Lu Site. Good agreements are achieved in all three applications between observations and simulations in both parameter estimates and water dynamics reproduction. Results show that eSTOMP/BeoPEST approach is highly scalable and can be run efficiently with hundreds or thousands of processors. BeoPEST is fault tolerant and new nodes can be dynamically added and removed. A major advantage of this approach is the ability to use high-resolution geologic models to preserve the spatial structure in the inverse model, which leads to better parameter estimates and improved predictions when using the inverse-conditioned realizations of parameter fields.
Federal Register 2010, 2011, 2012, 2013, 2014
2012-05-04
... the powerhouse to Sierra Pacific Power's Fort Churchill generating station, parallel to an existing... powerhouse to the PDCI and then parallel to the PDCI to the Fort Churchill generating station; and (9...
Design and Execution of make-like, distributed Analyses based on Spotify’s Pipelining Package Luigi
NASA Astrophysics Data System (ADS)
Erdmann, M.; Fischer, B.; Fischer, R.; Rieger, M.
2017-10-01
In high-energy particle physics, workflow management systems are primarily used as tailored solutions in dedicated areas such as Monte Carlo production. However, physicists performing data analyses are usually required to steer their individual workflows manually which is time-consuming and often leads to undocumented relations between particular workloads. We present a generic analysis design pattern that copes with the sophisticated demands of end-to-end HEP analyses and provides a make-like execution system. It is based on the open-source pipelining package Luigi which was developed at Spotify and enables the definition of arbitrary workloads, so-called Tasks, and the dependencies between them in a lightweight and scalable structure. Further features are multi-user support, automated dependency resolution and error handling, central scheduling, and status visualization in the web. In addition to already built-in features for remote jobs and file systems like Hadoop and HDFS, we added support for WLCG infrastructure such as LSF and CREAM job submission, as well as remote file access through the Grid File Access Library. Furthermore, we implemented automated resubmission functionality, software sandboxing, and a command line interface with auto-completion for a convenient working environment. For the implementation of a t \\overline{{{t}}} H cross section measurement, we created a generic Python interface that provides programmatic access to all external information such as datasets, physics processes, statistical models, and additional files and values. In summary, the setup enables the execution of the entire analysis in a parallelized and distributed fashion with a single command.
Schnek: A C++ library for the development of parallel simulation codes on regular grids
NASA Astrophysics Data System (ADS)
Schmitz, Holger
2018-05-01
A large number of algorithms across the field of computational physics are formulated on grids with a regular topology. We present Schnek, a library that enables fast development of parallel simulations on regular grids. Schnek contains a number of easy-to-use modules that greatly reduce the amount of administrative code for large-scale simulation codes. The library provides an interface for reading simulation setup files with a hierarchical structure. The structure of the setup file is translated into a hierarchy of simulation modules that the developer can specify. The reader parses and evaluates mathematical expressions and initialises variables or grid data. This enables developers to write modular and flexible simulation codes with minimal effort. Regular grids of arbitrary dimension are defined as well as mechanisms for defining physical domain sizes, grid staggering, and ghost cells on these grids. Ghost cells can be exchanged between neighbouring processes using MPI with a simple interface. The grid data can easily be written into HDF5 files using serial or parallel I/O.
Optimized Next-Generation Sequencing Genotype-Haplotype Calling for Genome Variability Analysis
Navarro, Javier; Nevado, Bruno; Hernández, Porfidio; Vera, Gonzalo; Ramos-Onsins, Sebastián E
2017-01-01
The accurate estimation of nucleotide variability using next-generation sequencing data is challenged by the high number of sequencing errors produced by new sequencing technologies, especially for nonmodel species, where reference sequences may not be available and the read depth may be low due to limited budgets. The most popular single-nucleotide polymorphism (SNP) callers are designed to obtain a high SNP recovery and low false discovery rate but are not designed to account appropriately the frequency of the variants. Instead, algorithms designed to account for the frequency of SNPs give precise results for estimating the levels and the patterns of variability. These algorithms are focused on the unbiased estimation of the variability and not on the high recovery of SNPs. Here, we implemented a fast and optimized parallel algorithm that includes the method developed by Roesti et al and Lynch, which estimates the genotype of each individual at each site, considering the possibility to call both bases from the genotype, a single one or none. This algorithm does not consider the reference and therefore is independent of biases related to the reference nucleotide specified. The pipeline starts from a BAM file converted to pileup or mpileup format and the software outputs a FASTA file. The new program not only reduces the running times but also, given the improved use of resources, it allows its usage with smaller computers and large parallel computers, expanding its benefits to a wider range of researchers. The output file can be analyzed using software for population genetics analysis, such as the R library PopGenome, the software VariScan, and the program mstatspop for analysis considering positions with missing data. PMID:28894353
An effective XML based name mapping mechanism within StoRM
NASA Astrophysics Data System (ADS)
Corso, E.; Forti, A.; Ghiselli, A.; Magnoni, L.; Zappi, R.
2008-07-01
In a Grid environment the naming capability allows users to refer to specific data resources in a physical storage system using a high level logical identifier. This logical identifier is typically organized in a file system like structure, a hierarchical tree of names. Storage Resource Manager (SRM) services map the logical identifier to the physical location of data evaluating a set of parameters as the desired quality of services and the VOMS attributes specified in the requests. StoRM is a SRM service developed by INFN and ICTP-EGRID to manage file and space on standard POSIX and high performing parallel and cluster file systems. An upcoming requirement in the Grid data scenario is the orthogonality of the logical name and the physical location of data, in order to refer, with the same identifier, to different copies of data archived in various storage areas with different quality of service. The mapping mechanism proposed in StoRM is based on a XML document that represents the different storage components managed by the service, the storage areas defined by the site administrator, the quality of service they provide and the Virtual Organization that want to use the storage area. An appropriate directory tree is realized in each storage component reflecting the XML schema. In this scenario StoRM is able to identify the physical location of a requested data evaluating the logical identifier and the specified attributes following the XML schema, without querying any database service. This paper presents the namespace schema defined, the different entities represented and the technical details of the StoRM implementation.
2013-03-15
310) 451-7002; Fax: (310) 451-6915; Email : order@rand.org The research described in this report was conducted as part of a series of previously...the office of a project sponsor at computer 11 and then sent through email from computer 11 to computer 12 over network 14. Sometimes, this file is...operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover
ImageJ: Image processing and analysis in Java
NASA Astrophysics Data System (ADS)
Rasband, W. S.
2012-06-01
ImageJ is a public domain Java image processing program inspired by NIH Image. It can display, edit, analyze, process, save and print 8-bit, 16-bit and 32-bit images. It can read many image formats including TIFF, GIF, JPEG, BMP, DICOM, FITS and "raw". It supports "stacks", a series of images that share a single window. It is multithreaded, so time-consuming operations such as image file reading can be performed in parallel with other operations.
NASA Astrophysics Data System (ADS)
Baregheh, Mandana; Mezentsev, Vladimir; Schmitz, Holger
2011-06-01
We describe a parallel multi-threaded approach for high performance modelling of wide class of phenomena in ultrafast nonlinear optics. Specific implementation has been performed using the highly parallel capabilities of a programmable graphics processor.
An Adaptable Seismic Data Format for Modern Scientific Workflows
NASA Astrophysics Data System (ADS)
Smith, J. A.; Bozdag, E.; Krischer, L.; Lefebvre, M.; Lei, W.; Podhorszki, N.; Tromp, J.
2013-12-01
Data storage, exchange, and access play a critical role in modern seismology. Current seismic data formats, such as SEED, SAC, and SEG-Y, were designed with specific applications in mind and are frequently a major bottleneck in implementing efficient workflows. We propose a new modern parallel format that can be adapted for a variety of seismic workflows. The Adaptable Seismic Data Format (ASDF) features high-performance parallel read and write support and the ability to store an arbitrary number of traces of varying sizes. Provenance information is stored inside the file so that users know the origin of the data as well as the precise operations that have been applied to the waveforms. The design of the new format is based on several real-world use cases, including earthquake seismology and seismic interferometry. The metadata is based on the proven XML schemas StationXML and QuakeML. Existing time-series analysis tool-kits are easily interfaced with this new format so that seismologists can use robust, previously developed software packages, such as ObsPy and the SAC library. ADIOS, netCDF4, and HDF5 can be used as the underlying container format. At Princeton University, we have chosen to use ADIOS as the container format because it has shown superior scalability for certain applications, such as dealing with big data on HPC systems. In the context of high-performance computing, we have implemented ASDF into the global adjoint tomography workflow on Oak Ridge National Laboratory's supercomputer Titan.
New Parallel computing framework for radiation transport codes
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kostin, M.A.; /Michigan State U., NSCL; Mokhov, N.V.
A new parallel computing framework has been developed to use with general-purpose radiation transport codes. The framework was implemented as a C++ module that uses MPI for message passing. The module is significantly independent of radiation transport codes it can be used with, and is connected to the codes by means of a number of interface functions. The framework was integrated with the MARS15 code, and an effort is under way to deploy it in PHITS. Besides the parallel computing functionality, the framework offers a checkpoint facility that allows restarting calculations with a saved checkpoint file. The checkpoint facility canmore » be used in single process calculations as well as in the parallel regime. Several checkpoint files can be merged into one thus combining results of several calculations. The framework also corrects some of the known problems with the scheduling and load balancing found in the original implementations of the parallel computing functionality in MARS15 and PHITS. The framework can be used efficiently on homogeneous systems and networks of workstations, where the interference from the other users is possible.« less
NASA Astrophysics Data System (ADS)
Anantharaj, Valentine; Norman, Matthew; Evans, Katherine; Taylor, Mark; Worley, Patrick; Hack, James; Mayer, Benjamin
2014-05-01
During 2013, high-resolution climate model simulations accounted for over 100 million "core hours" using Titan at the Oak Ridge Leadership Computing Facility (OLCF). The suite of climate modeling experiments, primarily using the Community Earth System Model (CESM) at nearly 0.25 degree horizontal resolution, generated over a petabyte of data and nearly 100,000 files, ranging in sizes from 20 MB to over 100 GB. Effective utilization of leadership class resources requires careful planning and preparation. The application software, such as CESM, need to be ported, optimized and benchmarked for the target platform in order to meet the computational readiness requirements. The model configuration needs to be "tuned and balanced" for the experiments. This can be a complicated and resource intensive process, especially for high-resolution configurations using complex physics. The volume of I/O also increases with resolution; and new strategies may be required to manage I/O especially for large checkpoint and restart files that may require more frequent output for resiliency. It is also essential to monitor the application performance during the course of the simulation exercises. Finally, the large volume of data needs to be analyzed to derive the scientific results; and appropriate data and information delivered to the stakeholders. Titan is currently the largest supercomputer available for open science. The computational resources, in terms of "titan core hours" are allocated primarily via the Innovative and Novel Computational Impact on Theory and Experiment (INCITE) and ASCR Leadership Computing Challenge (ALCC) programs, both sponsored by the U.S. Department of Energy (DOE) Office of Science. Titan is a Cray XK7 system, capable of a theoretical peak performance of over 27 PFlop/s, consists of 18,688 compute nodes, with a NVIDIA Kepler K20 GPU and a 16-core AMD Opteron CPU in every node, for a total of 299,008 Opteron cores and 18,688 GPUs offering a cumulative 560,640 equivalent cores. Scientific applications, such as CESM, are also required to demonstrate a "computational readiness capability" to efficiently scale across and utilize 20% of the entire system. The 0,25 deg configuration of the spectral element dynamical core of the Community Atmosphere Model (CAM-SE), the atmospheric component of CESM, has been demonstrated to scale efficiently across more than 5,000 nodes (80,000 CPU cores) on Titan. The tracer transport routines of CAM-SE have also been ported to take advantage of the hybrid many-core architecture of Titan using GPUs [see EGU2014-4233], yielding over 2X speedup when transporting over 100 tracers. The high throughput I/O in CESM, based on the Parallel IO Library (PIO), is being further augmented to support even higher resolutions and enhance resiliency. The application performance of the individual runs are archived in a database and routinely analyzed to identify and rectify performance degradation during the course of the experiments. The various resources available at the OLCF now support a scientific workflow to facilitate high-resolution climate modelling. A high-speed center-wide parallel file system, called ATLAS, capable of 1 TB/s, is available on Titan as well as on the clusters used for analysis (Rhea) and visualization (Lens/EVEREST). Long-term archive is facilitated by the HPSS storage system. The Earth System Grid (ESG), featuring search & discovery, is also used to deliver data. The end-to-end workflow allows OLCF users to efficiently share data and publish results in a timely manner.
Madanecki, Piotr; Bałut, Magdalena; Buckley, Patrick G; Ochocka, J Renata; Bartoszewski, Rafał; Crossman, David K; Messiaen, Ludwine M; Piotrowski, Arkadiusz
2018-01-01
High-throughput technologies generate considerable amount of data which often requires bioinformatic expertise to analyze. Here we present High-Throughput Tabular Data Processor (HTDP), a platform independent Java program. HTDP works on any character-delimited column data (e.g. BED, GFF, GTF, PSL, WIG, VCF) from multiple text files and supports merging, filtering and converting of data that is produced in the course of high-throughput experiments. HTDP can also utilize itemized sets of conditions from external files for complex or repetitive filtering/merging tasks. The program is intended to aid global, real-time processing of large data sets using a graphical user interface (GUI). Therefore, no prior expertise in programming, regular expression, or command line usage is required of the user. Additionally, no a priori assumptions are imposed on the internal file composition. We demonstrate the flexibility and potential of HTDP in real-life research tasks including microarray and massively parallel sequencing, i.e. identification of disease predisposing variants in the next generation sequencing data as well as comprehensive concurrent analysis of microarray and sequencing results. We also show the utility of HTDP in technical tasks including data merge, reduction and filtering with external criteria files. HTDP was developed to address functionality that is missing or rudimentary in other GUI software for processing character-delimited column data from high-throughput technologies. Flexibility, in terms of input file handling, provides long term potential functionality in high-throughput analysis pipelines, as the program is not limited by the currently existing applications and data formats. HTDP is available as the Open Source software (https://github.com/pmadanecki/htdp).
Bałut, Magdalena; Buckley, Patrick G.; Ochocka, J. Renata; Bartoszewski, Rafał; Crossman, David K.; Messiaen, Ludwine M.; Piotrowski, Arkadiusz
2018-01-01
High-throughput technologies generate considerable amount of data which often requires bioinformatic expertise to analyze. Here we present High-Throughput Tabular Data Processor (HTDP), a platform independent Java program. HTDP works on any character-delimited column data (e.g. BED, GFF, GTF, PSL, WIG, VCF) from multiple text files and supports merging, filtering and converting of data that is produced in the course of high-throughput experiments. HTDP can also utilize itemized sets of conditions from external files for complex or repetitive filtering/merging tasks. The program is intended to aid global, real-time processing of large data sets using a graphical user interface (GUI). Therefore, no prior expertise in programming, regular expression, or command line usage is required of the user. Additionally, no a priori assumptions are imposed on the internal file composition. We demonstrate the flexibility and potential of HTDP in real-life research tasks including microarray and massively parallel sequencing, i.e. identification of disease predisposing variants in the next generation sequencing data as well as comprehensive concurrent analysis of microarray and sequencing results. We also show the utility of HTDP in technical tasks including data merge, reduction and filtering with external criteria files. HTDP was developed to address functionality that is missing or rudimentary in other GUI software for processing character-delimited column data from high-throughput technologies. Flexibility, in terms of input file handling, provides long term potential functionality in high-throughput analysis pipelines, as the program is not limited by the currently existing applications and data formats. HTDP is available as the Open Source software (https://github.com/pmadanecki/htdp). PMID:29432475
Using Robinhood to Manage Files on Peregrine | High-Performance Computing |
: $ module load robinhood Here are some usage examples: To find all the files on /projects that you own scratch.conf -u $USER /scratch File purging on /scratch uses last access time to determine which files are
WinSCP for Windows File Transfers | High-Performance Computing | NREL
WinSCP for Windows File Transfers WinSCP for Windows File Transfers WinSCP for can used to securely transfer files between your local computer running Microsoft Windows and a remote computer running Linux
DOE Office of Scientific and Technical Information (OSTI.GOV)
Langer, S; Rotman, D; Schwegler, E
The Institutional Computing Executive Group (ICEG) review of FY05-06 Multiprogrammatic and Institutional Computing (M and IC) activities is presented in the attached report. In summary, we find that the M and IC staff does an outstanding job of acquiring and supporting a wide range of institutional computing resources to meet the programmatic and scientific goals of LLNL. The responsiveness and high quality of support given to users and the programs investing in M and IC reflects the dedication and skill of the M and IC staff. M and IC has successfully managed serial capacity, parallel capacity, and capability computing resources.more » Serial capacity computing supports a wide range of scientific projects which require access to a few high performance processors within a shared memory computer. Parallel capacity computing supports scientific projects that require a moderate number of processors (up to roughly 1000) on a parallel computer. Capability computing supports parallel jobs that push the limits of simulation science. M and IC has worked closely with Stockpile Stewardship, and together they have made LLNL a premier institution for computational and simulation science. Such a standing is vital to the continued success of laboratory science programs and to the recruitment and retention of top scientists. This report provides recommendations to build on M and IC's accomplishments and improve simulation capabilities at LLNL. We recommend that institution fully fund (1) operation of the atlas cluster purchased in FY06 to support a few large projects; (2) operation of the thunder and zeus clusters to enable 'mid-range' parallel capacity simulations during normal operation and a limited number of large simulations during dedicated application time; (3) operation of the new yana cluster to support a wide range of serial capacity simulations; (4) improvements to the reliability and performance of the Lustre parallel file system; (5) support for the new GDO petabyte-class storage facility on the green network for use in data intensive external collaborations; and (6) continued support for visualization and other methods for analyzing large simulations. We also recommend that M and IC begin planning in FY07 for the next upgrade of its parallel clusters. LLNL investments in M and IC have resulted in a world-class simulation capability leading to innovative science. We thank the LLNL management for its continued support and thank the M and IC staff for its vision and dedicated efforts to make it all happen.« less
NASA Astrophysics Data System (ADS)
Zaghi, S.
2014-07-01
OFF, an open source (free software) code for performing fluid dynamics simulations, is presented. The aim of OFF is to solve, numerically, the unsteady (and steady) compressible Navier-Stokes equations of fluid dynamics by means of finite volume techniques: the research background is mainly focused on high-order (WENO) schemes for multi-fluids, multi-phase flows over complex geometries. To this purpose a highly modular, object-oriented application program interface (API) has been developed. In particular, the concepts of data encapsulation and inheritance available within Fortran language (from standard 2003) have been stressed in order to represent each fluid dynamics "entity" (e.g. the conservative variables of a finite volume, its geometry, etc…) by a single object so that a large variety of computational libraries can be easily (and efficiently) developed upon these objects. The main features of OFF can be summarized as follows: Programming LanguageOFF is written in standard (compliant) Fortran 2003; its design is highly modular in order to enhance simplicity of use and maintenance without compromising the efficiency; Parallel Frameworks Supported the development of OFF has been also targeted to maximize the computational efficiency: the code is designed to run on shared-memory multi-cores workstations and distributed-memory clusters of shared-memory nodes (supercomputers); the code's parallelization is based on Open Multiprocessing (OpenMP) and Message Passing Interface (MPI) paradigms; Usability, Maintenance and Enhancement in order to improve the usability, maintenance and enhancement of the code also the documentation has been carefully taken into account; the documentation is built upon comprehensive comments placed directly into the source files (no external documentation files needed): these comments are parsed by means of doxygen free software producing high quality html and latex documentation pages; the distributed versioning system referred as git has been adopted in order to facilitate the collaborative maintenance and improvement of the code; CopyrightsOFF is a free software that anyone can use, copy, distribute, study, change and improve under the GNU Public License version 3. The present paper is a manifesto of OFF code and presents the currently implemented features and ongoing developments. This work is focused on the computational techniques adopted and a detailed description of the main API characteristics is reported. OFF capabilities are demonstrated by means of one and two dimensional examples and a three dimensional real application.
McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression
Islam, Tanzima Zerin; Mohror, Kathryn; Bagchi, Saurabh; ...
2013-01-01
High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources. The high overheads force large-scale applications to reduce checkpoint frequency, which means more compute time is lost in the event of failure. We alleviate this problem through a scalable checkpoint-restart system, mcrEngine. McrEngine aggregates checkpoints from multiple application processes with knowledge of the data semantics available through widely-used I/O libraries, e.g., HDF5 and netCDF, and compresses them. Our novel scheme improves compressibility ofmore » checkpoints up to 115% over simple concatenation and compression. Our evaluation with large-scale application checkpoints show that mcrEngine reduces checkpointing overhead by up to 87% and restart overhead by up to 62% over a baseline with no aggregation or compression.« less
Novel 3D Compression Methods for Geometry, Connectivity and Texture
NASA Astrophysics Data System (ADS)
Siddeq, M. M.; Rodrigues, M. A.
2016-06-01
A large number of applications in medical visualization, games, engineering design, entertainment, heritage, e-commerce and so on require the transmission of 3D models over the Internet or over local networks. 3D data compression is an important requirement for fast data storage, access and transmission within bandwidth limitations. The Wavefront OBJ (object) file format is commonly used to share models due to its clear simple design. Normally each OBJ file contains a large amount of data (e.g. vertices and triangulated faces, normals, texture coordinates and other parameters) describing the mesh surface. In this paper we introduce a new method to compress geometry, connectivity and texture coordinates by a novel Geometry Minimization Algorithm (GM-Algorithm) in connection with arithmetic coding. First, each vertex ( x, y, z) coordinates are encoded to a single value by the GM-Algorithm. Second, triangle faces are encoded by computing the differences between two adjacent vertex locations, which are compressed by arithmetic coding together with texture coordinates. We demonstrate the method on large data sets achieving compression ratios between 87 and 99 % without reduction in the number of reconstructed vertices and triangle faces. The decompression step is based on a Parallel Fast Matching Search Algorithm (Parallel-FMS) to recover the structure of the 3D mesh. A comparative analysis of compression ratios is provided with a number of commonly used 3D file formats such as VRML, OpenCTM and STL highlighting the performance and effectiveness of the proposed method.
al3c: high-performance software for parameter inference using Approximate Bayesian Computation.
Stram, Alexander H; Marjoram, Paul; Chen, Gary K
2015-11-01
The development of Approximate Bayesian Computation (ABC) algorithms for parameter inference which are both computationally efficient and scalable in parallel computing environments is an important area of research. Monte Carlo rejection sampling, a fundamental component of ABC algorithms, is trivial to distribute over multiple processors but is inherently inefficient. While development of algorithms such as ABC Sequential Monte Carlo (ABC-SMC) help address the inherent inefficiencies of rejection sampling, such approaches are not as easily scaled on multiple processors. As a result, current Bayesian inference software offerings that use ABC-SMC lack the ability to scale in parallel computing environments. We present al3c, a C++ framework for implementing ABC-SMC in parallel. By requiring only that users define essential functions such as the simulation model and prior distribution function, al3c abstracts the user from both the complexities of parallel programming and the details of the ABC-SMC algorithm. By using the al3c framework, the user is able to scale the ABC-SMC algorithm in parallel computing environments for his or her specific application, with minimal programming overhead. al3c is offered as a static binary for Linux and OS-X computing environments. The user completes an XML configuration file and C++ plug-in template for the specific application, which are used by al3c to obtain the desired results. Users can download the static binaries, source code, reference documentation and examples (including those in this article) by visiting https://github.com/ahstram/al3c. astram@usc.edu Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Next Generation Parallelization Systems for Processing and Control of PDS Image Node Assets
NASA Astrophysics Data System (ADS)
Verma, R.
2017-06-01
We present next-generation parallelization tools to help Planetary Data System (PDS) Imaging Node (IMG) better monitor, process, and control changes to nearly 650 million file assets and over a dozen machines on which they are referenced or stored.
Using the Parallel Computing Toolbox with MATLAB on the Peregrine System |
parallel pool took %g seconds.\\n', toc) % "single program multiple data" spmd fprintf('Worker %d says Hello World!\\n', labindex) end delete(gcp); % close the parallel pool exit To run the script on a compute node, create the file helloWorld.sub: #!/bin/bash #PBS -l walltime=05:00 #PBS -l nodes=1 #PBS -N
Graphical Representation of Parallel Algorithmic Processes
1990-12-01
interface with the AAARF main process . The source code for the AAARF class-common library is in the common subdi- rectory and consists of the following files... for public release; distribution unlimited AFIT/GCE/ENG/90D-07 Graphical Representation of Parallel Algorithmic Processes THESIS Presented to the...goal of this study is to develop an algorithm animation facility for parallel processes executing on different architectures, from multiprocessor
EOS MLS Level 2 Data Processing Software Version 3
NASA Technical Reports Server (NTRS)
Livesey, Nathaniel J.; VanSnyder, Livesey W.; Read, William G.; Schwartz, Michael J.; Lambert, Alyn; Santee, Michelle L.; Nguyen, Honghanh T.; Froidevaux, Lucien; wang, Shuhui; Manney, Gloria L.;
2011-01-01
This software accepts the EOS MLS calibrated measurements of microwave radiances products and operational meteorological data, and produces a set of estimates of atmospheric temperature and composition. This version has been designed to be as flexible as possible. The software is controlled by a Level 2 Configuration File that controls all aspects of the software: defining the contents of state and measurement vectors, defining the configurations of the various forward models available, reading appropriate a priori spectroscopic and calibration data, performing retrievals, post-processing results, computing diagnostics, and outputting results in appropriate files. In production mode, the software operates in a parallel form, with one instance of the program acting as a master, coordinating the work of multiple slave instances on a cluster of computers, each computing the results for individual chunks of data. In addition, to do conventional retrieval calculations and producing geophysical products, the Level 2 Configuration File can instruct the software to produce files of simulated radiances based on a state vector formed from a set of geophysical product files taken as input. Combining both the retrieval and simulation tasks in a single piece of software makes it far easier to ensure that identical forward model algorithms and parameters are used in both tasks. This also dramatically reduces the complexity of the code maintenance effort.
SciSpark's SRDD : A Scientific Resilient Distributed Dataset for Multidimensional Data
NASA Astrophysics Data System (ADS)
Palamuttam, R. S.; Wilson, B. D.; Mogrovejo, R. M.; Whitehall, K. D.; Mattmann, C. A.; McGibbney, L. J.; Ramirez, P.
2015-12-01
Remote sensing data and climate model output are multi-dimensional arrays of massive sizes locked away in heterogeneous file formats (HDF5/4, NetCDF 3/4) and metadata models (HDF-EOS, CF) making it difficult to perform multi-stage, iterative science processing since each stage requires writing and reading data to and from disk. We have developed SciSpark, a robust Big Data framework, that extends ApacheTM Spark for scaling scientific computations. Apache Spark improves the map-reduce implementation in ApacheTM Hadoop for parallel computing on a cluster, by emphasizing in-memory computation, "spilling" to disk only as needed, and relying on lazy evaluation. Central to Spark is the Resilient Distributed Dataset (RDD), an in-memory distributed data structure that extends the functional paradigm provided by the Scala programming language. However, RDDs are ideal for tabular or unstructured data, and not for highly dimensional data. The SciSpark project introduces the Scientific Resilient Distributed Dataset (sRDD), a distributed-computing array structure which supports iterative scientific algorithms for multidimensional data. SciSpark processes data stored in NetCDF and HDF files by partitioning them across time or space and distributing the partitions among a cluster of compute nodes. We show usability and extensibility of SciSpark by implementing distributed algorithms for geospatial operations on large collections of multi-dimensional grids. In particular we address the problem of scaling an automated method for finding Mesoscale Convective Complexes. SciSpark provides a tensor interface to support the pluggability of different matrix libraries. We evaluate performance of the various matrix libraries in distributed pipelines, such as Nd4jTM and BreezeTM. We detail the architecture and design of SciSpark, our efforts to integrate climate science algorithms, parallel ingest and partitioning (sharding) of A-Train satellite observations from model grids. These solutions are encompassed in SciSpark, an open-source software framework for distributed computing on scientific data.
Efficient Parallel Kernel Solvers for Computational Fluid Dynamics Applications
NASA Technical Reports Server (NTRS)
Sun, Xian-He
1997-01-01
Distributed-memory parallel computers dominate today's parallel computing arena. These machines, such as Intel Paragon, IBM SP2, and Cray Origin2OO, have successfully delivered high performance computing power for solving some of the so-called "grand-challenge" problems. Despite initial success, parallel machines have not been widely accepted in production engineering environments due to the complexity of parallel programming. On a parallel computing system, a task has to be partitioned and distributed appropriately among processors to reduce communication cost and to attain load balance. More importantly, even with careful partitioning and mapping, the performance of an algorithm may still be unsatisfactory, since conventional sequential algorithms may be serial in nature and may not be implemented efficiently on parallel machines. In many cases, new algorithms have to be introduced to increase parallel performance. In order to achieve optimal performance, in addition to partitioning and mapping, a careful performance study should be conducted for a given application to find a good algorithm-machine combination. This process, however, is usually painful and elusive. The goal of this project is to design and develop efficient parallel algorithms for highly accurate Computational Fluid Dynamics (CFD) simulations and other engineering applications. The work plan is 1) developing highly accurate parallel numerical algorithms, 2) conduct preliminary testing to verify the effectiveness and potential of these algorithms, 3) incorporate newly developed algorithms into actual simulation packages. The work plan has well achieved. Two highly accurate, efficient Poisson solvers have been developed and tested based on two different approaches: (1) Adopting a mathematical geometry which has a better capacity to describe the fluid, (2) Using compact scheme to gain high order accuracy in numerical discretization. The previously developed Parallel Diagonal Dominant (PDD) algorithm and Reduced Parallel Diagonal Dominant (RPDD) algorithm have been carefully studied on different parallel platforms for different applications, and a NASA simulation code developed by Man M. Rai and his colleagues has been parallelized and implemented based on data dependency analysis. These achievements are addressed in detail in the paper.
NASA Astrophysics Data System (ADS)
Baru, Chaitan; Nandigam, Viswanath; Krishnan, Sriram
2010-05-01
Increasingly, the geoscience user community expects modern IT capabilities to be available in service of their research and education activities, including the ability to easily access and process large remote sensing datasets via online portals such as GEON (www.geongrid.org) and OpenTopography (opentopography.org). However, serving such datasets via online data portals presents a number of challenges. In this talk, we will evaluate the pros and cons of alternative storage strategies for management and processing of such datasets using binary large object implementations (BLOBs) in database systems versus implementation in Hadoop files using the Hadoop Distributed File System (HDFS). The storage and I/O requirements for providing online access to large datasets dictate the need for declustering data across multiple disks, for capacity as well as bandwidth and response time performance. This requires partitioning larger files into a set of smaller files, and is accompanied by the concomitant requirement for managing large numbers of file. Storing these sub-files as blobs in a shared-nothing database implemented across a cluster provides the advantage that all the distributed storage management is done by the DBMS. Furthermore, subsetting and processing routines can be implemented as user-defined functions (UDFs) on these blobs and would run in parallel across the set of nodes in the cluster. On the other hand, there are both storage overheads and constraints, and software licensing dependencies created by such an implementation. Another approach is to store the files in an external filesystem with pointers to them from within database tables. The filesystem may be a regular UNIX filesystem, a parallel filesystem, or HDFS. In the HDFS case, HDFS would provide the file management capability, while the subsetting and processing routines would be implemented as Hadoop programs using the MapReduce model. Hadoop and its related software libraries are freely available. Another consideration is the strategy used for partitioning large data collections, and large datasets within collections, using round-robin vs hash partitioning vs range partitioning methods. Each has different characteristics in terms of spatial locality of data and resultant degree of declustering of the computations on the data. Furthermore, we have observed that, in practice, there can be large variations in the frequency of access to different parts of a large data collection and/or dataset, thereby creating "hotspots" in the data. We will evaluate the ability of different approaches for dealing effectively with such hotspots and alternative strategies for dealing with hotspots.
k-RP*{sub s}: A scalable distributed data structure for high-performance multi-attribute access
DOE Office of Scientific and Technical Information (OSTI.GOV)
Litwin, W.; Neimat, M.A.
k-RP*{sub s} is a new data structure for scalable multicomputer files with multi-attribute (k-d) keys. We discuss the k-RP*{sub s} file evolution and search algorithms. Performance analysis shows that a k-RP*{sub s} file can be much larger and orders of magnitude faster than a traditional k-d file. The speed-up is especially important for range and partial match searches that are often impractical with traditional k-d files. This opens up a new perspective for many applications.
PWHATSHAP: efficient haplotyping for future generation sequencing.
Bracciali, Andrea; Aldinucci, Marco; Patterson, Murray; Marschall, Tobias; Pisanti, Nadia; Merelli, Ivan; Torquati, Massimo
2016-09-22
Haplotype phasing is an important problem in the analysis of genomics information. Given a set of DNA fragments of an individual, it consists of determining which one of the possible alleles (alternative forms of a gene) each fragment comes from. Haplotype information is relevant to gene regulation, epigenetics, genome-wide association studies, evolutionary and population studies, and the study of mutations. Haplotyping is currently addressed as an optimisation problem aiming at solutions that minimise, for instance, error correction costs, where costs are a measure of the confidence in the accuracy of the information acquired from DNA sequencing. Solutions have typically an exponential computational complexity. WHATSHAP is a recent optimal approach which moves computational complexity from DNA fragment length to fragment overlap, i.e., coverage, and is hence of particular interest when considering sequencing technology's current trends that are producing longer fragments. Given the potential relevance of efficient haplotyping in several analysis pipelines, we have designed and engineered PWHATSHAP, a parallel, high-performance version of WHATSHAP. PWHATSHAP is embedded in a toolkit developed in Python and supports genomics datasets in standard file formats. Building on WHATSHAP, PWHATSHAP exhibits the same complexity exploring a number of possible solutions which is exponential in the coverage of the dataset. The parallel implementation on multi-core architectures allows for a relevant reduction of the execution time for haplotyping, while the provided results enjoy the same high accuracy as that provided by WHATSHAP, which increases with coverage. Due to its structure and management of the large datasets, the parallelisation of WHATSHAP posed demanding technical challenges, which have been addressed exploiting a high-level parallel programming framework. The result, PWHATSHAP, is a freely available toolkit that improves the efficiency of the analysis of genomics information.
[Series: Medical Applications of the PHITS Code (2): Acceleration by Parallel Computing].
Furuta, Takuya; Sato, Tatsuhiko
2015-01-01
Time-consuming Monte Carlo dose calculation becomes feasible owing to the development of computer technology. However, the recent development is due to emergence of the multi-core high performance computers. Therefore, parallel computing becomes a key to achieve good performance of software programs. A Monte Carlo simulation code PHITS contains two parallel computing functions, the distributed-memory parallelization using protocols of message passing interface (MPI) and the shared-memory parallelization using open multi-processing (OpenMP) directives. Users can choose the two functions according to their needs. This paper gives the explanation of the two functions with their advantages and disadvantages. Some test applications are also provided to show their performance using a typical multi-core high performance workstation.
New NAS Parallel Benchmarks Results
NASA Technical Reports Server (NTRS)
Yarrow, Maurice; Saphir, William; VanderWijngaart, Rob; Woo, Alex; Kutler, Paul (Technical Monitor)
1997-01-01
NPB2 (NAS (NASA Advanced Supercomputing) Parallel Benchmarks 2) is an implementation, based on Fortran and the MPI (message passing interface) message passing standard, of the original NAS Parallel Benchmark specifications. NPB2 programs are run with little or no tuning, in contrast to NPB vendor implementations, which are highly optimized for specific architectures. NPB2 results complement, rather than replace, NPB results. Because they have not been optimized by vendors, NPB2 implementations approximate the performance a typical user can expect for a portable parallel program on distributed memory parallel computers. Together these results provide an insightful comparison of the real-world performance of high-performance computers. New NPB2 features: New implementation (CG), new workstation class problem sizes, new serial sample versions, more performance statistics.
MPI, HPF or OpenMP: A Study with the NAS Benchmarks
NASA Technical Reports Server (NTRS)
Jin, Hao-Qiang; Frumkin, Michael; Hribar, Michelle; Waheed, Abdul; Yan, Jerry; Saini, Subhash (Technical Monitor)
1999-01-01
Porting applications to new high performance parallel and distributed platforms is a challenging task. Writing parallel code by hand is time consuming and costly, but the task can be simplified by high level languages and would even better be automated by parallelizing tools and compilers. The definition of HPF (High Performance Fortran, based on data parallel model) and OpenMP (based on shared memory parallel model) standards has offered great opportunity in this respect. Both provide simple and clear interfaces to language like FORTRAN and simplify many tedious tasks encountered in writing message passing programs. In our study we implemented the parallel versions of the NAS Benchmarks with HPF and OpenMP directives. Comparison of their performance with the MPI implementation and pros and cons of different approaches will be discussed along with experience of using computer-aided tools to help parallelize these benchmarks. Based on the study,potentials of applying some of the techniques to realistic aerospace applications will be presented
MPI, HPF or OpenMP: A Study with the NAS Benchmarks
NASA Technical Reports Server (NTRS)
Jin, H.; Frumkin, M.; Hribar, M.; Waheed, A.; Yan, J.; Saini, Subhash (Technical Monitor)
1999-01-01
Porting applications to new high performance parallel and distributed platforms is a challenging task. Writing parallel code by hand is time consuming and costly, but this task can be simplified by high level languages and would even better be automated by parallelizing tools and compilers. The definition of HPF (High Performance Fortran, based on data parallel model) and OpenMP (based on shared memory parallel model) standards has offered great opportunity in this respect. Both provide simple and clear interfaces to language like FORTRAN and simplify many tedious tasks encountered in writing message passing programs. In our study, we implemented the parallel versions of the NAS Benchmarks with HPF and OpenMP directives. Comparison of their performance with the MPI implementation and pros and cons of different approaches will be discussed along with experience of using computer-aided tools to help parallelize these benchmarks. Based on the study, potentials of applying some of the techniques to realistic aerospace applications will be presented.
Application of Open Source Technologies for Oceanographic Data Analysis
NASA Astrophysics Data System (ADS)
Huang, T.; Gangl, M.; Quach, N. T.; Wilson, B. D.; Chang, G.; Armstrong, E. M.; Chin, T. M.; Greguska, F.
2015-12-01
NEXUS is a data-intensive analysis solution developed with a new approach for handling science data that enables large-scale data analysis by leveraging open source technologies such as Apache Cassandra, Apache Spark, Apache Solr, and Webification. NEXUS has been selected to provide on-the-fly time-series and histogram generation for the Soil Moisture Active Passive (SMAP) mission for Level 2 and Level 3 Active, Passive, and Active Passive products. It also provides an on-the-fly data subsetting capability. NEXUS is designed to scale horizontally, enabling it to handle massive amounts of data in parallel. It takes a new approach on managing time and geo-referenced array data by dividing data artifacts into chunks and stores them in an industry-standard, horizontally scaled NoSQL database. This approach enables the development of scalable data analysis services that can infuse and leverage the elastic computing infrastructure of the Cloud. It is equipped with a high-performance geospatial and indexed data search solution, coupled with a high-performance data Webification solution free from file I/O bottlenecks, as well as a high-performance, in-memory data analysis engine. In this talk, we will focus on the recently funded AIST 2014 project by using NEXUS as the core for oceanographic anomaly detection service and web portal. We call it, OceanXtremes
NASA Astrophysics Data System (ADS)
Yan, Hui; Wang, K. G.; Jones, Jim E.
2016-06-01
A parallel algorithm for large-scale three-dimensional phase-field simulations of phase coarsening is developed and implemented on high-performance architectures. From the large-scale simulations, a new kinetics in phase coarsening in the region of ultrahigh volume fraction is found. The parallel implementation is capable of harnessing the greater computer power available from high-performance architectures. The parallelized code enables increase in three-dimensional simulation system size up to a 5123 grid cube. Through the parallelized code, practical runtime can be achieved for three-dimensional large-scale simulations, and the statistical significance of the results from these high resolution parallel simulations are greatly improved over those obtainable from serial simulations. A detailed performance analysis on speed-up and scalability is presented, showing good scalability which improves with increasing problem size. In addition, a model for prediction of runtime is developed, which shows a good agreement with actual run time from numerical tests.
ABM Drag_Pass Report Generator
NASA Technical Reports Server (NTRS)
Fisher, Forest; Gladden, Roy; Khanampornpan, Teerapat
2008-01-01
dragREPORT software was developed in parallel with abmREPORT, which is described in the preceding article. Both programs were built on the capabilities created during that process. This tool generates a drag_pass report that summarizes vital information from the MRO aerobreaking drag_pass build process to facilitate both sequence reviews and provide a high-level summarization of the sequence for mission management. The script extracts information from the ENV, SSF, FRF, SCMFmax, and OPTG files, presenting them in a single, easy-to-check report providing the majority of parameters needed for cross check and verification as part of the sequence review process. Prior to dragReport, all the needed information was spread across a number of different files, each in a different format. This software is a Perl script that extracts vital summarization information and build-process details from a number of source files into a single, concise report format used to aid the MPST sequence review process and to provide a high-level summarization of the sequence for mission management reference. This software could be adapted for future aerobraking missions to provide similar reports, review and summarization information.
NASA Astrophysics Data System (ADS)
Capone, V.; Esposito, R.; Pardi, S.; Taurino, F.; Tortone, G.
2012-12-01
Over the last few years we have seen an increasing number of services and applications needed to manage and maintain cloud computing facilities. This is particularly true for computing in high energy physics, which often requires complex configurations and distributed infrastructures. In this scenario a cost effective rationalization and consolidation strategy is the key to success in terms of scalability and reliability. In this work we describe an IaaS (Infrastructure as a Service) cloud computing system, with high availability and redundancy features, which is currently in production at INFN-Naples and ATLAS Tier-2 data centre. The main goal we intended to achieve was a simplified method to manage our computing resources and deliver reliable user services, reusing existing hardware without incurring heavy costs. A combined usage of virtualization and clustering technologies allowed us to consolidate our services on a small number of physical machines, reducing electric power costs. As a result of our efforts we developed a complete solution for data and computing centres that can be easily replicated using commodity hardware. Our architecture consists of 2 main subsystems: a clustered storage solution, built on top of disk servers running GlusterFS file system, and a virtual machines execution environment. GlusterFS is a network file system able to perform parallel writes on multiple disk servers, providing this way live replication of data. High availability is also achieved via a network configuration using redundant switches and multiple paths between hypervisor hosts and disk servers. We also developed a set of management scripts to easily perform basic system administration tasks such as automatic deployment of new virtual machines, adaptive scheduling of virtual machines on hypervisor hosts, live migration and automated restart in case of hypervisor failures.
Stocco, Andrea; Yamasaki, Brianna L; Prat, Chantel S
2018-04-01
This article describes the data analyzed in the paper "Individual differences in the Simon effect are underpinned by differences in the competitive dynamics in the basal ganglia: An experimental verification and a computational model" (Stocco et al., 2017) [1]. The data includes behavioral results from participants performing three cognitive tasks (Probabilistic Stimulus Selection (Frank et al., 2004) [2], Simon task (Craft and Simon, 1970) [3], and Automated Operation Span (Unsworth et al., 2005) [4]), as well as simulationed traces generated by a computational neurocognitive model that accounts for individual variations in human performance across the tasks. The experimental data encompasses individual data files (in both preprocessed and native output format) as well as group-level summary files. The simulation data includes the entire model code, the results of a full-grid search of the model's parameter space, and the code used to partition the model space and parallelize the simulations. Finally, the repository includes the R scripts used to carry out the statistical analyses reported in the original paper.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fischer, G.A.
2011-07-01
Document available in abstract form only, full text of document follows: The dosimetry from the H. B. Robinson Unit 2 Pressure Vessel Benchmark is analyzed with a suite of Westinghouse-developed codes and data libraries. The radiation transport from the reactor core to the surveillance capsule and ex-vessel locations is performed by RAPTOR-M3G, a parallel deterministic radiation transport code that calculates high-resolution neutron flux information in three dimensions. The cross-section library used in this analysis is the ALPAN library, an Evaluated Nuclear Data File (ENDF)/B-VII.0-based library designed for reactor dosimetry and fluence analysis applications. Dosimetry is evaluated with the industry-standard SNLRMLmore » reactor dosimetry cross-section data library. (authors)« less
Parallelization of NAS Benchmarks for Shared Memory Multiprocessors
NASA Technical Reports Server (NTRS)
Waheed, Abdul; Yan, Jerry C.; Saini, Subhash (Technical Monitor)
1998-01-01
This paper presents our experiences of parallelizing the sequential implementation of NAS benchmarks using compiler directives on SGI Origin2000 distributed shared memory (DSM) system. Porting existing applications to new high performance parallel and distributed computing platforms is a challenging task. Ideally, a user develops a sequential version of the application, leaving the task of porting to new generations of high performance computing systems to parallelization tools and compilers. Due to the simplicity of programming shared-memory multiprocessors, compiler developers have provided various facilities to allow the users to exploit parallelism. Native compilers on SGI Origin2000 support multiprocessing directives to allow users to exploit loop-level parallelism in their programs. Additionally, supporting tools can accomplish this process automatically and present the results of parallelization to the users. We experimented with these compiler directives and supporting tools by parallelizing sequential implementation of NAS benchmarks. Results reported in this paper indicate that with minimal effort, the performance gain is comparable with the hand-parallelized, carefully optimized, message-passing implementations of the same benchmarks.
smallWig: parallel compression of RNA-seq WIG files.
Wang, Zhiying; Weissman, Tsachy; Milenkovic, Olgica
2016-01-15
We developed a new lossless compression method for WIG data, named smallWig, offering the best known compression rates for RNA-seq data and featuring random access functionalities that enable visualization, summary statistics analysis and fast queries from the compressed files. Our approach results in order of magnitude improvements compared with bigWig and ensures compression rates only a fraction of those produced by cWig. The key features of the smallWig algorithm are statistical data analysis and a combination of source coding methods that ensure high flexibility and make the algorithm suitable for different applications. Furthermore, for general-purpose file compression, the compression rate of smallWig approaches the empirical entropy of the tested WIG data. For compression with random query features, smallWig uses a simple block-based compression scheme that introduces only a minor overhead in the compression rate. For archival or storage space-sensitive applications, the method relies on context mixing techniques that lead to further improvements of the compression rate. Implementations of smallWig can be executed in parallel on different sets of chromosomes using multiple processors, thereby enabling desirable scaling for future transcriptome Big Data platforms. The development of next-generation sequencing technologies has led to a dramatic decrease in the cost of DNA/RNA sequencing and expression profiling. RNA-seq has emerged as an important and inexpensive technology that provides information about whole transcriptomes of various species and organisms, as well as different organs and cellular communities. The vast volume of data generated by RNA-seq experiments has significantly increased data storage costs and communication bandwidth requirements. Current compression tools for RNA-seq data such as bigWig and cWig either use general-purpose compressors (gzip) or suboptimal compression schemes that leave significant room for improvement. To substantiate this claim, we performed a statistical analysis of expression data in different transform domains and developed accompanying entropy coding methods that bridge the gap between theoretical and practical WIG file compression rates. We tested different variants of the smallWig compression algorithm on a number of integer-and real- (floating point) valued RNA-seq WIG files generated by the ENCODE project. The results reveal that, on average, smallWig offers 18-fold compression rate improvements, up to 2.5-fold compression time improvements, and 1.5-fold decompression time improvements when compared with bigWig. On the tested files, the memory usage of the algorithm never exceeded 90 KB. When more elaborate context mixing compressors were used within smallWig, the obtained compression rates were as much as 23 times better than those of bigWig. For smallWig used in the random query mode, which also supports retrieval of the summary statistics, an overhead in the compression rate of roughly 3-17% was introduced depending on the chosen system parameters. An increase in encoding and decoding time of 30% and 55% represents an additional performance loss caused by enabling random data access. We also implemented smallWig using multi-processor programming. This parallelization feature decreases the encoding delay 2-3.4 times compared with that of a single-processor implementation, with the number of processors used ranging from 2 to 8; in the same parameter regime, the decoding delay decreased 2-5.2 times. The smallWig software can be downloaded from: http://stanford.edu/~zhiyingw/smallWig/smallwig.html, http://publish.illinois.edu/milenkovic/, http://web.stanford.edu/~tsachy/. zhiyingw@stanford.edu Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
High-Performance Compute Infrastructure in Astronomy: 2020 Is Only Months Away
NASA Astrophysics Data System (ADS)
Berriman, B.; Deelman, E.; Juve, G.; Rynge, M.; Vöckler, J. S.
2012-09-01
By 2020, astronomy will be awash with as much as 60 PB of public data. Full scientific exploitation of such massive volumes of data will require high-performance computing on server farms co-located with the data. Development of this computing model will be a community-wide enterprise that has profound cultural and technical implications. Astronomers must be prepared to develop environment-agnostic applications that support parallel processing. The community must investigate the applicability and cost-benefit of emerging technologies such as cloud computing to astronomy, and must engage the Computer Science community to develop science-driven cyberinfrastructure such as workflow schedulers and optimizers. We report here the results of collaborations between a science center, IPAC, and a Computer Science research institute, ISI. These collaborations may be considered pathfinders in developing a high-performance compute infrastructure in astronomy. These collaborations investigated two exemplar large-scale science-driver workflow applications: 1) Calculation of an infrared atlas of the Galactic Plane at 18 different wavelengths by placing data from multiple surveys on a common plate scale and co-registering all the pixels; 2) Calculation of an atlas of periodicities present in the public Kepler data sets, which currently contain 380,000 light curves. These products have been generated with two workflow applications, written in C for performance and designed to support parallel processing on multiple environments and platforms, but with different compute resource needs: the Montage image mosaic engine is I/O-bound, and the NASA Star and Exoplanet Database periodogram code is CPU-bound. Our presentation will report cost and performance metrics and lessons-learned for continuing development. Applicability of Cloud Computing: Commercial Cloud providers generally charge for all operations, including processing, transfer of input and output data, and for storage of data, and so the costs of running applications vary widely according to how they use resources. The cloud is well suited to processing CPU-bound (and memory bound) workflows such as the periodogram code, given the relatively low cost of processing in comparison with I/O operations. I/O-bound applications such as Montage perform best on high-performance clusters with fast networks and parallel file-systems. Science-driven Cyberinfrastructure: Montage has been widely used as a driver application to develop workflow management services, such as task scheduling in distributed environments, designing fault tolerance techniques for job schedulers, and developing workflow orchestration techniques. Running Parallel Applications Across Distributed Cloud Environments: Data processing will eventually take place in parallel distributed across cyber infrastructure environments having different architectures. We have used the Pegasus Work Management System (WMS) to successfully run applications across three very different environments: TeraGrid, OSG (Open Science Grid), and FutureGrid. Provisioning resources across different grids and clouds (also referred to as Sky Computing), involves establishing a distributed environment, where issues of, e.g, remote job submission, data management, and security need to be addressed. This environment also requires building virtual machine images that can run in different environments. Usually, each cloud provides basic images that can be customized with additional software and services. In most of our work, we provisioned compute resources using a custom application, called Wrangler. Pegasus WMS abstracts the architectures of the compute environments away from the end-user, and can be considered a first-generation tool suitable for scientists to run their applications on disparate environments.
NASA Technical Reports Server (NTRS)
Ferrara, Jeffrey; Calk, William; Atwell, William; Tsui, Tina
2013-01-01
MPISS is an automatic file transfer system that implements a combination of standard and mission-unique transfer protocols required by the Global Precipitation Measurement Mission (GPM) Precipitation Processing System (PPS) to control the flow of data between the MOC and the PPS. The primary features of MPISS are file transfers (both with and without PPS specific protocols), logging of file transfer and system events to local files and a standard messaging bus, short term storage of data files to facilitate retransmissions, and generation of file transfer accounting reports. The system includes a graphical user interface (GUI) to control the system, allow manual operations, and to display events in real time. The PPS specific protocols are an enhanced version of those that were developed for the Tropical Rainfall Measuring Mission (TRMM). All file transfers between the MOC and the PPS use the SSH File Transfer Protocol (SFTP). For reports and data files generated within the MOC, no additional protocols are used when transferring files to the PPS. For observatory data files, an additional handshaking protocol of data notices and data receipts is used. MPISS generates and sends to the PPS data notices containing data start and stop times along with a checksum for the file for each observatory data file transmitted. MPISS retrieves the PPS generated data receipts that indicate the success or failure of the PPS to ingest the data file and/or notice. MPISS retransmits the appropriate files as indicated in the receipt when required. MPISS also automatically retrieves files from the PPS. The unique feature of this software is the use of both standard and PPS specific protocols in parallel. The advantage of this capability is that it supports users that require the PPS protocol as well as those that do not require it. The system is highly configurable to accommodate the needs of future users.
Addressing the Tension Between Strong Perimeter Control an Usability
NASA Technical Reports Server (NTRS)
Hinke, Thomas H.; Kolano, Paul Z.; Keller, Chris
2006-01-01
This paper describes a strong perimeter control system for a general purpose processing system, with the perimeter control system taking significant steps to address usability issues, thus mitigating the tension between strong perimeter protection and usability. A secure front end enforces two-factor authentication for all interactive access to an enclave that contains a large supercomputer and various associated systems, with each requiring their own authentication. Usability is addressed through a design in which the user has to perform two-factor authentication at the secure front end in order to gain access to the enclave, while an agent transparently performs public key authentication as needed to authenticate to specific systems within the enclave. The paper then describes a proxy system that allows users to transfer files into the enclave under script control, when the user is not present to perform two-factor authentication. This uses a pre-authorization approach based on public key technology, which is still strongly tied to both two-factor authentication and strict control over where files can be transferred on the target system. Finally the paper describes an approach to support network applications and systems such as grids or parallel file transfer protocols that require the use of many ports through the perimeter. The paper describes a least privilege approach that dynamically opens ports on a host-specific, if-authorized, as-needed, just-in-time basis.
Compression of high-density EMG signals for trapezius and gastrocnemius muscles.
Itiki, Cinthia; Furuie, Sergio S; Merletti, Roberto
2014-03-10
New technologies for data transmission and multi-electrode arrays increased the demand for compressing high-density electromyography (HD EMG) signals. This article aims the compression of HD EMG signals recorded by two-dimensional electrode matrices at different muscle-contraction forces. It also shows methodological aspects of compressing HD EMG signals for non-pinnate (upper trapezius) and pinnate (medial gastrocnemius) muscles, using image compression techniques. HD EMG signals were placed in image rows, according to two distinct electrode orders: parallel and perpendicular to the muscle longitudinal axis. For the lossless case, the images obtained from single-differential signals as well as their differences in time were compressed. For the lossy algorithm, the images associated to the recorded monopolar or single-differential signals were compressed for different compression levels. Lossless compression provided up to 59.3% file-size reduction (FSR), with lower contraction forces associated to higher FSR. For lossy compression, a 90.8% reduction on the file size was attained, while keeping the signal-to-noise ratio (SNR) at 21.19 dB. For a similar FSR, higher contraction forces corresponded to higher SNR CONCLUSIONS: The computation of signal differences in time improves the performance of lossless compression while the selection of signals in the transversal order improves the lossy compression of HD EMG, for both pinnate and non-pinnate muscles.
Compression of high-density EMG signals for trapezius and gastrocnemius muscles
2014-01-01
Background New technologies for data transmission and multi-electrode arrays increased the demand for compressing high-density electromyography (HD EMG) signals. This article aims the compression of HD EMG signals recorded by two-dimensional electrode matrices at different muscle-contraction forces. It also shows methodological aspects of compressing HD EMG signals for non-pinnate (upper trapezius) and pinnate (medial gastrocnemius) muscles, using image compression techniques. Methods HD EMG signals were placed in image rows, according to two distinct electrode orders: parallel and perpendicular to the muscle longitudinal axis. For the lossless case, the images obtained from single-differential signals as well as their differences in time were compressed. For the lossy algorithm, the images associated to the recorded monopolar or single-differential signals were compressed for different compression levels. Results Lossless compression provided up to 59.3% file-size reduction (FSR), with lower contraction forces associated to higher FSR. For lossy compression, a 90.8% reduction on the file size was attained, while keeping the signal-to-noise ratio (SNR) at 21.19 dB. For a similar FSR, higher contraction forces corresponded to higher SNR Conclusions The computation of signal differences in time improves the performance of lossless compression while the selection of signals in the transversal order improves the lossy compression of HD EMG, for both pinnate and non-pinnate muscles. PMID:24612604
Sam2bam: High-Performance Framework for NGS Data Preprocessing Tools
Cheng, Yinhe; Tzeng, Tzy-Hwa Kathy
2016-01-01
This paper introduces a high-throughput software tool framework called sam2bam that enables users to significantly speed up pre-processing for next-generation sequencing data. The sam2bam is especially efficient on single-node multi-core large-memory systems. It can reduce the runtime of data pre-processing in marking duplicate reads on a single node system by 156–186x compared with de facto standard tools. The sam2bam consists of parallel software components that can fully utilize multiple processors, available memory, high-bandwidth storage, and hardware compression accelerators, if available. The sam2bam provides file format conversion between well-known genome file formats, from SAM to BAM, as a basic feature. Additional features such as analyzing, filtering, and converting input data are provided by using plug-in tools, e.g., duplicate marking, which can be attached to sam2bam at runtime. We demonstrated that sam2bam could significantly reduce the runtime of next generation sequencing (NGS) data pre-processing from about two hours to about one minute for a whole-exome data set on a 16-core single-node system using up to 130 GB of memory. The sam2bam could reduce the runtime of NGS data pre-processing from about 20 hours to about nine minutes for a whole-genome sequencing data set on the same system using up to 711 GB of memory. PMID:27861637
Parallel line analysis: multifunctional software for the biomedical sciences
NASA Technical Reports Server (NTRS)
Swank, P. R.; Lewis, M. L.; Damron, K. L.; Morrison, D. R.
1990-01-01
An easy to use, interactive FORTRAN program for analyzing the results of parallel line assays is described. The program is menu driven and consists of five major components: data entry, data editing, manual analysis, manual plotting, and automatic analysis and plotting. Data can be entered from the terminal or from previously created data files. The data editing portion of the program is used to inspect and modify data and to statistically identify outliers. The manual analysis component is used to test the assumptions necessary for parallel line assays using analysis of covariance techniques and to determine potency ratios with confidence limits. The manual plotting component provides a graphic display of the data on the terminal screen or on a standard line printer. The automatic portion runs through multiple analyses without operator input. Data may be saved in a special file to expedite input at a future time.
OceanXtremes: Scalable Anomaly Detection in Oceanographic Time-Series
NASA Astrophysics Data System (ADS)
Wilson, B. D.; Armstrong, E. M.; Chin, T. M.; Gill, K. M.; Greguska, F. R., III; Huang, T.; Jacob, J. C.; Quach, N.
2016-12-01
The oceanographic community must meet the challenge to rapidly identify features and anomalies in complex and voluminous observations to further science and improve decision support. Given this data-intensive reality, we are developing an anomaly detection system, called OceanXtremes, powered by an intelligent, elastic Cloud-based analytic service backend that enables execution of domain-specific, multi-scale anomaly and feature detection algorithms across the entire archive of 15 to 30-year ocean science datasets.Our parallel analytics engine is extending the NEXUS system and exploits multiple open-source technologies: Apache Cassandra as a distributed spatial "tile" cache, Apache Spark for in-memory parallel computation, and Apache Solr for spatial search and storing pre-computed tile statistics and other metadata. OceanXtremes provides these key capabilities: Parallel generation (Spark on a compute cluster) of 15 to 30-year Ocean Climatologies (e.g. sea surface temperature or SST) in hours or overnight, using simple pixel averages or customizable Gaussian-weighted "smoothing" over latitude, longitude, and time; Parallel pre-computation, tiling, and caching of anomaly fields (daily variables minus a chosen climatology) with pre-computed tile statistics; Parallel detection (over the time-series of tiles) of anomalies or phenomena by regional area-averages exceeding a specified threshold (e.g. high SST in El Nino or SST "blob" regions), or more complex, custom data mining algorithms; Shared discovery and exploration of ocean phenomena and anomalies (facet search using Solr), along with unexpected correlations between key measured variables; Scalable execution for all capabilities on a hybrid Cloud, using our on-premise OpenStack Cloud cluster or at Amazon. The key idea is that the parallel data-mining operations will be run "near" the ocean data archives (a local "network" hop) so that we can efficiently access the thousands of files making up a three decade time-series. The presentation will cover the architecture of OceanXtremes, parallelization of the climatology computation and anomaly detection algorithms using Spark, example results for SST and other time-series, and parallel performance metrics.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fang, Chin; Corttrell, R. A.
This Technical Note provides an overview of high-performance parallel Big Data transfers with and without encryption for data in-transit over multiple network channels. It shows that with the parallel approach, it is feasible to carry out high-performance parallel "encrypted" Big Data transfers without serious impact to throughput. But other impacts, e.g. the energy-consumption part should be investigated. It also explains our rationales of using a statistics-based approach for gaining understanding from test results and for improving the system. The presentation is of high-level nature. Nevertheless, at the end we will pose some questions and identify potentially fruitful directions for futuremore » work.« less
NASA Technical Reports Server (NTRS)
Welch, J. D.
1975-01-01
The preliminary design of an experiment for landmark recognition and tracking from the Shuttle/Advanced Technology Laboratory is described. It makes use of parallel coherent optical processing to perform correlation tests between landmarks observed passively with a telescope and previously made holographic matched filters. The experimental equipment including the optics, the low power laser, the random access file of matched filters and the electro-optical readout device are described. A real time optically excited liquid crystal device is recommended for performing the input non-coherent optical to coherent optical interface function. A development program leading to a flight experiment in 1981 is outlined.
A Parallel Pipelined Renderer for the Time-Varying Volume Data
NASA Technical Reports Server (NTRS)
Chiueh, Tzi-Cker; Ma, Kwan-Liu
1997-01-01
This paper presents a strategy for efficiently rendering time-varying volume data sets on a distributed-memory parallel computer. Time-varying volume data take large storage space and visualizing them requires reading large files continuously or periodically throughout the course of the visualization process. Instead of using all the processors to collectively render one volume at a time, a pipelined rendering process is formed by partitioning processors into groups to render multiple volumes concurrently. In this way, the overall rendering time may be greatly reduced because the pipelined rendering tasks are overlapped with the I/O required to load each volume into a group of processors; moreover, parallelization overhead may be reduced as a result of partitioning the processors. We modify an existing parallel volume renderer to exploit various levels of rendering parallelism and to study how the partitioning of processors may lead to optimal rendering performance. Two factors which are important to the overall execution time are re-source utilization efficiency and pipeline startup latency. The optimal partitioning configuration is the one that balances these two factors. Tests on Intel Paragon computers show that in general optimal partitionings do exist for a given rendering task and result in 40-50% saving in overall rendering time.
Comparison of working length control consistency between hand K-files and Mtwo NiTi rotary system.
Krajczár, Károly; Varga, Enikő; Marada, Gyula; Jeges, Sára; Tóth, Vilmos
2016-04-01
The purpose of this study was to investigate the consistency of working length control between hand instrumentation in comparison to engine driven Mtwo nickel-titanium rotary files. Forty extracted maxillary molars were selected and divided onto two parallel groups. The working lengths of the mesiobuccal root canals were estimated. The teeth were fixed in a phantom head. The root canal preparation was carried out group 1 (n=20) with hand K-files, (VDW, Munich, Germany) and group 2 (n=20) with Mtwo instruments (VDW, Munich, Germany). Vestibulo-oral and mesio-distal directional x-ray images were taken before the preparation with #10 K-file, inserted into the mesiobuccal root canal to the working length, and after preparation with #25, #30 and #40 files. Working lenght changes were detected with measurements between the radiological apex and the instrument tips. In the Mtwo group a difference in the working competency (p<0.05) could be noticed only in the vestibulo-oral direction from #10 to #40 file. The hand instrument group showed a significant difference in working length competency for each larger file size (p<0.05) (ANOVA). Regression analysis in the hand instrumentation group indicated a working length decrease with a mean of 0,2 mm after each consecutive file size (p<0.01). The outcome of our trial indicated a high consistency in working length control for root canal preparation under simulated clinical condition using Mtwo rotary files. Mtwo NiTi rotary file did therefore proved to be more accurate in comparison to the conventional hand instrumentation. Working length, Mtwo, nickel-titanium, hand preparation, engine driven preparation.
Comparison of working length control consistency between hand K-files and Mtwo NiTi rotary system
Krajczár, Károly; Varga, Enikő; Jeges, Sára; Tóth, Vilmos
2016-01-01
Background The purpose of this study was to investigate the consistency of working length control between hand instrumentation in comparison to engine driven Mtwo nickel-titanium rotary files. Material and Methods Forty extracted maxillary molars were selected and divided onto two parallel groups. The working lengths of the mesiobuccal root canals were estimated. The teeth were fixed in a phantom head. The root canal preparation was carried out group 1 (n=20) with hand K-files, (VDW, Munich, Germany) and group 2 (n=20) with Mtwo instruments (VDW, Munich, Germany). Vestibulo-oral and mesio-distal directional x-ray images were taken before the preparation with #10 K-file, inserted into the mesiobuccal root canal to the working length, and after preparation with #25, #30 and #40 files. Working lenght changes were detected with measurements between the radiological apex and the instrument tips. Results In the Mtwo group a difference in the working competency (p<0.05) could be noticed only in the vestibulo-oral direction from #10 to #40 file. The hand instrument group showed a significant difference in working length competency for each larger file size (p<0.05) (ANOVA). Regression analysis in the hand instrumentation group indicated a working length decrease with a mean of 0,2 mm after each consecutive file size (p<0.01). Conclusions The outcome of our trial indicated a high consistency in working length control for root canal preparation under simulated clinical condition using Mtwo rotary files. Mtwo NiTi rotary file did therefore proved to be more accurate in comparison to the conventional hand instrumentation. Key words:Working length, Mtwo, nickel-titanium, hand preparation, engine driven preparation. PMID:27034752
NASA Technical Reports Server (NTRS)
Morgan, Philip E.
2004-01-01
This final report contains reports of research related to the tasks "Scalable High Performance Computing: Direct and Lark-Eddy Turbulent FLow Simulations Using Massively Parallel Computers" and "Devleop High-Performance Time-Domain Computational Electromagnetics Capability for RCS Prediction, Wave Propagation in Dispersive Media, and Dual-Use Applications. The discussion of Scalable High Performance Computing reports on three objectives: validate, access scalability, and apply two parallel flow solvers for three-dimensional Navier-Stokes flows; develop and validate a high-order parallel solver for Direct Numerical Simulations (DNS) and Large Eddy Simulation (LES) problems; and Investigate and develop a high-order Reynolds averaged Navier-Stokes turbulence model. The discussion of High-Performance Time-Domain Computational Electromagnetics reports on five objectives: enhancement of an electromagnetics code (CHARGE) to be able to effectively model antenna problems; utilize lessons learned in high-order/spectral solution of swirling 3D jets to apply to solving electromagnetics project; transition a high-order fluids code, FDL3DI, to be able to solve Maxwell's Equations using compact-differencing; develop and demonstrate improved radiation absorbing boundary conditions for high-order CEM; and extend high-order CEM solver to address variable material properties. The report also contains a review of work done by the systems engineer.
Accounting Data to Web Interface Using PERL
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hargeaves, C
2001-08-13
This document will explain the process to create a web interface for the accounting information generated by the High Performance Storage Systems (HPSS) accounting report feature. The accounting report contains useful data but it is not easily accessed in a meaningful way. The accounting report is the only way to see summarized storage usage information. The first step is to take the accounting data, make it meaningful and store the modified data in persistent databases. The second step is to generate the various user interfaces, HTML pages, that will be used to access the data. The third step is tomore » transfer all required files to the web server. The web pages pass parameters to Common Gateway Interface (CGI) scripts that generate dynamic web pages and graphs. The end result is a web page with specific information presented in text with or without graphs. The accounting report has a specific format that allows the use of regular expressions to verify if a line is storage data. Each storage data line is stored in a detailed database file with a name that includes the run date. The detailed database is used to create a summarized database file that also uses run date in its name. The summarized database is used to create the group.html web page that includes a list of all storage users. Scripts that query the database folder to build a list of available databases generate two additional web pages. A master script that is run monthly as part of a cron job, after the accounting report has completed, manages all of these individual scripts. All scripts are written in the PERL programming language. Whenever possible data manipulation scripts are written as filters. All scripts are written to be single source, which means they will function properly on both the open and closed networks at LLNL. The master script handles the command line inputs for all scripts, file transfers to the web server and records run information in a log file. The rest of the scripts manipulate the accounting data or use the files created to generate HTML pages. Each script will be described in detail herein. The following is a brief description of HPSS taken directly from an HPSS web site. ''HPSS is a major development project, which began in 1993 as a Cooperative Research and Development Agreement (CRADA) between government and industry. The primary objective of HPSS is to move very large data objects between high performance computers, workstation clusters, and storage libraries at speeds many times faster than is possible with today's software systems. For example, HPSS can manage parallel data transfers from multiple network-connected disk arrays at rates greater than 1 Gbyte per second, making it possible to access high definition digitized video in real time.'' The HPSS accounting report is a canned report whose format is controlled by the HPSS developers.« less
Beating the tyranny of scale with a private cloud configured for Big Data
NASA Astrophysics Data System (ADS)
Lawrence, Bryan; Bennett, Victoria; Churchill, Jonathan; Juckes, Martin; Kershaw, Philip; Pepler, Sam; Pritchard, Matt; Stephens, Ag
2015-04-01
The Joint Analysis System, JASMIN, consists of a five significant hardware components: a batch computing cluster, a hypervisor cluster, bulk disk storage, high performance disk storage, and access to a tape robot. Each of the computing clusters consists of a heterogeneous set of servers, supporting a range of possible data analysis tasks - and a unique network environment makes it relatively trivial to migrate servers between the two clusters. The high performance disk storage will include the world's largest (publicly visible) deployment of the Panasas parallel disk system. Initially deployed in April 2012, JASMIN has already undergone two major upgrades, culminating in a system which by April 2015, will have in excess of 16 PB of disk and 4000 cores. Layered on the basic hardware are a range of services, ranging from managed services, such as the curated archives of the Centre for Environmental Data Archival or the data analysis environment for the National Centres for Atmospheric Science and Earth Observation, to a generic Infrastructure as a Service (IaaS) offering for the UK environmental science community. Here we present examples of some of the big data workloads being supported in this environment - ranging from data management tasks, such as checksumming 3 PB of data held in over one hundred million files, to science tasks, such as re-processing satellite observations with new algorithms, or calculating new diagnostics on petascale climate simulation outputs. We will demonstrate how the provision of a cloud environment closely coupled to a batch computing environment, all sharing the same high performance disk system allows massively parallel processing without the necessity to shuffle data excessively - even as it supports many different virtual communities, each with guaranteed performance. We will discuss the advantages of having a heterogeneous range of servers with available memory from tens of GB at the low end to (currently) two TB at the high end. There are some limitations of the JASMIN environment, the high performance disk environment is not fully available in the IaaS environment, and a planned ability to burst compute heavy jobs into the public cloud is not yet fully available. There are load balancing and performance issues that need to be understood. We will conclude with projections for future usage, and our plans to meet those requirements.
High-energy physics software parallelization using database techniques
NASA Astrophysics Data System (ADS)
Argante, E.; van der Stok, P. D. V.; Willers, I.
1997-02-01
A programming model for software parallelization, called CoCa, is introduced that copes with problems caused by typical features of high-energy physics software. By basing CoCa on the database transaction paradimg, the complexity induced by the parallelization is for a large part transparent to the programmer, resulting in a higher level of abstraction than the native message passing software. CoCa is implemented on a Meiko CS-2 and on a SUN SPARCcenter 2000 parallel computer. On the CS-2, the performance is comparable with the performance of native PVM and MPI.
Scalable Unix commands for parallel processors : a high-performance implementation.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ong, E.; Lusk, E.; Gropp, W.
2001-06-22
We describe a family of MPI applications we call the Parallel Unix Commands. These commands are natural parallel versions of common Unix user commands such as ls, ps, and find, together with a few similar commands particular to the parallel environment. We describe the design and implementation of these programs and present some performance results on a 256-node Linux cluster. The Parallel Unix Commands are open source and freely available.
Optoelectronic associative recall using motionless-head parallel readout optical disk
NASA Astrophysics Data System (ADS)
Marchand, P. J.; Krishnamoorthy, A. V.; Ambs, P.; Esener, S. C.
1990-12-01
High data rates, low retrieval times, and simple implementation are presently shown to be obtainable by means of a motionless-head 2D parallel-readout system for optical disks. Since the optical disk obviates mechanical head motions for access, focusing, and tracking, addressing is performed exclusively through the disk's rotation. Attention is given to a high-performance associative memory system configuration which employs a parallel readout disk.
NASA Technical Reports Server (NTRS)
Saini, Subash; Bailey, David; Chancellor, Marisa K. (Technical Monitor)
1997-01-01
High Performance Fortran (HPF), the high-level language for parallel Fortran programming, is based on Fortran 90. HALF was defined by an informal standards committee known as the High Performance Fortran Forum (HPFF) in 1993, and modeled on TMC's CM Fortran language. Several HPF features have since been incorporated into the draft ANSI/ISO Fortran 95, the next formal revision of the Fortran standard. HPF allows users to write a single parallel program that can execute on a serial machine, a shared-memory parallel machine, or a distributed-memory parallel machine. HPF eliminates the complex, error-prone task of explicitly specifying how, where, and when to pass messages between processors on distributed-memory machines, or when to synchronize processors on shared-memory machines. HPF is designed in a way that allows the programmer to code an application at a high level, and then selectively optimize portions of the code by dropping into message-passing or calling tuned library routines as 'extrinsics'. Compilers supporting High Performance Fortran features first appeared in late 1994 and early 1995 from Applied Parallel Research (APR) Digital Equipment Corporation, and The Portland Group (PGI). IBM introduced an HPF compiler for the IBM RS/6000 SP/2 in April of 1996. Over the past two years, these implementations have shown steady improvement in terms of both features and performance. The performance of various hardware/ programming model (HPF and MPI (message passing interface)) combinations will be compared, based on latest NAS (NASA Advanced Supercomputing) Parallel Benchmark (NPB) results, thus providing a cross-machine and cross-model comparison. Specifically, HPF based NPB results will be compared with MPI based NPB results to provide perspective on performance currently obtainable using HPF versus MPI or versus hand-tuned implementations such as those supplied by the hardware vendors. In addition we would also present NPB (Version 1.0) performance results for the following systems: DEC Alpha Server 8400 5/440, Fujitsu VPP Series (VX, VPP300, and VPP700), HP/Convex Exemplar SPP2000, IBM RS/6000 SP P2SC node (120 MHz) NEC SX-4/32, SGI/CRAY T3E, SGI Origin2000.
Parallelized direct execution simulation of message-passing parallel programs
NASA Technical Reports Server (NTRS)
Dickens, Phillip M.; Heidelberger, Philip; Nicol, David M.
1994-01-01
As massively parallel computers proliferate, there is growing interest in findings ways by which performance of massively parallel codes can be efficiently predicted. This problem arises in diverse contexts such as parallelizing computers, parallel performance monitoring, and parallel algorithm development. In this paper we describe one solution where one directly executes the application code, but uses a discrete-event simulator to model details of the presumed parallel machine such as operating system and communication network behavior. Because this approach is computationally expensive, we are interested in its own parallelization specifically the parallelization of the discrete-event simulator. We describe methods suitable for parallelized direct execution simulation of message-passing parallel programs, and report on the performance of such a system, Large Application Parallel Simulation Environment (LAPSE), we have built on the Intel Paragon. On all codes measured to date, LAPSE predicts performance well typically within 10 percent relative error. Depending on the nature of the application code, we have observed low slowdowns (relative to natively executing code) and high relative speedups using up to 64 processors.
Purge-Alert | High-Performance Computing | NREL
and maintain if you want notifications for files that will be removed within 7 days. Simply add a single email address to this file: /scratch/$USER/.notify-email The email you will receive will provide instructions as to where the files are located. You will only receive emails if there are files that will be
Component Framework for Loosely Coupled High Performance Integrated Plasma Simulations
NASA Astrophysics Data System (ADS)
Elwasif, W. R.; Bernholdt, D. E.; Shet, A. G.; Batchelor, D. B.; Foley, S.
2010-11-01
We present the design and implementation of a component-based simulation framework for the execution of coupled time-dependent plasma modeling codes. The Integrated Plasma Simulator (IPS) provides a flexible lightweight component model that streamlines the integration of stand alone codes into coupled simulations. Standalone codes are adapted to the IPS component interface specification using a thin wrapping layer implemented in the Python programming language. The framework provides services for inter-component method invocation, configuration, task, and data management, asynchronous event management, simulation monitoring, and checkpoint/restart capabilities. Services are invoked, as needed, by the computational components to coordinate the execution of different aspects of coupled simulations on Massive parallel Processing (MPP) machines. A common plasma state layer serves as the foundation for inter-component, file-based data exchange. The IPS design principles, implementation details, and execution model will be presented, along with an overview of several use cases.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Staley, Martin
2017-09-20
This high-performance ray tracing library provides very fast rendering; compact code; type flexibility through C++ "generic programming" techniques; and ease of use via an application programming interface (API) that operates independently of any GUI, on-screen display, or other enclosing application. Kip supports constructive solid geometry (CSG) models based on a wide variety of built-in shapes and logical operators, and also allows for user-defined shapes and operators to be provided. Additional features include basic texturing; input/output of models using a simple human-readable file format and with full error checking and detailed diagnostics; and support for shared data parallelism. Kip is writtenmore » in pure, ANSI standard C++; is entirely platform independent; and is very easy to use. As a C++ "header only" library, it requires no build system, configuration or installation scripts, wizards, non-C++ preprocessing, makefiles, shell scripts, or external libraries.« less
Federal Register 2010, 2011, 2012, 2013, 2014
2011-04-13
... on the Exchange's Internet Web site at http://www.directedge.com . \\3\\ A Member is any registered... strategy to the ROUD/ROUE routing strategies is Parallel D or Parallel 2D with the DRT (Dark routing... one method. The Commission will post all comments on the Commission's Internet Web site ( http://www...
ASDF: An Adaptable Seismic Data Format with Full Provenance
NASA Astrophysics Data System (ADS)
Smith, J. A.; Krischer, L.; Tromp, J.; Lefebvre, M. P.
2015-12-01
In order for seismologists to maximize their knowledge of how the Earth works, they must extract the maximum amount of useful information from all recorded seismic data available for their research. This requires assimilating large sets of waveform data, keeping track of vast amounts of metadata, using validated standards for quality control, and automating the workflow in a careful and efficient manner. In addition, there is a growing gap between CPU/GPU speeds and disk access speeds that leads to an I/O bottleneck in seismic workflows. This is made even worse by existing seismic data formats that were not designed for performance and are limited to a few fixed headers for storing metadata.The Adaptable Seismic Data Format (ASDF) is a new data format for seismology that solves the problems with existing seismic data formats and integrates full provenance into the definition. ASDF is a self-describing format that features parallel I/O using the parallel HDF5 library. This makes it a great choice for use on HPC clusters. The format integrates the standards QuakeML for seismic sources and StationXML for receivers. ASDF is suitable for storing earthquake data sets, where all waveforms for a single earthquake are stored in a one file, ambient noise cross-correlations, and adjoint sources. The format comes with a user-friendly Python reader and writer that gives seismologists access to a full set of Python tools for seismology. There is also a faster C/Fortran library for integrating ASDF into performance-focused numerical wave solvers, such as SPECFEM3D_GLOBE. Finally, a GUI tool designed for visually exploring the format exists that provides a flexible interface for both research and educational applications. ASDF is a new seismic data format that offers seismologists high-performance parallel processing, organized and validated contents, and full provenance tracking for automated seismological workflows.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Collins, Benjamin S.
The Futility package contains the following: 1) Definition of the size of integers and real numbers; 2) A generic Unit test harness; 3) Definitions for some basic extensions to the Fortran language: arbitrary length strings, a parameter list construct, exception handlers, command line processor, timers; 4) Geometry definitions: point, line, plane, box, cylinder, polyhedron; 5) File wrapper functions: standard Fortran input/output files, Fortran binary files, HDF5 files; 6) Parallel wrapper functions: MPI, and Open MP abstraction layers, partitioning algorithms; 7) Math utilities: BLAS, Matrix and Vector definitions, Linear Solver methods and wrappers for other TPLs (PETSC, MKL, etc), preconditioner classes;more » 8) Misc: random number generator, water saturation properties, sorting algorithms.« less
Distributed File System Utilities to Manage Large DatasetsVersion 0.5
DOE Office of Scientific and Technical Information (OSTI.GOV)
2014-05-21
FileUtils provides a suite of tools to manage large datasets typically created by large parallel MPI applications. They are written in C and use standard POSIX I/Ocalls. The current suite consists of tools to copy, compare, remove, and list. The tools provide dramatic speedup over existing Linux tools, which often run as a single process.
76 FR 68194 - Submission for OMB Review; Comment Request
Federal Register 2010, 2011, 2012, 2013, 2014
2011-11-03
... Layouts for HPBS Work Measures. OMB No.: 0970-0230. Description: There is no longer a High Performance... information on States' performance. The Transmission File Layouts form provides the format that States will... are not requesting any changes to the Transmission File Layouts form. Respondents: Respondents may...
Abreu, Rui Mv; Froufe, Hugo Jc; Queiroz, Maria João Rp; Ferreira, Isabel Cfr
2010-10-28
Virtual screening of small molecules using molecular docking has become an important tool in drug discovery. However, large scale virtual screening is time demanding and usually requires dedicated computer clusters. There are a number of software tools that perform virtual screening using AutoDock4 but they require access to dedicated Linux computer clusters. Also no software is available for performing virtual screening with Vina using computer clusters. In this paper we present MOLA, an easy-to-use graphical user interface tool that automates parallel virtual screening using AutoDock4 and/or Vina in bootable non-dedicated computer clusters. MOLA automates several tasks including: ligand preparation, parallel AutoDock4/Vina jobs distribution and result analysis. When the virtual screening project finishes, an open-office spreadsheet file opens with the ligands ranked by binding energy and distance to the active site. All results files can automatically be recorded on an USB-flash drive or on the hard-disk drive using VirtualBox. MOLA works inside a customized Live CD GNU/Linux operating system, developed by us, that bypass the original operating system installed on the computers used in the cluster. This operating system boots from a CD on the master node and then clusters other computers as slave nodes via ethernet connections. MOLA is an ideal virtual screening tool for non-experienced users, with a limited number of multi-platform heterogeneous computers available and no access to dedicated Linux computer clusters. When a virtual screening project finishes, the computers can just be restarted to their original operating system. The originality of MOLA lies on the fact that, any platform-independent computer available can he added to the cluster, without ever using the computer hard-disk drive and without interfering with the installed operating system. With a cluster of 10 processors, and a potential maximum speed-up of 10x, the parallel algorithm of MOLA performed with a speed-up of 8,64× using AutoDock4 and 8,60× using Vina.
Computational Particle Dynamic Simulations on Multicore Processors (CPDMu) Final Report Phase I
DOE Office of Scientific and Technical Information (OSTI.GOV)
Schmalz, Mark S
2011-07-24
Statement of Problem - Department of Energy has many legacy codes for simulation of computational particle dynamics and computational fluid dynamics applications that are designed to run on sequential processors and are not easily parallelized. Emerging high-performance computing architectures employ massively parallel multicore architectures (e.g., graphics processing units) to increase throughput. Parallelization of legacy simulation codes is a high priority, to achieve compatibility, efficiency, accuracy, and extensibility. General Statement of Solution - A legacy simulation application designed for implementation on mainly-sequential processors has been represented as a graph G. Mathematical transformations, applied to G, produce a graph representation {und G}more » for a high-performance architecture. Key computational and data movement kernels of the application were analyzed/optimized for parallel execution using the mapping G {yields} {und G}, which can be performed semi-automatically. This approach is widely applicable to many types of high-performance computing systems, such as graphics processing units or clusters comprised of nodes that contain one or more such units. Phase I Accomplishments - Phase I research decomposed/profiled computational particle dynamics simulation code for rocket fuel combustion into low and high computational cost regions (respectively, mainly sequential and mainly parallel kernels), with analysis of space and time complexity. Using the research team's expertise in algorithm-to-architecture mappings, the high-cost kernels were transformed, parallelized, and implemented on Nvidia Fermi GPUs. Measured speedups (GPU with respect to single-core CPU) were approximately 20-32X for realistic model parameters, without final optimization. Error analysis showed no loss of computational accuracy. Commercial Applications and Other Benefits - The proposed research will constitute a breakthrough in solution of problems related to efficient parallel computation of particle and fluid dynamics simulations. These problems occur throughout DOE, military and commercial sectors: the potential payoff is high. We plan to license or sell the solution to contractors for military and domestic applications such as disaster simulation (aerodynamic and hydrodynamic), Government agencies (hydrological and environmental simulations), and medical applications (e.g., in tomographic image reconstruction). Keywords - High-performance Computing, Graphic Processing Unit, Fluid/Particle Simulation. Summary for Members of Congress - Department of Energy has many simulation codes that must compute faster, to be effective. The Phase I research parallelized particle/fluid simulations for rocket combustion, for high-performance computing systems.« less
Grider, Gary A.; Poole, Stephen W.
2015-09-01
Collective buffering and data pattern solutions are provided for storage, retrieval, and/or analysis of data in a collective parallel processing environment. For example, a method can be provided for data storage in a collective parallel processing environment. The method comprises receiving data to be written for a plurality of collective processes within a collective parallel processing environment, extracting a data pattern for the data to be written for the plurality of collective processes, generating a representation describing the data pattern, and saving the data and the representation.
Optimizing CyberShake Seismic Hazard Workflows for Large HPC Resources
NASA Astrophysics Data System (ADS)
Callaghan, S.; Maechling, P. J.; Juve, G.; Vahi, K.; Deelman, E.; Jordan, T. H.
2014-12-01
The CyberShake computational platform is a well-integrated collection of scientific software and middleware that calculates 3D simulation-based probabilistic seismic hazard curves and hazard maps for the Los Angeles region. Currently each CyberShake model comprises about 235 million synthetic seismograms from about 415,000 rupture variations computed at 286 sites. CyberShake integrates large-scale parallel and high-throughput serial seismological research codes into a processing framework in which early stages produce files used as inputs by later stages. Scientific workflow tools are used to manage the jobs, data, and metadata. The Southern California Earthquake Center (SCEC) developed the CyberShake platform using USC High Performance Computing and Communications systems and open-science NSF resources.CyberShake calculations were migrated to the NSF Track 1 system NCSA Blue Waters when it became operational in 2013, via an interdisciplinary team approach including domain scientists, computer scientists, and middleware developers. Due to the excellent performance of Blue Waters and CyberShake software optimizations, we reduced the makespan (a measure of wallclock time-to-solution) of a CyberShake study from 1467 to 342 hours. We will describe the technical enhancements behind this improvement, including judicious introduction of new GPU software, improved scientific software components, increased workflow-based automation, and Blue Waters-specific workflow optimizations.Our CyberShake performance improvements highlight the benefits of scientific workflow tools. The CyberShake workflow software stack includes the Pegasus Workflow Management System (Pegasus-WMS, which includes Condor DAGMan), HTCondor, and Globus GRAM, with Pegasus-mpi-cluster managing the high-throughput tasks on the HPC resources. The workflow tools handle data management, automatically transferring about 13 TB back to SCEC storage.We will present performance metrics from the most recent CyberShake study, executed on Blue Waters. We will compare the performance of CPU and GPU versions of our large-scale parallel wave propagation code, AWP-ODC-SGT. Finally, we will discuss how these enhancements have enabled SCEC to move forward with plans to increase the CyberShake simulation frequency to 1.0 Hz.
Compact 0-complete trees: A new method for searching large files
DOE Office of Scientific and Technical Information (OSTI.GOV)
Orlandic, R.; Pfaltz, J.L.
1988-01-26
In this report, a novel approach to ordered retrieval in very large files is developed. The method employs a B-tree like search algorithm that is independent of key type or key length because all keys in index blocks are encoded by a 1 byte surrogate. The replacement of actual key sequences by the 1 byte surrogate ensures a maximal possible fan out and greatly reduces the storage overhead of maintaining access indices. Initially, retrieval in binary trie structure is developed. With the aid of a fairly complex recurrence relation, the rather scraggly binary trie is transformed into compact multi-way searchmore » tree. Then the recurrence relation itself is replaced by an unusually simple search algorithm. Then implementation details and empirical performance results are presented. Reduction of index size by 50%--75% opens up the possibility of replicating system-wide indices for parallel access in distributed databases. 23 figs.« less
A high performance linear equation solver on the VPP500 parallel supercomputer
DOE Office of Scientific and Technical Information (OSTI.GOV)
Nakanishi, Makoto; Ina, Hiroshi; Miura, Kenichi
1994-12-31
This paper describes the implementation of two high performance linear equation solvers developed for the Fujitsu VPP500, a distributed memory parallel supercomputer system. The solvers take advantage of the key architectural features of VPP500--(1) scalability for an arbitrary number of processors up to 222 processors, (2) flexible data transfer among processors provided by a crossbar interconnection network, (3) vector processing capability on each processor, and (4) overlapped computation and transfer. The general linear equation solver based on the blocked LU decomposition method achieves 120.0 GFLOPS performance with 100 processors in the LIN-PACK Highly Parallel Computing benchmark.
A high performance parallel algorithm for 1-D FFT
DOE Office of Scientific and Technical Information (OSTI.GOV)
Agarwal, R.C.; Gustavson, F.G.; Zubair, M.
1994-12-31
In this paper the authors propose a parallel high performance FFT algorithm based on a multi-dimensional formulation. They use this to solve a commonly encountered FFT based kernel on a distributed memory parallel machine, the IBM scalable parallel system, SP1. The kernel requires a forward FFT computation of an input sequence, multiplication of the transformed data by a coefficient array, and finally an inverse FFT computation of the resultant data. They show that the multi-dimensional formulation helps in reducing the communication costs and also improves the single node performance by effectively utilizing the memory system of the node. They implementedmore » this kernel on the IBM SP1 and observed a performance of 1.25 GFLOPS on a 64-node machine.« less
ERIC Educational Resources Information Center
von Davier, Matthias
2016-01-01
This report presents results on a parallel implementation of the expectation-maximization (EM) algorithm for multidimensional latent variable models. The developments presented here are based on code that parallelizes both the E step and the M step of the parallel-E parallel-M algorithm. Examples presented in this report include item response…
Rauniyar, Navin
2015-01-01
The parallel reaction monitoring (PRM) assay has emerged as an alternative method of targeted quantification. The PRM assay is performed in a high resolution and high mass accuracy mode on a mass spectrometer. This review presents the features that make PRM a highly specific and selective method for targeted quantification using quadrupole-Orbitrap hybrid instruments. In addition, this review discusses the label-based and label-free methods of quantification that can be performed with the targeted approach. PMID:26633379
DOE Office of Scientific and Technical Information (OSTI.GOV)
Busby, L. E.
Fortran modules tend to serialize compilation of large Fortran projects, by introducing dependencies among the source files. If file A depends on file B, (A uses a module defined by B), you must finish compiling B before you can begin compiling A. Some Fortran compilers (Intel ifort, GNU gfortran and IBM xlf, at least) offer an option to ‘‘verify syntax’’, with the side effect of also producing any associated Fortran module files. As it happens, this option usually runs much faster than the object code generation and optimization phases. For some projects on some machines, it can be advantageous tomore » compile in two passes: The first pass generates the module files, quickly; the second pass produces the object files, in parallel. We achieve a 3.8× speedup in the case study below.« less
NaviSE: superenhancer navigator integrating epigenomics signal algebra.
Ascensión, Alex M; Arrospide-Elgarresta, Mikel; Izeta, Ander; Araúzo-Bravo, Marcos J
2017-06-06
Superenhancers are crucial structural genomic elements determining cell fate, and they are also involved in the determination of several diseases, such as cancer or neurodegeneration. Although there are pipelines which use independent pieces of software to predict the presence of superenhancers from genome-wide chromatin marks or DNA-interaction protein binding sites, there is not yet an integrated software tool that processes automatically algebra combinations of raw data sequencing into a comprehensive final annotated report of predicted superenhancers. We have developed NaviSE, a user-friendly streamlined tool which performs a fully-automated parallel processing of genome-wide epigenomics data from sequencing files into a final report, built with a comprehensive set of annotated files that are navigated through a graphic user interface dynamically generated by NaviSE. NaviSE also implements an 'epigenomics signal algebra' that allows the combination of multiple activation and repression epigenomics signals. NaviSE provides an interactive chromosomal landscaping of the locations of superenhancers, which can be navigated to obtain annotated information about superenhancer signal profile, associated genes, gene ontology enrichment analysis, motifs of transcription factor binding sites enriched in superenhancers, graphs of the metrics evaluating the superenhancers quality, protein-protein interaction networks and enriched metabolic pathways among other features. We have parallelised the most time-consuming tasks achieving a reduction up to 30% for a 15 CPUs machine. We have optimized the default parameters of NaviSE to facilitate its use. NaviSE allows different entry levels of data processing, from sra-fastq files to bed files; and unifies the processing of multiple replicates. NaviSE outperforms the more time-consuming processes required in a non-integrated pipeline. Alongside its high performance, NaviSE is able to provide biological insights, predicting cell type specific markers, such as SOX2 and ZIC3 in embryonic stem cells, CDK5R1 and REST in neurons and CD86 and TLR2 in monocytes. NaviSE is a user-friendly streamlined solution for superenhancer analysis, annotation and navigation, requiring only basic computer and next generation sequencing knowledge. NaviSE binaries and documentation are available at: https://sourceforge.net/projects/navise-superenhancer/ .
Central Satellite Data Repository Supporting Research and Development
NASA Astrophysics Data System (ADS)
Han, W.; Brust, J.
2015-12-01
Near real-time satellite data is critical to many research and development activities of atmosphere, land, and ocean processes. Acquiring and managing huge volumes of satellite data without (or with less) latency in an organization is always a challenge in the big data age. An organization level data repository is a practical solution to meeting this challenge. The STAR (Center for Satellite Applications and Research of NOAA) Central Data Repository (SCDR) is a scalable, stable, and reliable repository to acquire, manipulate, and disseminate various types of satellite data in an effective and efficient manner. SCDR collects more than 200 data products, which are commonly used by multiple groups in STAR, from NOAA, GOES, Metop, Suomi NPP, Sentinel, Himawari, and other satellites. The processes of acquisition, recording, retrieval, organization, and dissemination are performed in parallel. Multiple data access interfaces, like FTP, FTPS, HTTP, HTTPS, and RESTful, are supported in the SCDR to obtain satellite data from their providers through high speed internet. The original satellite data in various raster formats can be parsed in the respective adapter to retrieve data information. The data information is ingested to the corresponding partitioned tables in the central database. All files are distributed equally on the Network File System (NFS) disks to balance the disk load. SCDR provides consistent interfaces (including Perl utility, portal, and RESTful Web service) to locate files of interest easily and quickly and access them directly by over 200 compute servers via NFS. SCDR greatly improves collection and integration of near real-time satellite data, addresses satellite data requirements of scientists and researchers, and facilitates their primary research and development activities.
Ellingson, Sally R; Dakshanamurthy, Sivanesan; Brown, Milton; Smith, Jeremy C; Baudry, Jerome
2014-04-25
In this paper we give the current state of high-throughput virtual screening. We describe a case study of using a task-parallel MPI (Message Passing Interface) version of Autodock4 [1], [2] to run a virtual high-throughput screen of one-million compounds on the Jaguar Cray XK6 Supercomputer at Oak Ridge National Laboratory. We include a description of scripts developed to increase the efficiency of the predocking file preparation and postdocking analysis. A detailed tutorial, scripts, and source code for this MPI version of Autodock4 are available online at http://www.bio.utk.edu/baudrylab/autodockmpi.htm.
Parallel integer sorting with medium and fine-scale parallelism
NASA Technical Reports Server (NTRS)
Dagum, Leonardo
1993-01-01
Two new parallel integer sorting algorithms, queue-sort and barrel-sort, are presented and analyzed in detail. These algorithms do not have optimal parallel complexity, yet they show very good performance in practice. Queue-sort designed for fine-scale parallel architectures which allow the queueing of multiple messages to the same destination. Barrel-sort is designed for medium-scale parallel architectures with a high message passing overhead. The performance results from the implementation of queue-sort on a Connection Machine CM-2 and barrel-sort on a 128 processor iPSC/860 are given. The two implementations are found to be comparable in performance but not as good as a fully vectorized bucket sort on the Cray YMP.
Federal Register 2010, 2011, 2012, 2013, 2014
2011-04-14
... on the Exchange's Internet Web site at http://www.directedge.com . \\3\\ A Member is any registered... Parallel D or Parallel 2D with the DRT (Dark routing technique) option on BZX. BZX charges $0.0020 per... the DRT (Dark routing technique) option on BZX or SCAN/STGY on Nasdaq OMX Exchange (``Nasdaq.'') BATS...
TU-AB-BRC-12: Optimized Parallel MonteCarlo Dose Calculations for Secondary MU Checks
DOE Office of Scientific and Technical Information (OSTI.GOV)
French, S; Nazareth, D; Bellor, M
Purpose: Secondary MU checks are an important tool used during a physics review of a treatment plan. Commercial software packages offer varying degrees of theoretical dose calculation accuracy, depending on the modality involved. Dose calculations of VMAT plans are especially prone to error due to the large approximations involved. Monte Carlo (MC) methods are not commonly used due to their long run times. We investigated two methods to increase the computational efficiency of MC dose simulations with the BEAMnrc code. Distributed computing resources, along with optimized code compilation, will allow for accurate and efficient VMAT dose calculations. Methods: The BEAMnrcmore » package was installed on a high performance computing cluster accessible to our clinic. MATLAB and PYTHON scripts were developed to convert a clinical VMAT DICOM plan into BEAMnrc input files. The BEAMnrc installation was optimized by running the VMAT simulations through profiling tools which indicated the behavior of the constituent routines in the code, e.g. the bremsstrahlung splitting routine, and the specified random number generator. This information aided in determining the most efficient compiling parallel configuration for the specific CPU’s available on our cluster, resulting in the fastest VMAT simulation times. Our method was evaluated with calculations involving 10{sup 8} – 10{sup 9} particle histories which are sufficient to verify patient dose using VMAT. Results: Parallelization allowed the calculation of patient dose on the order of 10 – 15 hours with 100 parallel jobs. Due to the compiler optimization process, further speed increases of 23% were achieved when compared with the open-source compiler BEAMnrc packages. Conclusion: Analysis of the BEAMnrc code allowed us to optimize the compiler configuration for VMAT dose calculations. In future work, the optimized MC code, in conjunction with the parallel processing capabilities of BEAMnrc, will be applied to provide accurate and efficient secondary MU checks.« less
Development of EnergyPlus Utility to Batch Simulate Building Energy Performance on a National Scale
DOE Office of Scientific and Technical Information (OSTI.GOV)
Valencia, Jayson F.; Dirks, James A.
2008-08-29
EnergyPlus is a simulation program that requires a large number of details to fully define and model a building. Hundreds or even thousands of lines in a text file are needed to run the EnergyPlus simulation depending on the size of the building. To manually create these files is a time consuming process that would not be practical when trying to create input files for thousands of buildings needed to simulate national building energy performance. To streamline the process needed to create the input files for EnergyPlus, two methods were created to work in conjunction with the National Renewable Energymore » Laboratory (NREL) Preprocessor; this reduced the hundreds of inputs needed to define a building in EnergyPlus to a small set of high-level parameters. The first method uses Java routines to perform all of the preprocessing on a Windows machine while the second method carries out all of the preprocessing on the Linux cluster by using an in-house built utility called Generalized Parametrics (GPARM). A comma delimited (CSV) input file is created to define the high-level parameters for any number of buildings. Each method then takes this CSV file and uses the data entered for each parameter to populate an extensible markup language (XML) file used by the NREL Preprocessor to automatically prepare EnergyPlus input data files (idf) using automatic building routines and macro templates. Using a Linux utility called “make”, the idf files can then be automatically run through the Linux cluster and the desired data from each building can be aggregated into one table to be analyzed. Creating a large number of EnergyPlus input files results in the ability to batch simulate building energy performance and scale the result to national energy consumption estimates.« less
Web Based Parallel Programming Workshop for Undergraduate Education.
ERIC Educational Resources Information Center
Marcus, Robert L.; Robertson, Douglass
Central State University (Ohio), under a contract with Nichols Research Corporation, has developed a World Wide web based workshop on high performance computing entitled "IBN SP2 Parallel Programming Workshop." The research is part of the DoD (Department of Defense) High Performance Computing Modernization Program. The research…
Darabara, Myrsini; Bourithis, Lefteris; Zinelis, Spiros; Papadimitriou, George D
2004-07-01
The purpose of this study was to determine the elemental composition, microstructure, and hardness of commercially available reamers, K files, and H files. Five instruments of each type from different manufacturers (Antaeos, FKG, Maillefer, Mani, and Micromega) were embedded in epoxy resin along their longitudinal axis. After metallographic grinding and polishing, the specimens were chemically etched and their microstructure investigated under an incident light microscope. The specimens were studied under a scanning electron microscope, and their elemental compositions were determined by energy dispersive X-ray microanalysis. The same surfaces were repolished and X-ray diffraction was performed. The same specimen surface was used for the assessment of the Vickers hardness (HV200) by using a microhardness tester with a 200-g load and 20-s contact time. The hardness results were statistically analyzed with two-way ANOVA and Tukey's test (a = 0.05). All files demonstrated extensively elongated grains parallel to longitudinal file axis because of cold drawing. The elemental composition of Maillefer and Mani reamers, Antaeos K files, and Mani H files were found in the range of AISI 303 SS, whereas all the rest were determined as AISI 304 SS. Two different phases (austenite SSt and martensite SSt) were identified with X-ray diffraction for all files tested. The results of hardness classified reamers in the following decreasing order (HMV200): Micromega = 673 +/- 29, Mani = 662 +/- 24, Maillefer = 601 +/- 34, Antaeos = 586 +/- 18, FKG = 557 +/- 19, and the K files (HV200): FKG = 673 +/- 16, Mani = 647 +/- 19, Maillefer = 603 +/- 41, Antaeos = 566 +/- 21, Micromega = 555 +/- 15, and the H files (HMV200): Mani = 640 +/- 12, FKG = 583 +/- 31, Maillefer = 581 +/- 5, Antaeos = 573 +/- 3, Micromega = 546 +/- 14. Although only two stainless steel alloys were used for the production of endodontic files, the differences in hardness are independent to the alloys used, implying that other factors, such as the production method or the thermomechanical history of the alloys, play an important role on the mechanical properties of endodontic files.
A new deadlock resolution protocol and message matching algorithm for the extreme-scale simulator
Engelmann, Christian; Naughton, III, Thomas J.
2016-03-22
Investigating the performance of parallel applications at scale on future high-performance computing (HPC) architectures and the performance impact of different HPC architecture choices is an important component of HPC hardware/software co-design. The Extreme-scale Simulator (xSim) is a simulation toolkit for investigating the performance of parallel applications at scale. xSim scales to millions of simulated Message Passing Interface (MPI) processes. The overhead introduced by a simulation tool is an important performance and productivity aspect. This paper documents two improvements to xSim: (1)~a new deadlock resolution protocol to reduce the parallel discrete event simulation overhead and (2)~a new simulated MPI message matchingmore » algorithm to reduce the oversubscription management overhead. The results clearly show a significant performance improvement. The simulation overhead for running the NAS Parallel Benchmark suite was reduced from 102% to 0% for the embarrassingly parallel (EP) benchmark and from 1,020% to 238% for the conjugate gradient (CG) benchmark. xSim offers a highly accurate simulation mode for better tracking of injected MPI process failures. Furthermore, with highly accurate simulation, the overhead was reduced from 3,332% to 204% for EP and from 37,511% to 13,808% for CG.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bailey, David H.
The NAS Parallel Benchmarks (NPB) are a suite of parallel computer performance benchmarks. They were originally developed at the NASA Ames Research Center in 1991 to assess high-end parallel supercomputers. Although they are no longer used as widely as they once were for comparing high-end system performance, they continue to be studied and analyzed a great deal in the high-performance computing community. The acronym 'NAS' originally stood for the Numerical Aeronautical Simulation Program at NASA Ames. The name of this organization was subsequently changed to the Numerical Aerospace Simulation Program, and more recently to the NASA Advanced Supercomputing Center, althoughmore » the acronym remains 'NAS.' The developers of the original NPB suite were David H. Bailey, Eric Barszcz, John Barton, David Browning, Russell Carter, LeoDagum, Rod Fatoohi, Samuel Fineberg, Paul Frederickson, Thomas Lasinski, Rob Schreiber, Horst Simon, V. Venkatakrishnan and Sisira Weeratunga. The original NAS Parallel Benchmarks consisted of eight individual benchmark problems, each of which focused on some aspect of scientific computing. The principal focus was in computational aerophysics, although most of these benchmarks have much broader relevance, since in a much larger sense they are typical of many real-world scientific computing applications. The NPB suite grew out of the need for a more rational procedure to select new supercomputers for acquisition by NASA. The emergence of commercially available highly parallel computer systems in the late 1980s offered an attractive alternative to parallel vector supercomputers that had been the mainstay of high-end scientific computing. However, the introduction of highly parallel systems was accompanied by a regrettable level of hype, not only on the part of the commercial vendors but even, in some cases, by scientists using the systems. As a result, it was difficult to discern whether the new systems offered any fundamental performance advantage over vector supercomputers, and, if so, which of the parallel offerings would be most useful in real-world scientific computation. In part to draw attention to some of the performance reporting abuses prevalent at the time, the present author wrote a humorous essay 'Twelve Ways to Fool the Masses,' which described in a light-hearted way a number of the questionable ways in which both vendor marketing people and scientists were inflating and distorting their performance results. All of this underscored the need for an objective and scientifically defensible measure to compare performance on these systems.« less
High Performance Databases For Scientific Applications
NASA Technical Reports Server (NTRS)
French, James C.; Grimshaw, Andrew S.
1997-01-01
The goal for this task is to develop an Extensible File System (ELFS). ELFS attacks the problem of the following: 1. Providing high bandwidth performance architectures; 2. Reducing the cognitive burden faced by applications programmers when they attempt to optimize; and 3. Seamlessly managing the proliferation of data formats and architectural differences. The approach for ELFS solution consists of language and run-time system support that permits the specification on a hierarchy of file classes.
ROOT: A C++ framework for petabyte data storage, statistical analysis and visualization
DOE Office of Scientific and Technical Information (OSTI.GOV)
Antcheva, I.; /CERN; Ballintijn, M.
2009-01-01
ROOT is an object-oriented C++ framework conceived in the high-energy physics (HEP) community, designed for storing and analyzing petabytes of data in an efficient way. Any instance of a C++ class can be stored into a ROOT file in a machine-independent compressed binary format. In ROOT the TTree object container is optimized for statistical data analysis over very large data sets by using vertical data storage techniques. These containers can span a large number of files on local disks, the web or a number of different shared file systems. In order to analyze this data, the user can chose outmore » of a wide set of mathematical and statistical functions, including linear algebra classes, numerical algorithms such as integration and minimization, and various methods for performing regression analysis (fitting). In particular, the RooFit package allows the user to perform complex data modeling and fitting while the RooStats library provides abstractions and implementations for advanced statistical tools. Multivariate classification methods based on machine learning techniques are available via the TMVA package. A central piece in these analysis tools are the histogram classes which provide binning of one- and multi-dimensional data. Results can be saved in high-quality graphical formats like Postscript and PDF or in bitmap formats like JPG or GIF. The result can also be stored into ROOT macros that allow a full recreation and rework of the graphics. Users typically create their analysis macros step by step, making use of the interactive C++ interpreter CINT, while running over small data samples. Once the development is finished, they can run these macros at full compiled speed over large data sets, using on-the-fly compilation, or by creating a stand-alone batch program. Finally, if processing farms are available, the user can reduce the execution time of intrinsically parallel tasks - e.g. data mining in HEP - by using PROOF, which will take care of optimally distributing the work over the available resources in a transparent way.« less
NASA Astrophysics Data System (ADS)
Poat, M. D.; Lauret, J.; Betts, W.
2015-12-01
The STAR online computing infrastructure has become an intensive dynamic system used for first-hand data collection and analysis resulting in a dense collection of data output. As we have transitioned to our current state, inefficient, limited storage systems have become an impediment to fast feedback to online shift crews. Motivation for a centrally accessible, scalable and redundant distributed storage system had become a necessity in this environment. OpenStack Swift Object Storage and Ceph Object Storage are two eye-opening technologies as community use and development have led to success elsewhere. In this contribution, OpenStack Swift and Ceph have been put to the test with single and parallel I/O tests, emulating real world scenarios for data processing and workflows. The Ceph file system storage, offering a POSIX compliant file system mounted similarly to an NFS share was of particular interest as it aligned with our requirements and was retained as our solution. I/O performance tests were run against the Ceph POSIX file system and have presented surprising results indicating true potential for fast I/O and reliability. STAR'S online compute farm historical use has been for job submission and first hand data analysis. The goal of reusing the online compute farm to maintain a storage cluster and job submission will be an efficient use of the current infrastructure.
Oak Ridge Institutional Cluster Autotune Test Drive Report
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jibonananda, Sanyal; New, Joshua Ryan
2014-02-01
The Oak Ridge Institutional Cluster (OIC) provides general purpose computational resources for the ORNL staff to run computation heavy jobs that are larger than desktop applications but do not quite require the scale and power of the Oak Ridge Leadership Computing Facility (OLCF). This report details the efforts made and conclusions derived in performing a short test drive of the cluster resources on Phase 5 of the OIC. EnergyPlus was used in the analysis as a candidate user program and the overall software environment was evaluated against anticipated challenges experienced with resources such as the shared memory-Nautilus (JICS) and Titanmore » (OLCF). The OIC performed within reason and was found to be acceptable in the context of running EnergyPlus simulations. The number of cores per node and the availability of scratch space per node allow non-traditional desktop focused applications to leverage parallel ensemble execution. Although only individual runs of EnergyPlus were executed, the software environment on the OIC appeared suitable to run ensemble simulations with some modifications to the Autotune workflow. From a standpoint of general usability, the system supports common Linux libraries, compilers, standard job scheduling software (Torque/Moab), and the OpenMPI library (the only MPI library) for MPI communications. The file system is a Panasas file system which literature indicates to be an efficient file system.« less
Boutsioukis, C; Verhaagen, B; Walmsley, A D; Versluis, M; van der Sluis, L W M
2013-11-01
(i) To quantify in a simulated root canal model the file-to-wall contact during ultrasonic activation of an irrigant and to evaluate the effect of root canal size, file insertion depth, ultrasonic power, root canal level and previous training, (ii) To investigate the effect of file-to-wall contact on file oscillation. File-to-wall contact was measured during ultrasonic activation of the irrigant performed by 15 trained and 15 untrained participants in two metal root canal models. Results were analyzed by two 5-way mixed-design anovas. The level of significance was set at P < 0.05. Additionally, high-speed visualizations, laser-vibrometer measurements and numerical simulations of the file oscillation were conducted. File-to-wall contact occurred in all cases during 20% of the activation time. Contact time was significantly shorter at high power (P < 0.001), when the file was positioned away from working length (P < 0.001), in the larger root canal (P < 0.001) and from coronal towards apical third of the root canal (P < 0.002), in most of the cases studied. Previous training did not show a consistent significant effect. File oscillation was affected by contact during 94% of the activation time. During wall contact, the file bounced back and forth against the wall at audible frequencies (ca. 5 kHz), but still performed the original 30 kHz oscillations. Travelling waves were identified on the file. The file oscillation was not dampened completely due to the contact and hydrodynamic cavitation was detected. Considerable file-to-wall contact occur-red during irrigant activation. Therefore, the term 'Passive Ultrasonic Irrigation' should be amended to 'Ultrasonically Activated Irrigation'. © 2013 International Endodontic Journal. Published by John Wiley & Sons Ltd.
NASA Technical Reports Server (NTRS)
Saini, Subhash; Frumkin, Michael; Hribar, Michelle; Jin, Hao-Qiang; Waheed, Abdul; Yan, Jerry
1998-01-01
Porting applications to new high performance parallel and distributed computing platforms is a challenging task. Since writing parallel code by hand is extremely time consuming and costly, porting codes would ideally be automated by using some parallelization tools and compilers. In this paper, we compare the performance of the hand written NAB Parallel Benchmarks against three parallel versions generated with the help of tools and compilers: 1) CAPTools: an interactive computer aided parallelization too] that generates message passing code, 2) the Portland Group's HPF compiler and 3) using compiler directives with the native FORTAN77 compiler on the SGI Origin2000.
NIF Ignition Target 3D Point Design
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jones, O; Marinak, M; Milovich, J
2008-11-05
We have developed an input file for running 3D NIF hohlraums that is optimized such that it can be run in 1-2 days on parallel computers. We have incorporated increasing levels of automation into the 3D input file: (1) Configuration controlled input files; (2) Common file for 2D and 3D, different types of capsules (symcap, etc.); and (3) Can obtain target dimensions, laser pulse, and diagnostics settings automatically from NIF Campaign Management Tool. Using 3D Hydra calculations to investigate different problems: (1) Intrinsic 3D asymmetry; (2) Tolerance to nonideal 3D effects (e.g. laser power balance, pointing errors); and (3) Syntheticmore » diagnostics.« less
A Parallel Rendering Algorithm for MIMD Architectures
NASA Technical Reports Server (NTRS)
Crockett, Thomas W.; Orloff, Tobias
1991-01-01
Applications such as animation and scientific visualization demand high performance rendering of complex three dimensional scenes. To deliver the necessary rendering rates, highly parallel hardware architectures are required. The challenge is then to design algorithms and software which effectively use the hardware parallelism. A rendering algorithm targeted to distributed memory MIMD architectures is described. For maximum performance, the algorithm exploits both object-level and pixel-level parallelism. The behavior of the algorithm is examined both analytically and experimentally. Its performance for large numbers of processors is found to be limited primarily by communication overheads. An experimental implementation for the Intel iPSC/860 shows increasing performance from 1 to 128 processors across a wide range of scene complexities. It is shown that minimal modifications to the algorithm will adapt it for use on shared memory architectures as well.
High Performance Data Transfer for Distributed Data Intensive Sciences
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fang, Chin; Cottrell, R 'Les' A.; Hanushevsky, Andrew B.
We report on the development of ZX software providing high performance data transfer and encryption. The design scales in: computation power, network interfaces, and IOPS while carefully balancing the available resources. Two U.S. patent-pending algorithms help tackle data sets containing lots of small files and very large files, and provide insensitivity to network latency. It has a cluster-oriented architecture, using peer-to-peer technologies to ease deployment, operation, usage, and resource discovery. Its unique optimizations enable effective use of flash memory. Using a pair of existing data transfer nodes at SLAC and NERSC, we compared its performance to that of bbcp andmore » GridFTP and determined that they were comparable. With a proof of concept created using two four-node clusters with multiple distributed multi-core CPUs, network interfaces and flash memory, we achieved 155Gbps memory-to-memory over a 2x100Gbps link aggregated channel and 70Gbps file-to-file with encryption over a 5000 mile 100Gbps link.« less
Nuclide Depletion Capabilities in the Shift Monte Carlo Code
Davidson, Gregory G.; Pandya, Tara M.; Johnson, Seth R.; ...
2017-12-21
A new depletion capability has been developed in the Exnihilo radiation transport code suite. This capability enables massively parallel domain-decomposed coupling between the Shift continuous-energy Monte Carlo solver and the nuclide depletion solvers in ORIGEN to perform high-performance Monte Carlo depletion calculations. This paper describes this new depletion capability and discusses its various features, including a multi-level parallel decomposition, high-order transport-depletion coupling, and energy-integrated power renormalization. Several test problems are presented to validate the new capability against other Monte Carlo depletion codes, and the parallel performance of the new capability is analyzed.
DAKOTA Design Analysis Kit for Optimization and Terascale
DOE Office of Scientific and Technical Information (OSTI.GOV)
Adams, Brian M.; Dalbey, Keith R.; Eldred, Michael S.
2010-02-24
The DAKOTA (Design Analysis Kit for Optimization and Terascale Applications) toolkit provides a flexible and extensible interface between simulation codes (computational models) and iterative analysis methods. By employing object-oriented design to implement abstractions of the key components required for iterative systems analyses, the DAKOTA toolkit provides a flexible and extensible problem-solving environment for design and analysis of computational models on high performance computers.A user provides a set of DAKOTA commands in an input file and launches DAKOTA. DAKOTA invokes instances of the computational models, collects their results, and performs systems analyses. DAKOTA contains algorithms for optimization with gradient and nongradient-basedmore » methods; uncertainty quantification with sampling, reliability, polynomial chaos, stochastic collocation, and epistemic methods; parameter estimation with nonlinear least squares methods; and sensitivity/variance analysis with design of experiments and parameter study methods. These capabilities may be used on their own or as components within advanced strategies such as hybrid optimization, surrogate-based optimization, mixed integer nonlinear programming, or optimization under uncertainty. Services for parallel computing, simulation interfacing, approximation modeling, fault tolerance, restart, and graphics are also included.« less
Development and performance of a new version of the OASIS coupler, OASIS3-MCT_3.0
NASA Astrophysics Data System (ADS)
Craig, Anthony; Valcke, Sophie; Coquart, Laure
2017-09-01
OASIS is coupling software developed primarily for use in the climate community. It provides the ability to couple different models with low implementation and performance overhead. OASIS3-MCT is the latest version of OASIS. It includes several improvements compared to OASIS3, including elimination of a separate hub coupler process, parallelization of the coupling communication and run-time grid interpolation, and the ability to easily reuse mapping weight files. OASIS3-MCT_3.0 is the latest release and includes the ability to couple between components running sequentially on the same set of tasks as well as to couple within a single component between different grids or decompositions such as physics, dynamics, and I/O. OASIS3-MCT has been tested with different configurations on up to 32 000 processes, with components running on high-resolution grids with up to 1.5 million grid cells, and with over 10 000 2-D coupling fields. Several new features will be available in OASIS3-MCT_4.0, and some of those are also described.
Parkhurst, David L.; Kipp, Kenneth L.; Engesgaard, Peter; Charlton, Scott R.
2004-01-01
The computer program PHAST simulates multi-component, reactive solute transport in three-dimensional saturated ground-water flow systems. PHAST is a versatile ground-water flow and solute-transport simulator with capabilities to model a wide range of equilibrium and kinetic geochemical reactions. The flow and transport calculations are based on a modified version of HST3D that is restricted to constant fluid density and constant temperature. The geochemical reactions are simulated with the geochemical model PHREEQC, which is embedded in PHAST. PHAST is applicable to the study of natural and contaminated ground-water systems at a variety of scales ranging from laboratory experiments to local and regional field scales. PHAST can be used in studies of migration of nutrients, inorganic and organic contaminants, and radionuclides; in projects such as aquifer storage and recovery or engineered remediation; and in investigations of the natural rock-water interactions in aquifers. PHAST is not appropriate for unsaturated-zone flow, multiphase flow, density-dependent flow, or waters with high ionic strengths. A variety of boundary conditions are available in PHAST to simulate flow and transport, including specified-head, flux, and leaky conditions, as well as the special cases of rivers and wells. Chemical reactions in PHAST include (1) homogeneous equilibria using an ion-association thermodynamic model; (2) heterogeneous equilibria between the aqueous solution and minerals, gases, surface complexation sites, ion exchange sites, and solid solutions; and (3) kinetic reactions with rates that are a function of solution composition. The aqueous model (elements, chemical reactions, and equilibrium constants), minerals, gases, exchangers, surfaces, and rate expressions may be defined or modified by the user. A number of options are available to save results of simulations to output files. The data may be saved in three formats: a format suitable for viewing with a text editor; a format suitable for exporting to spreadsheets and post-processing programs; or in Hierarchical Data Format (HDF), which is a compressed binary format. Data in the HDF file can be visualized on Windows computers with the program Model Viewer and extracted with the utility program PHASTHDF; both programs are distributed with PHAST. Operator splitting of the flow, transport, and geochemical equations is used to separate the three processes into three sequential calculations. No iterations between transport and reaction calculations are implemented. A three-dimensional Cartesian coordinate system and finite-difference techniques are used for the spatial and temporal discretization of the flow and transport equations. The non-linear chemical equilibrium equations are solved by a Newton-Raphson method, and the kinetic reaction equations are solved by a Runge-Kutta or an implicit method for integrating ordinary differential equations. The PHAST simulator may require large amounts of memory and long Central Processing Unit (CPU) times. To reduce the long CPU times, a parallel version of PHAST has been developed that runs on a multiprocessor computer or on a collection of computers that are networked. The parallel version requires Message Passing Interface, which is currently (2004) freely available. The parallel version is effective in reducing simulation times. This report documents the use of the PHAST simulator, including running the simulator, preparing the input files, selecting the output files, and visualizing the results. It also presents four examples that verify the numerical method and demonstrate the capabilities of the simulator. PHAST requires three input files. Only the flow and transport file is described in detail in this report. The other two files, the chemistry data file and the database file, are identical to PHREEQC files and the detailed description of these files is found in the PHREEQC documentation.
Limpanuparb, Taweetham; Milthorpe, Josh; Rendell, Alistair P
2014-10-30
Use of the modern parallel programming language X10 for computing long-range Coulomb and exchange interactions is presented. By using X10, a partitioned global address space language with support for task parallelism and the explicit representation of data locality, the resolution of the Ewald operator can be parallelized in a straightforward manner including use of both intranode and internode parallelism. We evaluate four different schemes for dynamic load balancing of integral calculation using X10's work stealing runtime, and report performance results for long-range HF energy calculation of large molecule/high quality basis running on up to 1024 cores of a high performance cluster machine. Copyright © 2014 Wiley Periodicals, Inc.
A computer program for converting rectangular coordinates to latitude-longitude coordinates
Rutledge, A.T.
1989-01-01
A computer program was developed for converting the coordinates of any rectangular grid on a map to coordinates on a grid that is parallel to lines of equal latitude and longitude. Using this program in conjunction with groundwater flow models, the user can extract data and results from models with varying grid orientations and place these data into grid structure that is oriented parallel to lines of equal latitude and longitude. All cells in the rectangular grid must have equal dimensions, and all cells in the latitude-longitude grid measure one minute by one minute. This program is applicable if the map used shows lines of equal latitude as arcs and lines of equal longitude as straight lines and assumes that the Earth 's surface can be approximated as a sphere. The program user enters the row number , column number, and latitude and longitude of the midpoint of the cell for three test cells on the rectangular grid. The latitude and longitude of boundaries of the rectangular grid also are entered. By solving sets of simultaneous linear equations, the program calculates coefficients that are used for making the conversion. As an option in the program, the user may build a groundwater model file based on a grid that is parallel to lines of equal latitude and longitude. The program reads a data file based on the rectangular coordinates and automatically forms the new data file. (USGS)
Parallel-vector solution of large-scale structural analysis problems on supercomputers
NASA Technical Reports Server (NTRS)
Storaasli, Olaf O.; Nguyen, Duc T.; Agarwal, Tarun K.
1989-01-01
A direct linear equation solution method based on the Choleski factorization procedure is presented which exploits both parallel and vector features of supercomputers. The new equation solver is described, and its performance is evaluated by solving structural analysis problems on three high-performance computers. The method has been implemented using Force, a generic parallel FORTRAN language.
Parallel Architectures and Parallel Algorithms for Integrated Vision Systems. Ph.D. Thesis
NASA Technical Reports Server (NTRS)
Choudhary, Alok Nidhi
1989-01-01
Computer vision is regarded as one of the most complex and computationally intensive problems. An integrated vision system (IVS) is a system that uses vision algorithms from all levels of processing to perform for a high level application (e.g., object recognition). An IVS normally involves algorithms from low level, intermediate level, and high level vision. Designing parallel architectures for vision systems is of tremendous interest to researchers. Several issues are addressed in parallel architectures and parallel algorithms for integrated vision systems.
Architecture and performances of the AGILE Telemetry Preprocessing System (TMPPS)
NASA Astrophysics Data System (ADS)
Trifoglio, M.; Bulgarelli, A.; Gianotti, F.; Lazzarotto, F.; Di Cocco, G.; Fuschino, F.; Tavani, M.
2008-07-01
AGILE is an Italian Space Agency (ASI) satellite dedicated to high energy Astrophysics. It was launched successfully on 23 April 2007, and it has been operated by the AGILE Ground Segment, consisting of the Ground Station located in Malindi (Kenia), the Mission Operations Centre (MOC) and the AGILE Data Centre (ADC) established in Italy, at Telespazio in Fucino and at the ASI Science Data Centre (ASDC) in Frascati respectively. Due to the low equatorial orbit at ~ 530 Km. with inclination angle of ~ 2.5°, the satellite passes over the Ground Station every ~ 100'. During the visibility period of . ~ 12', the Telemetry (TM) is down linked through two separated virtual channels, VC0 and VC1. The former is devoted to the real time TM generated during the pass at the average rate of 50 Kbit/s and is directly relayed to the Control Centre. The latter is used to downlink TM data collected on the satellite on-board mass memory during the non visibility period. This generates at the Ground Station a raw TM file of up to 37 MByte. Within 20' after the end of the contact, both the real time and mass memory TM arrive at ADC through the dedicated VPN ASINet. Here they are automatically detected and ingested by the TMPPS pipeline in less than 5 minutes. The TMPPS archives each TM file and sorts its packets into one stream for each of the different TM layout. Each stream is processed in parallel in order to unpack the various telemetry field and archive them into suitable FITS files. Each operation is tracked into a MySQL data base which interfaces the TMPPS pipeline to the rest of the scientific pipeline running at ADC. In this paper the architecture and the performance of the TMPPS will be described and discussed.
Simulation of a Rotorcraft in Turbulent Flows
1991-09-01
Knot) Aircraft Parallel Aircraft Parallel Aircraft Parallel To Ship’s To Port-To-Star- To Starboard- Centerline board Landing To-Port Landing Lineup ...Line Lineup Line 345 to 015/35 340 to 005/45 345 to 005145 016 t,) 040/30 006 to 035!35 006 to 025/40 041 to 180/45 036 to 050/30 026 to 040/30 181 to...WIND /FRA3 LOW REYNOLD’S NUMBER AERODYNAMICS FOR NACA0012 AIRFOIL REQUIRES DS/DM NACA0012/AIRFOIL NO SEQUENTIAL FILES REQUIRED INPUT FOR FORCE FRA3
Image Harvest: an open-source platform for high-throughput plant image processing and analysis
Knecht, Avi C.; Campbell, Malachy T.; Caprez, Adam; Swanson, David R.; Walia, Harkamal
2016-01-01
High-throughput plant phenotyping is an effective approach to bridge the genotype-to-phenotype gap in crops. Phenomics experiments typically result in large-scale image datasets, which are not amenable for processing on desktop computers, thus creating a bottleneck in the image-analysis pipeline. Here, we present an open-source, flexible image-analysis framework, called Image Harvest (IH), for processing images originating from high-throughput plant phenotyping platforms. Image Harvest is developed to perform parallel processing on computing grids and provides an integrated feature for metadata extraction from large-scale file organization. Moreover, the integration of IH with the Open Science Grid provides academic researchers with the computational resources required for processing large image datasets at no cost. Image Harvest also offers functionalities to extract digital traits from images to interpret plant architecture-related characteristics. To demonstrate the applications of these digital traits, a rice (Oryza sativa) diversity panel was phenotyped and genome-wide association mapping was performed using digital traits that are used to describe different plant ideotypes. Three major quantitative trait loci were identified on rice chromosomes 4 and 6, which co-localize with quantitative trait loci known to regulate agronomically important traits in rice. Image Harvest is an open-source software for high-throughput image processing that requires a minimal learning curve for plant biologists to analyzephenomics datasets. PMID:27141917
Fast in-memory elastic full-waveform inversion using consumer-grade GPUs
NASA Astrophysics Data System (ADS)
Sivertsen Bergslid, Tore; Birger Raknes, Espen; Arntsen, Børge
2017-04-01
Full-waveform inversion (FWI) is a technique to estimate subsurface properties by using the recorded waveform produced by a seismic source and applying inverse theory. This is done through an iterative optimization procedure, where each iteration requires solving the wave equation many times, then trying to minimize the difference between the modeled and the measured seismic data. Having to model many of these seismic sources per iteration means that this is a highly computationally demanding procedure, which usually involves writing a lot of data to disk. We have written code that does forward modeling and inversion entirely in memory. A typical HPC cluster has many more CPUs than GPUs. Since FWI involves modeling many seismic sources per iteration, the obvious approach is to parallelize the code on a source-by-source basis, where each core of the CPU performs one modeling, and do all modelings simultaneously. With this approach, the GPU is already at a major disadvantage in pure numbers. Fortunately, GPUs can more than make up for this hardware disadvantage by performing each modeling much faster than a CPU. Another benefit of parallelizing each individual modeling is that it lets each modeling use a lot more RAM. If one node has 128 GB of RAM and 20 CPU cores, each modeling can use only 6.4 GB RAM if one is running the node at full capacity with source-by-source parallelization on the CPU. A parallelized per-source code using GPUs can use 64 GB RAM per modeling. Whenever a modeling uses more RAM than is available and has to start using regular disk space the runtime increases dramatically, due to slow file I/O. The extremely high computational speed of the GPUs combined with the large amount of RAM available for each modeling lets us do high frequency FWI for fairly large models very quickly. For a single modeling, our GPU code outperforms the single-threaded CPU-code by a factor of about 75. Successful inversions have been run on data with frequencies up to 40 Hz for a model of 2001 by 600 grid points with 5 m grid spacing and 5000 time steps, in less than 2.5 minutes per source. In practice, using 15 nodes (30 GPUs) to model 101 sources, each iteration took approximately 9 minutes. For reference, the same inversion run with our CPU code uses two hours per iteration. This was done using only a very simple wavefield interpolation technique, saving every second timestep. Using a more sophisticated checkpointing or wavefield reconstruction method would allow us to increase this model size significantly. Our results show that ordinary gaming GPUs are a viable alternative to the expensive professional GPUs often used today, when performing large scale modeling and inversion in geophysics.
Performance Models for the Spike Banded Linear System Solver
Manguoglu, Murat; Saied, Faisal; Sameh, Ahmed; ...
2011-01-01
With availability of large-scale parallel platforms comprised of tens-of-thousands of processors and beyond, there is significant impetus for the development of scalable parallel sparse linear system solvers and preconditioners. An integral part of this design process is the development of performance models capable of predicting performance and providing accurate cost models for the solvers and preconditioners. There has been some work in the past on characterizing performance of the iterative solvers themselves. In this paper, we investigate the problem of characterizing performance and scalability of banded preconditioners. Recent work has demonstrated the superior convergence properties and robustness of banded preconditioners,more » compared to state-of-the-art ILU family of preconditioners as well as algebraic multigrid preconditioners. Furthermore, when used in conjunction with efficient banded solvers, banded preconditioners are capable of significantly faster time-to-solution. Our banded solver, the Truncated Spike algorithm is specifically designed for parallel performance and tolerance to deep memory hierarchies. Its regular structure is also highly amenable to accurate performance characterization. Using these characteristics, we derive the following results in this paper: (i) we develop parallel formulations of the Truncated Spike solver, (ii) we develop a highly accurate pseudo-analytical parallel performance model for our solver, (iii) we show excellent predication capabilities of our model – based on which we argue the high scalability of our solver. Our pseudo-analytical performance model is based on analytical performance characterization of each phase of our solver. These analytical models are then parameterized using actual runtime information on target platforms. An important consequence of our performance models is that they reveal underlying performance bottlenecks in both serial and parallel formulations. All of our results are validated on diverse heterogeneous multiclusters – platforms for which performance prediction is particularly challenging. Finally, we provide predict the scalability of the Spike algorithm using up to 65,536 cores with our model. In this paper we extend the results presented in the Ninth International Symposium on Parallel and Distributed Computing.« less
A high-speed linear algebra library with automatic parallelism
NASA Technical Reports Server (NTRS)
Boucher, Michael L.
1994-01-01
Parallel or distributed processing is key to getting highest performance workstations. However, designing and implementing efficient parallel algorithms is difficult and error-prone. It is even more difficult to write code that is both portable to and efficient on many different computers. Finally, it is harder still to satisfy the above requirements and include the reliability and ease of use required of commercial software intended for use in a production environment. As a result, the application of parallel processing technology to commercial software has been extremely small even though there are numerous computationally demanding programs that would significantly benefit from application of parallel processing. This paper describes DSSLIB, which is a library of subroutines that perform many of the time-consuming computations in engineering and scientific software. DSSLIB combines the high efficiency and speed of parallel computation with a serial programming model that eliminates many undesirable side-effects of typical parallel code. The result is a simple way to incorporate the power of parallel processing into commercial software without compromising maintainability, reliability, or ease of use. This gives significant advantages over less powerful non-parallel entries in the market.
An approach for coupled-code multiphysics core simulations from a common input
Schmidt, Rodney; Belcourt, Kenneth; Hooper, Russell; ...
2014-12-10
This study describes an approach for coupled-code multiphysics reactor core simulations that is being developed by the Virtual Environment for Reactor Applications (VERA) project in the Consortium for Advanced Simulation of Light-Water Reactors (CASL). In this approach a user creates a single problem description, called the “VERAIn” common input file, to define and setup the desired coupled-code reactor core simulation. A preprocessing step accepts the VERAIn file and generates a set of fully consistent input files for the different physics codes being coupled. The problem is then solved using a single-executable coupled-code simulation tool applicable to the problem, which ismore » built using VERA infrastructure software tools and the set of physics codes required for the problem of interest. The approach is demonstrated by performing an eigenvalue and power distribution calculation of a typical three-dimensional 17 × 17 assembly with thermal–hydraulic and fuel temperature feedback. All neutronics aspects of the problem (cross-section calculation, neutron transport, power release) are solved using the Insilico code suite and are fully coupled to a thermal–hydraulic analysis calculated by the Cobra-TF (CTF) code. The single-executable coupled-code (Insilico-CTF) simulation tool is created using several VERA tools, including LIME (Lightweight Integrating Multiphysics Environment for coupling codes), DTK (Data Transfer Kit), Trilinos, and TriBITS. Parallel calculations are performed on the Titan supercomputer at Oak Ridge National Laboratory using 1156 cores, and a synopsis of the solution results and code performance is presented. Finally, ongoing development of this approach is also briefly described.« less
Accelerating Large Scale Image Analyses on Parallel, CPU-GPU Equipped Systems
Teodoro, George; Kurc, Tahsin M.; Pan, Tony; Cooper, Lee A.D.; Kong, Jun; Widener, Patrick; Saltz, Joel H.
2014-01-01
The past decade has witnessed a major paradigm shift in high performance computing with the introduction of accelerators as general purpose processors. These computing devices make available very high parallel computing power at low cost and power consumption, transforming current high performance platforms into heterogeneous CPU-GPU equipped systems. Although the theoretical performance achieved by these hybrid systems is impressive, taking practical advantage of this computing power remains a very challenging problem. Most applications are still deployed to either GPU or CPU, leaving the other resource under- or un-utilized. In this paper, we propose, implement, and evaluate a performance aware scheduling technique along with optimizations to make efficient collaborative use of CPUs and GPUs on a parallel system. In the context of feature computations in large scale image analysis applications, our evaluations show that intelligently co-scheduling CPUs and GPUs can significantly improve performance over GPU-only or multi-core CPU-only approaches. PMID:25419545
PDB Editor: a user-friendly Java-based Protein Data Bank file editor with a GUI.
Lee, Jonas; Kim, Sung Hou
2009-04-01
The Protein Data Bank file format is the format most widely used by protein crystallographers and biologists to disseminate and manipulate protein structures. Despite this, there are few user-friendly software packages available to efficiently edit and extract raw information from PDB files. This limitation often leads to many protein crystallographers wasting significant time manually editing PDB files. PDB Editor, written in Java Swing GUI, allows the user to selectively search, select, extract and edit information in parallel. Furthermore, the program is a stand-alone application written in Java which frees users from the hassles associated with platform/operating system-dependent installation and usage. PDB Editor can be downloaded from http://sourceforge.net/projects/pdbeditorjl/.
OpenMSI: A High-Performance Web-Based Platform for Mass Spectrometry Imaging
DOE Office of Scientific and Technical Information (OSTI.GOV)
Rubel, Oliver; Greiner, Annette; Cholia, Shreyas
Mass spectrometry imaging (MSI) enables researchers to directly probe endogenous molecules directly within the architecture of the biological matrix. Unfortunately, efficient access, management, and analysis of the data generated by MSI approaches remain major challenges to this rapidly developing field. Despite the availability of numerous dedicated file formats and software packages, it is a widely held viewpoint that the biggest challenge is simply opening, sharing, and analyzing a file without loss of information. Here we present OpenMSI, a software framework and platform that addresses these challenges via an advanced, high-performance, extensible file format and Web API for remote data accessmore » (http://openmsi.nersc.gov). The OpenMSI file format supports storage of raw MSI data, metadata, and derived analyses in a single, self-describing format based on HDF5 and is supported by a large range of analysis software (e.g., Matlab and R) and programming languages (e.g., C++, Fortran, and Python). Careful optimization of the storage layout of MSI data sets using chunking, compression, and data replication accelerates common, selective data access operations while minimizing data storage requirements and are critical enablers of rapid data I/O. The OpenMSI file format has shown to provide >2000-fold improvement for image access operations, enabling spectrum and image retrieval in less than 0.3 s across the Internet even for 50 GB MSI data sets. To make remote high-performance compute resources accessible for analysis and to facilitate data sharing and collaboration, we describe an easy-to-use yet powerful Web API, enabling fast and convenient access to MSI data, metadata, and derived analysis results stored remotely to facilitate high-performance data analysis and enable implementation of Web based data sharing, visualization, and analysis.« less
NASA Technical Reports Server (NTRS)
Runnels, Tyson D.
1993-01-01
This is a case study. It deals with the use of a 'virtual file system' (VFS) for Boeing's UNIX-based Product Standards Data System (PSDS). One of the objectives of PSDS is to store digital standards documents. The file-storage requirements are that the files must be rapidly accessible, stored for long periods of time - as though they were paper, protected from disaster, and accumulative to about 80 billion characters (80 gigabytes). This volume of data will be approached in the first two years of the project's operation. The approach chosen is to install a hierarchical file migration system using optical disk cartridges. Files are migrated from high-performance media to lower performance optical media based on a least-frequency-used algorithm. The optical media are less expensive per character stored and are removable. Vital statistics about the removable optical disk cartridges are maintained in a database. The assembly of hardware and software acts as a single virtual file system transparent to the PSDS user. The files are copied to 'backup-and-recover' media whose vital statistics are also stored in the database. Seventeen months into operation, PSDS is storing 49 gigabytes. A number of operational and performance problems were overcome. Costs are under control. New and/or alternative uses for the VFS are being considered.
2007-03-01
software level retrieve state information that can inherently contain more contextual information . As a result, such mechanisms can be applied in more...ease by which state information can be gathered for monitoring purposes. For example, we consider soft security to allow for easier state retrieval ...files are to be checked and what parameters are to be verified. The independent auditor periodically retrieves information pertaining to the files in
Dynamic performance of high speed solenoid valve with parallel coils
NASA Astrophysics Data System (ADS)
Kong, Xiaowu; Li, Shizhen
2014-07-01
The methods of improving the dynamic performance of high speed on/off solenoid valve include increasing the magnetic force of armature and the slew rate of coil current, decreasing the mass and stroke of moving parts. The increase of magnetic force usually leads to the decrease of current slew rate, which could increase the delay time of the dynamic response of solenoid valve. Using a high voltage to drive coil can solve this contradiction, but a high driving voltage can also lead to more cost and a decrease of safety and reliability. In this paper, a new scheme of parallel coils is investigated, in which the single coil of solenoid is replaced by parallel coils with same ampere turns. Based on the mathematic model of high speed solenoid valve, the theoretical formula for the delay time of solenoid valve is deduced. Both the theoretical analysis and the dynamic simulation show that the effect of dividing a single coil into N parallel sub-coils is close to that of driving the single coil with N times of the original driving voltage as far as the delay time of solenoid valve is concerned. A specific test bench is designed to measure the dynamic performance of high speed on/off solenoid valve. The experimental results also prove that both the delay time and switching time of the solenoid valves can be decreased greatly by adopting the parallel coil scheme. This research presents a simple and practical method to improve the dynamic performance of high speed on/off solenoid valve.
NASA Astrophysics Data System (ADS)
Rizki, Permata Nur Miftahur; Lee, Heezin; Lee, Minsu; Oh, Sangyoon
2017-01-01
With the rapid advance of remote sensing technology, the amount of three-dimensional point-cloud data has increased extraordinarily, requiring faster processing in the construction of digital elevation models. There have been several attempts to accelerate the computation using parallel methods; however, little attention has been given to investigating different approaches for selecting the most suited parallel programming model for a given computing environment. We present our findings and insights identified by implementing three popular high-performance parallel approaches (message passing interface, MapReduce, and GPGPU) on time demanding but accurate kriging interpolation. The performances of the approaches are compared by varying the size of the grid and input data. In our empirical experiment, we demonstrate the significant acceleration by all three approaches compared to a C-implemented sequential-processing method. In addition, we also discuss the pros and cons of each method in terms of usability, complexity infrastructure, and platform limitation to give readers a better understanding of utilizing those parallel approaches for gridding purposes.
VCF-Explorer: filtering and analysing whole genome VCF files.
Akgün, Mete; Demirci, Hüseyin
2017-11-01
The decreasing cost in high-throughput technologies led to a number of sequencing projects consisting of thousands of whole genomes. The paradigm shift from exome to whole genome brings a significant increase in the size of output files. Most of the existing tools which are developed to analyse exome files are not adequate for larger VCF files produced by whole genome studies. In this work we present VCF-Explorer, a variant analysis software capable of handling large files. Memory efficiency and avoiding computationally costly pre-processing step enable to carry out the analysis to be performed with ordinary computers. VCF-Explorer provides an easy to use environment where users can define various types of queries based on variant and sample genotype level annotations. VCF-Explorer can be run in different environments and computational platforms ranging from a standard laptop to a high performance server. VCF-Explorer is freely available at: http://vcfexplorer.sourceforge.net/. mete.akgun@tubitak.gov.tr. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
Microfluidic integration of parallel solid-phase liquid chromatography.
Huft, Jens; Haynes, Charles A; Hansen, Carl L
2013-03-05
We report the development of a fully integrated microfluidic chromatography system based on a recently developed column geometry that allows for robust packing of high-performance separation columns in poly(dimethylsiloxane) microfluidic devices having integrated valves made by multilayer soft lithography (MSL). The combination of parallel high-performance separation columns and on-chip plumbing was used to achieve a fully integrated system for on-chip chromatography, including all steps of automated sample loading, programmable gradient generation, separation, fluorescent detection, and sample recovery. We demonstrate this system in the separation of fluorescently labeled DNA and parallel purification of reverse transcription polymerase chain reaction (RT-PCR) amplified variable regions of mouse immunoglobulin genes using a strong anion exchange (AEX) resin. Parallel sample recovery in an immiscible oil stream offers the advantage of low sample dilution and high recovery rates. The ability to perform nucleic acid size selection and recovery on subnanogram samples of DNA holds promise for on-chip genomics applications including sequencing library preparation, cloning, and sample fractionation for diagnostics.
Distributed PACS using distributed file system with hierarchical meta data servers.
Hiroyasu, Tomoyuki; Minamitani, Yoshiyuki; Miki, Mitsunori; Yokouchi, Hisatake; Yoshimi, Masato
2012-01-01
In this research, we propose a new distributed PACS (Picture Archiving and Communication Systems) which is available to integrate several PACSs that exist in each medical institution. The conventional PACS controls DICOM file into one data-base. On the other hand, in the proposed system, DICOM file is separated into meta data and image data and those are stored individually. Using this mechanism, since file is not always accessed the entire data, some operations such as finding files, changing titles, and so on can be performed in high-speed. At the same time, as distributed file system is utilized, accessing image files can also achieve high-speed access and high fault tolerant. The introduced system has a more significant point. That is the simplicity to integrate several PACSs. In the proposed system, only the meta data servers are integrated and integrated system can be constructed. This system also has the scalability of file access with along to the number of file numbers and file sizes. On the other hand, because meta-data server is integrated, the meta data server is the weakness of this system. To solve this defect, hieratical meta data servers are introduced. Because of this mechanism, not only fault--tolerant ability is increased but scalability of file access is also increased. To discuss the proposed system, the prototype system using Gfarm was implemented. For evaluating the implemented system, file search operating time of Gfarm and NFS were compared.
Nose, Atsushi; Yamazaki, Tomohiro; Katayama, Hironobu; Uehara, Shuji; Kobayashi, Masatsugu; Shida, Sayaka; Odahara, Masaki; Takamiya, Kenichi; Matsumoto, Shizunori; Miyashita, Leo; Watanabe, Yoshihiro; Izawa, Takashi; Muramatsu, Yoshinori; Nitta, Yoshikazu; Ishikawa, Masatoshi
2018-04-24
We have developed a high-speed vision chip using 3D stacking technology to address the increasing demand for high-speed vision chips in diverse applications. The chip comprises a 1/3.2-inch, 1.27 Mpixel, 500 fps (0.31 Mpixel, 1000 fps, 2 × 2 binning) vision chip with 3D-stacked column-parallel Analog-to-Digital Converters (ADCs) and 140 Giga Operation per Second (GOPS) programmable Single Instruction Multiple Data (SIMD) column-parallel PEs for new sensing applications. The 3D-stacked structure and column parallel processing architecture achieve high sensitivity, high resolution, and high-accuracy object positioning.
AP-IO: asynchronous pipeline I/O for hiding periodic output cost in CFD simulation.
Xiaoguang, Ren; Xinhai, Xu
2014-01-01
Computational fluid dynamics (CFD) simulation often needs to periodically output intermediate results to files in the form of snapshots for visualization or restart, which seriously impacts the performance. In this paper, we present asynchronous pipeline I/O (AP-IO) optimization scheme for the periodically snapshot output on the basis of asynchronous I/O and CFD application characteristics. In AP-IO, dedicated background I/O processes or threads are in charge of handling the file write in pipeline mode, therefore the write overhead can be hidden with more calculation than classic asynchronous I/O. We design the framework of AP-IO and implement it in OpenFOAM, providing CFD users with a user-friendly interface. Experimental results on the Tianhe-2 supercomputer demonstrate that AP-IO can achieve a good optimization effect for the periodical snapshot output in CFD application, and the effect is especially better for massively parallel CFD simulations, which can reduce the total execution time up to about 40%.
AESOP: A Python Library for Investigating Electrostatics in Protein Interactions.
Harrison, Reed E S; Mohan, Rohith R; Gorham, Ronald D; Kieslich, Chris A; Morikis, Dimitrios
2017-05-09
Electric fields often play a role in guiding the association of protein complexes. Such interactions can be further engineered to accelerate complex association, resulting in protein systems with increased productivity. This is especially true for enzymes where reaction rates are typically diffusion limited. To facilitate quantitative comparisons of electrostatics in protein families and to describe electrostatic contributions of individual amino acids, we previously developed a computational framework called AESOP. We now implement this computational tool in Python with increased usability and the capability of performing calculations in parallel. AESOP utilizes PDB2PQR and Adaptive Poisson-Boltzmann Solver to generate grid-based electrostatic potential files for protein structures provided by the end user. There are methods within AESOP for quantitatively comparing sets of grid-based electrostatic potentials in terms of similarity or generating ensembles of electrostatic potential files for a library of mutants to quantify the effects of perturbations in protein structure and protein-protein association. Copyright © 2017 Biophysical Society. Published by Elsevier Inc. All rights reserved.
AP-IO: Asynchronous Pipeline I/O for Hiding Periodic Output Cost in CFD Simulation
Xiaoguang, Ren; Xinhai, Xu
2014-01-01
Computational fluid dynamics (CFD) simulation often needs to periodically output intermediate results to files in the form of snapshots for visualization or restart, which seriously impacts the performance. In this paper, we present asynchronous pipeline I/O (AP-IO) optimization scheme for the periodically snapshot output on the basis of asynchronous I/O and CFD application characteristics. In AP-IO, dedicated background I/O processes or threads are in charge of handling the file write in pipeline mode, therefore the write overhead can be hidden with more calculation than classic asynchronous I/O. We design the framework of AP-IO and implement it in OpenFOAM, providing CFD users with a user-friendly interface. Experimental results on the Tianhe-2 supercomputer demonstrate that AP-IO can achieve a good optimization effect for the periodical snapshot output in CFD application, and the effect is especially better for massively parallel CFD simulations, which can reduce the total execution time up to about 40%. PMID:24955390
Method for position emission mammography image reconstruction
Smith, Mark Frederick
2004-10-12
An image reconstruction method comprising accepting coincidence datat from either a data file or in real time from a pair of detector heads, culling event data that is outside a desired energy range, optionally saving the desired data for each detector position or for each pair of detector pixels on the two detector heads, and then reconstructing the image either by backprojection image reconstruction or by iterative image reconstruction. In the backprojection image reconstruction mode, rays are traced between centers of lines of response (LOR's), counts are then either allocated by nearest pixel interpolation or allocated by an overlap method and then corrected for geometric effects and attenuation and the data file updated. If the iterative image reconstruction option is selected, one implementation is to compute a grid Siddon retracing, and to perform maximum likelihood expectation maiximization (MLEM) computed by either: a) tracing parallel rays between subpixels on opposite detector heads; or b) tracing rays between randomized endpoint locations on opposite detector heads.
Vectorization for Molecular Dynamics on Intel Xeon Phi Corpocessors
NASA Astrophysics Data System (ADS)
Yi, Hongsuk
2014-03-01
Many modern processors are capable of exploiting data-level parallelism through the use of single instruction multiple data (SIMD) execution. The new Intel Xeon Phi coprocessor supports 512 bit vector registers for the high performance computing. In this paper, we have developed a hierarchical parallelization scheme for accelerated molecular dynamics simulations with the Terfoff potentials for covalent bond solid crystals on Intel Xeon Phi coprocessor systems. The scheme exploits multi-level parallelism computing. We combine thread-level parallelism using a tightly coupled thread-level and task-level parallelism with 512-bit vector register. The simulation results show that the parallel performance of SIMD implementations on Xeon Phi is apparently superior to their x86 CPU architecture.
a Hadoop-Based Algorithm of Generating dem Grid from Point Cloud Data
NASA Astrophysics Data System (ADS)
Jian, X.; Xiao, X.; Chengfang, H.; Zhizhong, Z.; Zhaohui, W.; Dengzhong, Z.
2015-04-01
Airborne LiDAR technology has proven to be the most powerful tools to obtain high-density, high-accuracy and significantly detailed surface information of terrain and surface objects within a short time, and from which the Digital Elevation Model of high quality can be extracted. Point cloud data generated from the pre-processed data should be classified by segmentation algorithms, so as to differ the terrain points from disorganized points, then followed by a procedure of interpolating the selected points to turn points into DEM data. The whole procedure takes a long time and huge computing resource due to high-density, that is concentrated on by a number of researches. Hadoop is a distributed system infrastructure developed by the Apache Foundation, which contains a highly fault-tolerant distributed file system (HDFS) with high transmission rate and a parallel programming model (Map/Reduce). Such a framework is appropriate for DEM generation algorithms to improve efficiency. Point cloud data of Dongting Lake acquired by Riegl LMS-Q680i laser scanner was utilized as the original data to generate DEM by a Hadoop-based algorithms implemented in Linux, then followed by another traditional procedure programmed by C++ as the comparative experiment. Then the algorithm's efficiency, coding complexity, and performance-cost ratio were discussed for the comparison. The results demonstrate that the algorithm's speed depends on size of point set and density of DEM grid, and the non-Hadoop implementation can achieve a high performance when memory is big enough, but the multiple Hadoop implementation can achieve a higher performance-cost ratio, while point set is of vast quantities on the other hand.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Cohen, J; Dossa, D; Gokhale, M
Critical data science applications requiring frequent access to storage perform poorly on today's computing architectures. This project addresses efficient computation of data-intensive problems in national security and basic science by exploring, advancing, and applying a new form of computing called storage-intensive supercomputing (SISC). Our goal is to enable applications that simply cannot run on current systems, and, for a broad range of data-intensive problems, to deliver an order of magnitude improvement in price/performance over today's data-intensive architectures. This technical report documents much of the work done under LDRD 07-ERD-063 Storage Intensive Supercomputing during the period 05/07-09/07. The following chapters describe:more » (1) a new file I/O monitoring tool iotrace developed to capture the dynamic I/O profiles of Linux processes; (2) an out-of-core graph benchmark for level-set expansion of scale-free graphs; (3) an entity extraction benchmark consisting of a pipeline of eight components; and (4) an image resampling benchmark drawn from the SWarp program in the LSST data processing pipeline. The performance of the graph and entity extraction benchmarks was measured in three different scenarios: data sets residing on the NFS file server and accessed over the network; data sets stored on local disk; and data sets stored on the Fusion I/O parallel NAND Flash array. The image resampling benchmark compared performance of software-only to GPU-accelerated. In addition to the work reported here, an additional text processing application was developed that used an FPGA to accelerate n-gram profiling for language classification. The n-gram application will be presented at SC07 at the High Performance Reconfigurable Computing Technologies and Applications Workshop. The graph and entity extraction benchmarks were run on a Supermicro server housing the NAND Flash 40GB parallel disk array, the Fusion-io. The Fusion system specs are as follows: SuperMicro X7DBE Xeon Dual Socket Blackford Server Motherboard; 2 Intel Xeon Dual-Core 2.66 GHz processors; 1 GB DDR2 PC2-5300 RAM (2 x 512); 80GB Hard Drive (Seagate SATA II Barracuda). The Fusion board is presently capable of 4X in a PCIe slot. The image resampling benchmark was run on a dual Xeon workstation with NVIDIA graphics card (see Chapter 5 for full specification). An XtremeData Opteron+FPGA was used for the language classification application. We observed that these benchmarks are not uniformly I/O intensive. The only benchmark that showed greater that 50% of the time in I/O was the graph algorithm when it accessed data files over NFS. When local disk was used, the graph benchmark spent at most 40% of its time in I/O. The other benchmarks were CPU dominated. The image resampling benchmark and language classification showed order of magnitude speedup over software by using co-processor technology to offload the CPU-intensive kernels. Our experiments to date suggest that emerging hardware technologies offer significant benefit to boosting the performance of data-intensive algorithms. Using GPU and FPGA co-processors, we were able to improve performance by more than an order of magnitude on the benchmark algorithms, eliminating the processor bottleneck of CPU-bound tasks. Experiments with a prototype solid state nonvolative memory available today show 10X better throughput on random reads than disk, with a 2X speedup on a graph processing benchmark when compared to the use of local SATA disk.« less
Expediting Scientific Data Analysis with Reorganization of Data
DOE Office of Scientific and Technical Information (OSTI.GOV)
Byna, Surendra; Wu, Kesheng
2013-08-19
Data producers typically optimize the layout of data files to minimize the write time. In most cases, data analysis tasks read these files in access patterns different from the write patterns causing poor read performance. In this paper, we introduce Scientific Data Services (SDS), a framework for bridging the performance gap between writing and reading scientific data. SDS reorganizes data to match the read patterns of analysis tasks and enables transparent data reads from the reorganized data. We implemented a HDF5 Virtual Object Layer (VOL) plugin to redirect the HDF5 dataset read calls to the reorganized data. To demonstrate themore » effectiveness of SDS, we applied two parallel data organization techniques: a sort-based organization on a plasma physics data and a transpose-based organization on mass spectrometry imaging data. We also extended the HDF5 data access API to allow selection of data based on their values through a query interface, called SDS Query. We evaluated the execution time in accessing various subsets of data through existing HDF5 Read API and SDS Query. We showed that reading the reorganized data using SDS is up to 55X faster than reading the original data.« less
Paucke, Madlen; Oppermann, Frank; Koch, Iring; Jescheniak, Jörg D
2015-12-01
Previous dual-task picture-naming studies suggest that lexical processes require capacity-limited processes and prevent other tasks to be carried out in parallel. However, studies involving the processing of multiple pictures suggest that parallel lexical processing is possible. The present study investigated the specific costs that may arise when such parallel processing occurs. We used a novel dual-task paradigm by presenting 2 visual objects associated with different tasks and manipulating between-task similarity. With high similarity, a picture-naming task (T1) was combined with a phoneme-decision task (T2), so that lexical processes were shared across tasks. With low similarity, picture-naming was combined with a size-decision T2 (nonshared lexical processes). In Experiment 1, we found that a manipulation of lexical processes (lexical frequency of T1 object name) showed an additive propagation with low between-task similarity and an overadditive propagation with high between-task similarity. Experiment 2 replicated this differential forward propagation of the lexical effect and showed that it disappeared with longer stimulus onset asynchronies. Moreover, both experiments showed backward crosstalk, indexed as worse T1 performance with high between-task similarity compared with low similarity. Together, these findings suggest that conditions of high between-task similarity can lead to parallel lexical processing in both tasks, which, however, does not result in benefits but rather in extra performance costs. These costs can be attributed to crosstalk based on the dual-task binding problem arising from parallel processing. Hence, the present study reveals that capacity-limited lexical processing can run in parallel across dual tasks but only at the expense of extraordinary high costs. (c) 2015 APA, all rights reserved).
FastaValidator: an open-source Java library to parse and validate FASTA formatted sequences.
Waldmann, Jost; Gerken, Jan; Hankeln, Wolfgang; Schweer, Timmy; Glöckner, Frank Oliver
2014-06-14
Advances in sequencing technologies challenge the efficient importing and validation of FASTA formatted sequence data which is still a prerequisite for most bioinformatic tools and pipelines. Comparative analysis of commonly used Bio*-frameworks (BioPerl, BioJava and Biopython) shows that their scalability and accuracy is hampered. FastaValidator represents a platform-independent, standardized, light-weight software library written in the Java programming language. It targets computer scientists and bioinformaticians writing software which needs to parse quickly and accurately large amounts of sequence data. For end-users FastaValidator includes an interactive out-of-the-box validation of FASTA formatted files, as well as a non-interactive mode designed for high-throughput validation in software pipelines. The accuracy and performance of the FastaValidator library qualifies it for large data sets such as those commonly produced by massive parallel (NGS) technologies. It offers scientists a fast, accurate and standardized method for parsing and validating FASTA formatted sequence data.
Computational Issues in Damping Identification for Large Scale Problems
NASA Technical Reports Server (NTRS)
Pilkey, Deborah L.; Roe, Kevin P.; Inman, Daniel J.
1997-01-01
Two damping identification methods are tested for efficiency in large-scale applications. One is an iterative routine, and the other a least squares method. Numerical simulations have been performed on multiple degree-of-freedom models to test the effectiveness of the algorithm and the usefulness of parallel computation for the problems. High Performance Fortran is used to parallelize the algorithm. Tests were performed using the IBM-SP2 at NASA Ames Research Center. The least squares method tested incurs high communication costs, which reduces the benefit of high performance computing. This method's memory requirement grows at a very rapid rate meaning that larger problems can quickly exceed available computer memory. The iterative method's memory requirement grows at a much slower pace and is able to handle problems with 500+ degrees of freedom on a single processor. This method benefits from parallelization, and significant speedup can he seen for problems of 100+ degrees-of-freedom.
NASA Technical Reports Server (NTRS)
Bailey, David (Editor); Barton, John (Editor); Lasinski, Thomas (Editor); Simon, Horst (Editor)
1993-01-01
A new set of benchmarks was developed for the performance evaluation of highly parallel supercomputers. These benchmarks consist of a set of kernels, the 'Parallel Kernels,' and a simulated application benchmark. Together they mimic the computation and data movement characteristics of large scale computational fluid dynamics (CFD) applications. The principal distinguishing feature of these benchmarks is their 'pencil and paper' specification - all details of these benchmarks are specified only algorithmically. In this way many of the difficulties associated with conventional benchmarking approaches on highly parallel systems are avoided.
VisIVO: A Library and Integrated Tools for Large Astrophysical Dataset Exploration
NASA Astrophysics Data System (ADS)
Becciani, U.; Costa, A.; Ersotelos, N.; Krokos, M.; Massimino, P.; Petta, C.; Vitello, F.
2012-09-01
VisIVO provides an integrated suite of tools and services that can be used in many scientific fields. VisIVO development starts in the Virtual Observatory framework. VisIVO allows users to visualize meaningfully highly-complex, large-scale datasets and create movies of these visualizations based on distributed infrastructures. VisIVO supports high-performance, multi-dimensional visualization of large-scale astrophysical datasets. Users can rapidly obtain meaningful visualizations while preserving full and intuitive control of the relevant parameters. VisIVO consists of VisIVO Desktop - a stand-alone application for interactive visualization on standard PCs, VisIVO Server - a platform for high performance visualization, VisIVO Web - a custom designed web portal, VisIVOSmartphone - an application to exploit the VisIVO Server functionality and the latest VisIVO features: VisIVO Library allows a job running on a computational system (grid, HPC, etc.) to produce movies directly with the code internal data arrays without the need to produce intermediate files. This is particularly important when running on large computational facilities, where the user wants to have a look at the results during the data production phase. For example, in grid computing facilities, images can be produced directly in the grid catalogue while the user code is running in a system that cannot be directly accessed by the user (a worker node). The deployment of VisIVO on the DG and gLite is carried out with the support of EDGI and EGI-Inspire projects. Depending on the structure and size of datasets under consideration, the data exploration process could take several hours of CPU for creating customized views and the production of movies could potentially last several days. For this reason an MPI parallel version of VisIVO could play a fundamental role in increasing performance, e.g. it could be automatically deployed on nodes that are MPI aware. A central concept in our development is thus to produce unified code that can run either on serial nodes or in parallel by using HPC oriented grid nodes. Another important aspect, to obtain as high performance as possible, is the integration of VisIVO processes with grid nodes where GPUs are available. We have selected CUDA for implementing a range of computationally heavy modules. VisIVO is supported by EGI-Inspire, EDGI and SCI-BUS projects.
Parallel, Distributed Scripting with Python
DOE Office of Scientific and Technical Information (OSTI.GOV)
Miller, P J
2002-05-24
Parallel computers used to be, for the most part, one-of-a-kind systems which were extremely difficult to program portably. With SMP architectures, the advent of the POSIX thread API and OpenMP gave developers ways to portably exploit on-the-box shared memory parallelism. Since these architectures didn't scale cost-effectively, distributed memory clusters were developed. The associated MPI message passing libraries gave these systems a portable paradigm too. Having programmers effectively use this paradigm is a somewhat different question. Distributed data has to be explicitly transported via the messaging system in order for it to be useful. In high level languages, the MPI librarymore » gives access to data distribution routines in C, C++, and FORTRAN. But we need more than that. Many reasonable and common tasks are best done in (or as extensions to) scripting languages. Consider sysadm tools such as password crackers, file purgers, etc ... These are simple to write in a scripting language such as Python (an open source, portable, and freely available interpreter). But these tasks beg to be done in parallel. Consider the a password checker that checks an encrypted password against a 25,000 word dictionary. This can take around 10 seconds in Python (6 seconds in C). It is trivial to parallelize if you can distribute the information and co-ordinate the work.« less
Computer architecture evaluation for structural dynamics computations: Project summary
NASA Technical Reports Server (NTRS)
Standley, Hilda M.
1989-01-01
The intent of the proposed effort is the examination of the impact of the elements of parallel architectures on the performance realized in a parallel computation. To this end, three major projects are developed: a language for the expression of high level parallelism, a statistical technique for the synthesis of multicomputer interconnection networks based upon performance prediction, and a queueing model for the analysis of shared memory hierarchies.
Terascale Cluster for Advanced Turbulent Combustion Simulations
2008-07-25
the system We have given the name CATS (for Combustion And Turbulence Simulator) to the terascale system that was obtained through this grant. CATS ...lnfiniBand interconnect. CATS includes an interactive login node and a file server, each holding in excess of 1 terabyte of file storage. The 35 active...compute nodes of CATS enable us to run up to 140-core parallel MPI batch jobs; one node is reserved to run the scheduler. CATS is operated and
Processing Benefits of Resonance Acoustic Mixing on High Performance Propellants and Explosives
2012-02-01
slightly greater stress Modulus similar Dewetting Distribution Statement A: Approved for Public Release Tensile Comparison File: NAVAIR Brief 18...greater stress Modulus similar Dewetting Distribution Statement A: Approved for Public Release Resodyn Mixed Explosive 19 File: NAVAIR Brief
Parallel Scaling Characteristics of Selected NERSC User ProjectCodes
DOE Office of Scientific and Technical Information (OSTI.GOV)
Skinner, David; Verdier, Francesca; Anand, Harsh
This report documents parallel scaling characteristics of NERSC user project codes between Fiscal Year 2003 and the first half of Fiscal Year 2004 (Oct 2002-March 2004). The codes analyzed cover 60% of all the CPU hours delivered during that time frame on seaborg, a 6080 CPU IBM SP and the largest parallel computer at NERSC. The scale in terms of concurrency and problem size of the workload is analyzed. Drawing on batch queue logs, performance data and feedback from researchers we detail the motivations, benefits, and challenges of implementing highly parallel scientific codes on current NERSC High Performance Computing systems.more » An evaluation and outlook of the NERSC workload for Allocation Year 2005 is presented.« less
Designing for Peta-Scale in the LSST Database
NASA Astrophysics Data System (ADS)
Kantor, J.; Axelrod, T.; Becla, J.; Cook, K.; Nikolaev, S.; Gray, J.; Plante, R.; Nieto-Santisteban, M.; Szalay, A.; Thakar, A.
2007-10-01
The Large Synoptic Survey Telescope (LSST), a proposed ground-based 8.4 m telescope with a 10 deg^2 field of view, will generate 15 TB of raw images every observing night. When calibration and processed data are added, the image archive, catalogs, and meta-data will grow 15 PB yr^{-1} on average. The LSST Data Management System (DMS) must capture, process, store, index, replicate, and provide open access to this data. Alerts must be triggered within 30 s of data acquisition. To do this in real-time at these data volumes will require advances in data management, database, and file system techniques. This paper describes the design of the LSST DMS and emphasizes features for peta-scale data. The LSST DMS will employ a combination of distributed database and file systems, with schema, partitioning, and indexing oriented for parallel operations. Image files are stored in a distributed file system with references to, and meta-data from, each file stored in the databases. The schema design supports pipeline processing, rapid ingest, and efficient query. Vertical partitioning reduces disk input/output requirements, horizontal partitioning allows parallel data access using arrays of servers and disks. Indexing is extensive, utilizing both conventional RAM-resident indexes and column-narrow, row-deep tag tables/covering indices that are extracted from tables that contain many more attributes. The DMS Data Access Framework is encapsulated in a middleware framework to provide a uniform service interface to all framework capabilities. This framework will provide the automated work-flow, replication, and data analysis capabilities necessary to make data processing and data quality analysis feasible at this scale.
A Fast Synthetic Aperture Radar Raw Data Simulation Using Cloud Computing.
Li, Zhixin; Su, Dandan; Zhu, Haijiang; Li, Wei; Zhang, Fan; Li, Ruirui
2017-01-08
Synthetic Aperture Radar (SAR) raw data simulation is a fundamental problem in radar system design and imaging algorithm research. The growth of surveying swath and resolution results in a significant increase in data volume and simulation period, which can be considered to be a comprehensive data intensive and computing intensive issue. Although several high performance computing (HPC) methods have demonstrated their potential for accelerating simulation, the input/output (I/O) bottleneck of huge raw data has not been eased. In this paper, we propose a cloud computing based SAR raw data simulation algorithm, which employs the MapReduce model to accelerate the raw data computing and the Hadoop distributed file system (HDFS) for fast I/O access. The MapReduce model is designed for the irregular parallel accumulation of raw data simulation, which greatly reduces the parallel efficiency of graphics processing unit (GPU) based simulation methods. In addition, three kinds of optimization strategies are put forward from the aspects of programming model, HDFS configuration and scheduling. The experimental results show that the cloud computing based algorithm achieves 4_ speedup over the baseline serial approach in an 8-node cloud environment, and each optimization strategy can improve about 20%. This work proves that the proposed cloud algorithm is capable of solving the computing intensive and data intensive issues in SAR raw data simulation, and is easily extended to large scale computing to achieve higher acceleration.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mercier, C.W.
The Network File System (NFS) will be the user interface to a High-Performance Data System (HPDS) being developed at Los Alamos National Laboratory (LANL). HPDS will manage high-capacity, high-performance storage systems connected directly to a high-speed network from distributed workstations. NFS will be modified to maximize performance and to manage massive amounts of data. 6 refs., 3 figs.
Photonic content-addressable memory system that uses a parallel-readout optical disk
NASA Astrophysics Data System (ADS)
Krishnamoorthy, Ashok V.; Marchand, Philippe J.; Yayla, Gökçe; Esener, Sadik C.
1995-11-01
We describe a high-performance associative-memory system that can be implemented by means of an optical disk modified for parallel readout and a custom-designed silicon integrated circuit with parallel optical input. The system can achieve associative recall on 128 \\times 128 bit images and also on variable-size subimages. The system's behavior and performance are evaluated on the basis of experimental results on a motionless-head parallel-readout optical-disk system, logic simulations of the very-large-scale integrated chip, and a software emulation of the overall system.
Scalable parallel communications
NASA Technical Reports Server (NTRS)
Maly, K.; Khanna, S.; Overstreet, C. M.; Mukkamala, R.; Zubair, M.; Sekhar, Y. S.; Foudriat, E. C.
1992-01-01
Coarse-grain parallelism in networking (that is, the use of multiple protocol processors running replicated software sending over several physical channels) can be used to provide gigabit communications for a single application. Since parallel network performance is highly dependent on real issues such as hardware properties (e.g., memory speeds and cache hit rates), operating system overhead (e.g., interrupt handling), and protocol performance (e.g., effect of timeouts), we have performed detailed simulations studies of both a bus-based multiprocessor workstation node (based on the Sun Galaxy MP multiprocessor) and a distributed-memory parallel computer node (based on the Touchstone DELTA) to evaluate the behavior of coarse-grain parallelism. Our results indicate: (1) coarse-grain parallelism can deliver multiple 100 Mbps with currently available hardware platforms and existing networking protocols (such as Transmission Control Protocol/Internet Protocol (TCP/IP) and parallel Fiber Distributed Data Interface (FDDI) rings); (2) scale-up is near linear in n, the number of protocol processors, and channels (for small n and up to a few hundred Mbps); and (3) since these results are based on existing hardware without specialized devices (except perhaps for some simple modifications of the FDDI boards), this is a low cost solution to providing multiple 100 Mbps on current machines. In addition, from both the performance analysis and the properties of these architectures, we conclude: (1) multiple processors providing identical services and the use of space division multiplexing for the physical channels can provide better reliability than monolithic approaches (it also provides graceful degradation and low-cost load balancing); (2) coarse-grain parallelism supports running several transport protocols in parallel to provide different types of service (for example, one TCP handles small messages for many users, other TCP's running in parallel provide high bandwidth service to a single application); and (3) coarse grain parallelism will be able to incorporate many future improvements from related work (e.g., reduced data movement, fast TCP, fine-grain parallelism) also with near linear speed-ups.
Concurrent Probabilistic Simulation of High Temperature Composite Structural Response
NASA Technical Reports Server (NTRS)
Abdi, Frank
1996-01-01
A computational structural/material analysis and design tool which would meet industry's future demand for expedience and reduced cost is presented. This unique software 'GENOA' is dedicated to parallel and high speed analysis to perform probabilistic evaluation of high temperature composite response of aerospace systems. The development is based on detailed integration and modification of diverse fields of specialized analysis techniques and mathematical models to combine their latest innovative capabilities into a commercially viable software package. The technique is specifically designed to exploit the availability of processors to perform computationally intense probabilistic analysis assessing uncertainties in structural reliability analysis and composite micromechanics. The primary objectives which were achieved in performing the development were: (1) Utilization of the power of parallel processing and static/dynamic load balancing optimization to make the complex simulation of structure, material and processing of high temperature composite affordable; (2) Computational integration and synchronization of probabilistic mathematics, structural/material mechanics and parallel computing; (3) Implementation of an innovative multi-level domain decomposition technique to identify the inherent parallelism, and increasing convergence rates through high- and low-level processor assignment; (4) Creating the framework for Portable Paralleled architecture for the machine independent Multi Instruction Multi Data, (MIMD), Single Instruction Multi Data (SIMD), hybrid and distributed workstation type of computers; and (5) Market evaluation. The results of Phase-2 effort provides a good basis for continuation and warrants Phase-3 government, and industry partnership.
Multi-Modulator for Bandwidth-Efficient Communication
NASA Technical Reports Server (NTRS)
Gray, Andrew; Lee, Dennis; Lay, Norman; Cheetham, Craig; Fong, Wai; Yeh, Pen-Shu; King, Robin; Ghuman, Parminder; Hoy, Scott; Fisher, Dave
2009-01-01
A modulator circuit board has recently been developed to be used in conjunction with a vector modulator to generate any of a large number of modulations for bandwidth-efficient radio transmission of digital data signals at rates than can exceed 100 Mb/s. The modulations include quadrature phaseshift keying (QPSK), offset quadrature phase-shift keying (OQPSK), Gaussian minimum-shift keying (GMSK), and octonary phase-shift keying (8PSK) with square-root raised-cosine pulse shaping. The figure is a greatly simplified block diagram showing the relationship between the modulator board and the rest of the transmitter. The role of the modulator board is to encode the incoming data stream and to shape the resulting pulses, which are fed as inputs to the vector modulator. The combination of encoding and pulse shaping in a given application is chosen to maximize the bandwidth efficiency. The modulator board includes gallium arsenide serial-to-parallel converters at its input end. A complementary metal oxide/semiconductor (CMOS) field-programmable gate array (FPGA) performs the coding and modulation computations and utilizes parallel processing in doing so. The results of the parallel computation are combined and converted to pulse waveforms by use of gallium arsenide parallel-to-serial converters integrated with digital-to-analog converters. Without changing the hardware, one can configure the modulator to produce any of the designed combinations of coding and modulation by loading the appropriate bit configuration file into the FPGA.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sublet, J.-Ch.; Koning, A.J.; Forrest, R.A.
The reasons for the conversion of the European Activation File, EAF into ENDF-6 format are threefold. First, it significantly enhances the JEFF-3.0 release by the addition of an activation file. Second, to considerably increase its usage by using a recognized, official file format, allowing existing plug-in processes to be effective; and third, to move towards a universal nuclear data file in contrast to the current separate general and special-purpose files. The format chosen for the JEFF-3.0/A file uses reaction cross sections (MF-3), cross sections (MF-10), and multiplicities (MF-9). Having the data in ENDF-6 format allows the ENDF suite of utilitiesmore » and checker codes to be used alongside many other utility, visualizing, and processing codes. It is based on the EAF activation file used for many applications from fission to fusion, including dosimetry, inventories, depletion-transmutation, and geophysics. JEFF-3.0/A takes advantage of four generations of EAF files. Extensive benchmarking activities on these files provide feedback and validation with integral measurements. These, in parallel with a detailed graphical analysis based on EXFOR, have been applied stimulating new measurements, significantly increasing the quality of this activation file. The next step is to include the EAF uncertainty data for all channels into JEFF-3.0/A.« less
NASA Astrophysics Data System (ADS)
Timchenko, Leonid; Yarovyi, Andrii; Kokriatskaya, Nataliya; Nakonechna, Svitlana; Abramenko, Ludmila; Ławicki, Tomasz; Popiel, Piotr; Yesmakhanova, Laura
2016-09-01
The paper presents a method of parallel-hierarchical transformations for rapid recognition of dynamic images using GPU technology. Direct parallel-hierarchical transformations based on cluster CPU-and GPU-oriented hardware platform. Mathematic models of training of the parallel hierarchical (PH) network for the transformation are developed, as well as a training method of the PH network for recognition of dynamic images. This research is most topical for problems on organizing high-performance computations of super large arrays of information designed to implement multi-stage sensing and processing as well as compaction and recognition of data in the informational structures and computer devices. This method has such advantages as high performance through the use of recent advances in parallelization, possibility to work with images of ultra dimension, ease of scaling in case of changing the number of nodes in the cluster, auto scan of local network to detect compute nodes.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Not Available
An account of the Caltech Concurrent Computation Program (C{sup 3}P), a five year project that focused on answering the question: Can parallel computers be used to do large-scale scientific computations '' As the title indicates, the question is answered in the affirmative, by implementing numerous scientific applications on real parallel computers and doing computations that produced new scientific results. In the process of doing so, C{sup 3}P helped design and build several new computers, designed and implemented basic system software, developed algorithms for frequently used mathematical computations on massively parallel machines, devised performance models and measured the performance of manymore » computers, and created a high performance computing facility based exclusively on parallel computers. While the initial focus of C{sup 3}P was the hypercube architecture developed by C. Seitz, many of the methods developed and lessons learned have been applied successfully on other massively parallel architectures.« less
O'Donnell, Michael
2015-01-01
State-and-transition simulation modeling relies on knowledge of vegetation composition and structure (states) that describe community conditions, mechanistic feedbacks such as fire that can affect vegetation establishment, and ecological processes that drive community conditions as well as the transitions between these states. However, as the need for modeling larger and more complex landscapes increase, a more advanced awareness of computing resources becomes essential. The objectives of this study include identifying challenges of executing state-and-transition simulation models, identifying common bottlenecks of computing resources, developing a workflow and software that enable parallel processing of Monte Carlo simulations, and identifying the advantages and disadvantages of different computing resources. To address these objectives, this study used the ApexRMS® SyncroSim software and embarrassingly parallel tasks of Monte Carlo simulations on a single multicore computer and on distributed computing systems. The results demonstrated that state-and-transition simulation models scale best in distributed computing environments, such as high-throughput and high-performance computing, because these environments disseminate the workloads across many compute nodes, thereby supporting analysis of larger landscapes, higher spatial resolution vegetation products, and more complex models. Using a case study and five different computing environments, the top result (high-throughput computing versus serial computations) indicated an approximate 96.6% decrease of computing time. With a single, multicore compute node (bottom result), the computing time indicated an 81.8% decrease relative to using serial computations. These results provide insight into the tradeoffs of using different computing resources when research necessitates advanced integration of ecoinformatics incorporating large and complicated data inputs and models. - See more at: http://aimspress.com/aimses/ch/reader/view_abstract.aspx?file_no=Environ2015030&flag=1#sthash.p1XKDtF8.dpuf
Visualization and Tracking of Parallel CFD Simulations
NASA Technical Reports Server (NTRS)
Vaziri, Arsi; Kremenetsky, Mark
1995-01-01
We describe a system for interactive visualization and tracking of a 3-D unsteady computational fluid dynamics (CFD) simulation on a parallel computer. CM/AVS, a distributed, parallel implementation of a visualization environment (AVS) runs on the CM-5 parallel supercomputer. A CFD solver is run as a CM/AVS module on the CM-5. Data communication between the solver, other parallel visualization modules, and a graphics workstation, which is running AVS, are handled by CM/AVS. Partitioning of the visualization task, between CM-5 and the workstation, can be done interactively in the visual programming environment provided by AVS. Flow solver parameters can also be altered by programmable interactive widgets. This system partially removes the requirement of storing large solution files at frequent time steps, a characteristic of the traditional 'simulate (yields) store (yields) visualize' post-processing approach.
Arkas: Rapid reproducible RNAseq analysis
Colombo, Anthony R.; J. Triche Jr, Timothy; Ramsingh, Giridharan
2017-01-01
The recently introduced Kallisto pseudoaligner has radically simplified the quantification of transcripts in RNA-sequencing experiments. We offer cloud-scale RNAseq pipelines Arkas-Quantification, and Arkas-Analysis available within Illumina’s BaseSpace cloud application platform which expedites Kallisto preparatory routines, reliably calculates differential expression, and performs gene-set enrichment of REACTOME pathways . Due to inherit inefficiencies of scale, Illumina's BaseSpace computing platform offers a massively parallel distributive environment improving data management services and data importing. Arkas-Quantification deploys Kallisto for parallel cloud computations and is conveniently integrated downstream from the BaseSpace Sequence Read Archive (SRA) import/conversion application titled SRA Import. Arkas-Analysis annotates the Kallisto results by extracting structured information directly from source FASTA files with per-contig metadata, calculates the differential expression and gene-set enrichment analysis on both coding genes and transcripts. The Arkas cloud pipeline supports ENSEMBL transcriptomes and can be used downstream from the SRA Import facilitating raw sequencing importing, SRA FASTQ conversion, RNA quantification and analysis steps. PMID:28868134
Portable multi-node LQCD Monte Carlo simulations using OpenACC
NASA Astrophysics Data System (ADS)
Bonati, Claudio; Calore, Enrico; D'Elia, Massimo; Mesiti, Michele; Negro, Francesco; Sanfilippo, Francesco; Schifano, Sebastiano Fabio; Silvi, Giorgio; Tripiccione, Raffaele
This paper describes a state-of-the-art parallel Lattice QCD Monte Carlo code for staggered fermions, purposely designed to be portable across different computer architectures, including GPUs and commodity CPUs. Portability is achieved using the OpenACC parallel programming model, used to develop a code that can be compiled for several processor architectures. The paper focuses on parallelization on multiple computing nodes using OpenACC to manage parallelism within the node, and OpenMPI to manage parallelism among the nodes. We first discuss the available strategies to be adopted to maximize performances, we then describe selected relevant details of the code, and finally measure the level of performance and scaling-performance that we are able to achieve. The work focuses mainly on GPUs, which offer a significantly high level of performances for this application, but also compares with results measured on other processors.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wu, Chase Qishi
A number of Department of Energy (DOE) science applications, involving exascale computing systems and large experimental facilities, are expected to generate large volumes of data, in the range of petabytes to exabytes, which will be transported over wide-area networks for the purpose of storage, visualization, and analysis. To support such capabilities, significant progress has been made in various components including the deployment of 100 Gbps networks with future 1 Tbps bandwidth, increases in end-host capabilities with multiple cores and buses, capacity improvements in large disk arrays, and deployment of parallel file systems such as Lustre and GPFS. High-performance source-to-sink datamore » flows must be composed of these component systems, which requires significant optimizations of the storage-to-host data and execution paths to match the edge and long-haul network connections. In particular, end systems are currently supported by 10-40 Gbps Network Interface Cards (NIC) and 8-32 Gbps storage Host Channel Adapters (HCAs), which carry the individual flows that collectively must reach network speeds of 100 Gbps and higher. Indeed, such data flows must be synthesized using multicore, multibus hosts connected to high-performance storage systems on one side and to the network on the other side. Current experimental results show that the constituent flows must be optimally composed and preserved from storage systems, across the hosts and the networks with minimal interference. Furthermore, such a capability must be made available transparently to the science users without placing undue demands on them to account for the details of underlying systems and networks. And, this task is expected to become even more complex in the future due to the increasing sophistication of hosts, storage systems, and networks that constitute the high-performance flows. The objectives of this proposal are to (1) develop and test the component technologies and their synthesis methods to achieve source-to-sink high-performance flows, and (2) develop tools that provide these capabilities through simple interfaces to users and applications. In terms of the former, we propose to develop (1) optimization methods that align and transition multiple storage flows to multiple network flows on multicore, multibus hosts; and (2) edge and long-haul network path realization and maintenance using advanced provisioning methods including OSCARS and OpenFlow. We also propose synthesis methods that combine these individual technologies to compose high-performance flows using a collection of constituent storage-network flows, and realize them across the storage and local network connections as well as long-haul connections. We propose to develop automated user tools that profile the hosts, storage systems, and network connections; compose the source-to-sink complex flows; and set up and maintain the needed network connections. These solutions will be tested using (1) 100 Gbps connection(s) between Oak Ridge National Laboratory (ORNL) and Argonne National Laboratory (ANL) with storage systems supported by Lustre and GPFS file systems with an asymmetric connection to University of Memphis (UM); (2) ORNL testbed with multicore and multibus hosts, switches with OpenFlow capabilities, and network emulators; and (3) 100 Gbps connections from ESnet and their Openflow testbed, and other experimental connections. This proposal brings together the expertise and facilities of the two national laboratories, ORNL and ANL, and UM. It also represents a collaboration between DOE and the Department of Defense (DOD) projects at ORNL by sharing technical expertise and personnel costs, and leveraging the existing DOD Extreme Scale Systems Center (ESSC) facilities at ORNL.« less
Profiling and Improving I/O Performance of a Large-Scale Climate Scientific Application
NASA Technical Reports Server (NTRS)
Liu, Zhuo; Wang, Bin; Wang, Teng; Tian, Yuan; Xu, Cong; Wang, Yandong; Yu, Weikuan; Cruz, Carlos A.; Zhou, Shujia; Clune, Tom;
2013-01-01
Exascale computing systems are soon to emerge, which will pose great challenges on the huge gap between computing and I/O performance. Many large-scale scientific applications play an important role in our daily life. The huge amounts of data generated by such applications require highly parallel and efficient I/O management policies. In this paper, we adopt a mission-critical scientific application, GEOS-5, as a case to profile and analyze the communication and I/O issues that are preventing applications from fully utilizing the underlying parallel storage systems. Through in-detail architectural and experimental characterization, we observe that current legacy I/O schemes incur significant network communication overheads and are unable to fully parallelize the data access, thus degrading applications' I/O performance and scalability. To address these inefficiencies, we redesign its I/O framework along with a set of parallel I/O techniques to achieve high scalability and performance. Evaluation results on the NASA discover cluster show that our optimization of GEOS-5 with ADIOS has led to significant performance improvements compared to the original GEOS-5 implementation.
Compute Server Performance Results
NASA Technical Reports Server (NTRS)
Stockdale, I. E.; Barton, John; Woodrow, Thomas (Technical Monitor)
1994-01-01
Parallel-vector supercomputers have been the workhorses of high performance computing. As expectations of future computing needs have risen faster than projected vector supercomputer performance, much work has been done investigating the feasibility of using Massively Parallel Processor systems as supercomputers. An even more recent development is the availability of high performance workstations which have the potential, when clustered together, to replace parallel-vector systems. We present a systematic comparison of floating point performance and price-performance for various compute server systems. A suite of highly vectorized programs was run on systems including traditional vector systems such as the Cray C90, and RISC workstations such as the IBM RS/6000 590 and the SGI R8000. The C90 system delivers 460 million floating point operations per second (FLOPS), the highest single processor rate of any vendor. However, if the price-performance ration (PPR) is considered to be most important, then the IBM and SGI processors are superior to the C90 processors. Even without code tuning, the IBM and SGI PPR's of 260 and 220 FLOPS per dollar exceed the C90 PPR of 160 FLOPS per dollar when running our highly vectorized suite,
NASA Astrophysics Data System (ADS)
Plaza, Antonio; Chang, Chein-I.; Plaza, Javier; Valencia, David
2006-05-01
The incorporation of hyperspectral sensors aboard airborne/satellite platforms is currently producing a nearly continual stream of multidimensional image data, and this high data volume has soon introduced new processing challenges. The price paid for the wealth spatial and spectral information available from hyperspectral sensors is the enormous amounts of data that they generate. Several applications exist, however, where having the desired information calculated quickly enough for practical use is highly desirable. High computing performance of algorithm analysis is particularly important in homeland defense and security applications, in which swift decisions often involve detection of (sub-pixel) military targets (including hostile weaponry, camouflage, concealment, and decoys) or chemical/biological agents. In order to speed-up computational performance of hyperspectral imaging algorithms, this paper develops several fast parallel data processing techniques. Techniques include four classes of algorithms: (1) unsupervised classification, (2) spectral unmixing, and (3) automatic target recognition, and (4) onboard data compression. A massively parallel Beowulf cluster (Thunderhead) at NASA's Goddard Space Flight Center in Maryland is used to measure parallel performance of the proposed algorithms. In order to explore the viability of developing onboard, real-time hyperspectral data compression algorithms, a Xilinx Virtex-II field programmable gate array (FPGA) is also used in experiments. Our quantitative and comparative assessment of parallel techniques and strategies may help image analysts in selection of parallel hyperspectral algorithms for specific applications.
Reliable file sharing in distributed operating system using web RTC
NASA Astrophysics Data System (ADS)
Dukiya, Rajesh
2017-12-01
Since, the evolution of distributed operating system, distributed file system is come out to be important part in operating system. P2P is a reliable way in Distributed Operating System for file sharing. It was introduced in 1999, later it became a high research interest topic. Peer to Peer network is a type of network, where peers share network workload and other load related tasks. A P2P network can be a period of time connection, where a bunch of computers connected by a USB (Universal Serial Bus) port to transfer or enable disk sharing i.e. file sharing. Currently P2P requires special network that should be designed in P2P way. Nowadays, there is a big influence of browsers in our life. In this project we are going to study of file sharing mechanism in distributed operating system in web browsers, where we will try to find performance bottlenecks which our research will going to be an improvement in file sharing by performance and scalability in distributed file systems. Additionally, we will discuss the scope of Web Torrent file sharing and free-riding in peer to peer networks.
Zhu, Xinjie; Zhang, Qiang; Ho, Eric Dun; Yu, Ken Hung-On; Liu, Chris; Huang, Tim H; Cheng, Alfred Sze-Lok; Kao, Ben; Lo, Eric; Yip, Kevin Y
2017-09-22
A genomic signal track is a set of genomic intervals associated with values of various types, such as measurements from high-throughput experiments. Analysis of signal tracks requires complex computational methods, which often make the analysts focus too much on the detailed computational steps rather than on their biological questions. Here we propose Signal Track Query Language (STQL) for simple analysis of signal tracks. It is a Structured Query Language (SQL)-like declarative language, which means one only specifies what computations need to be done but not how these computations are to be carried out. STQL provides a rich set of constructs for manipulating genomic intervals and their values. To run STQL queries, we have developed the Signal Track Analytical Research Tool (START, http://yiplab.cse.cuhk.edu.hk/start/ ), a system that includes a Web-based user interface and a back-end execution system. The user interface helps users select data from our database of around 10,000 commonly-used public signal tracks, manage their own tracks, and construct, store and share STQL queries. The back-end system automatically translates STQL queries into optimized low-level programs and runs them on a computer cluster in parallel. We use STQL to perform 14 representative analytical tasks. By repeating these analyses using bedtools, Galaxy and custom Python scripts, we show that the STQL solution is usually the simplest, and the parallel execution achieves significant speed-up with large data files. Finally, we describe how a biologist with minimal formal training in computer programming self-learned STQL to analyze DNA methylation data we produced from 60 pairs of hepatocellular carcinoma (HCC) samples. Overall, STQL and START provide a generic way for analyzing a large number of genomic signal tracks in parallel easily.
Image Harvest: an open-source platform for high-throughput plant image processing and analysis.
Knecht, Avi C; Campbell, Malachy T; Caprez, Adam; Swanson, David R; Walia, Harkamal
2016-05-01
High-throughput plant phenotyping is an effective approach to bridge the genotype-to-phenotype gap in crops. Phenomics experiments typically result in large-scale image datasets, which are not amenable for processing on desktop computers, thus creating a bottleneck in the image-analysis pipeline. Here, we present an open-source, flexible image-analysis framework, called Image Harvest (IH), for processing images originating from high-throughput plant phenotyping platforms. Image Harvest is developed to perform parallel processing on computing grids and provides an integrated feature for metadata extraction from large-scale file organization. Moreover, the integration of IH with the Open Science Grid provides academic researchers with the computational resources required for processing large image datasets at no cost. Image Harvest also offers functionalities to extract digital traits from images to interpret plant architecture-related characteristics. To demonstrate the applications of these digital traits, a rice (Oryza sativa) diversity panel was phenotyped and genome-wide association mapping was performed using digital traits that are used to describe different plant ideotypes. Three major quantitative trait loci were identified on rice chromosomes 4 and 6, which co-localize with quantitative trait loci known to regulate agronomically important traits in rice. Image Harvest is an open-source software for high-throughput image processing that requires a minimal learning curve for plant biologists to analyzephenomics datasets. © The Author 2016. Published by Oxford University Press on behalf of the Society for Experimental Biology.
Multi-Zone Liquid Thrust Chamber Performance Code with Domain Decomposition for Parallel Processing
NASA Technical Reports Server (NTRS)
Navaz, Homayun K.
2002-01-01
Computational Fluid Dynamics (CFD) has considerably evolved in the last decade. There are many computer programs that can perform computations on viscous internal or external flows with chemical reactions. CFD has become a commonly used tool in the design and analysis of gas turbines, ramjet combustors, turbo-machinery, inlet ducts, rocket engines, jet interaction, missile, and ramjet nozzles. One of the problems of interest to NASA has always been the performance prediction for rocket and air-breathing engines. Due to the complexity of flow in these engines it is necessary to resolve the flowfield into a fine mesh to capture quantities like turbulence and heat transfer. However, calculation on a high-resolution grid is associated with a prohibitively increasing computational time that can downgrade the value of the CFD for practical engineering calculations. The Liquid Thrust Chamber Performance (LTCP) code was developed for NASA/MSFC (Marshall Space Flight Center) to perform liquid rocket engine performance calculations. This code is a 2D/axisymmetric full Navier-Stokes (NS) solver with fully coupled finite rate chemistry and Eulerian treatment of liquid fuel and/or oxidizer droplets. One of the advantages of this code has been the resemblance of its input file to the JANNAF (Joint Army Navy NASA Air Force Interagency Propulsion Committee) standard TDK code, and its automatic grid generation for JANNAF defined combustion chamber wall geometry. These options minimize the learning effort for TDK users, and make the code a good candidate for performing engineering calculations. Although the LTCP code was developed for liquid rocket engines, it is a general-purpose code and has been used for solving many engineering problems. However, the single zone formulation of the LTCP has limited the code to be applicable to problems with complex geometry. Furthermore, the computational time becomes prohibitively large for high-resolution problems with chemistry, two-equation turbulence model, and two-phase flow. To overcome these limitations, the LTCP code is rewritten to include the multi-zone capability with domain decomposition that makes it suitable for parallel processing, i.e., enabling the code to run every zone or sub-domain on a separate processor. This can reduce the run time by a factor of 6 to 8, depending on the problem.
High Performance Programming Using Explicit Shared Memory Model on Cray T3D1
NASA Technical Reports Server (NTRS)
Simon, Horst D.; Saini, Subhash; Grassi, Charles
1994-01-01
The Cray T3D system is the first-phase system in Cray Research, Inc.'s (CRI) three-phase massively parallel processing (MPP) program. This system features a heterogeneous architecture that closely couples DEC's Alpha microprocessors and CRI's parallel-vector technology, i.e., the Cray Y-MP and Cray C90. An overview of the Cray T3D hardware and available programming models is presented. Under Cray Research adaptive Fortran (CRAFT) model four programming methods (data parallel, work sharing, message-passing using PVM, and explicit shared memory model) are available to the users. However, at this time data parallel and work sharing programming models are not available to the user community. The differences between standard PVM and CRI's PVM are highlighted with performance measurements such as latencies and communication bandwidths. We have found that the performance of neither standard PVM nor CRI s PVM exploits the hardware capabilities of the T3D. The reasons for the bad performance of PVM as a native message-passing library are presented. This is illustrated by the performance of NAS Parallel Benchmarks (NPB) programmed in explicit shared memory model on Cray T3D. In general, the performance of standard PVM is about 4 to 5 times less than obtained by using explicit shared memory model. This degradation in performance is also seen on CM-5 where the performance of applications using native message-passing library CMMD on CM-5 is also about 4 to 5 times less than using data parallel methods. The issues involved (such as barriers, synchronization, invalidating data cache, aligning data cache etc.) while programming in explicit shared memory model are discussed. Comparative performance of NPB using explicit shared memory programming model on the Cray T3D and other highly parallel systems such as the TMC CM-5, Intel Paragon, Cray C90, IBM-SP1, etc. is presented.
Tankam, Patrice; Santhanam, Anand P.; Lee, Kye-Sung; Won, Jungeun; Canavesi, Cristina; Rolland, Jannick P.
2014-01-01
Abstract. Gabor-domain optical coherence microscopy (GD-OCM) is a volumetric high-resolution technique capable of acquiring three-dimensional (3-D) skin images with histological resolution. Real-time image processing is needed to enable GD-OCM imaging in a clinical setting. We present a parallelized and scalable multi-graphics processing unit (GPU) computing framework for real-time GD-OCM image processing. A parallelized control mechanism was developed to individually assign computation tasks to each of the GPUs. For each GPU, the optimal number of amplitude-scans (A-scans) to be processed in parallel was selected to maximize GPU memory usage and core throughput. We investigated five computing architectures for computational speed-up in processing 1000×1000 A-scans. The proposed parallelized multi-GPU computing framework enables processing at a computational speed faster than the GD-OCM image acquisition, thereby facilitating high-speed GD-OCM imaging in a clinical setting. Using two parallelized GPUs, the image processing of a 1×1×0.6 mm3 skin sample was performed in about 13 s, and the performance was benchmarked at 6.5 s with four GPUs. This work thus demonstrates that 3-D GD-OCM data may be displayed in real-time to the examiner using parallelized GPU processing. PMID:24695868
Tankam, Patrice; Santhanam, Anand P; Lee, Kye-Sung; Won, Jungeun; Canavesi, Cristina; Rolland, Jannick P
2014-07-01
Gabor-domain optical coherence microscopy (GD-OCM) is a volumetric high-resolution technique capable of acquiring three-dimensional (3-D) skin images with histological resolution. Real-time image processing is needed to enable GD-OCM imaging in a clinical setting. We present a parallelized and scalable multi-graphics processing unit (GPU) computing framework for real-time GD-OCM image processing. A parallelized control mechanism was developed to individually assign computation tasks to each of the GPUs. For each GPU, the optimal number of amplitude-scans (A-scans) to be processed in parallel was selected to maximize GPU memory usage and core throughput. We investigated five computing architectures for computational speed-up in processing 1000×1000 A-scans. The proposed parallelized multi-GPU computing framework enables processing at a computational speed faster than the GD-OCM image acquisition, thereby facilitating high-speed GD-OCM imaging in a clinical setting. Using two parallelized GPUs, the image processing of a 1×1×0.6 mm3 skin sample was performed in about 13 s, and the performance was benchmarked at 6.5 s with four GPUs. This work thus demonstrates that 3-D GD-OCM data may be displayed in real-time to the examiner using parallelized GPU processing.
Summary of Documentation for DYNA3D-ParaDyn's Software Quality Assurance Regression Test Problems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zywicz, Edward
The Software Quality Assurance (SQA) regression test suite for DYNA3D (Zywicz and Lin, 2015) and ParaDyn (DeGroot, et al., 2015) currently contains approximately 600 problems divided into 21 suites, and is a required component of ParaDyn’s SQA plan (Ferencz and Oliver, 2013). The regression suite allows developers to ensure that software modifications do not unintentionally alter the code response. The entire regression suite is run prior to permanently incorporating any software modification or addition. When code modifications alter test problem results, the specific cause must be determined and fully understood before the software changes and revised test answers can bemore » incorporated. The regression suite is executed on LLNL platforms using a Python script and an associated data file. The user specifies the DYNA3D or ParaDyn executable, number of processors to use, test problems to run, and other options to the script. The data file details how each problem and its answer extraction scripts are executed. For each problem in the regression suite there exists an input deck, an eight-processor partition file, an answer file, and various extraction scripts. These scripts assemble a temporary answer file in a specific format from the simulation results. The temporary and stored answer files are compared to a specific level of numerical precision, and when differences are detected the test problem is flagged as failed. Presently, numerical results are stored and compared to 16 digits. At this accuracy level different processor types, compilers, number of partitions, etc. impact the results to various degrees. Thus, for consistency purposes the regression suite is run with ParaDyn using 8 processors on machines with a specific processor type (currently the Intel Xeon E5530 processor). For non-parallel regression problems, i.e., the two XFEM problems, DYNA3D is used instead. When environments or platforms change, executables using the current source code and the new resource are created and the regression suite is run. If differences in answers arise, the new answers are retained provided that the differences are inconsequential. This bootstrap approach allows the test suite answers to evolve in a controlled manner with a high level of confidence. Developers also run the entire regression suite with (serial) DYNA3D. While these results normally differ from the stored (parallel) answers, abnormal termination or wildly different values are strong indicators of potential issues.« less
NASA Astrophysics Data System (ADS)
Dykstra, D.; Bockelman, B.; Blomer, J.; Herner, K.; Levshina, T.; Slyz, M.
2015-12-01
A common use pattern in the computing models of particle physics experiments is running many distributed applications that read from a shared set of data files. We refer to this data is auxiliary data, to distinguish it from (a) event data from the detector (which tends to be different for every job), and (b) conditions data about the detector (which tends to be the same for each job in a batch of jobs). Relatively speaking, conditions data also tends to be relatively small per job where both event data and auxiliary data are larger per job. Unlike event data, auxiliary data comes from a limited working set of shared files. Since there is spatial locality of the auxiliary data access, the use case appears to be identical to that of the CernVM- Filesystem (CVMFS). However, we show that distributing auxiliary data through CVMFS causes the existing CVMFS infrastructure to perform poorly. We utilize a CVMFS client feature called "alien cache" to cache data on existing local high-bandwidth data servers that were engineered for storing event data. This cache is shared between the worker nodes at a site and replaces caching CVMFS files on both the worker node local disks and on the site's local squids. We have tested this alien cache with the dCache NFSv4.1 interface, Lustre, and the Hadoop Distributed File System (HDFS) FUSE interface, and measured performance. In addition, we use high-bandwidth data servers at central sites to perform the CVMFS Stratum 1 function instead of the low-bandwidth web servers deployed for the CVMFS software distribution function. We have tested this using the dCache HTTP interface. As a result, we have a design for an end-to-end high-bandwidth distributed caching read-only filesystem, using existing client software already widely deployed to grid worker nodes and existing file servers already widely installed at grid sites. Files are published in a central place and are soon available on demand throughout the grid and cached locally on the site with a convenient POSIX interface. This paper discusses the details of the architecture and reports performance measurements.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dykstra, D.; Bockelman, B.; Blomer, J.
A common use pattern in the computing models of particle physics experiments is running many distributed applications that read from a shared set of data files. We refer to this data is auxiliary data, to distinguish it from (a) event data from the detector (which tends to be different for every job), and (b) conditions data about the detector (which tends to be the same for each job in a batch of jobs). Relatively speaking, conditions data also tends to be relatively small per job where both event data and auxiliary data are larger per job. Unlike event data, auxiliarymore » data comes from a limited working set of shared files. Since there is spatial locality of the auxiliary data access, the use case appears to be identical to that of the CernVM- Filesystem (CVMFS). However, we show that distributing auxiliary data through CVMFS causes the existing CVMFS infrastructure to perform poorly. We utilize a CVMFS client feature called 'alien cache' to cache data on existing local high-bandwidth data servers that were engineered for storing event data. This cache is shared between the worker nodes at a site and replaces caching CVMFS files on both the worker node local disks and on the site's local squids. We have tested this alien cache with the dCache NFSv4.1 interface, Lustre, and the Hadoop Distributed File System (HDFS) FUSE interface, and measured performance. In addition, we use high-bandwidth data servers at central sites to perform the CVMFS Stratum 1 function instead of the low-bandwidth web servers deployed for the CVMFS software distribution function. We have tested this using the dCache HTTP interface. As a result, we have a design for an end-to-end high-bandwidth distributed caching read-only filesystem, using existing client software already widely deployed to grid worker nodes and existing file servers already widely installed at grid sites. Files are published in a central place and are soon available on demand throughout the grid and cached locally on the site with a convenient POSIX interface. This paper discusses the details of the architecture and reports performance measurements.« less
Benchmark Comparison of Cloud Analytics Methods Applied to Earth Observations
NASA Astrophysics Data System (ADS)
Lynnes, C.; Little, M. M.; Huang, T.; Jacob, J. C.; Yang, C. P.; Kuo, K. S.
2016-12-01
Cloud computing has the potential to bring high performance computing capabilities to the average science researcher. However, in order to take full advantage of cloud capabilities, the science data used in the analysis must often be reorganized. This typically involves sharding the data across multiple nodes to enable relatively fine-grained parallelism. This can be either via cloud-based filesystems or cloud-enabled databases such as Cassandra, Rasdaman or SciDB. Since storing an extra copy of data leads to increased cost and data management complexity, NASA is interested in determining the benefits and costs of various cloud analytics methods for real Earth Observation cases. Accordingly, NASA's Earth Science Technology Office and Earth Science Data and Information Systems project have teamed with cloud analytics practitioners to run a benchmark comparison on cloud analytics methods using the same input data and analysis algorithms. We have particularly looked at analysis algorithms that work over long time series, because these are particularly intractable for many Earth Observation datasets which typically store data with one or just a few time steps per file. This post will present side-by-side cost and performance results for several common Earth observation analysis operations.
OS friendly microprocessor architecture: Hardware level computer security
NASA Astrophysics Data System (ADS)
Jungwirth, Patrick; La Fratta, Patrick
2016-05-01
We present an introduction to the patented OS Friendly Microprocessor Architecture (OSFA) and hardware level computer security. Conventional microprocessors have not tried to balance hardware performance and OS performance at the same time. Conventional microprocessors have depended on the Operating System for computer security and information assurance. The goal of the OS Friendly Architecture is to provide a high performance and secure microprocessor and OS system. We are interested in cyber security, information technology (IT), and SCADA control professionals reviewing the hardware level security features. The OS Friendly Architecture is a switched set of cache memory banks in a pipeline configuration. For light-weight threads, the memory pipeline configuration provides near instantaneous context switching times. The pipelining and parallelism provided by the cache memory pipeline provides for background cache read and write operations while the microprocessor's execution pipeline is running instructions. The cache bank selection controllers provide arbitration to prevent the memory pipeline and microprocessor's execution pipeline from accessing the same cache bank at the same time. This separation allows the cache memory pages to transfer to and from level 1 (L1) caching while the microprocessor pipeline is executing instructions. Computer security operations are implemented in hardware. By extending Unix file permissions bits to each cache memory bank and memory address, the OSFA provides hardware level computer security.
Sanders, David M.; Decker, Derek E.
1999-01-01
Optical patterns and lithographic techniques are used as part of a process to embed parallel and evenly spaced conductors in the non-planar surfaces of an insulator to produce high gradient insulators. The approach extends the size that high gradient insulating structures can be fabricated as well as improves the performance of those insulators by reducing the scale of the alternating parallel lines of insulator and conductor along the surface. This fabrication approach also substantially decreases the cost required to produce high gradient insulators.
Status of the calibration and alignment framework at the Belle II experiment
NASA Astrophysics Data System (ADS)
Dossett, D.; Sevior, M.; Ritter, M.; Kuhr, T.; Bilka, T.; Yaschenko, S.;
2017-10-01
The Belle II detector at the Super KEKB e+e-collider plans to take first collision data in 2018. The monetary and CPU time costs associated with storing and processing the data mean that it is crucial for the detector components at Belle II to be calibrated quickly and accurately. A fast and accurate calibration system would allow the high level trigger to increase the efficiency of event selection, and can give users analysis-quality reconstruction promptly. A flexible framework to automate the fast production of calibration constants is being developed in the Belle II Analysis Software Framework (basf2). Detector experts only need to create two components from C++ base classes in order to use the automation system. The first collects data from Belle II event data files and outputs much smaller files to pass to the second component. This runs the main calibration algorithm to produce calibration constants ready for upload into the conditions database. A Python framework coordinates the input files, order of processing, and submission of jobs. Splitting the operation into collection and algorithm processing stages allows the framework to optionally parallelize the collection stage on a batch system.
High Performance Fortran for Aerospace Applications
NASA Technical Reports Server (NTRS)
Mehrotra, Piyush; Zima, Hans; Bushnell, Dennis M. (Technical Monitor)
2000-01-01
This paper focuses on the use of High Performance Fortran (HPF) for important classes of algorithms employed in aerospace applications. HPF is a set of Fortran extensions designed to provide users with a high-level interface for programming data parallel scientific applications, while delegating to the compiler/runtime system the task of generating explicitly parallel message-passing programs. We begin by providing a short overview of the HPF language. This is followed by a detailed discussion of the efficient use of HPF for applications involving multiple structured grids such as multiblock and adaptive mesh refinement (AMR) codes as well as unstructured grid codes. We focus on the data structures and computational structures used in these codes and on the high-level strategies that can be expressed in HPF to optimally exploit the parallelism in these algorithms.
NASA Astrophysics Data System (ADS)
Moon, Hongsik
What is the impact of multicore and associated advanced technologies on computational software for science? Most researchers and students have multicore laptops or desktops for their research and they need computing power to run computational software packages. Computing power was initially derived from Central Processing Unit (CPU) clock speed. That changed when increases in clock speed became constrained by power requirements. Chip manufacturers turned to multicore CPU architectures and associated technological advancements to create the CPUs for the future. Most software applications benefited by the increased computing power the same way that increases in clock speed helped applications run faster. However, for Computational ElectroMagnetics (CEM) software developers, this change was not an obvious benefit - it appeared to be a detriment. Developers were challenged to find a way to correctly utilize the advancements in hardware so that their codes could benefit. The solution was parallelization and this dissertation details the investigation to address these challenges. Prior to multicore CPUs, advanced computer technologies were compared with the performance using benchmark software and the metric was FLoting-point Operations Per Seconds (FLOPS) which indicates system performance for scientific applications that make heavy use of floating-point calculations. Is FLOPS an effective metric for parallelized CEM simulation tools on new multicore system? Parallel CEM software needs to be benchmarked not only by FLOPS but also by the performance of other parameters related to type and utilization of the hardware, such as CPU, Random Access Memory (RAM), hard disk, network, etc. The codes need to be optimized for more than just FLOPs and new parameters must be included in benchmarking. In this dissertation, the parallel CEM software named High Order Basis Based Integral Equation Solver (HOBBIES) is introduced. This code was developed to address the needs of the changing computer hardware platforms in order to provide fast, accurate and efficient solutions to large, complex electromagnetic problems. The research in this dissertation proves that the performance of parallel code is intimately related to the configuration of the computer hardware and can be maximized for different hardware platforms. To benchmark and optimize the performance of parallel CEM software, a variety of large, complex projects are created and executed on a variety of computer platforms. The computer platforms used in this research are detailed in this dissertation. The projects run as benchmarks are also described in detail and results are presented. The parameters that affect parallel CEM software on High Performance Computing Clusters (HPCC) are investigated. This research demonstrates methods to maximize the performance of parallel CEM software code.
NASA Technical Reports Server (NTRS)
Bailey, D. H.; Barszcz, E.; Barton, J. T.; Carter, R. L.; Lasinski, T. A.; Browning, D. S.; Dagum, L.; Fatoohi, R. A.; Frederickson, P. O.; Schreiber, R. S.
1991-01-01
A new set of benchmarks has been developed for the performance evaluation of highly parallel supercomputers in the framework of the NASA Ames Numerical Aerodynamic Simulation (NAS) Program. These consist of five 'parallel kernel' benchmarks and three 'simulated application' benchmarks. Together they mimic the computation and data movement characteristics of large-scale computational fluid dynamics applications. The principal distinguishing feature of these benchmarks is their 'pencil and paper' specification-all details of these benchmarks are specified only algorithmically. In this way many of the difficulties associated with conventional benchmarking approaches on highly parallel systems are avoided.
Portability and Cross-Platform Performance of an MPI-Based Parallel Polygon Renderer
NASA Technical Reports Server (NTRS)
Crockett, Thomas W.
1999-01-01
Visualizing the results of computations performed on large-scale parallel computers is a challenging problem, due to the size of the datasets involved. One approach is to perform the visualization and graphics operations in place, exploiting the available parallelism to obtain the necessary rendering performance. Over the past several years, we have been developing algorithms and software to support visualization applications on NASA's parallel supercomputers. Our results have been incorporated into a parallel polygon rendering system called PGL. PGL was initially developed on tightly-coupled distributed-memory message-passing systems, including Intel's iPSC/860 and Paragon, and IBM's SP2. Over the past year, we have ported it to a variety of additional platforms, including the HP Exemplar, SGI Origin2OOO, Cray T3E, and clusters of Sun workstations. In implementing PGL, we have had two primary goals: cross-platform portability and high performance. Portability is important because (1) our manpower resources are limited, making it difficult to develop and maintain multiple versions of the code, and (2) NASA's complement of parallel computing platforms is diverse and subject to frequent change. Performance is important in delivering adequate rendering rates for complex scenes and ensuring that parallel computing resources are used effectively. Unfortunately, these two goals are often at odds. In this paper we report on our experiences with portability and performance of the PGL polygon renderer across a range of parallel computing platforms.
NASA Technical Reports Server (NTRS)
Hribar, Michelle R.; Frumkin, Michael; Jin, Haoqiang; Waheed, Abdul; Yan, Jerry; Saini, Subhash (Technical Monitor)
1998-01-01
Over the past decade, high performance computing has evolved rapidly; systems based on commodity microprocessors have been introduced in quick succession from at least seven vendors/families. Porting codes to every new architecture is a difficult problem; in particular, here at NASA, there are many large CFD applications that are very costly to port to new machines by hand. The LCM ("Legacy Code Modernization") Project is the development of an integrated parallelization environment (IPE) which performs the automated mapping of legacy CFD (Fortran) applications to state-of-the-art high performance computers. While most projects to port codes focus on the parallelization of the code, we consider porting to be an iterative process consisting of several steps: 1) code cleanup, 2) serial optimization,3) parallelization, 4) performance monitoring and visualization, 5) intelligent tools for automated tuning using performance prediction and 6) machine specific optimization. The approach for building this parallelization environment is to build the components for each of the steps simultaneously and then integrate them together. The demonstration will exhibit our latest research in building this environment: 1. Parallelizing tools and compiler evaluation. 2. Code cleanup and serial optimization using automated scripts 3. Development of a code generator for performance prediction 4. Automated partitioning 5. Automated insertion of directives. These demonstrations will exhibit the effectiveness of an automated approach for all the steps involved with porting and tuning a legacy code application for a new architecture.
A Queue Simulation Tool for a High Performance Scientific Computing Center
NASA Technical Reports Server (NTRS)
Spear, Carrie; McGalliard, James
2007-01-01
The NASA Center for Computational Sciences (NCCS) at the Goddard Space Flight Center provides high performance highly parallel processors, mass storage, and supporting infrastructure to a community of computational Earth and space scientists. Long running (days) and highly parallel (hundreds of CPUs) jobs are common in the workload. NCCS management structures batch queues and allocates resources to optimize system use and prioritize workloads. NCCS technical staff use a locally developed discrete event simulation tool to model the impacts of evolving workloads, potential system upgrades, alternative queue structures and resource allocation policies.
Bilingual parallel programming
DOE Office of Scientific and Technical Information (OSTI.GOV)
Foster, I.; Overbeek, R.
1990-01-01
Numerous experiments have demonstrated that computationally intensive algorithms support adequate parallelism to exploit the potential of large parallel machines. Yet successful parallel implementations of serious applications are rare. The limiting factor is clearly programming technology. None of the approaches to parallel programming that have been proposed to date -- whether parallelizing compilers, language extensions, or new concurrent languages -- seem to adequately address the central problems of portability, expressiveness, efficiency, and compatibility with existing software. In this paper, we advocate an alternative approach to parallel programming based on what we call bilingual programming. We present evidence that this approach providesmore » and effective solution to parallel programming problems. The key idea in bilingual programming is to construct the upper levels of applications in a high-level language while coding selected low-level components in low-level languages. This approach permits the advantages of a high-level notation (expressiveness, elegance, conciseness) to be obtained without the cost in performance normally associated with high-level approaches. In addition, it provides a natural framework for reusing existing code.« less
Visualising Earth's Mantle based on Global Adjoint Tomography
NASA Astrophysics Data System (ADS)
Bozdag, E.; Pugmire, D.; Lefebvre, M. P.; Hill, J.; Komatitsch, D.; Peter, D. B.; Podhorszki, N.; Tromp, J.
2017-12-01
Recent advances in 3D wave propagation solvers and high-performance computing have enabled regional and global full-waveform inversions. Interpretation of tomographic models is often done on visually. Robust and efficient visualization tools are necessary to thoroughly investigate large model files, particularly at the global scale. In collaboration with Oak Ridge National Laboratory (ORNL), we have developed effective visualization tools and used for visualization of our first-generation global model, GLAD-M15 (Bozdag et al. 2016). VisIt (https://wci.llnl.gov/simulation/computer-codes/visit/) is used for initial exploration of the models and for extraction of seismological features. The broad capability of VisIt, and its demonstrated scalability proved valuable for experimenting with different visualization techniques, and in the creation of timely results. Utilizing VisIt's plugin-architecture, a data reader plugin was developed, which reads the ADIOS (https://www.olcf.ornl.gov/center-projects/adios/) format of our model files. Blender (https://www.blender.org) is used for the setup of lighting, materials, camera paths and rendering of geometry. Python scripting was used to control the orchestration of different geometries, as well as camera animation for 3D movies. While we continue producing 3D contour plots and movies for various seismic parameters to better visualize plume- and slab-like features as well as anisotropy throughout the mantle, our aim is to make visualization an integral part of our global adjoint tomography workflow to routinely produce various 2D cross-sections to facilitate examination of our models after each iteration. This will ultimately form the basis for use of pattern recognition techniques in our investigations. Simulations for global adjoint tomography are performed on ORNL's Titan system and visualization is done in parallel on ORNL's post-processing cluster Rhea.
Improvements to Nuclear Data and Its Uncertainties by Theoretical Modeling
DOE Office of Scientific and Technical Information (OSTI.GOV)
Danon, Yaron; Nazarewicz, Witold; Talou, Patrick
2013-02-18
This project addresses three important gaps in existing evaluated nuclear data libraries that represent a significant hindrance against highly advanced modeling and simulation capabilities for the Advanced Fuel Cycle Initiative (AFCI). This project will: Develop advanced theoretical tools to compute prompt fission neutrons and gamma-ray characteristics well beyond average spectra and multiplicity, and produce new evaluated files of U and Pu isotopes, along with some minor actinides; Perform state-of-the-art fission cross-section modeling and calculations using global and microscopic model input parameters, leading to truly predictive fission cross-sections capabilities. Consistent calculations for a suite of Pu isotopes will be performed; Implementmore » innovative data assimilation tools, which will reflect the nuclear data evaluation process much more accurately, and lead to a new generation of uncertainty quantification files. New covariance matrices will be obtained for Pu isotopes and compared to existing ones. The deployment of a fleet of safe and efficient advanced reactors that minimize radiotoxic waste and are proliferation-resistant is a clear and ambitious goal of AFCI. While in the past the design, construction and operation of a reactor were supported through empirical trials, this new phase in nuclear energy production is expected to rely heavily on advanced modeling and simulation capabilities. To be truly successful, a program for advanced simulations of innovative reactors will have to develop advanced multi-physics capabilities, to be run on massively parallel super- computers, and to incorporate adequate and precise underlying physics. And all these areas have to be developed simultaneously to achieve those ambitious goals. Of particular interest are reliable fission cross-section uncertainty estimates (including important correlations) and evaluations of prompt fission neutrons and gamma-ray spectra and uncertainties.« less
Network survivability performance (computer diskette)
NASA Astrophysics Data System (ADS)
1993-11-01
File characteristics: Data file; 1 file. Physical description: 1 computer diskette; 3 1/2 in.; high density; 2.0MB. System requirements: Mac; Word. This technical report has been developed to address the survivability of telecommunications networks including services. It responds to the need for a common understanding of, and assessment techniques for network survivability, availability, integrity, and reliability. It provides a basis for designing and operating telecommunication networks to user expectations for network survivability.
High Performance Data Distribution for Scientific Community
NASA Astrophysics Data System (ADS)
Tirado, Juan M.; Higuero, Daniel; Carretero, Jesus
2010-05-01
Institutions such as NASA, ESA or JAXA find solutions to distribute data from their missions to the scientific community, and their long term archives. This is a complex problem, as it includes a vast amount of data, several geographically distributed archives, heterogeneous architectures with heterogeneous networks, and users spread around the world. We propose a novel architecture (HIDDRA) that solves this problem aiming to reduce user intervention in data acquisition and processing. HIDDRA is a modular system that provides a highly efficient parallel multiprotocol download engine, using a publish/subscribe policy which helps the final user to obtain data of interest transparently. Our system can deal simultaneously with multiple protocols (HTTP,HTTPS, FTP, GridFTP among others) to obtain the maximum bandwidth, reducing the workload in data server and increasing flexibility. It can also provide high reliability and fault tolerance, as several sources of data can be used to perform one file download. HIDDRA architecture can be arranged into a data distribution network deployed on several sites that can cooperate to provide former features. HIDDRA has been addressed by the 2009 e-IRG Report on Data Management as a promising initiative for data interoperability. Our first prototype has been evaluated in collaboration with the ESAC centre in Villafranca del Castillo (Spain) that shows a high scalability and performance, opening a wide spectrum of opportunities. Some preliminary results have been published in the Journal of Astrophysics and Space Science [1]. [1] D. Higuero, J.M. Tirado, J. Carretero, F. Félix, and A. de La Fuente. HIDDRA: a highly independent data distribution and retrieval architecture for space observation missions. Astrophysics and Space Science, 321(3):169-175, 2009
Synthesizing parallel imaging applications using the CAP (computer-aided parallelization) tool
NASA Astrophysics Data System (ADS)
Gennart, Benoit A.; Mazzariol, Marc; Messerli, Vincent; Hersch, Roger D.
1997-12-01
Imaging applications such as filtering, image transforms and compression/decompression require vast amounts of computing power when applied to large data sets. These applications would potentially benefit from the use of parallel processing. However, dedicated parallel computers are expensive and their processing power per node lags behind that of the most recent commodity components. Furthermore, developing parallel applications remains a difficult task: writing and debugging the application is difficult (deadlocks), programs may not be portable from one parallel architecture to the other, and performance often comes short of expectations. In order to facilitate the development of parallel applications, we propose the CAP computer-aided parallelization tool which enables application programmers to specify at a high-level of abstraction the flow of data between pipelined-parallel operations. In addition, the CAP tool supports the programmer in developing parallel imaging and storage operations. CAP enables combining efficiently parallel storage access routines and image processing sequential operations. This paper shows how processing and I/O intensive imaging applications must be implemented to take advantage of parallelism and pipelining between data access and processing. This paper's contribution is (1) to show how such implementations can be compactly specified in CAP, and (2) to demonstrate that CAP specified applications achieve the performance of custom parallel code. The paper analyzes theoretically the performance of CAP specified applications and demonstrates the accuracy of the theoretical analysis through experimental measurements.
NASA Astrophysics Data System (ADS)
Yu, Leiming; Nina-Paravecino, Fanny; Kaeli, David; Fang, Qianqian
2018-01-01
We present a highly scalable Monte Carlo (MC) three-dimensional photon transport simulation platform designed for heterogeneous computing systems. Through the development of a massively parallel MC algorithm using the Open Computing Language framework, this research extends our existing graphics processing unit (GPU)-accelerated MC technique to a highly scalable vendor-independent heterogeneous computing environment, achieving significantly improved performance and software portability. A number of parallel computing techniques are investigated to achieve portable performance over a wide range of computing hardware. Furthermore, multiple thread-level and device-level load-balancing strategies are developed to obtain efficient simulations using multiple central processing units and GPUs.
Zhu, Xiang; Zhang, Dianwen
2013-01-01
We present a fast, accurate and robust parallel Levenberg-Marquardt minimization optimizer, GPU-LMFit, which is implemented on graphics processing unit for high performance scalable parallel model fitting processing. GPU-LMFit can provide a dramatic speed-up in massive model fitting analyses to enable real-time automated pixel-wise parametric imaging microscopy. We demonstrate the performance of GPU-LMFit for the applications in superresolution localization microscopy and fluorescence lifetime imaging microscopy. PMID:24130785
Parallel algorithms for large-scale biological sequence alignment on Xeon-Phi based clusters.
Lan, Haidong; Chan, Yuandong; Xu, Kai; Schmidt, Bertil; Peng, Shaoliang; Liu, Weiguo
2016-07-19
Computing alignments between two or more sequences are common operations frequently performed in computational molecular biology. The continuing growth of biological sequence databases establishes the need for their efficient parallel implementation on modern accelerators. This paper presents new approaches to high performance biological sequence database scanning with the Smith-Waterman algorithm and the first stage of progressive multiple sequence alignment based on the ClustalW heuristic on a Xeon Phi-based compute cluster. Our approach uses a three-level parallelization scheme to take full advantage of the compute power available on this type of architecture; i.e. cluster-level data parallelism, thread-level coarse-grained parallelism, and vector-level fine-grained parallelism. Furthermore, we re-organize the sequence datasets and use Xeon Phi shuffle operations to improve I/O efficiency. Evaluations show that our method achieves a peak overall performance up to 220 GCUPS for scanning real protein sequence databanks on a single node consisting of two Intel E5-2620 CPUs and two Intel Xeon Phi 7110P cards. It also exhibits good scalability in terms of sequence length and size, and number of compute nodes for both database scanning and multiple sequence alignment. Furthermore, the achieved performance is highly competitive in comparison to optimized Xeon Phi and GPU implementations. Our implementation is available at https://github.com/turbo0628/LSDBS-mpi .
An efficient parallel algorithm for matrix-vector multiplication
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hendrickson, B.; Leland, R.; Plimpton, S.
The multiplication of a vector by a matrix is the kernel computation of many algorithms in scientific computation. A fast parallel algorithm for this calculation is therefore necessary if one is to make full use of the new generation of parallel supercomputers. This paper presents a high performance, parallel matrix-vector multiplication algorithm that is particularly well suited to hypercube multiprocessors. For an n x n matrix on p processors, the communication cost of this algorithm is O(n/[radical]p + log(p)), independent of the matrix sparsity pattern. The performance of the algorithm is demonstrated by employing it as the kernel in themore » well-known NAS conjugate gradient benchmark, where a run time of 6.09 seconds was observed. This is the best published performance on this benchmark achieved to date using a massively parallel supercomputer.« less
Federal Register 2010, 2011, 2012, 2013, 2014
2013-02-11
... PDCI to the Ponderosa substation, or (ii) the Bonneville Power Administration (BPA) existing transmission line corridor and then running parallel to the BPA line to the Ponderosa substation; and (9...
Non-volatile main memory management methods based on a file system.
Oikawa, Shuichi
2014-01-01
There are upcoming non-volatile (NV) memory technologies that provide byte addressability and high performance. PCM, MRAM, and STT-RAM are such examples. Such NV memory can be used as storage because of its data persistency without power supply while it can be used as main memory because of its high performance that matches up with DRAM. There are a number of researches that investigated its uses for main memory and storage. They were, however, conducted independently. This paper presents the methods that enables the integration of the main memory and file system management for NV memory. Such integration makes NV memory simultaneously utilized as both main memory and storage. The presented methods use a file system as their basis for the NV memory management. We implemented the proposed methods in the Linux kernel, and performed the evaluation on the QEMU system emulator. The evaluation results show that 1) the proposed methods can perform comparably to the existing DRAM memory allocator and significantly better than the page swapping, 2) their performance is affected by the internal data structures of a file system, and 3) the data structures appropriate for traditional hard disk drives do not always work effectively for byte addressable NV memory. We also performed the evaluation of the effects caused by the longer access latency of NV memory by cycle-accurate full-system simulation. The results show that the effect on page allocation cost is limited if the increase of latency is moderate.
NASA Astrophysics Data System (ADS)
Georgiev, K.; Zlatev, Z.
2010-11-01
The Danish Eulerian Model (DEM) is an Eulerian model for studying the transport of air pollutants on large scale. Originally, the model was developed at the National Environmental Research Institute of Denmark. The model computational domain covers Europe and some neighbour parts belong to the Atlantic Ocean, Asia and Africa. If DEM model is to be applied by using fine grids, then its discretization leads to a huge computational problem. This implies that such a model as DEM must be run only on high-performance computer architectures. The implementation and tuning of such a complex large-scale model on each different computer is a non-trivial task. Here, some comparison results of running of this model on different kind of vector (CRAY C92A, Fujitsu, etc.), parallel computers with distributed memory (IBM SP, CRAY T3E, Beowulf clusters, Macintosh G4 clusters, etc.), parallel computers with shared memory (SGI Origin, SUN, etc.) and parallel computers with two levels of parallelism (IBM SMP, IBM BlueGene/P, clusters of multiprocessor nodes, etc.) will be presented. The main idea in the parallel version of DEM is domain partitioning approach. Discussions according to the effective use of the cache and hierarchical memories of the modern computers as well as the performance, speed-ups and efficiency achieved will be done. The parallel code of DEM, created by using MPI standard library, appears to be highly portable and shows good efficiency and scalability on different kind of vector and parallel computers. Some important applications of the computer model output are presented in short.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sewell, Christopher Meyer
This is a set of slides from a guest lecture for a class at the University of Texas, El Paso on visualization and data analysis for high-performance computing. The topics covered are the following: trends in high-performance computing; scientific visualization, such as OpenGL, ray tracing and volume rendering, VTK, and ParaView; data science at scale, such as in-situ visualization, image databases, distributed memory parallelism, shared memory parallelism, VTK-m, "big data", and then an analysis example.
A Domain Decomposition Parallelization of the Fast Marching Method
NASA Technical Reports Server (NTRS)
Herrmann, M.
2003-01-01
In this paper, the first domain decomposition parallelization of the Fast Marching Method for level sets has been presented. Parallel speedup has been demonstrated in both the optimal and non-optimal domain decomposition case. The parallel performance of the proposed method is strongly dependent on load balancing separately the number of nodes on each side of the interface. A load imbalance of nodes on either side of the domain leads to an increase in communication and rollback operations. Furthermore, the amount of inter-domain communication can be reduced by aligning the inter-domain boundaries with the interface normal vectors. In the case of optimal load balancing and aligned inter-domain boundaries, the proposed parallel FMM algorithm is highly efficient, reaching efficiency factors of up to 0.98. Future work will focus on the extension of the proposed parallel algorithm to higher order accuracy. Also, to further enhance parallel performance, the coupling of the domain decomposition parallelization to the G(sub 0)-based parallelization will be investigated.
Parallel, Asynchronous Executive (PAX): System concepts, facilities, and architecture
NASA Technical Reports Server (NTRS)
Jones, W. H.
1983-01-01
The Parallel, Asynchronous Executive (PAX) is a software operating system simulation that allows many computers to work on a single problem at the same time. PAX is currently implemented on a UNIVAC 1100/42 computer system. Independent UNIVAC runstreams are used to simulate independent computers. Data are shared among independent UNIVAC runstreams through shared mass-storage files. PAX has achieved the following: (1) applied several computing processes simultaneously to a single, logically unified problem; (2) resolved most parallel processor conflicts by careful work assignment; (3) resolved by means of worker requests to PAX all conflicts not resolved by work assignment; (4) provided fault isolation and recovery mechanisms to meet the problems of an actual parallel, asynchronous processing machine. Additionally, one real-life problem has been constructed for the PAX environment. This is CASPER, a collection of aerodynamic and structural dynamic problem simulation routines. CASPER is not discussed in this report except to provide examples of parallel-processing techniques.
Performance of a 300 Mbps 1:16 serial/parallel optoelectronic receiver module
NASA Technical Reports Server (NTRS)
Richard, M. A.; Claspy, P. C.; Bhasin, K. B.; Bendett, M. B.
1990-01-01
Optical interconnects are being considered for the high speed distribution of multiplexed control signals in GaAs monolithic microwave integrated circuit (MMIC) based phased array antennas. The performance of a hybrid GaAs optoelectronic integrated circuit (OEIC) is described, as well as its design and fabrication. The OEIC converts a 16-bit serial optical input to a 16 parallel line electrical output using an on-board 1:16 demultiplexer and operates at data rates as high as 30b Mbps. The performance characteristics and potential applications of the device are presented.
NASA Technical Reports Server (NTRS)
Schutz, Bob E.; Baker, Gregory A.
1997-01-01
The recovery of a high resolution geopotential from satellite gradiometer observations motivates the examination of high performance computational techniques. The primary subject matter addresses specifically the use of satellite gradiometer and GPS observations to form and invert the normal matrix associated with a large degree and order geopotential solution. Memory resident and out-of-core parallel linear algebra techniques along with data parallel batch algorithms form the foundation of the least squares application structure. A secondary topic includes the adoption of object oriented programming techniques to enhance modularity and reusability of code. Applications implementing the parallel and object oriented methods successfully calculate the degree variance for a degree and order 110 geopotential solution on 32 processors of the Cray T3E. The memory resident gradiometer application exhibits an overall application performance of 5.4 Gflops, and the out-of-core linear solver exhibits an overall performance of 2.4 Gflops. The combination solution derived from a sun synchronous gradiometer orbit produce average geoid height variances of 17 millimeters.
NASA Astrophysics Data System (ADS)
Baker, Gregory Allen
The recovery of a high resolution geopotential from satellite gradiometer observations motivates the examination of high performance computational techniques. The primary subject matter addresses specifically the use of satellite gradiometer and GPS observations to form and invert the normal matrix associated with a large degree and order geopotential solution. Memory resident and out-of-core parallel linear algebra techniques along with data parallel batch algorithms form the foundation of the least squares application structure. A secondary topic includes the adoption of object oriented programming techniques to enhance modularity and reusability of code. Applications implementing the parallel and object oriented methods successfully calculate the degree variance for a degree and order 110 geopotential solution on 32 processors of the Cray T3E. The memory resident gradiometer application exhibits an overall application performance of 5.4 Gflops, and the out-of-core linear solver exhibits an overall performance of 2.4 Gflops. The combination solution derived from a sun synchronous gradiometer orbit produce average geoid height variances of 17 millimeters.
Efficient implementation of parallel three-dimensional FFT on clusters of PCs
NASA Astrophysics Data System (ADS)
Takahashi, Daisuke
2003-05-01
In this paper, we propose a high-performance parallel three-dimensional fast Fourier transform (FFT) algorithm on clusters of PCs. The three-dimensional FFT algorithm can be altered into a block three-dimensional FFT algorithm to reduce the number of cache misses. We show that the block three-dimensional FFT algorithm improves performance by utilizing the cache memory effectively. We use the block three-dimensional FFT algorithm to implement the parallel three-dimensional FFT algorithm. We succeeded in obtaining performance of over 1.3 GFLOPS on an 8-node dual Pentium III 1 GHz PC SMP cluster.
Choi, Hyungsuk; Choi, Woohyuk; Quan, Tran Minh; Hildebrand, David G C; Pfister, Hanspeter; Jeong, Won-Ki
2014-12-01
As the size of image data from microscopes and telescopes increases, the need for high-throughput processing and visualization of large volumetric data has become more pressing. At the same time, many-core processors and GPU accelerators are commonplace, making high-performance distributed heterogeneous computing systems affordable. However, effectively utilizing GPU clusters is difficult for novice programmers, and even experienced programmers often fail to fully leverage the computing power of new parallel architectures due to their steep learning curve and programming complexity. In this paper, we propose Vivaldi, a new domain-specific language for volume processing and visualization on distributed heterogeneous computing systems. Vivaldi's Python-like grammar and parallel processing abstractions provide flexible programming tools for non-experts to easily write high-performance parallel computing code. Vivaldi provides commonly used functions and numerical operators for customized visualization and high-throughput image processing applications. We demonstrate the performance and usability of Vivaldi on several examples ranging from volume rendering to image segmentation.
A parallel implementation of a multisensor feature-based range-estimation method
NASA Technical Reports Server (NTRS)
Suorsa, Raymond E.; Sridhar, Banavar
1993-01-01
There are many proposed vision based methods to perform obstacle detection and avoidance for autonomous or semi-autonomous vehicles. All methods, however, will require very high processing rates to achieve real time performance. A system capable of supporting autonomous helicopter navigation will need to extract obstacle information from imagery at rates varying from ten frames per second to thirty or more frames per second depending on the vehicle speed. Such a system will need to sustain billions of operations per second. To reach such high processing rates using current technology, a parallel implementation of the obstacle detection/ranging method is required. This paper describes an efficient and flexible parallel implementation of a multisensor feature-based range-estimation algorithm, targeted for helicopter flight, realized on both a distributed-memory and shared-memory parallel computer.
A design methodology for portable software on parallel computers
NASA Technical Reports Server (NTRS)
Nicol, David M.; Miller, Keith W.; Chrisman, Dan A.
1993-01-01
This final report for research that was supported by grant number NAG-1-995 documents our progress in addressing two difficulties in parallel programming. The first difficulty is developing software that will execute quickly on a parallel computer. The second difficulty is transporting software between dissimilar parallel computers. In general, we expect that more hardware-specific information will be included in software designs for parallel computers than in designs for sequential computers. This inclusion is an instance of portability being sacrificed for high performance. New parallel computers are being introduced frequently. Trying to keep one's software on the current high performance hardware, a software developer almost continually faces yet another expensive software transportation. The problem of the proposed research is to create a design methodology that helps designers to more precisely control both portability and hardware-specific programming details. The proposed research emphasizes programming for scientific applications. We completed our study of the parallelizability of a subsystem of the NASA Earth Radiation Budget Experiment (ERBE) data processing system. This work is summarized in section two. A more detailed description is provided in Appendix A ('Programming Practices to Support Eventual Parallelism'). Mr. Chrisman, a graduate student, wrote and successfully defended a Ph.D. dissertation proposal which describes our research associated with the issues of software portability and high performance. The list of research tasks are specified in the proposal. The proposal 'A Design Methodology for Portable Software on Parallel Computers' is summarized in section three and is provided in its entirety in Appendix B. We are currently studying a proposed subsystem of the NASA Clouds and the Earth's Radiant Energy System (CERES) data processing system. This software is the proof-of-concept for the Ph.D. dissertation. We have implemented and measured the performance of a portion of this subsystem on the Intel iPSC/2 parallel computer. These results are provided in section four. Our future work is summarized in section five, our acknowledgements are stated in section six, and references for published papers associated with NAG-1-995 are provided in section seven.
NASA Technical Reports Server (NTRS)
1991-01-01
Various papers on supercomputing are presented. The general topics addressed include: program analysis/data dependence, memory access, distributed memory code generation, numerical algorithms, supercomputer benchmarks, latency tolerance, parallel programming, applications, processor design, networks, performance tools, mapping and scheduling, characterization affecting performance, parallelism packaging, computing climate change, combinatorial algorithms, hardware and software performance issues, system issues. (No individual items are abstracted in this volume)
New algorithms for processing time-series big EEG data within mobile health monitoring systems.
Serhani, Mohamed Adel; Menshawy, Mohamed El; Benharref, Abdelghani; Harous, Saad; Navaz, Alramzana Nujum
2017-10-01
Recent advances in miniature biomedical sensors, mobile smartphones, wireless communications, and distributed computing technologies provide promising techniques for developing mobile health systems. Such systems are capable of monitoring epileptic seizures reliably, which are classified as chronic diseases. Three challenging issues raised in this context with regard to the transformation, compression, storage, and visualization of big data, which results from a continuous recording of epileptic seizures using mobile devices. In this paper, we address the above challenges by developing three new algorithms to process and analyze big electroencephalography data in a rigorous and efficient manner. The first algorithm is responsible for transforming the standard European Data Format (EDF) into the standard JavaScript Object Notation (JSON) and compressing the transformed JSON data to decrease the size and time through the transfer process and to increase the network transfer rate. The second algorithm focuses on collecting and storing the compressed files generated by the transformation and compression algorithm. The collection process is performed with respect to the on-the-fly technique after decompressing files. The third algorithm provides relevant real-time interaction with signal data by prospective users. It particularly features the following capabilities: visualization of single or multiple signal channels on a smartphone device and query data segments. We tested and evaluated the effectiveness of our approach through a software architecture model implementing a mobile health system to monitor epileptic seizures. The experimental findings from 45 experiments are promising and efficiently satisfy the approach's objectives in a price of linearity. Moreover, the size of compressed JSON files and transfer times are reduced by 10% and 20%, respectively, while the average total time is remarkably reduced by 67% through all performed experiments. Our approach successfully develops efficient algorithms in terms of processing time, memory usage, and energy consumption while maintaining a high scalability of the proposed solution. Our approach efficiently supports data partitioning and parallelism relying on the MapReduce platform, which can help in monitoring and automatic detection of epileptic seizures. Copyright © 2017 Elsevier B.V. All rights reserved.
ParallABEL: an R library for generalized parallelization of genome-wide association studies.
Sangket, Unitsa; Mahasirimongkol, Surakameth; Chantratita, Wasun; Tandayya, Pichaya; Aulchenko, Yurii S
2010-04-29
Genome-Wide Association (GWA) analysis is a powerful method for identifying loci associated with complex traits and drug response. Parts of GWA analyses, especially those involving thousands of individuals and consuming hours to months, will benefit from parallel computation. It is arduous acquiring the necessary programming skills to correctly partition and distribute data, control and monitor tasks on clustered computers, and merge output files. Most components of GWA analysis can be divided into four groups based on the types of input data and statistical outputs. The first group contains statistics computed for a particular Single Nucleotide Polymorphism (SNP), or trait, such as SNP characterization statistics or association test statistics. The input data of this group includes the SNPs/traits. The second group concerns statistics characterizing an individual in a study, for example, the summary statistics of genotype quality for each sample. The input data of this group includes individuals. The third group consists of pair-wise statistics derived from analyses between each pair of individuals in the study, for example genome-wide identity-by-state or genomic kinship analyses. The input data of this group includes pairs of SNPs/traits. The final group concerns pair-wise statistics derived for pairs of SNPs, such as the linkage disequilibrium characterisation. The input data of this group includes pairs of individuals. We developed the ParallABEL library, which utilizes the Rmpi library, to parallelize these four types of computations. ParallABEL library is not only aimed at GenABEL, but may also be employed to parallelize various GWA packages in R. The data set from the North American Rheumatoid Arthritis Consortium (NARAC) includes 2,062 individuals with 545,080, SNPs' genotyping, was used to measure ParallABEL performance. Almost perfect speed-up was achieved for many types of analyses. For example, the computing time for the identity-by-state matrix was linearly reduced from approximately eight hours to one hour when ParallABEL employed eight processors. Executing genome-wide association analysis using the ParallABEL library on a computer cluster is an effective way to boost performance, and simplify the parallelization of GWA studies. ParallABEL is a user-friendly parallelization of GenABEL.
Modelling parallel programs and multiprocessor architectures with AXE
NASA Technical Reports Server (NTRS)
Yan, Jerry C.; Fineman, Charles E.
1991-01-01
AXE, An Experimental Environment for Parallel Systems, was designed to model and simulate for parallel systems at the process level. It provides an integrated environment for specifying computation models, multiprocessor architectures, data collection, and performance visualization. AXE is being used at NASA-Ames for developing resource management strategies, parallel problem formulation, multiprocessor architectures, and operating system issues related to the High Performance Computing and Communications Program. AXE's simple, structured user-interface enables the user to model parallel programs and machines precisely and efficiently. Its quick turn-around time keeps the user interested and productive. AXE models multicomputers. The user may easily modify various architectural parameters including the number of sites, connection topologies, and overhead for operating system activities. Parallel computations in AXE are represented as collections of autonomous computing objects known as players. Their use and behavior is described. Performance data of the multiprocessor model can be observed on a color screen. These include CPU and message routing bottlenecks, and the dynamic status of the software.
Combining Phase Identification and Statistic Modeling for Automated Parallel Benchmark Generation
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jin, Ye; Ma, Xiaosong; Liu, Qing Gary
2015-01-01
Parallel application benchmarks are indispensable for evaluating/optimizing HPC software and hardware. However, it is very challenging and costly to obtain high-fidelity benchmarks reflecting the scale and complexity of state-of-the-art parallel applications. Hand-extracted synthetic benchmarks are time-and labor-intensive to create. Real applications themselves, while offering most accurate performance evaluation, are expensive to compile, port, reconfigure, and often plainly inaccessible due to security or ownership concerns. This work contributes APPRIME, a novel tool for trace-based automatic parallel benchmark generation. Taking as input standard communication-I/O traces of an application's execution, it couples accurate automatic phase identification with statistical regeneration of event parameters tomore » create compact, portable, and to some degree reconfigurable parallel application benchmarks. Experiments with four NAS Parallel Benchmarks (NPB) and three real scientific simulation codes confirm the fidelity of APPRIME benchmarks. They retain the original applications' performance characteristics, in particular the relative performance across platforms.« less
The INFN-CNAF Tier-1 GEMSS Mass Storage System and database facility activity
NASA Astrophysics Data System (ADS)
Ricci, Pier Paolo; Cavalli, Alessandro; Dell'Agnello, Luca; Favaro, Matteo; Gregori, Daniele; Prosperini, Andrea; Pezzi, Michele; Sapunenko, Vladimir; Zizzi, Giovanni; Vagnoni, Vincenzo
2015-05-01
The consolidation of Mass Storage services at the INFN-CNAF Tier1 Storage department that has occurred during the last 5 years, resulted in a reliable, high performance and moderately easy-to-manage facility that provides data access, archive, backup and database services to several different use cases. At present, the GEMSS Mass Storage System, developed and installed at CNAF and based upon an integration between the IBM GPFS parallel filesystem and the Tivoli Storage Manager (TSM) tape management software, is one of the largest hierarchical storage sites in Europe. It provides storage resources for about 12% of LHC data, as well as for data of other non-LHC experiments. Files are accessed using standard SRM Grid services provided by the Storage Resource Manager (StoRM), also developed at CNAF. Data access is also provided by XRootD and HTTP/WebDaV endpoints. Besides these services, an Oracle database facility is in production characterized by an effective level of parallelism, redundancy and availability. This facility is running databases for storing and accessing relational data objects and for providing database services to the currently active use cases. It takes advantage of several Oracle technologies, like Real Application Cluster (RAC), Automatic Storage Manager (ASM) and Enterprise Manager centralized management tools, together with other technologies for performance optimization, ease of management and downtime reduction. The aim of the present paper is to illustrate the state-of-the-art of the INFN-CNAF Tier1 Storage department infrastructures and software services, and to give a brief outlook to forthcoming projects. A description of the administrative, monitoring and problem-tracking tools that play a primary role in managing the whole storage framework is also given.
QoS support for end users of I/O-intensive applications using shared storage systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Davis, Marion Kei; Zhang, Xuechen; Jiang, Song
2011-01-19
I/O-intensive applications are becoming increasingly common on today's high-performance computing systems. While performance of compute-bound applications can be effectively guaranteed with techniques such as space sharing or QoS-aware process scheduling, it remains a challenge to meet QoS requirements for end users of I/O-intensive applications using shared storage systems because it is difficult to differentiate I/O services for different applications with individual quality requirements. Furthermore, it is difficult for end users to accurately specify performance goals to the storage system using I/O-related metrics such as request latency or throughput. As access patterns, request rates, and the system workload change in time,more » a fixed I/O performance goal, such as bounds on throughput or latency, can be expensive to achieve and may not lead to a meaningful performance guarantees such as bounded program execution time. We propose a scheme supporting end-users QoS goals, specified in terms of program execution time, in shared storage environments. We automatically translate the users performance goals into instantaneous I/O throughput bounds using a machine learning technique, and use dynamically determined service time windows to efficiently meet the throughput bounds. We have implemented this scheme in the PVFS2 parallel file system and have conducted an extensive evaluation. Our results show that this scheme can satisfy realistic end-user QoS requirements by making highly efficient use of the I/O resources. The scheme seeks to balance programs attainment of QoS requirements, and saves as much of the remaining I/O capacity as possible for best-effort programs.« less
High-performance metadata indexing and search in petascale data storage systems
NASA Astrophysics Data System (ADS)
Leung, A. W.; Shao, M.; Bisson, T.; Pasupathy, S.; Miller, E. L.
2008-07-01
Large-scale storage systems used for scientific applications can store petabytes of data and billions of files, making the organization and management of data in these systems a difficult, time-consuming task. The ability to search file metadata in a storage system can address this problem by allowing scientists to quickly navigate experiment data and code while allowing storage administrators to gather the information they need to properly manage the system. In this paper, we present Spyglass, a file metadata search system that achieves scalability by exploiting storage system properties, providing the scalability that existing file metadata search tools lack. In doing so, Spyglass can achieve search performance up to several thousand times faster than existing database solutions. We show that Spyglass enables important functionality that can aid data management for scientists and storage administrators.
Computational Performance of a Parallelized Three-Dimensional High-Order Spectral Element Toolbox
NASA Astrophysics Data System (ADS)
Bosshard, Christoph; Bouffanais, Roland; Clémençon, Christian; Deville, Michel O.; Fiétier, Nicolas; Gruber, Ralf; Kehtari, Sohrab; Keller, Vincent; Latt, Jonas
In this paper, a comprehensive performance review of an MPI-based high-order three-dimensional spectral element method C++ toolbox is presented. The focus is put on the performance evaluation of several aspects with a particular emphasis on the parallel efficiency. The performance evaluation is analyzed with help of a time prediction model based on a parameterization of the application and the hardware resources. A tailor-made CFD computation benchmark case is introduced and used to carry out this review, stressing the particular interest for clusters with up to 8192 cores. Some problems in the parallel implementation have been detected and corrected. The theoretical complexities with respect to the number of elements, to the polynomial degree, and to communication needs are correctly reproduced. It is concluded that this type of code has a nearly perfect speed up on machines with thousands of cores, and is ready to make the step to next-generation petaflop machines.
Newman, D M; Hawley, R W; Goeckel, D L; Crawford, R D; Abraham, S; Gallagher, N C
1993-05-10
An efficient storage format was developed for computer-generated holograms for use in electron-beam lithography. This method employs run-length encoding and Lempel-Ziv-Welch compression and succeeds in exposing holograms that were previously infeasible owing to the hologram's tremendous pattern-data file size. These holograms also require significant computation; thus the algorithm was implemented on a parallel computer, which improved performance by 2 orders of magnitude. The decompression algorithm was integrated into the Cambridge electron-beam machine's front-end processor.Although this provides much-needed ability, some hardware enhancements will be required in the future to overcome inadequacies in the current front-end processor that result in a lengthy exposure time.
SU-F-T-266: Dynalogs Based Evaluation of Different Dose Rate IMRT Using DVH and Gamma Index
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ahmed, S; Ahmed, S; Ahmed, F
2016-06-15
Purpose: This work investigates the impact of low and high dose rate on IMRT through Dynalogs by evaluating Gamma Index and Dose Volume Histogram. Methods: The Eclipse™ treatment planning software was used to generate plans on prostate and head and neck sites. A range of dose rates 300 MU/min and 600 MU/min were applied to each plan in order to investigate their effect on the beam ON time, efficiency and accuracy. Each plan had distinct monitor units per fraction, delivery time, mean dose rate and leaf speed. The DVH data was used in the assessment of the conformity and planmore » quality.The treatments were delivered on Varian™ Clinac 2100C accelerator equipped with 120 leaf millennium MLC. Dynalogs of each plan were analyzed by MATLAB™ program. Fluence measurements were performed using the Sun Nuclear™ 2D diode array and results were assessed, based on Gamma analysis of dose fluence maps, beam delivery statistics and Dynalogs data. Results: Minor differences found by adjusted R-squared analysis of DVH’s for all the plans with different dose rates. It has been also found that more and larger fields have greater time reduction at high dose rate and there was a sharp decrease in number of control points observed in dynalog files by switching dose rate from 300 MU/min to 600 MU/min. Gamma Analysis of all plans passes the confidence limit of ≥95% with greater number of passing points in 300 MU/min dose rate plans. Conclusion: The dynalog files are compatible tool for software based IMRT QA. It can work perfectly parallel to measurement based QA setup and stand-by procedure for pre and post delivery of treatment plan.« less
GPU based cloud system for high-performance arrhythmia detection with parallel k-NN algorithm.
Tae Joon Jun; Hyun Ji Park; Hyuk Yoo; Young-Hak Kim; Daeyoung Kim
2016-08-01
In this paper, we propose an GPU based Cloud system for high-performance arrhythmia detection. Pan-Tompkins algorithm is used for QRS detection and we optimized beat classification algorithm with K-Nearest Neighbor (K-NN). To support high performance beat classification on the system, we parallelized beat classification algorithm with CUDA to execute the algorithm on virtualized GPU devices on the Cloud system. MIT-BIH Arrhythmia database is used for validation of the algorithm. The system achieved about 93.5% of detection rate which is comparable to previous researches while our algorithm shows 2.5 times faster execution time compared to CPU only detection algorithm.
NASA Technical Reports Server (NTRS)
Campbell, David; Wysong, Ingrid; Kaplan, Carolyn; Mott, David; Wadsworth, Dean; VanGilder, Douglas
2000-01-01
An AFRL/NRL team has recently been selected to develop a scalable, parallel, reacting, multidimensional (SUPREM) Direct Simulation Monte Carlo (DSMC) code for the DoD user community under the High Performance Computing Modernization Office (HPCMO) Common High Performance Computing Software Support Initiative (CHSSI). This paper will introduce the JANNAF Exhaust Plume community to this three-year development effort and present the overall goals, schedule, and current status of this new code.
pWeb: A High-Performance, Parallel-Computing Framework for Web-Browser-Based Medical Simulation.
Halic, Tansel; Ahn, Woojin; De, Suvranu
2014-01-01
This work presents a pWeb - a new language and compiler for parallelization of client-side compute intensive web applications such as surgical simulations. The recently introduced HTML5 standard has enabled creating unprecedented applications on the web. Low performance of the web browser, however, remains the bottleneck of computationally intensive applications including visualization of complex scenes, real time physical simulations and image processing compared to native ones. The new proposed language is built upon web workers for multithreaded programming in HTML5. The language provides fundamental functionalities of parallel programming languages as well as the fork/join parallel model which is not supported by web workers. The language compiler automatically generates an equivalent parallel script that complies with the HTML5 standard. A case study on realistic rendering for surgical simulations demonstrates enhanced performance with a compact set of instructions.
Cloud Engineering Principles and Technology Enablers for Medical Image Processing-as-a-Service.
Bao, Shunxing; Plassard, Andrew J; Landman, Bennett A; Gokhale, Aniruddha
2017-04-01
Traditional in-house, laboratory-based medical imaging studies use hierarchical data structures (e.g., NFS file stores) or databases (e.g., COINS, XNAT) for storage and retrieval. The resulting performance from these approaches is, however, impeded by standard network switches since they can saturate network bandwidth during transfer from storage to processing nodes for even moderate-sized studies. To that end, a cloud-based "medical image processing-as-a-service" offers promise in utilizing the ecosystem of Apache Hadoop, which is a flexible framework providing distributed, scalable, fault tolerant storage and parallel computational modules, and HBase, which is a NoSQL database built atop Hadoop's distributed file system. Despite this promise, HBase's load distribution strategy of region split and merge is detrimental to the hierarchical organization of imaging data (e.g., project, subject, session, scan, slice). This paper makes two contributions to address these concerns by describing key cloud engineering principles and technology enhancements we made to the Apache Hadoop ecosystem for medical imaging applications. First, we propose a row-key design for HBase, which is a necessary step that is driven by the hierarchical organization of imaging data. Second, we propose a novel data allocation policy within HBase to strongly enforce collocation of hierarchically related imaging data. The proposed enhancements accelerate data processing by minimizing network usage and localizing processing to machines where the data already exist. Moreover, our approach is amenable to the traditional scan, subject, and project-level analysis procedures, and is compatible with standard command line/scriptable image processing software. Experimental results for an illustrative sample of imaging data reveals that our new HBase policy results in a three-fold time improvement in conversion of classic DICOM to NiFTI file formats when compared with the default HBase region split policy, and nearly a six-fold improvement over a commonly available network file system (NFS) approach even for relatively small file sets. Moreover, file access latency is lower than network attached storage.
Runtime Detection of C-Style Errors in UPC Code
DOE Office of Scientific and Technical Information (OSTI.GOV)
Pirkelbauer, P; Liao, C; Panas, T
2011-09-29
Unified Parallel C (UPC) extends the C programming language (ISO C 99) with explicit parallel programming support for the partitioned global address space (PGAS), which provides a global memory space with localized partitions to each thread. Like its ancestor C, UPC is a low-level language that emphasizes code efficiency over safety. The absence of dynamic (and static) safety checks allows programmer oversights and software flaws that can be hard to spot. In this paper, we present an extension of a dynamic analysis tool, ROSE-Code Instrumentation and Runtime Monitor (ROSECIRM), for UPC to help programmers find C-style errors involving the globalmore » address space. Built on top of the ROSE source-to-source compiler infrastructure, the tool instruments source files with code that monitors operations and keeps track of changes to the system state. The resulting code is linked to a runtime monitor that observes the program execution and finds software defects. We describe the extensions to ROSE-CIRM that were necessary to support UPC. We discuss complications that arise from parallel code and our solutions. We test ROSE-CIRM against a runtime error detection test suite, and present performance results obtained from running error-free codes. ROSE-CIRM is released as part of the ROSE compiler under a BSD-style open source license.« less
Configuration Management File Manager Developed for Numerical Propulsion System Simulation
NASA Technical Reports Server (NTRS)
Follen, Gregory J.
1997-01-01
One of the objectives of the High Performance Computing and Communication Project's (HPCCP) Numerical Propulsion System Simulation (NPSS) is to provide a common and consistent way to manage applications, data, and engine simulations. The NPSS Configuration Management (CM) File Manager integrated with the Common Desktop Environment (CDE) window management system provides a common look and feel for the configuration management of data, applications, and engine simulations for U.S. engine companies. In addition, CM File Manager provides tools to manage a simulation. Features include managing input files, output files, textual notes, and any other material normally associated with simulation. The CM File Manager includes a generic configuration management Application Program Interface (API) that can be adapted for the configuration management repositories of any U.S. engine company.
Announcements | High-Performance Computing | NREL
12th from 12 - 3 PM. Continue reading Data Transfer Queue March 08, 2018 A new queue "data , scratch) and MSS. Continue reading Purge-Alert January 11, 2018 We now have a notification method that . Continue reading Please Move Inactive Files Off the /projects File System January 11, 2018 The /projects
Windows VPN Set Up | High-Performance Computing | NREL
it in your My Documents folder Configure the client software using that conf file Start the TEXT NEEDED Configure the Client Software Start the Endian Connect App. You'll configure the connection using the hpcvpn-win.conf file, uncheck the "save password" link, and add your UserID. Start
Shah, Dipali Yogesh; Wadekar, Swati Ishwara; Dadpe, Ashwini Manish; Jadhav, Ganesh Ranganath; Choudhary, Lalit Jayant; Kalra, Dheeraj Deepak
2017-01-01
The purpose of this study was to compare and evaluate the shaping ability of ProTaper (PT) and Self-Adjusting File (SAF) system using cone-beam computed tomography (CBCT) to assess their performance in oval-shaped root canals. Sixty-two mandibular premolars with single oval canals were divided into two experimental groups ( n = 31) according to the systems used: Group I - PT and Group II - SAF. Canals were evaluated before and after instrumentation using CBCT to assess centering ratio and canal transportation at three levels. Data were statistically analyzed using one-way analysis of variance, post hoc Tukey's test, and t -test. The SAF showed better centering ability and lesser canal transportation than the PT only in the buccolingual plane at 6 and 9 mm levels. The shaping ability of the PT was best in the apical third in both the planes. The SAF had statistically significant better centering and lesser canal transportation in the buccolingual as compared to the mesiodistal plane at the middle and coronal levels. The SAF produced significantly less transportation and remained centered than the PT at the middle and coronal levels in the buccolingual plane of oval canals. In the mesiodistal plane, the performance of both the systems was parallel.
Kepper, Nick; Ettig, Ramona; Dickmann, Frank; Stehr, Rene; Grosveld, Frank G; Wedemann, Gero; Knoch, Tobias A
2010-01-01
Especially in the life-science and the health-care sectors the huge IT requirements are imminent due to the large and complex systems to be analysed and simulated. Grid infrastructures play here a rapidly increasing role for research, diagnostics, and treatment, since they provide the necessary large-scale resources efficiently. Whereas grids were first used for huge number crunching of trivially parallelizable problems, increasingly parallel high-performance computing is required. Here, we show for the prime example of molecular dynamic simulations how the presence of large grid clusters including very fast network interconnects within grid infrastructures allows now parallel high-performance grid computing efficiently and thus combines the benefits of dedicated super-computing centres and grid infrastructures. The demands for this service class are the highest since the user group has very heterogeneous requirements: i) two to many thousands of CPUs, ii) different memory architectures, iii) huge storage capabilities, and iv) fast communication via network interconnects, are all needed in different combinations and must be considered in a highly dedicated manner to reach highest performance efficiency. Beyond, advanced and dedicated i) interaction with users, ii) the management of jobs, iii) accounting, and iv) billing, not only combines classic with parallel high-performance grid usage, but more importantly is also able to increase the efficiency of IT resource providers. Consequently, the mere "yes-we-can" becomes a huge opportunity like e.g. the life-science and health-care sectors as well as grid infrastructures by reaching higher level of resource efficiency.
Guo, Fei; Kubis, Peter; Li, Ning; Przybilla, Thomas; Matt, Gebhard; Stubhan, Tobias; Ameri, Tayebeh; Butz, Benjamin; Spiecker, Erdmann; Forberich, Karen; Brabec, Christoph J
2014-12-23
Tandem architecture is the most relevant concept to overcome the efficiency limit of single-junction photovoltaic solar cells. Series-connected tandem polymer solar cells (PSCs) have advanced rapidly during the past decade. In contrast, the development of parallel-connected tandem cells is lagging far behind due to the big challenge in establishing an efficient interlayer with high transparency and high in-plane conductivity. Here, we report all-solution fabrication of parallel tandem PSCs using silver nanowires as intermediate charge collecting electrode. Through a rational interface design, a robust interlayer is established, enabling the efficient extraction and transport of electrons from subcells. The resulting parallel tandem cells exhibit high fill factors of ∼60% and enhanced current densities which are identical to the sum of the current densities of the subcells. These results suggest that solution-processed parallel tandem configuration provides an alternative avenue toward high performance photovoltaic devices.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sadayappan, Ponnuswamy
Exascale computing systems will provide a thousand-fold increase in parallelism and a proportional increase in failure rate relative to today's machines. Systems software for exascale machines must provide the infrastructure to support existing applications while simultaneously enabling efficient execution of new programming models that naturally express dynamic, adaptive, irregular computation; coupled simulations; and massive data analysis in a highly unreliable hardware environment with billions of threads of execution. We propose a new approach to the data and work distribution model provided by system software based on the unifying formalism of an abstract file system. The proposed hierarchical data model providesmore » simple, familiar visibility and access to data structures through the file system hierarchy, while providing fault tolerance through selective redundancy. The hierarchical task model features work queues whose form and organization are represented as file system objects. Data and work are both first class entities. By exposing the relationships between data and work to the runtime system, information is available to optimize execution time and provide fault tolerance. The data distribution scheme provides replication (where desirable and possible) for fault tolerance and efficiency, and it is hierarchical to make it possible to take advantage of locality. The user, tools, and applications, including legacy applications, can interface with the data, work queues, and one another through the abstract file model. This runtime environment will provide multiple interfaces to support traditional Message Passing Interface applications, languages developed under DARPA's High Productivity Computing Systems program, as well as other, experimental programming models. We will validate our runtime system with pilot codes on existing platforms and will use simulation to validate for exascale-class platforms. In this final report, we summarize research results from the work done at the Ohio State University towards the larger goals of the project listed above.« less
NASA Astrophysics Data System (ADS)
Wei, Xiaohui; Li, Weishan; Tian, Hailong; Li, Hongliang; Xu, Haixiao; Xu, Tianfu
2015-07-01
The numerical simulation of multiphase flow and reactive transport in the porous media on complex subsurface problem is a computationally intensive application. To meet the increasingly computational requirements, this paper presents a parallel computing method and architecture. Derived from TOUGHREACT that is a well-established code for simulating subsurface multi-phase flow and reactive transport problems, we developed a high performance computing THC-MP based on massive parallel computer, which extends greatly on the computational capability for the original code. The domain decomposition method was applied to the coupled numerical computing procedure in the THC-MP. We designed the distributed data structure, implemented the data initialization and exchange between the computing nodes and the core solving module using the hybrid parallel iterative and direct solver. Numerical accuracy of the THC-MP was verified through a CO2 injection-induced reactive transport problem by comparing the results obtained from the parallel computing and sequential computing (original code). Execution efficiency and code scalability were examined through field scale carbon sequestration applications on the multicore cluster. The results demonstrate successfully the enhanced performance using the THC-MP on parallel computing facilities.
5 CFR 293.403 - Contents of employee performance files.
Code of Federal Regulations, 2011 CFR
2011-01-01
... REGULATIONS PERSONNEL RECORDS Employee Performance File System Records § 293.403 Contents of employee performance files. (a) A decision on what constitutes a performance-related document within the meaning of... 5 Administrative Personnel 1 2011-01-01 2011-01-01 false Contents of employee performance files...
5 CFR 293.403 - Contents of employee performance files.
Code of Federal Regulations, 2013 CFR
2013-01-01
... REGULATIONS PERSONNEL RECORDS Employee Performance File System Records § 293.403 Contents of employee performance files. (a) A decision on what constitutes a performance-related document within the meaning of... 5 Administrative Personnel 1 2013-01-01 2013-01-01 false Contents of employee performance files...
5 CFR 293.403 - Contents of employee performance files.
Code of Federal Regulations, 2014 CFR
2014-01-01
... REGULATIONS PERSONNEL RECORDS Employee Performance File System Records § 293.403 Contents of employee performance files. (a) A decision on what constitutes a performance-related document within the meaning of... 5 Administrative Personnel 1 2014-01-01 2014-01-01 false Contents of employee performance files...
5 CFR 293.403 - Contents of employee performance files.
Code of Federal Regulations, 2012 CFR
2012-01-01
... REGULATIONS PERSONNEL RECORDS Employee Performance File System Records § 293.403 Contents of employee performance files. (a) A decision on what constitutes a performance-related document within the meaning of... 5 Administrative Personnel 1 2012-01-01 2012-01-01 false Contents of employee performance files...
Approaches in highly parameterized inversion - GENIE, a general model-independent TCP/IP run manager
Muffels, Christopher T.; Schreuder, Willem A.; Doherty, John E.; Karanovic, Marinko; Tonkin, Matthew J.; Hunt, Randall J.; Welter, David E.
2012-01-01
GENIE is a model-independent suite of programs that can be used to generally distribute, manage, and execute multiple model runs via the TCP/IP infrastructure. The suite consists of a file distribution interface, a run manage, a run executer, and a routine that can be compiled as part of a program and used to exchange model runs with the run manager. Because communication is via a standard protocol (TCP/IP), any computer connected to the Internet can serve in any of the capacities offered by this suite. Model independence is consistent with the existing template and instruction file protocols of the widely used PEST parameter estimation program. This report describes (1) the problem addressed; (2) the approach used by GENIE to queue, distribute, and retrieve model runs; and (3) user instructions, classes, and functions developed. It also includes (4) an example to illustrate the linking of GENIE with Parallel PEST using the interface routine.
DOE Office of Scientific and Technical Information (OSTI.GOV)
2016-09-12
Mcqueuer is a simple tool that allows anyone from researchers to experienced developers to create multi-node/multi-core jobs by simply creating a file with a list of commands. Users simply combine tasks, which would otherwise each be their own job on the cluster, into a single file that is given to Mcqueuer. Mcqueuer then does the heavy lifting required to process the tasks in parallel in a single multi-node job. In addition, Mcqueuer provides load-balancing, which frees the user from having to worry about complex memory and CPU considerations, and instead focus on the processing itself.
Evaluation of Contact Friction in Fracture of Rotationally Bent Nitinol Endodontic Files
NASA Astrophysics Data System (ADS)
Haimed, Tariq Abu
2011-12-01
The high flexibility of rotary Nitinol (Ni-Ti) files has helped clinicians perform root canal treatments with fewer technical errors than seen with stainless steel files. However, intracanal file fracture can occur, compromising the outcome of the treatment. Ni-Ti file fracture incidence is roughly around 4% amongst specialists and higher amongst general practitioners. Therefore, eliminating or reducing this problem should improve patient care. The aim of this project was to isolate and examine the role of friction between files and the canal walls of the glass tube model, and bending-related maximum strain amplitudes, on Ni-Ti file lifetimes-tofracture in the presence of different irrigant solutions and file coatings. A specifically designed device was used to test over 300 electropolished EndoSequenceRTM Ni-Ti files for number of cycles to failure (NCF) in smooth, bent glass tube models at 45 and 60 degrees during dry, coated and liquid-lubricated rotation at 600rpm. Fractured files were examined under Scanning Electron Microscopy (SEM) afterwards. Four different file sizes 25.04, 25.06, 35.04, 35.06 (diameter in mm/taper %) and six surface modification conditions were used independently. These conditions included, three solutions; (1) a surfactant-based solution, Surface-Active-Displacement-Solution (SADS), (2) a mouth wash proven to remove biofilms, Delmopinol 1%(DEL), and (3) Bleach 6% (vol.%), the most common antibacterial endodontic irrigant solution. The conditions also included two low-friction silane-based coating groups, 3-Hepta-fluoroisopropyl-propoxymethyl-dichlorosilane (3-HEPT) and Octadecyltrichlorosilane (ODS), in addition to an as-received file control group (Dry). The coefficient of friction (CF) between the file and the canal walls for each condition was measured as well as the surface tension of the irrigant solutions and the critical surface tension of the coated and uncoated files by contact angle measurements. The radius of curvature and maximum strain amplitude (MSA) for each file size were determined based on images of the files inside the glass tubes. The force of insertion for each file type under each condition was also measured inside 45 and 60 degree glass tube paths, static and while dynamic. The results showed that NCF of Ni-Ti files is strongly inversely related to the CF which ranged from 0.15 for ODS and 3-HEPT coated files to 0.43 for irrigant bleach. High CF (in the presence of bleach) significantly reduced the NCF. Conversely, lower CF (in the presence of other solutions and file coatings) resulted in significantly higher NCF. CF was found to be directly related to the surface tension of the media used. Similarly, high MSA typical of low radius of curvature and high bending angle significantly diminished the fatigue life of Ni-Ti files. The integral of the force of insertion versus time curve was the highest for bleach irrigation which also showed the highest CF. Scanning electron microscope inspection of file fracture surfaces illustrated a 2-step progressive failure mode characterized by creation of a smooth initial fatigue area (striation marks) followed by catastrophic ductile fracture (dimple area) when the intact file shaft area was sufficiently reduced. The bleach-lubricated files failed earlier and with a smaller fatigue area (23%) than all other groups (31-35%) indicating premature fracture in the presence of higher frictional forces. The acquired data demonstrate that the combination of low MSA and low CF (by using coatings or solutions with low surface tension), related to the magnitude of the superficial drag force, can lead to statistically longer rotational bending lifetimes for Ni-Ti files. Based on the data of this study, lubricant solutions with low surface tension could significantly improve the fracture life of Ni-Ti files in root canal glass model. Laboratory testing using natural teeth should be performed to evaluate the effect of using such solutions on the fatigue life of Ni-Ti files.
Relation of Parallel Discrete Event Simulation algorithms with physical models
NASA Astrophysics Data System (ADS)
Shchur, L. N.; Shchur, L. V.
2015-09-01
We extend concept of local simulation times in parallel discrete event simulation (PDES) in order to take into account architecture of the current hardware and software in high-performance computing. We shortly review previous research on the mapping of PDES on physical problems, and emphasise how physical results may help to predict parallel algorithms behaviour.
A Fast Synthetic Aperture Radar Raw Data Simulation Using Cloud Computing
Li, Zhixin; Su, Dandan; Zhu, Haijiang; Li, Wei; Zhang, Fan; Li, Ruirui
2017-01-01
Synthetic Aperture Radar (SAR) raw data simulation is a fundamental problem in radar system design and imaging algorithm research. The growth of surveying swath and resolution results in a significant increase in data volume and simulation period, which can be considered to be a comprehensive data intensive and computing intensive issue. Although several high performance computing (HPC) methods have demonstrated their potential for accelerating simulation, the input/output (I/O) bottleneck of huge raw data has not been eased. In this paper, we propose a cloud computing based SAR raw data simulation algorithm, which employs the MapReduce model to accelerate the raw data computing and the Hadoop distributed file system (HDFS) for fast I/O access. The MapReduce model is designed for the irregular parallel accumulation of raw data simulation, which greatly reduces the parallel efficiency of graphics processing unit (GPU) based simulation methods. In addition, three kinds of optimization strategies are put forward from the aspects of programming model, HDFS configuration and scheduling. The experimental results show that the cloud computing based algorithm achieves 4× speedup over the baseline serial approach in an 8-node cloud environment, and each optimization strategy can improve about 20%. This work proves that the proposed cloud algorithm is capable of solving the computing intensive and data intensive issues in SAR raw data simulation, and is easily extended to large scale computing to achieve higher acceleration. PMID:28075343
Experience in highly parallel processing using DAP
NASA Technical Reports Server (NTRS)
Parkinson, D.
1987-01-01
Distributed Array Processors (DAP) have been in day to day use for ten years and a large amount of user experience has been gained. The profile of user applications is similar to that of the Massively Parallel Processor (MPP) working group. Experience has shown that contrary to expectations, highly parallel systems provide excellent performance on so-called dirty problems such as the physics part of meteorological codes. The reasons for this observation are discussed. The arguments against replacing bit processors with floating point processors are also discussed.
Parallel processing optimization strategy based on MapReduce model in cloud storage environment
NASA Astrophysics Data System (ADS)
Cui, Jianming; Liu, Jiayi; Li, Qiuyan
2017-05-01
Currently, a large number of documents in the cloud storage process employed the way of packaging after receiving all the packets. From the local transmitter this stored procedure to the server, packing and unpacking will consume a lot of time, and the transmission efficiency is low as well. A new parallel processing algorithm is proposed to optimize the transmission mode. According to the operation machine graphs model work, using MPI technology parallel execution Mapper and Reducer mechanism. It is good to use MPI technology to implement Mapper and Reducer parallel mechanism. After the simulation experiment of Hadoop cloud computing platform, this algorithm can not only accelerate the file transfer rate, but also shorten the waiting time of the Reducer mechanism. It will break through traditional sequential transmission constraints and reduce the storage coupling to improve the transmission efficiency.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bonachea, D.; Dickens, P.; Thakur, R.
There is a growing interest in using Java as the language for developing high-performance computing applications. To be successful in the high-performance computing domain, however, Java must not only be able to provide high computational performance, but also high-performance I/O. In this paper, we first examine several approaches that attempt to provide high-performance I/O in Java - many of which are not obvious at first glance - and evaluate their performance on two parallel machines, the IBM SP and the SGI Origin2000. We then propose extensions to the Java I/O library that address the deficiencies in the Java I/O APImore » and improve performance dramatically. The extensions add bulk (array) I/O operations to Java, thereby removing much of the overhead currently associated with array I/O in Java. We have implemented the extensions in two ways: in a standard JVM using the Java Native Interface (JNI) and in a high-performance parallel dialect of Java called Titanium. We describe the two implementations and present performance results that demonstrate the benefits of the proposed extensions.« less
NASA Astrophysics Data System (ADS)
Wang, H.; Chen, H.; Chen, X.; Wu, Q.; Wang, Z.
2016-12-01
The Global Nested Air Quality Prediction Modeling System for Hg (GNAQPMS-Hg) is a global chemical transport model coupled Hg transport module to investigate the mercury pollution. In this study, we present our work of transplanting the GNAQPMS model on Intel Xeon Phi processor, Knights Landing (KNL) to accelerate the model. KNL is the second-generation product adopting Many Integrated Core Architecture (MIC) architecture. Compared with the first generation Knight Corner (KNC), KNL has more new hardware features, that it can be used as unique processor as well as coprocessor with other CPU. According to the Vtune tool, the high overhead modules in GNAQPMS model have been addressed, including CBMZ gas chemistry, advection and convection module, and wet deposition module. These high overhead modules were accelerated by optimizing code and using new techniques of KNL. The following optimized measures was done: 1) Changing the pure MPI parallel mode to hybrid parallel mode with MPI and OpenMP; 2.Vectorizing the code to using the 512-bit wide vector computation unit. 3. Reducing unnecessary memory access and calculation. 4. Reducing Thread Local Storage (TLS) for common variables with each OpenMP thread in CBMZ. 5. Changing the way of global communication from files writing and reading to MPI functions. After optimization, the performance of GNAQPMS is greatly increased both on CPU and KNL platform, the single-node test showed that optimized version has 2.6x speedup on two sockets CPU platform and 3.3x speedup on one socket KNL platform compared with the baseline version code, which means the KNL has 1.29x speedup when compared with 2 sockets CPU platform.
Conflict Detection Algorithm to Minimize Locking for MPI-IO Atomicity
NASA Astrophysics Data System (ADS)
Sehrish, Saba; Wang, Jun; Thakur, Rajeev
Many scientific applications require high-performance concurrent I/O accesses to a file by multiple processes. Those applications rely indirectly on atomic I/O capabilities in order to perform updates to structured datasets, such as those stored in HDF5 format files. Current support for atomicity in MPI-IO is provided by locking around the operations, imposing lock overhead in all situations, even though in many cases these operations are non-overlapping in the file. We propose to isolate non-overlapping accesses from overlapping ones in independent I/O cases, allowing the non-overlapping ones to proceed without imposing lock overhead. To enable this, we have implemented an efficient conflict detection algorithm in MPI-IO using MPI file views and datatypes. We show that our conflict detection scheme incurs minimal overhead on I/O operations, making it an effective mechanism for avoiding locks when they are not needed.
By Hand or Not By-Hand: A Case Study of Alternative Approaches to Parallelize CFD Applications
NASA Technical Reports Server (NTRS)
Yan, Jerry C.; Bailey, David (Technical Monitor)
1997-01-01
While parallel processing promises to speed up applications by several orders of magnitude, the performance achieved still depends upon several factors, including the multiprocessor architecture, system software, data distribution and alignment, as well as the methods used for partitioning the application and mapping its components onto the architecture. The existence of the Gorden Bell Prize given out at Supercomputing every year suggests that while good performance can be attained for real applications on general purpose multiprocessors, the large investment in man-power and time still has to be repeated for each application-machine combination. As applications and machine architectures become more complex, the cost and time-delays for obtaining performance by hand will become prohibitive. Computer users today can turn to three possible avenues for help: parallel libraries, parallel languages and compilers, interactive parallelization tools. The success of these methodologies, in turn, depends on proper application of data dependency analysis, program structure recognition and transformation, performance prediction as well as exploitation of user supplied knowledge. NASA has been developing multidisciplinary applications on highly parallel architectures under the High Performance Computing and Communications Program. Over the past six years, the transition of underlying hardware and system software have forced the scientists to spend a large effort to migrate and recede their applications. Various attempts to exploit software tools to automate the parallelization process have not produced favorable results. In this paper, we report our most recent experience with CAPTOOL, a package developed at Greenwich University. We have chosen CAPTOOL for three reasons: 1. CAPTOOL accepts a FORTRAN 77 program as input. This suggests its potential applicability to a large collection of legacy codes currently in use. 2. CAPTOOL employs domain decomposition to obtain parallelism. Although the fact that not all kinds of parallelism are handled may seem unappealing, many NASA applications in computational aerosciences as well as earth and space sciences are amenable to domain decomposition. 3. CAPTOOL generates code for a large variety of environments employed across NASA centers: MPI/PVM on network of workstations to the IBS/SP2 and CRAY/T3D.
Exploiting Thread Parallelism for Ocean Modeling on Cray XC Supercomputers
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sarje, Abhinav; Jacobsen, Douglas W.; Williams, Samuel W.
The incorporation of increasing core counts in modern processors used to build state-of-the-art supercomputers is driving application development towards exploitation of thread parallelism, in addition to distributed memory parallelism, with the goal of delivering efficient high-performance codes. In this work we describe the exploitation of threading and our experiences with it with respect to a real-world ocean modeling application code, MPAS-Ocean. We present detailed performance analysis and comparisons of various approaches and configurations for threading on the Cray XC series supercomputers.
Performance Modeling of Network-Attached Storage Device Based Hierarchical Mass Storage Systems
NASA Technical Reports Server (NTRS)
Menasce, Daniel A.; Pentakalos, Odysseas I.
1995-01-01
Network attached storage devices improve I/O performance by separating control and data paths and eliminating host intervention during the data transfer phase. Devices are attached to both a high speed network for data transfer and to a slower network for control messages. Hierarchical mass storage systems use disks to cache the most recently used files and a combination of robotic and manually mounted tapes to store the bulk of the files in the file system. This paper shows how queuing network models can be used to assess the performance of hierarchical mass storage systems that use network attached storage devices as opposed to host attached storage devices. Simulation was used to validate the model. The analytic model presented here can be used, among other things, to evaluate the protocols involved in 1/0 over network attached devices.
Performance regression manager for large scale systems
Faraj, Daniel A.
2017-10-17
System and computer program product to perform an operation comprising generating, based on a first output generated by a first execution instance of a command, a first output file specifying a value of at least one performance metric, wherein the first output file is formatted according to a predefined format, comparing the value of the at least one performance metric in the first output file to a value of the performance metric in a second output file, the second output file having been generated based on a second output generated by a second execution instance of the command, and outputting for display an indication of a result of the comparison of the value of the at least one performance metric of the first output file to the value of the at least one performance metric of the second output file.
Distributed Finite Element Analysis Using a Transputer Network
NASA Technical Reports Server (NTRS)
Watson, James; Favenesi, James; Danial, Albert; Tombrello, Joseph; Yang, Dabby; Reynolds, Brian; Turrentine, Ronald; Shephard, Mark; Baehmann, Peggy
1989-01-01
The principal objective of this research effort was to demonstrate the extraordinarily cost effective acceleration of finite element structural analysis problems using a transputer-based parallel processing network. This objective was accomplished in the form of a commercially viable parallel processing workstation. The workstation is a desktop size, low-maintenance computing unit capable of supercomputer performance yet costs two orders of magnitude less. To achieve the principal research objective, a transputer based structural analysis workstation termed XPFEM was implemented with linear static structural analysis capabilities resembling commercially available NASTRAN. Finite element model files, generated using the on-line preprocessing module or external preprocessing packages, are downloaded to a network of 32 transputers for accelerated solution. The system currently executes at about one third Cray X-MP24 speed but additional acceleration appears likely. For the NASA selected demonstration problem of a Space Shuttle main engine turbine blade model with about 1500 nodes and 4500 independent degrees of freedom, the Cray X-MP24 required 23.9 seconds to obtain a solution while the transputer network, operated from an IBM PC-AT compatible host computer, required 71.7 seconds. Consequently, the $80,000 transputer network demonstrated a cost-performance ratio about 60 times better than the $15,000,000 Cray X-MP24 system.
Kindlmann, Gordon; Chiw, Charisee; Seltzer, Nicholas; Samuels, Lamont; Reppy, John
2016-01-01
Many algorithms for scientific visualization and image analysis are rooted in the world of continuous scalar, vector, and tensor fields, but are programmed in low-level languages and libraries that obscure their mathematical foundations. Diderot is a parallel domain-specific language that is designed to bridge this semantic gap by providing the programmer with a high-level, mathematical programming notation that allows direct expression of mathematical concepts in code. Furthermore, Diderot provides parallel performance that takes advantage of modern multicore processors and GPUs. The high-level notation allows a concise and natural expression of the algorithms and the parallelism allows efficient execution on real-world datasets.
Study of Solid State Drives performance in PROOF distributed analysis system
NASA Astrophysics Data System (ADS)
Panitkin, S. Y.; Ernst, M.; Petkus, R.; Rind, O.; Wenaus, T.
2010-04-01
Solid State Drives (SSD) is a promising storage technology for High Energy Physics parallel analysis farms. Its combination of low random access time and relatively high read speed is very well suited for situations where multiple jobs concurrently access data located on the same drive. It also has lower energy consumption and higher vibration tolerance than Hard Disk Drive (HDD) which makes it an attractive choice in many applications raging from personal laptops to large analysis farms. The Parallel ROOT Facility - PROOF is a distributed analysis system which allows to exploit inherent event level parallelism of high energy physics data. PROOF is especially efficient together with distributed local storage systems like Xrootd, when data are distributed over computing nodes. In such an architecture the local disk subsystem I/O performance becomes a critical factor, especially when computing nodes use multi-core CPUs. We will discuss our experience with SSDs in PROOF environment. We will compare performance of HDD with SSD in I/O intensive analysis scenarios. In particular we will discuss PROOF system performance scaling with a number of simultaneously running analysis jobs.
Message Passing and Shared Address Space Parallelism on an SMP Cluster
NASA Technical Reports Server (NTRS)
Shan, Hongzhang; Singh, Jaswinder P.; Oliker, Leonid; Biswas, Rupak; Biegel, Bryan (Technical Monitor)
2002-01-01
Currently, message passing (MP) and shared address space (SAS) are the two leading parallel programming paradigms. MP has been standardized with MPI, and is the more common and mature approach; however, code development can be extremely difficult, especially for irregularly structured computations. SAS offers substantial ease of programming, but may suffer from performance limitations due to poor spatial locality and high protocol overhead. In this paper, we compare the performance of and the programming effort required for six applications under both programming models on a 32-processor PC-SMP cluster, a platform that is becoming increasingly attractive for high-end scientific computing. Our application suite consists of codes that typically do not exhibit scalable performance under shared-memory programming due to their high communication-to-computation ratios and/or complex communication patterns. Results indicate that SAS can achieve about half the parallel efficiency of MPI for most of our applications, while being competitive for the others. A hybrid MPI+SAS strategy shows only a small performance advantage over pure MPI in some cases. Finally, improved implementations of two MPI collective operations on PC-SMP clusters are presented.
High-performance computational fluid dynamics: a custom-code approach
NASA Astrophysics Data System (ADS)
Fannon, James; Loiseau, Jean-Christophe; Valluri, Prashant; Bethune, Iain; Náraigh, Lennon Ó.
2016-07-01
We introduce a modified and simplified version of the pre-existing fully parallelized three-dimensional Navier-Stokes flow solver known as TPLS. We demonstrate how the simplified version can be used as a pedagogical tool for the study of computational fluid dynamics (CFDs) and parallel computing. TPLS is at its heart a two-phase flow solver, and uses calls to a range of external libraries to accelerate its performance. However, in the present context we narrow the focus of the study to basic hydrodynamics and parallel computing techniques, and the code is therefore simplified and modified to simulate pressure-driven single-phase flow in a channel, using only relatively simple Fortran 90 code with MPI parallelization, but no calls to any other external libraries. The modified code is analysed in order to both validate its accuracy and investigate its scalability up to 1000 CPU cores. Simulations are performed for several benchmark cases in pressure-driven channel flow, including a turbulent simulation, wherein the turbulence is incorporated via the large-eddy simulation technique. The work may be of use to advanced undergraduate and graduate students as an introductory study in CFDs, while also providing insight for those interested in more general aspects of high-performance computing.
Cedar Project---Original goals and progress to date
DOE Office of Scientific and Technical Information (OSTI.GOV)
Cybenko, G.; Kuck, D.; Padua, D.
1990-11-28
This work encompasses a broad attack on high speed parallel processing. Hardware, software, applications development, and performance evaluation and visualization as well as research topics are proposed. Our goal is to develop practical parallel processing for the 1990's.
Parallel-In-Time For Moving Meshes
DOE Office of Scientific and Technical Information (OSTI.GOV)
Falgout, R. D.; Manteuffel, T. A.; Southworth, B.
2016-02-04
With steadily growing computational resources available, scientists must develop e ective ways to utilize the increased resources. High performance, highly parallel software has be- come a standard. However until recent years parallelism has focused primarily on the spatial domain. When solving a space-time partial di erential equation (PDE), this leads to a sequential bottleneck in the temporal dimension, particularly when taking a large number of time steps. The XBraid parallel-in-time library was developed as a practical way to add temporal parallelism to existing se- quential codes with only minor modi cations. In this work, a rezoning-type moving mesh is appliedmore » to a di usion problem and formulated in a parallel-in-time framework. Tests and scaling studies are run using XBraid and demonstrate excellent results for the simple model problem considered herein.« less
Shrimankar, D D; Sathe, S R
2016-01-01
Sequence alignment is an important tool for describing the relationships between DNA sequences. Many sequence alignment algorithms exist, differing in efficiency, in their models of the sequences, and in the relationship between sequences. The focus of this study is to obtain an optimal alignment between two sequences of biological data, particularly DNA sequences. The algorithm is discussed with particular emphasis on time, speedup, and efficiency optimizations. Parallel programming presents a number of critical challenges to application developers. Today's supercomputer often consists of clusters of SMP nodes. Programming paradigms such as OpenMP and MPI are used to write parallel codes for such architectures. However, the OpenMP programs cannot be scaled for more than a single SMP node. However, programs written in MPI can have more than single SMP nodes. But such a programming paradigm has an overhead of internode communication. In this work, we explore the tradeoffs between using OpenMP and MPI. We demonstrate that the communication overhead incurs significantly even in OpenMP loop execution and increases with the number of cores participating. We also demonstrate a communication model to approximate the overhead from communication in OpenMP loops. Our results are astonishing and interesting to a large variety of input data files. We have developed our own load balancing and cache optimization technique for message passing model. Our experimental results show that our own developed techniques give optimum performance of our parallel algorithm for various sizes of input parameter, such as sequence size and tile size, on a wide variety of multicore architectures.
Shrimankar, D. D.; Sathe, S. R.
2016-01-01
Sequence alignment is an important tool for describing the relationships between DNA sequences. Many sequence alignment algorithms exist, differing in efficiency, in their models of the sequences, and in the relationship between sequences. The focus of this study is to obtain an optimal alignment between two sequences of biological data, particularly DNA sequences. The algorithm is discussed with particular emphasis on time, speedup, and efficiency optimizations. Parallel programming presents a number of critical challenges to application developers. Today’s supercomputer often consists of clusters of SMP nodes. Programming paradigms such as OpenMP and MPI are used to write parallel codes for such architectures. However, the OpenMP programs cannot be scaled for more than a single SMP node. However, programs written in MPI can have more than single SMP nodes. But such a programming paradigm has an overhead of internode communication. In this work, we explore the tradeoffs between using OpenMP and MPI. We demonstrate that the communication overhead incurs significantly even in OpenMP loop execution and increases with the number of cores participating. We also demonstrate a communication model to approximate the overhead from communication in OpenMP loops. Our results are astonishing and interesting to a large variety of input data files. We have developed our own load balancing and cache optimization technique for message passing model. Our experimental results show that our own developed techniques give optimum performance of our parallel algorithm for various sizes of input parameter, such as sequence size and tile size, on a wide variety of multicore architectures. PMID:27932868
LMC: Logarithmantic Monte Carlo
NASA Astrophysics Data System (ADS)
Mantz, Adam B.
2017-06-01
LMC is a Markov Chain Monte Carlo engine in Python that implements adaptive Metropolis-Hastings and slice sampling, as well as the affine-invariant method of Goodman & Weare, in a flexible framework. It can be used for simple problems, but the main use case is problems where expensive likelihood evaluations are provided by less flexible third-party software, which benefit from parallelization across many nodes at the sampling level. The parallel/adaptive methods use communication through MPI, or alternatively by writing/reading files, and mostly follow the approaches pioneered by CosmoMC (ascl:1106.025).
Scalable Unix tools on parallel processors
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gropp, W.; Lusk, E.
1994-12-31
The introduction of parallel processors that run a separate copy of Unix on each process has introduced new problems in managing the user`s environment. This paper discusses some generalizations of common Unix commands for managing files (e.g. 1s) and processes (e.g. ps) that are convenient and scalable. These basic tools, just like their Unix counterparts, are text-based. We also discuss a way to use these with a graphical user interface (GUI). Some notes on the implementation are provided. Prototypes of these commands are publicly available.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gurin, Péter; Varga, Szabolcs
2015-06-14
We extend the transfer matrix method of one-dimensional hard core fluids placed between confining walls for that case where the particles can pass each other and at most two layers can form. We derive an eigenvalue equation for a quasi-one-dimensional system of hard squares confined between two parallel walls, where the pore width is between σ and 3σ (σ is the side length of the square). The exact equation of state and the nearest neighbor distribution functions show three different structures: a fluid phase with one layer, a fluid phase with two layers, and a solid-like structure where the fluidmore » layers are strongly correlated. The structural transition between differently ordered fluids develops continuously with increasing density, i.e., no thermodynamic phase transition occurs. The high density structure of the system consists of clusters with two layers which are broken with particles staying in the middle of the pore.« less
Efficient parallel implementation of active appearance model fitting algorithm on GPU.
Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou
2014-01-01
The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures.
Efficient Parallel Implementation of Active Appearance Model Fitting Algorithm on GPU
Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou
2014-01-01
The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures. PMID:24723812
A visual parallel-BCI speller based on the time-frequency coding strategy.
Xu, Minpeng; Chen, Long; Zhang, Lixin; Qi, Hongzhi; Ma, Lan; Tang, Jiabei; Wan, Baikun; Ming, Dong
2014-04-01
Spelling is one of the most important issues in brain-computer interface (BCI) research. This paper is to develop a visual parallel-BCI speller system based on the time-frequency coding strategy in which the sub-speller switching among four simultaneously presented sub-spellers and the character selection are identified in a parallel mode. The parallel-BCI speller was constituted by four independent P300+SSVEP-B (P300 plus SSVEP blocking) spellers with different flicker frequencies, thereby all characters had a specific time-frequency code. To verify its effectiveness, 11 subjects were involved in the offline and online spellings. A classification strategy was designed to recognize the target character through jointly using the canonical correlation analysis and stepwise linear discriminant analysis. Online spellings showed that the proposed parallel-BCI speller had a high performance, reaching the highest information transfer rate of 67.4 bit min(-1), with an average of 54.0 bit min(-1) and 43.0 bit min(-1) in the three rounds and five rounds, respectively. The results indicated that the proposed parallel-BCI could be effectively controlled by users with attention shifting fluently among the sub-spellers, and highly improved the BCI spelling performance.
High-performance parallel analysis of coupled problems for aircraft propulsion
NASA Technical Reports Server (NTRS)
Felippa, C. A.; Farhat, C.; Lanteri, S.; Gumaste, U.; Ronaghi, M.
1994-01-01
Applications are described of high-performance parallel, computation for the analysis of complete jet engines, considering its multi-discipline coupled problem. The coupled problem involves interaction of structures with gas dynamics, heat conduction and heat transfer in aircraft engines. The methodology issues addressed include: consistent discrete formulation of coupled problems with emphasis on coupling phenomena; effect of partitioning strategies, augmentation and temporal solution procedures; sensitivity of response to problem parameters; and methods for interfacing multiscale discretizations in different single fields. The computer implementation issues addressed include: parallel treatment of coupled systems; domain decomposition and mesh partitioning strategies; data representation in object-oriented form and mapping to hardware driven representation, and tradeoff studies between partitioning schemes and fully coupled treatment.
Modeling Cooperative Threads to Project GPU Performance for Adaptive Parallelism
DOE Office of Scientific and Technical Information (OSTI.GOV)
Meng, Jiayuan; Uram, Thomas; Morozov, Vitali A.
Most accelerators, such as graphics processing units (GPUs) and vector processors, are particularly suitable for accelerating massively parallel workloads. On the other hand, conventional workloads are developed for multi-core parallelism, which often scale to only a few dozen OpenMP threads. When hardware threads significantly outnumber the degree of parallelism in the outer loop, programmers are challenged with efficient hardware utilization. A common solution is to further exploit the parallelism hidden deep in the code structure. Such parallelism is less structured: parallel and sequential loops may be imperfectly nested within each other, neigh boring inner loops may exhibit different concurrency patternsmore » (e.g. Reduction vs. Forall), yet have to be parallelized in the same parallel section. Many input-dependent transformations have to be explored. A programmer often employs a larger group of hardware threads to cooperatively walk through a smaller outer loop partition and adaptively exploit any encountered parallelism. This process is time-consuming and error-prone, yet the risk of gaining little or no performance remains high for such workloads. To reduce risk and guide implementation, we propose a technique to model workloads with limited parallelism that can automatically explore and evaluate transformations involving cooperative threads. Eventually, our framework projects the best achievable performance and the most promising transformations without implementing GPU code or using physical hardware. We envision our technique to be integrated into future compilers or optimization frameworks for autotuning.« less
Variable Delay Testing Using ONE
NASA Technical Reports Server (NTRS)
Ishac, Joseph
2002-01-01
This paper investigates the effect of long and changing propagation delays on the performance of TCP file transfers. Tests are performed with machines that emulate communication from a low/medium-earth satellite to Earth by way of a geosynchronous satellite. As a result of these tests, we find that TCP is fairly robust to varying delays given a high enough TCP timer granularity. However, performance degrades noticeably for larger file transfers when a finer timer granularity is used. Such results have also been observed in previous simulations by other researchers, and thus, this work serves as an extension of those results.
Integrated electronics for time-resolved array of single-photon avalanche diodes
NASA Astrophysics Data System (ADS)
Acconcia, G.; Crotti, M.; Rech, I.; Ghioni, M.
2013-12-01
The Time Correlated Single Photon Counting (TCSPC) technique has reached a prominent position among analytical methods employed in a great variety of fields, from medicine and biology (fluorescence spectroscopy) to telemetry (laser ranging) and communication (quantum cryptography). Nevertheless the development of TCSPC acquisition systems featuring both a high number of parallel channels and very high performance is still an open challenge: to satisfy the tight requirements set by the applications, a fully parallel acquisition system requires not only high efficiency single photon detectors but also a read-out electronics specifically designed to obtain the highest performance in conjunction with these sensors. To this aim three main blocks have been designed: a gigahertz bandwidth front-end stage to directly read the custom technology SPAD array avalanche current, a reconfigurable logic to route the detectors output signals to the acquisition chain and an array of time measurement circuits capable of recording the photon arrival times with picoseconds time resolution and a very high linearity. An innovative architecture based on these three circuits will feature a very high number of detectors to perform a truly parallel spatial or spectral analysis and a smaller number of high performance time-to-amplitude converter offering very high performance and a very high conversion frequency while limiting the area occupation and power dissipation. The routing logic will make the dynamic connection between the two arrays possible in order to guarantee that no information gets lost.
Seeing the forest for the trees: Networked workstations as a parallel processing computer
NASA Technical Reports Server (NTRS)
Breen, J. O.; Meleedy, D. M.
1992-01-01
Unlike traditional 'serial' processing computers in which one central processing unit performs one instruction at a time, parallel processing computers contain several processing units, thereby, performing several instructions at once. Many of today's fastest supercomputers achieve their speed by employing thousands of processing elements working in parallel. Few institutions can afford these state-of-the-art parallel processors, but many already have the makings of a modest parallel processing system. Workstations on existing high-speed networks can be harnessed as nodes in a parallel processing environment, bringing the benefits of parallel processing to many. While such a system can not rival the industry's latest machines, many common tasks can be accelerated greatly by spreading the processing burden and exploiting idle network resources. We study several aspects of this approach, from algorithms to select nodes to speed gains in specific tasks. With ever-increasing volumes of astronomical data, it becomes all the more necessary to utilize our computing resources fully.
Optimizing the Performance of Reactive Molecular Dynamics Simulations for Multi-core Architectures
DOE Office of Scientific and Technical Information (OSTI.GOV)
Aktulga, Hasan Metin; Coffman, Paul; Shan, Tzu-Ray
2015-12-01
Hybrid parallelism allows high performance computing applications to better leverage the increasing on-node parallelism of modern supercomputers. In this paper, we present a hybrid parallel implementation of the widely used LAMMPS/ReaxC package, where the construction of bonded and nonbonded lists and evaluation of complex ReaxFF interactions are implemented efficiently using OpenMP parallelism. Additionally, the performance of the QEq charge equilibration scheme is examined and a dual-solver is implemented. We present the performance of the resulting ReaxC-OMP package on a state-of-the-art multi-core architecture Mira, an IBM BlueGene/Q supercomputer. For system sizes ranging from 32 thousand to 16.6 million particles, speedups inmore » the range of 1.5-4.5x are observed using the new ReaxC-OMP software. Sustained performance improvements have been observed for up to 262,144 cores (1,048,576 processes) of Mira with a weak scaling efficiency of 91.5% in larger simulations containing 16.6 million particles.« less
Pawar, Ajinkya M; Thakur, Bhagyashree; Metzger, Zvi; Kfir, Anda; Pawar, Mansing
2016-01-01
Aim: The current ex vivo study compared the efficacy of removing root fillings using ProTaper retreatment files followed by either WaveOne reciprocating file or the Self-Adjusting File (SAF). Materials and Methods: Forty maxillary canines with single oval root canal were selected and sectioned to obtain 18-mm root segments. The root canals were instrumented with WaveOne primary files, followed by obturation using warm lateral compaction, and the sealer was allowed to fully set. The teeth were then divided into two equal groups (N = 20). Initial removal of the bulk of root filling material was performed with ProTaper retreatment files, followed by either WaveOne files (Group 1) or SAF (Group 2). Endosolv R was used as a gutta-percha softener. Preoperative and postoperative high-resolution cone-beam computed tomography (CBCT) was used to measure the volume of the root filling residue that was left after the procedure. Statistical analysis was performed using t-test. Results: The mean volume of root filling residue in Group 1 was 9.4 (±0.5) mm3, whereas in Group 2 the residue volume was 2.6 (±0.4) mm3, (P < 0.001; t-test). Conclusions: When SAF was used after ProTaper retreatment files, significantly less root filling residue was left in the canals compared to when WaveOne was used. PMID:26957798
The Design and Evaluation of "CAPTools"--A Computer Aided Parallelization Toolkit
NASA Technical Reports Server (NTRS)
Yan, Jerry; Frumkin, Michael; Hribar, Michelle; Jin, Haoqiang; Waheed, Abdul; Johnson, Steve; Cross, Jark; Evans, Emyr; Ierotheou, Constantinos; Leggett, Pete;
1998-01-01
Writing applications for high performance computers is a challenging task. Although writing code by hand still offers the best performance, it is extremely costly and often not very portable. The Computer Aided Parallelization Tools (CAPTools) are a toolkit designed to help automate the mapping of sequential FORTRAN scientific applications onto multiprocessors. CAPTools consists of the following major components: an inter-procedural dependence analysis module that incorporates user knowledge; a 'self-propagating' data partitioning module driven via user guidance; an execution control mask generation and optimization module for the user to fine tune parallel processing of individual partitions; a program transformation/restructuring facility for source code clean up and optimization; a set of browsers through which the user interacts with CAPTools at each stage of the parallelization process; and a code generator supporting multiple programming paradigms on various multiprocessors. Besides describing the rationale behind the architecture of CAPTools, the parallelization process is illustrated via case studies involving structured and unstructured meshes. The programming process and the performance of the generated parallel programs are compared against other programming alternatives based on the NAS Parallel Benchmarks, ARC3D and other scientific applications. Based on these results, a discussion on the feasibility of constructing architectural independent parallel applications is presented.
Performance regression manager for large scale systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Faraj, Daniel A.
System and computer program product to perform an operation comprising generating, based on a first output generated by a first execution instance of a command, a first output file specifying a value of at least one performance metric, wherein the first output file is formatted according to a predefined format, comparing the value of the at least one performance metric in the first output file to a value of the performance metric in a second output file, the second output file having been generated based on a second output generated by a second execution instance of the command, and outputtingmore » for display an indication of a result of the comparison of the value of the at least one performance metric of the first output file to the value of the at least one performance metric of the second output file.« less
Dust Dynamics in Protoplanetary Disks: Parallel Computing with PVM
NASA Astrophysics Data System (ADS)
de La Fuente Marcos, Carlos; Barge, Pierre; de La Fuente Marcos, Raúl
2002-03-01
We describe a parallel version of our high-order-accuracy particle-mesh code for the simulation of collisionless protoplanetary disks. We use this code to carry out a massively parallel, two-dimensional, time-dependent, numerical simulation, which includes dust particles, to study the potential role of large-scale, gaseous vortices in protoplanetary disks. This noncollisional problem is easy to parallelize on message-passing multicomputer architectures. We performed the simulations on a cache-coherent nonuniform memory access Origin 2000 machine, using both the parallel virtual machine (PVM) and message-passing interface (MPI) message-passing libraries. Our performance analysis suggests that, for our problem, PVM is about 25% faster than MPI. Using PVM and MPI made it possible to reduce CPU time and increase code performance. This allows for simulations with a large number of particles (N ~ 105-106) in reasonable CPU times. The performances of our implementation of the pa! rallel code on an Origin 2000 supercomputer are presented and discussed. They exhibit very good speedup behavior and low load unbalancing. Our results confirm that giant gaseous vortices can play a dominant role in giant planet formation.
Efficient parallelization of analytic bond-order potentials for large-scale atomistic simulations
NASA Astrophysics Data System (ADS)
Teijeiro, C.; Hammerschmidt, T.; Drautz, R.; Sutmann, G.
2016-07-01
Analytic bond-order potentials (BOPs) provide a way to compute atomistic properties with controllable accuracy. For large-scale computations of heterogeneous compounds at the atomistic level, both the computational efficiency and memory demand of BOP implementations have to be optimized. Since the evaluation of BOPs is a local operation within a finite environment, the parallelization concepts known from short-range interacting particle simulations can be applied to improve the performance of these simulations. In this work, several efficient parallelization methods for BOPs that use three-dimensional domain decomposition schemes are described. The schemes are implemented into the bond-order potential code BOPfox, and their performance is measured in a series of benchmarks. Systems of up to several millions of atoms are simulated on a high performance computing system, and parallel scaling is demonstrated for up to thousands of processors.
Communication library for run-time visualization of distributed, asynchronous data
DOE Office of Scientific and Technical Information (OSTI.GOV)
Rowlan, J.; Wightman, B.T.
1994-04-01
In this paper we present a method for collecting and visualizing data generated by a parallel computational simulation during run time. Data distributed across multiple processes is sent across parallel communication lines to a remote workstation, which sorts and queues the data for visualization. We have implemented our method in a set of tools called PORTAL (for Parallel aRchitecture data-TrAnsfer Library). The tools comprise generic routines for sending data from a parallel program (callable from either C or FORTRAN), a semi-parallel communication scheme currently built upon Unix Sockets, and a real-time connection to the scientific visualization program AVS. Our methodmore » is most valuable when used to examine large datasets that can be efficiently generated and do not need to be stored on disk. The PORTAL source libraries, detailed documentation, and a working example can be obtained by anonymous ftp from info.mcs.anl.gov from the file portal.tar.Z from the directory pub/portal.« less
5 CFR 293.403 - Contents of employee performance files.
Code of Federal Regulations, 2010 CFR
2010-01-01
... 5 Administrative Personnel 1 2010-01-01 2010-01-01 false Contents of employee performance files. 293.403 Section 293.403 Administrative Personnel OFFICE OF PERSONNEL MANAGEMENT CIVIL SERVICE REGULATIONS PERSONNEL RECORDS Employee Performance File System Records § 293.403 Contents of employee performance files. (a) A decision on what constitutes ...
Toward Transparent Data Management in Multi-layer Storage Hierarchy for HPC Systems
Wadhwa, Bharti; Byna, Suren; Butt, Ali R.
2018-04-17
Upcoming exascale high performance computing (HPC) systems are expected to comprise multi-tier storage hierarchy, and thus will necessitate innovative storage and I/O mechanisms. Traditional disk and block-based interfaces and file systems face severe challenges in utilizing capabilities of storage hierarchies due to the lack of hierarchy support and semantic interfaces. Object-based and semantically-rich data abstractions for scientific data management on large scale systems offer a sustainable solution to these challenges. Such data abstractions can also simplify users involvement in data movement. Here, we take the first steps of realizing such an object abstraction and explore storage mechanisms for these objectsmore » to enhance I/O performance, especially for scientific applications. We explore how an object-based interface can facilitate next generation scalable computing systems by presenting the mapping of data I/O from two real world HPC scientific use cases: a plasma physics simulation code (VPIC) and a cosmology simulation code (HACC). Our storage model stores data objects in different physical organizations to support data movement across layers of memory/storage hierarchy. Our implementation sclaes well to 16K parallel processes, and compared to the state of the art, such as MPI-IO and HDF5, our object-based data abstractions and data placement strategy in multi-level storage hierarchy achieves up to 7 X I/O performance improvement for scientific data.« less
I/O load balancing for big data HPC applications
DOE Office of Scientific and Technical Information (OSTI.GOV)
Paul, Arnab K.; Goyal, Arpit; Wang, Feiyi
High Performance Computing (HPC) big data problems require efficient distributed storage systems. However, at scale, such storage systems often experience load imbalance and resource contention due to two factors: the bursty nature of scientific application I/O; and the complex I/O path that is without centralized arbitration and control. For example, the extant Lustre parallel file system-that supports many HPC centers-comprises numerous components connected via custom network topologies, and serves varying demands of a large number of users and applications. Consequently, some storage servers can be more loaded than others, which creates bottlenecks and reduces overall application I/O performance. Existing solutionsmore » typically focus on per application load balancing, and thus are not as effective given their lack of a global view of the system. In this paper, we propose a data-driven approach to load balance the I/O servers at scale, targeted at Lustre deployments. To this end, we design a global mapper on Lustre Metadata Server, which gathers runtime statistics from key storage components on the I/O path, and applies Markov chain modeling and a minimum-cost maximum-flow algorithm to decide where data should be placed. Evaluation using a realistic system simulator and a real setup shows that our approach yields better load balancing, which in turn can improve end-to-end performance.« less
Toward Transparent Data Management in Multi-layer Storage Hierarchy for HPC Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wadhwa, Bharti; Byna, Suren; Butt, Ali R.
Upcoming exascale high performance computing (HPC) systems are expected to comprise multi-tier storage hierarchy, and thus will necessitate innovative storage and I/O mechanisms. Traditional disk and block-based interfaces and file systems face severe challenges in utilizing capabilities of storage hierarchies due to the lack of hierarchy support and semantic interfaces. Object-based and semantically-rich data abstractions for scientific data management on large scale systems offer a sustainable solution to these challenges. Such data abstractions can also simplify users involvement in data movement. Here, we take the first steps of realizing such an object abstraction and explore storage mechanisms for these objectsmore » to enhance I/O performance, especially for scientific applications. We explore how an object-based interface can facilitate next generation scalable computing systems by presenting the mapping of data I/O from two real world HPC scientific use cases: a plasma physics simulation code (VPIC) and a cosmology simulation code (HACC). Our storage model stores data objects in different physical organizations to support data movement across layers of memory/storage hierarchy. Our implementation sclaes well to 16K parallel processes, and compared to the state of the art, such as MPI-IO and HDF5, our object-based data abstractions and data placement strategy in multi-level storage hierarchy achieves up to 7 X I/O performance improvement for scientific data.« less
Comparing an FPGA to a Cell for an Image Processing Application
NASA Astrophysics Data System (ADS)
Rakvic, Ryan N.; Ngo, Hau; Broussard, Randy P.; Ives, Robert W.
2010-12-01
Modern advancements in configurable hardware, most notably Field-Programmable Gate Arrays (FPGAs), have provided an exciting opportunity to discover the parallel nature of modern image processing algorithms. On the other hand, PlayStation3 (PS3) game consoles contain a multicore heterogeneous processor known as the Cell, which is designed to perform complex image processing algorithms at a high performance. In this research project, our aim is to study the differences in performance of a modern image processing algorithm on these two hardware platforms. In particular, Iris Recognition Systems have recently become an attractive identification method because of their extremely high accuracy. Iris matching, a repeatedly executed portion of a modern iris recognition algorithm, is parallelized on an FPGA system and a Cell processor. We demonstrate a 2.5 times speedup of the parallelized algorithm on the FPGA system when compared to a Cell processor-based version.
CRITIC2: A program for real-space analysis of quantum chemical interactions in solids
NASA Astrophysics Data System (ADS)
Otero-de-la-Roza, A.; Johnson, Erin R.; Luaña, Víctor
2014-03-01
We present CRITIC2, a program for the analysis of quantum-mechanical atomic and molecular interactions in periodic solids. This code, a greatly improved version of the previous CRITIC program (Otero-de-la Roza et al., 2009), can: (i) find critical points of the electron density and related scalar fields such as the electron localization function (ELF), Laplacian, … (ii) integrate atomic properties in the framework of Bader’s Atoms-in-Molecules theory (QTAIM), (iii) visualize non-covalent interactions in crystals using the non-covalent interactions (NCI) index, (iv) generate relevant graphical representations including lines, planes, gradient paths, contour plots, atomic basins, … and (v) perform transformations between file formats describing scalar fields and crystal structures. CRITIC2 can interface with the output produced by a variety of electronic structure programs including WIEN2k, elk, PI, abinit, Quantum ESPRESSO, VASP, Gaussian, and, in general, any other code capable of writing the scalar field under study to a three-dimensional grid. CRITIC2 is parallelized, completely documented (including illustrative test cases) and publicly available under the GNU General Public License. Catalogue identifier: AECB_v2_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AECB_v2_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland Licensing provisions: yes No. of lines in distributed program, including test data, etc.: 11686949 No. of bytes in distributed program, including test data, etc.: 337020731 Distribution format: tar.gz Programming language: Fortran 77 and 90. Computer: Workstations. Operating system: Unix, GNU/Linux. Has the code been vectorized or parallelized?: Shared-memory parallelization can be used for most tasks. Classification: 7.3. Catalogue identifier of previous version: AECB_v1_0 Journal reference of previous version: Comput. Phys. Comm. 180 (2009) 157 Nature of problem: Analysis of quantum-chemical interactions in periodic solids by means of atoms-in-molecules and related formalisms. Solution method: Critical point search using Newton’s algorithm, atomic basin integration using bisection, qtree and grid-based algorithms, diverse graphical representations and computation of the non-covalent interactions index on a three-dimensional grid. Additional comments: !!!!! The distribution file for this program is over 330 Mbytes and therefore is not delivered directly when download or Email is requested. Instead a html file giving details of how the program can be obtained is sent. !!!!! Running time: Variable, depending on the crystal and the source of the underlying scalar field.
NASA Astrophysics Data System (ADS)
Xu, Jincheng; Liu, Wei; Wang, Jin; Liu, Linong; Zhang, Jianfeng
2018-02-01
De-absorption pre-stack time migration (QPSTM) compensates for the absorption and dispersion of seismic waves by introducing an effective Q parameter, thereby making it an effective tool for 3D, high-resolution imaging of seismic data. Although the optimal aperture obtained via stationary-phase migration reduces the computational cost of 3D QPSTM and yields 3D stationary-phase QPSTM, the associated computational efficiency is still the main problem in the processing of 3D, high-resolution images for real large-scale seismic data. In the current paper, we proposed a division method for large-scale, 3D seismic data to optimize the performance of stationary-phase QPSTM on clusters of graphics processing units (GPU). Then, we designed an imaging point parallel strategy to achieve an optimal parallel computing performance. Afterward, we adopted an asynchronous double buffering scheme for multi-stream to perform the GPU/CPU parallel computing. Moreover, several key optimization strategies of computation and storage based on the compute unified device architecture (CUDA) were adopted to accelerate the 3D stationary-phase QPSTM algorithm. Compared with the initial GPU code, the implementation of the key optimization steps, including thread optimization, shared memory optimization, register optimization and special function units (SFU), greatly improved the efficiency. A numerical example employing real large-scale, 3D seismic data showed that our scheme is nearly 80 times faster than the CPU-QPSTM algorithm. Our GPU/CPU heterogeneous parallel computing framework significant reduces the computational cost and facilitates 3D high-resolution imaging for large-scale seismic data.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Amadio, G.; et al.
An intensive R&D and programming effort is required to accomplish new challenges posed by future experimental high-energy particle physics (HEP) programs. The GeantV project aims to narrow the gap between the performance of the existing HEP detector simulation software and the ideal performance achievable, exploiting latest advances in computing technology. The project has developed a particle detector simulation prototype capable of transporting in parallel particles in complex geometries exploiting instruction level microparallelism (SIMD and SIMT), task-level parallelism (multithreading) and high-level parallelism (MPI), leveraging both the multi-core and the many-core opportunities. We present preliminary verification results concerning the electromagnetic (EM) physicsmore » models developed for parallel computing architectures within the GeantV project. In order to exploit the potential of vectorization and accelerators and to make the physics model effectively parallelizable, advanced sampling techniques have been implemented and tested. In this paper we introduce a set of automated statistical tests in order to verify the vectorized models by checking their consistency with the corresponding Geant4 models and to validate them against experimental data.« less
Performance regression manager for large scale systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Faraj, Daniel A.
Methods comprising generating, based on a first output generated by a first execution instance of a command, a first output file specifying a value of at least one performance metric, wherein the first output file is formatted according to a predefined format, comparing the value of the at least one performance metric in the first output file to a value of the performance metric in a second output file, the second output file having been generated based on a second output generated by a second execution instance of the command, and outputting for display an indication of a result ofmore » the comparison of the value of the at least one performance metric of the first output file to the value of the at least one performance metric of the second output file.« less
ModelMate - A graphical user interface for model analysis
Banta, Edward R.
2011-01-01
ModelMate is a graphical user interface designed to facilitate use of model-analysis programs with models. This initial version of ModelMate supports one model-analysis program, UCODE_2005, and one model software program, MODFLOW-2005. ModelMate can be used to prepare input files for UCODE_2005, run UCODE_2005, and display analysis results. A link to the GW_Chart graphing program facilitates visual interpretation of results. ModelMate includes capabilities for organizing directories used with the parallel-processing capabilities of UCODE_2005 and for maintaining files in those directories to be identical to a set of files in a master directory. ModelMate can be used on its own or in conjunction with ModelMuse, a graphical user interface for MODFLOW-2005 and PHAST.
The Spider Center Wide File System; From Concept to Reality
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shipman, Galen M; Dillow, David A; Oral, H Sarp
2009-01-01
The Leadership Computing Facility (LCF) at Oak Ridge National Laboratory (ORNL) has a diverse portfolio of computational resources ranging from a petascale XT4/XT5 simulation system (Jaguar) to numerous other systems supporting development, visualization, and data analytics. In order to support vastly different I/O needs of these systems Spider, a Lustre-based center wide file system was designed and deployed to provide over 240 GB/s of aggregate throughput with over 10 Petabytes of formatted capacity. A multi-stage InfiniBand network, dubbed as Scalable I/O Network (SION), with over 889 GB/s of bisectional bandwidth was deployed as part of Spider to provide connectivity tomore » our simulation, development, visualization, and other platforms. To our knowledge, while writing this paper, Spider is the largest and fastest POSIX-compliant parallel file system in production. This paper will detail the overall architecture of the Spider system, challenges in deploying and initial testings of a file system of this scale, and novel solutions to these challenges which offer key insights into file system design in the future.« less
Virtual file system on NoSQL for processing high volumes of HL7 messages.
Kimura, Eizen; Ishihara, Ken
2015-01-01
The Standardized Structured Medical Information Exchange (SS-MIX) is intended to be the standard repository for HL7 messages that depend on a local file system. However, its scalability is limited. We implemented a virtual file system using NoSQL to incorporate modern computing technology into SS-MIX and allow the system to integrate local patient IDs from different healthcare systems into a universal system. We discuss its implementation using the database MongoDB and describe its performance in a case study.
The paradigm compiler: Mapping a functional language for the connection machine
NASA Technical Reports Server (NTRS)
Dennis, Jack B.
1989-01-01
The Paradigm Compiler implements a new approach to compiling programs written in high level languages for execution on highly parallel computers. The general approach is to identify the principal data structures constructed by the program and to map these structures onto the processing elements of the target machine. The mapping is chosen to maximize performance as determined through compile time global analysis of the source program. The source language is Sisal, a functional language designed for scientific computations, and the target language is Paris, the published low level interface to the Connection Machine. The data structures considered are multidimensional arrays whose dimensions are known at compile time. Computations that build such arrays usually offer opportunities for highly parallel execution; they are data parallel. The Connection Machine is an attractive target for these computations, and the parallel for construct of the Sisal language is a convenient high level notation for data parallel algorithms. The principles and organization of the Paradigm Compiler are discussed.
Using the General Algebraic Modeling System on Peregrine | High-Performance
directory, type the following: module load gams cp /nopt/nrel/apps/gams/example/trnsport.gms . gams trnsport file. For example, if your model input uses LP procedure and you want to use Gurobi solver to solve it directory that you run GAMS. For example, for the Gurobi solver, its option file is "gurobi.opt"
SAM-FS: LSC's New Solaris-Based Storage Management Product
NASA Technical Reports Server (NTRS)
Angell, Kent
1996-01-01
SAM-FS is a full featured hierarchical storage management (HSM) device that operates as a file system on Solaris-based machines. The SAM-FS file system provides the user with all of the standard UNIX system utilities and calls, and adds some new commands, i.e. archive, release, stage, sls, sfind, and a family of maintenance commands. The system also offers enhancements such as high performance virtual disk read and write, control of the disk through an extent array, and the ability to dynamically allocate block size. SAM-FS provides 'archive sets' which are groupings of data to be copied to secondary storage. In practice, as soon as a file is written to disk, SAM-FS will make copies onto secondary media. SAM-FS is a scalable storage management system. The system can manage millions of files per system, though this is limited today by the speed of UNIX and its utilities. In the future, a new search algorithm will be implemented that will remove logical and performance restrictions on the number of files managed.
Early experiences in developing and managing the neuroscience gateway.
Sivagnanam, Subhashini; Majumdar, Amit; Yoshimoto, Kenneth; Astakhov, Vadim; Bandrowski, Anita; Martone, MaryAnn; Carnevale, Nicholas T
2015-02-01
The last few decades have seen the emergence of computational neuroscience as a mature field where researchers are interested in modeling complex and large neuronal systems and require access to high performance computing machines and associated cyber infrastructure to manage computational workflow and data. The neuronal simulation tools, used in this research field, are also implemented for parallel computers and suitable for high performance computing machines. But using these tools on complex high performance computing machines remains a challenge because of issues with acquiring computer time on these machines located at national supercomputer centers, dealing with complex user interface of these machines, dealing with data management and retrieval. The Neuroscience Gateway is being developed to alleviate and/or hide these barriers to entry for computational neuroscientists. It hides or eliminates, from the point of view of the users, all the administrative and technical barriers and makes parallel neuronal simulation tools easily available and accessible on complex high performance computing machines. It handles the running of jobs and data management and retrieval. This paper shares the early experiences in bringing up this gateway and describes the software architecture it is based on, how it is implemented, and how users can use this for computational neuroscience research using high performance computing at the back end. We also look at parallel scaling of some publicly available neuronal models and analyze the recent usage data of the neuroscience gateway.
Chen, Qingkui; Zhao, Deyu; Wang, Jingjuan
2017-01-01
This paper aims to develop a low-cost, high-performance and high-reliability computing system to process large-scale data using common data mining algorithms in the Internet of Things (IoT) computing environment. Considering the characteristics of IoT data processing, similar to mainstream high performance computing, we use a GPU (Graphics Processing Unit) cluster to achieve better IoT services. Firstly, we present an energy consumption calculation method (ECCM) based on WSNs. Then, using the CUDA (Compute Unified Device Architecture) Programming model, we propose a Two-level Parallel Optimization Model (TLPOM) which exploits reasonable resource planning and common compiler optimization techniques to obtain the best blocks and threads configuration considering the resource constraints of each node. The key to this part is dynamic coupling Thread-Level Parallelism (TLP) and Instruction-Level Parallelism (ILP) to improve the performance of the algorithms without additional energy consumption. Finally, combining the ECCM and the TLPOM, we use the Reliable GPU Cluster Architecture (RGCA) to obtain a high-reliability computing system considering the nodes’ diversity, algorithm characteristics, etc. The results show that the performance of the algorithms significantly increased by 34.1%, 33.96% and 24.07% for Fermi, Kepler and Maxwell on average with TLPOM and the RGCA ensures that our IoT computing system provides low-cost and high-reliability services. PMID:28777325
Early experiences in developing and managing the neuroscience gateway
Sivagnanam, Subhashini; Majumdar, Amit; Yoshimoto, Kenneth; Astakhov, Vadim; Bandrowski, Anita; Martone, MaryAnn; Carnevale, Nicholas. T.
2015-01-01
SUMMARY The last few decades have seen the emergence of computational neuroscience as a mature field where researchers are interested in modeling complex and large neuronal systems and require access to high performance computing machines and associated cyber infrastructure to manage computational workflow and data. The neuronal simulation tools, used in this research field, are also implemented for parallel computers and suitable for high performance computing machines. But using these tools on complex high performance computing machines remains a challenge because of issues with acquiring computer time on these machines located at national supercomputer centers, dealing with complex user interface of these machines, dealing with data management and retrieval. The Neuroscience Gateway is being developed to alleviate and/or hide these barriers to entry for computational neuroscientists. It hides or eliminates, from the point of view of the users, all the administrative and technical barriers and makes parallel neuronal simulation tools easily available and accessible on complex high performance computing machines. It handles the running of jobs and data management and retrieval. This paper shares the early experiences in bringing up this gateway and describes the software architecture it is based on, how it is implemented, and how users can use this for computational neuroscience research using high performance computing at the back end. We also look at parallel scaling of some publicly available neuronal models and analyze the recent usage data of the neuroscience gateway. PMID:26523124
Fang, Yuling; Chen, Qingkui; Xiong, Neal N; Zhao, Deyu; Wang, Jingjuan
2017-08-04
This paper aims to develop a low-cost, high-performance and high-reliability computing system to process large-scale data using common data mining algorithms in the Internet of Things (IoT) computing environment. Considering the characteristics of IoT data processing, similar to mainstream high performance computing, we use a GPU (Graphics Processing Unit) cluster to achieve better IoT services. Firstly, we present an energy consumption calculation method (ECCM) based on WSNs. Then, using the CUDA (Compute Unified Device Architecture) Programming model, we propose a Two-level Parallel Optimization Model (TLPOM) which exploits reasonable resource planning and common compiler optimization techniques to obtain the best blocks and threads configuration considering the resource constraints of each node. The key to this part is dynamic coupling Thread-Level Parallelism (TLP) and Instruction-Level Parallelism (ILP) to improve the performance of the algorithms without additional energy consumption. Finally, combining the ECCM and the TLPOM, we use the Reliable GPU Cluster Architecture (RGCA) to obtain a high-reliability computing system considering the nodes' diversity, algorithm characteristics, etc. The results show that the performance of the algorithms significantly increased by 34.1%, 33.96% and 24.07% for Fermi, Kepler and Maxwell on average with TLPOM and the RGCA ensures that our IoT computing system provides low-cost and high-reliability services.
Twin-tailed fail-over for fileservers maintaining full performance in the presence of a failure
Coteus, Paul W.; Gara, Alan G.; Giampapa, Mark E.; Heidelberger, Philip; Steinmacher-Burow, Burkhard D.
2008-02-12
A method for maintaining full performance of a file system in the presence of a failure is provided. The file system having N storage devices, where N is an integer greater than zero and N primary file servers where each file server is operatively connected to a corresponding storage device for accessing files therein. The file system further having a secondary file server operatively connected to at least one of the N storage devices. The method including: switching the connection of one of the N storage devices to the secondary file server upon a failure of one of the N primary file servers; and switching the connections of one or more of the remaining storage devices to a primary file server other than the failed file server as necessary so as to prevent a loss in performance and to provide each storage device with an operating file server.
Hybrid parallel computing architecture for multiview phase shifting
NASA Astrophysics Data System (ADS)
Zhong, Kai; Li, Zhongwei; Zhou, Xiaohui; Shi, Yusheng; Wang, Congjun
2014-11-01
The multiview phase-shifting method shows its powerful capability in achieving high resolution three-dimensional (3-D) shape measurement. Unfortunately, this ability results in very high computation costs and 3-D computations have to be processed offline. To realize real-time 3-D shape measurement, a hybrid parallel computing architecture is proposed for multiview phase shifting. In this architecture, the central processing unit can co-operate with the graphic processing unit (GPU) to achieve hybrid parallel computing. The high computation cost procedures, including lens distortion rectification, phase computation, correspondence, and 3-D reconstruction, are implemented in GPU, and a three-layer kernel function model is designed to simultaneously realize coarse-grained and fine-grained paralleling computing. Experimental results verify that the developed system can perform 50 fps (frame per second) real-time 3-D measurement with 260 K 3-D points per frame. A speedup of up to 180 times is obtained for the performance of the proposed technique using a NVIDIA GT560Ti graphics card rather than a sequential C in a 3.4 GHZ Inter Core i7 3770.
A Review of Lightweight Thread Approaches for High Performance Computing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Castello, Adrian; Pena, Antonio J.; Seo, Sangmin
High-level, directive-based solutions are becoming the programming models (PMs) of the multi/many-core architectures. Several solutions relying on operating system (OS) threads perfectly work with a moderate number of cores. However, exascale systems will spawn hundreds of thousands of threads in order to exploit their massive parallel architectures and thus conventional OS threads are too heavy for that purpose. Several lightweight thread (LWT) libraries have recently appeared offering lighter mechanisms to tackle massive concurrency. In order to examine the suitability of LWTs in high-level runtimes, we develop a set of microbenchmarks consisting of commonlyfound patterns in current parallel codes. Moreover, wemore » study the semantics offered by some LWT libraries in order to expose the similarities between different LWT application programming interfaces. This study reveals that a reduced set of LWT functions can be sufficient to cover the common parallel code patterns and that those LWT libraries perform better than OS threads-based solutions in cases where task and nested parallelism are becoming more popular with new architectures.« less
NASA Astrophysics Data System (ADS)
Mills, R. T.; Rupp, K.; Smith, B. F.; Brown, J.; Knepley, M.; Zhang, H.; Adams, M.; Hammond, G. E.
2017-12-01
As the high-performance computing community pushes towards the exascale horizon, power and heat considerations have driven the increasing importance and prevalence of fine-grained parallelism in new computer architectures. High-performance computing centers have become increasingly reliant on GPGPU accelerators and "manycore" processors such as the Intel Xeon Phi line, and 512-bit SIMD registers have even been introduced in the latest generation of Intel's mainstream Xeon server processors. The high degree of fine-grained parallelism and more complicated memory hierarchy considerations of such "manycore" processors present several challenges to existing scientific software. Here, we consider how the massively parallel, open-source hydrologic flow and reactive transport code PFLOTRAN - and the underlying Portable, Extensible Toolkit for Scientific Computation (PETSc) library on which it is built - can best take advantage of such architectures. We will discuss some key features of these novel architectures and our code optimizations and algorithmic developments targeted at them, and present experiences drawn from working with a wide range of PFLOTRAN benchmark problems on these architectures.
Execution of parallel algorithms on a heterogeneous multicomputer
NASA Astrophysics Data System (ADS)
Isenstein, Barry S.; Greene, Jonathon
1995-04-01
Many aerospace/defense sensing and dual-use applications require high-performance computing, extensive high-bandwidth interconnect and realtime deterministic operation. This paper will describe the architecture of a scalable multicomputer that includes DSP and RISC processors. A single chassis implementation is capable of delivering in excess of 10 GFLOPS of DSP processing power with 2 Gbytes/s of realtime sensor I/O. A software approach to implementing parallel algorithms called the Parallel Application System (PAS) is also presented. An example of applying PAS to a DSP application is shown.
Parallel k-means++ for Multiple Shared-Memory Architectures
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mackey, Patrick S.; Lewis, Robert R.
2016-09-22
In recent years k-means++ has become a popular initialization technique for improved k-means clustering. To date, most of the work done to improve its performance has involved parallelizing algorithms that are only approximations of k-means++. In this paper we present a parallelization of the exact k-means++ algorithm, with a proof of its correctness. We develop implementations for three distinct shared-memory architectures: multicore CPU, high performance GPU, and the massively multithreaded Cray XMT platform. We demonstrate the scalability of the algorithm on each platform. In addition we present a visual approach for showing which platform performed k-means++ the fastest for varyingmore » data sizes.« less
Portable parallel stochastic optimization for the design of aeropropulsion components
NASA Technical Reports Server (NTRS)
Sues, Robert H.; Rhodes, G. S.
1994-01-01
This report presents the results of Phase 1 research to develop a methodology for performing large-scale Multi-disciplinary Stochastic Optimization (MSO) for the design of aerospace systems ranging from aeropropulsion components to complete aircraft configurations. The current research recognizes that such design optimization problems are computationally expensive, and require the use of either massively parallel or multiple-processor computers. The methodology also recognizes that many operational and performance parameters are uncertain, and that uncertainty must be considered explicitly to achieve optimum performance and cost. The objective of this Phase 1 research was to initialize the development of an MSO methodology that is portable to a wide variety of hardware platforms, while achieving efficient, large-scale parallelism when multiple processors are available. The first effort in the project was a literature review of available computer hardware, as well as review of portable, parallel programming environments. The first effort was to implement the MSO methodology for a problem using the portable parallel programming language, Parallel Virtual Machine (PVM). The third and final effort was to demonstrate the example on a variety of computers, including a distributed-memory multiprocessor, a distributed-memory network of workstations, and a single-processor workstation. Results indicate the MSO methodology can be well-applied towards large-scale aerospace design problems. Nearly perfect linear speedup was demonstrated for computation of optimization sensitivity coefficients on both a 128-node distributed-memory multiprocessor (the Intel iPSC/860) and a network of workstations (speedups of almost 19 times achieved for 20 workstations). Very high parallel efficiencies (75 percent for 31 processors and 60 percent for 50 processors) were also achieved for computation of aerodynamic influence coefficients on the Intel. Finally, the multi-level parallelization strategy that will be needed for large-scale MSO problems was demonstrated to be highly efficient. The same parallel code instructions were used on both platforms, demonstrating portability. There are many applications for which MSO can be applied, including NASA's High-Speed-Civil Transport, and advanced propulsion systems. The use of MSO will reduce design and development time and testing costs dramatically.
Parallel Climate Data Assimilation PSAS Package
NASA Technical Reports Server (NTRS)
Ding, Hong Q.; Chan, Clara; Gennery, Donald B.; Ferraro, Robert D.
1996-01-01
We have designed and implemented a set of highly efficient and highly scalable algorithms for an unstructured computational package, the PSAS data assimilation package, as demonstrated by detailed performance analysis of systematic runs on up to 512node Intel Paragon. The equation solver achieves a sustained 18 Gflops performance. As the results, we achieved an unprecedented 100-fold solution time reduction on the Intel Paragon parallel platform over the Cray C90. This not only meets and exceeds the DAO time requirements, but also significantly enlarges the window of exploration in climate data assimilations.
Parallelized Kalman-Filter-Based Reconstruction of Particle Tracks on Many-Core Architectures
DOE Office of Scientific and Technical Information (OSTI.GOV)
Cerati, Giuseppe; Elmer, Peter; Krutelyov, Slava
Faced with physical and energy density limitations on clock speed, contemporary microprocessor designers have increasingly turned to on-chip parallelism for performance gains. Examples include the Intel Xeon Phi, GPGPUs, and similar technologies. Algorithms should accordingly be designed with ample amounts of fine-grained parallelism if they are to realize the full performance of the hardware. This requirement can be challenging for algorithms that are naturally expressed as a sequence of small-matrix operations, such as the Kalman filter methods widely in use in high-energy physics experiments. In the High-Luminosity Large Hadron Collider (HL-LHC), for example, one of the dominant computational problems ismore » expected to be finding and fitting charged-particle tracks during event reconstruction; today, the most common track-finding methods are those based on the Kalman filter. Experience at the LHC, both in the trigger and offline, has shown that these methods are robust and provide high physics performance. Previously we reported the significant parallel speedups that resulted from our efforts to adapt Kalman-filter-based tracking to many-core architectures such as Intel Xeon Phi. Here we report on how effectively those techniques can be applied to more realistic detector configurations and event complexity.« less
User-Defined Data Distributions in High-Level Programming Languages
NASA Technical Reports Server (NTRS)
Diaconescu, Roxana E.; Zima, Hans P.
2006-01-01
One of the characteristic features of today s high performance computing systems is a physically distributed memory. Efficient management of locality is essential for meeting key performance requirements for these architectures. The standard technique for dealing with this issue has involved the extension of traditional sequential programming languages with explicit message passing, in the context of a processor-centric view of parallel computation. This has resulted in complex and error-prone assembly-style codes in which algorithms and communication are inextricably interwoven. This paper presents a high-level approach to the design and implementation of data distributions. Our work is motivated by the need to improve the current parallel programming methodology by introducing a paradigm supporting the development of efficient and reusable parallel code. This approach is currently being implemented in the context of a new programming language called Chapel, which is designed in the HPCS project Cascade.
Integrating Cache Performance Modeling and Tuning Support in Parallelization Tools
NASA Technical Reports Server (NTRS)
Waheed, Abdul; Yan, Jerry; Saini, Subhash (Technical Monitor)
1998-01-01
With the resurgence of distributed shared memory (DSM) systems based on cache-coherent Non Uniform Memory Access (ccNUMA) architectures and increasing disparity between memory and processors speeds, data locality overheads are becoming the greatest bottlenecks in the way of realizing potential high performance of these systems. While parallelization tools and compilers facilitate the users in porting their sequential applications to a DSM system, a lot of time and effort is needed to tune the memory performance of these applications to achieve reasonable speedup. In this paper, we show that integrating cache performance modeling and tuning support within a parallelization environment can alleviate this problem. The Cache Performance Modeling and Prediction Tool (CPMP), employs trace-driven simulation techniques without the overhead of generating and managing detailed address traces. CPMP predicts the cache performance impact of source code level "what-if" modifications in a program to assist a user in the tuning process. CPMP is built on top of a customized version of the Computer Aided Parallelization Tools (CAPTools) environment. Finally, we demonstrate how CPMP can be applied to tune a real Computational Fluid Dynamics (CFD) application.
A practical approach to portability and performance problems on massively parallel supercomputers
DOE Office of Scientific and Technical Information (OSTI.GOV)
Beazley, D.M.; Lomdahl, P.S.
1994-12-08
We present an overview of the tactics we have used to achieve a high-level of performance while improving portability for a large-scale molecular dynamics code SPaSM. SPaSM was originally implemented in ANSI C with message passing for the Connection Machine 5 (CM-5). In 1993, SPaSM was selected as one of the winners in the IEEE Gordon Bell Prize competition for sustaining 50 Gflops on the 1024 node CM-5 at Los Alamos National Laboratory. Achieving this performance on the CM-5 required rewriting critical sections of code in CDPEAC assembler language. In addition, the code made extensive use of CM-5 parallel I/Omore » and the CMMD message passing library. Given this highly specialized implementation, we describe how we have ported the code to the Cray T3D and high performance workstations. In addition we will describe how it has been possible to do this using a single version of source code that runs on all three platforms without sacrificing any performance. Sound too good to be true? We hope to demonstrate that one can realize both code performance and portability without relying on the latest and greatest prepackaged tool or parallelizing compiler.« less
78 FR 26066 - Filing of Plats of Survey; NV
Federal Register 2010, 2011, 2012, 2013, 2014
2013-05-03
... plat, in 1 sheet, representing the dependent resurvey of the Fifth Standard Parallel North through... interest lands for Barrick Gold Exploration, Inc. 2. The Supplemental Plat of the following described lands... DEPARTMENT OF THE INTERIOR Bureau of Land Management [LLNV952000 L14200000.BJ0000 241A; 13-08807...
Processing MPI Datatypes Outside MPI
NASA Astrophysics Data System (ADS)
Ross, Robert; Latham, Robert; Gropp, William; Lusk, Ewing; Thakur, Rajeev
The MPI datatype functionality provides a powerful tool for describing structured memory and file regions in parallel applications, enabling noncontiguous data to be operated on by MPI communication and I/O routines. However, no facilities are provided by the MPI standard to allow users to efficiently manipulate MPI datatypes in their own codes.
Event parallelism: Distributed memory parallel computing for high energy physics experiments
NASA Astrophysics Data System (ADS)
Nash, Thomas
1989-12-01
This paper describes the present and expected future development of distributed memory parallel computers for high energy physics experiments. It covers the use of event parallel microprocessor farms, particularly at Fermilab, including both ACP multiprocessors and farms of MicroVAXES. These systems have proven very cost effective in the past. A case is made for moving to the more open environment of UNIX and RISC processors. The 2nd Generation ACP Multiprocessor System, which is based on powerful RISC system, is described. Given the promise of still more extraordinary increases in processor performance, a new emphasis on point to point, rather than bussed, communication will be required. Developments in this direction are described.
Proceedings of the second SISAL users` conference
DOE Office of Scientific and Technical Information (OSTI.GOV)
Feo, J T; Frerking, C; Miller, P J
1992-12-01
This report contains papers on the following topics: A sisal code for computing the fourier transform on S{sub N}; five ways to fill your knapsack; simulating material dislocation motion in sisal; candis as an interface for sisal; parallelisation and performance of the burg algorithm on a shared-memory multiprocessor; use of genetic algorithm in sisal to solve the file design problem; implementing FFT`s in sisal; programming and evaluating the performance of signal processing applications in the sisal programming environment; sisal and Von Neumann-based languages: translation and intercommunication; an IF2 code generator for ADAM architecture; program partitioning for NUMA multiprocessor computer systems;more » mapping functional parallelism on distributed memory machines; implicit array copying: prevention is better than cure ; mathematical syntax for sisal; an approach for optimizing recursive functions; implementing arrays in sisal 2.0; Fol: an object oriented extension to the sisal language; twine: a portable, extensible sisal execution kernel; and investigating the memory performance of the optimizing sisal compiler.« less
On the impact of communication complexity in the design of parallel numerical algorithms
NASA Technical Reports Server (NTRS)
Gannon, D.; Vanrosendale, J.
1984-01-01
This paper describes two models of the cost of data movement in parallel numerical algorithms. One model is a generalization of an approach due to Hockney, and is suitable for shared memory multiprocessors where each processor has vector capabilities. The other model is applicable to highly parallel nonshared memory MIMD systems. In the second model, algorithm performance is characterized in terms of the communication network design. Techniques used in VLSI complexity theory are also brought in, and algorithm independent upper bounds on system performance are derived for several problems that are important to scientific computation.
On the impact of communication complexity on the design of parallel numerical algorithms
NASA Technical Reports Server (NTRS)
Gannon, D. B.; Van Rosendale, J.
1984-01-01
This paper describes two models of the cost of data movement in parallel numerical alorithms. One model is a generalization of an approach due to Hockney, and is suitable for shared memory multiprocessors where each processor has vector capabilities. The other model is applicable to highly parallel nonshared memory MIMD systems. In this second model, algorithm performance is characterized in terms of the communication network design. Techniques used in VLSI complexity theory are also brought in, and algorithm-independent upper bounds on system performance are derived for several problems that are important to scientific computation.
Calibration of hydrological model with programme PEST
NASA Astrophysics Data System (ADS)
Brilly, Mitja; Vidmar, Andrej; Kryžanowski, Andrej; Bezak, Nejc; Šraj, Mojca
2016-04-01
PEST is tool based on minimization of an objective function related to the root mean square error between the model output and the measurement. We use "singular value decomposition", section of the PEST control file, and Tikhonov regularization method for successfully estimation of model parameters. The PEST sometimes failed if inverse problems were ill-posed, but (SVD) ensures that PEST maintains numerical stability. The choice of the initial guess for the initial parameter values is an important issue in the PEST and need expert knowledge. The flexible nature of the PEST software and its ability to be applied to whole catchments at once give results of calibration performed extremely well across high number of sub catchments. Use of parallel computing version of PEST called BeoPEST was successfully useful to speed up calibration process. BeoPEST employs smart slaves and point-to-point communications to transfer data between the master and slaves computers. The HBV-light model is a simple multi-tank-type model for simulating precipitation-runoff. It is conceptual balance model of catchment hydrology which simulates discharge using rainfall, temperature and estimates of potential evaporation. Version of HBV-light-CLI allows the user to run HBV-light from the command line. Input and results files are in XML form. This allows to easily connecting it with other applications such as pre and post-processing utilities and PEST itself. The procedure was applied on hydrological model of Savinja catchment (1852 km2) and consists of twenty one sub-catchments. Data are temporary processed on hourly basis.
A unified framework for building high performance DVEs
NASA Astrophysics Data System (ADS)
Lei, Kaibin; Ma, Zhixia; Xiong, Hua
2011-10-01
A unified framework for integrating PC cluster based parallel rendering with distributed virtual environments (DVEs) is presented in this paper. While various scene graphs have been proposed in DVEs, it is difficult to enable collaboration of different scene graphs. This paper proposes a technique for non-distributed scene graphs with the capability of object and event distribution. With the increase of graphics data, DVEs require more powerful rendering ability. But general scene graphs are inefficient in parallel rendering. The paper also proposes a technique to connect a DVE and a PC cluster based parallel rendering environment. A distributed multi-player video game is developed to show the interaction of different scene graphs and the parallel rendering performance on a large tiled display wall.
Performance of the SERI parallel-passage dehumidifer
DOE Office of Scientific and Technical Information (OSTI.GOV)
Schlepp, D.; Barlow, R.
1984-09-01
The key component in improving the performance of solar desiccant cooling systems is the dehumidifier. A parallel-passage geometry for the desiccant dehumidifier has been identified as meeting key criteria of low pressure drop, high mass transfer efficiency, and compact size. An experimental program to build and test a small-scale prototype of this design was undertaken in FY 1982, and the results are presented in this report. Computer models to predict the adsorption/desorption behavior of desiccant dehumidifiers were updated to take into account the geometry of the bed and predict potential system performance using the new component design. The parallel-passage designmore » proved to have high mass transfer effectiveness and low pressure drop over a wide range of test conditions typical of desiccant cooling system operation. The prototype dehumidifier averaged 93% effectiveness at pressure drops of less than 50 Pa at design point conditions. Predictions of system performance using models validated with the experimental data indicate that system thermal coefficients of performance (COPs) of 1.0 to 1.2 and electrical COPs above 8.5 are possible using this design.« less
VizieR Online Data Catalog: Milky Way L/T/M-dwarfs identified in BoRG survey (Holwerda+, 2014)
NASA Astrophysics Data System (ADS)
Holwerda, B. W.; Trenti, M.; Clarkson, W.; Sahu, K.; Bradley, L.; Stiavelli, M.; Pirzkal, N.; de Marchi, G.; Andersen, M.; Bouwens, R.; Ryan, R.
2017-07-01
Our principal data set is the WFC3 data from the BoRG (HST GO/PAR-11700; Trenti et al. 2011ApJ...727L..39T; Bradley et al. 2012ApJ...760..108B) survey to identify Milky Way dwarf stars from their morphology and color. The BoRG observations are undithered HST/WFC3 conducted in pure-parallel with the telescope pointing to a primary spectroscopic target with the Cosmic Origin Spectrograph (COS; typically a high-z QSO at high Galactic latitude). The limitations for such observations are primarily that no dithering strategy can be used (final images are at WFC3 native pixel scale) and total exposure times are dictated by the primary program. (5 data files).
4P: fast computing of population genetics statistics from large DNA polymorphism panels
Benazzo, Andrea; Panziera, Alex; Bertorelle, Giorgio
2015-01-01
Massive DNA sequencing has significantly increased the amount of data available for population genetics and molecular ecology studies. However, the parallel computation of simple statistics within and between populations from large panels of polymorphic sites is not yet available, making the exploratory analyses of a set or subset of data a very laborious task. Here, we present 4P (parallel processing of polymorphism panels), a stand-alone software program for the rapid computation of genetic variation statistics (including the joint frequency spectrum) from millions of DNA variants in multiple individuals and multiple populations. It handles a standard input file format commonly used to store DNA variation from empirical or simulation experiments. The computational performance of 4P was evaluated using large SNP (single nucleotide polymorphism) datasets from human genomes or obtained by simulations. 4P was faster or much faster than other comparable programs, and the impact of parallel computing using multicore computers or servers was evident. 4P is a useful tool for biologists who need a simple and rapid computer program to run exploratory population genetics analyses in large panels of genomic data. It is also particularly suitable to analyze multiple data sets produced in simulation studies. Unix, Windows, and MacOs versions are provided, as well as the source code for easier pipeline implementations. PMID:25628874